Objectives of the CRADA
This project involved the evaluation of prototype computer architectures developed by IBM, in particular, the POWERJ4 and POWER/2 architectures. The goal was to determine the strengths and weaknesses of these architectures and their associated software in order to determine their commercial viability. As part of the project the PVM [3] and LAPACK [2] software, which ORNL helped develop, was tested on the architectures. The associated beta software that we evaluated included: Fortran 90 compiler, C++ compiler, Ultamedia, AIX 4.1, and ESSL.
In addition we supplied documentation, expertise, and early software to assist IBM in porting PVM to the SP-1 architecture as part of our cooperative research. Task 2. ORNL will port PVM to the IBM shared memory machine and evaluate the performance of PVM on this system. ORNL will supply advanced specifications and access to the PVM development team to assist IBM's port of PVM to SP-1.
Task Summary
The port of PVM to POWER/4 uncovered a fatal flaw in POWER/4 operating system. IBM decided not to make POWER/4 a product, so the flaw wasnever fixed. We delivered assistance and materials for SP-1 port. Task 3. ORNL will evaluate the performance of the IBM machines for solving applications of interest to DOE. We found that our DOE applications ran poorly on POWER/4, well below expectations, but the same applications ran better than expected on POWER/2. Results on a POWER/$ at a different site were moderately better, suggesting that the expected performance is sensitive to OS level and setup. Task 4. ORNL will port and verify compatibility and evaluate performance of the LA-PACK linear algebra library executing on POWER/2. ORNL will inform IBM of any problems in porting or testing and write a report describing the POWER/2 evaluation. The LAPACK port to POWER/2 was completed. The ORNL Technical report ORNL/TM-11768 describes the performance of the RS6000 system (general name for the POWER series) on linear algebra operations. Performance was excellent and attributed to the optimizing Fortran compiler. The C++, Fortran 90, and multicast kernel software packages were evaluated on the POWER/2.
IBM was kept informed of all problems that were found. Task 5. IBM will provide ORNL with a POWER/2 system and the technical resources required to maintain the hardware and software in an operational mode. IBM will provide ORNL with the following software: AIX 3.2.5 (beta), X-windows, TCP/IP, C, C++, and Fortran compilers. IBM will use problem reports to correct product deficiencies and will make fixes available to ORNL as time permits. POWER/2 system and listed software were provided to ORNL. Technical assistance to maintain systems was provided. Fixes were made available to ORNL in a timely manner.
Benefit t o DOE
The cooperative research venture has benefited both parties. IBM got a steady stream of evaluation/problem reports used to improve the commercial acceptance of its prod-ucts. ORNL had the opportunity to evaluate DOE computational problems on stateof-the-art prototype hardware.
There are many benefits of this research to DOE. First, part of DOE'S High Performance Computing and Communication (HPCC) initiative involves the evaluation of state of the art computer hardware. This CRADA provided an opportunity for ORNL to gain early access to three different IBM architectures and evaluate them. Second, ORNL was able to get access to beta software packages to apply to DOE research. Research often progresses because of the availability of new technology. By having access to beta technology ORNL has the opportunity to be "first" in research breakthroughs.
Having early access allows ORNL to influence software developers to make sure the software has the features required by DOE problems. Finally, by having advanced knowledge of the stability and performance of certain software packages, the ORNL PI has acted in the role of advisor to production staff members at the Lab about whether a particular software product should be upgraded lab-wide.
Technical discussion
The goal of this project was to determine the commercial viability of prototype architectures, discover and correct software problems in the supporting software prior to general market availability of the product, and enhance the marketability of the architectures by porting PVM and LAPACK.
POWER/4
IBM POWER/4 architecture is four RS/SOOO model 530 workstations interconnected by an atomic memory complex and several megabytes of shared memory. We evaluated if such a shared memory architecture could effectively solve the computational problems of the user community.
The IBM POWER/4 architecture as conceived in 1991 is a hybrid of the symmetric multiprocessor system, where everything is shared, and the cluster multiprocessor system, where nothing is shared. In the POWER/$ architecture, nothing is shared except where explicitly needed. It integrates existing uniprocessor complexes, processors, and I/O, together with a shared, global memory. An atomic complex is designed to provide interprocessor communication, process synchronization and a monotonic counter/incrementer for event ordering. In addition, the POWER/4 memory subsystem provides local memory for each processor to reduce accesses to the shared global memory. The local memory feature is designed to enhance the system scalability.
The theory behind the POWER/4 design is that with its shared memory support and the load/store model preserved, it can inherit most of the shared memory processor (SMP) software technology in particular parallel compilers. And at the same time, with aggressive use of local memory, the POWER/4 can scale beyond the current limits of SMP, particularly if the shared memory is used only as a fast communication path.
Shared memory as a communication channel has low latency, and it can have bandwidth close to memory copy speeds.
The POWER/4 system is composed of four POWERserver processors each running at 42 Mhz in a single box. Each box has a the potential of 320 Mflops peak. The operating system is AIX 3.2 with kernel extensions to accommodate the atomic complex.
Local memory can be configured from 16 MB to 64 MB per processor, and the global memory can be configured from 64 MB to 448 MB per box.
Initially, ORNL obtained accounts on a half-speed POWER/4 system located at Florida State University. This system was used for learning about the system and to create the initial design for a PVM port to the POWER/4. There was a nine month delay in the manufacture of full-speed systems due to hardware problems and IBM lawyers approving access to outside beta testers. ORNL got an account on a full-speed POWER/4 system located at the IBM site in Austin Texas.
A full shared memory port of PVM was completed for the POWERI4 architecture.
Several OS problems arose during this development. They were reported to IBM who fixed them. The beta version of AIX 3.2 used on the POWER/4 underwent several updates over the course of the CRADA. By the end of the POWER/$ evaluation period, the PVM port was finished and available to IBM, but there was still one outstanding bug in the AIX version running on the Austin POWER/4. Two DOE application codes were run on the Austin POWER/4 to evaluate its compiler performance, CPU performance, and interactive performance. One code calculated the electronic density-of-states of high temperature superconductors. These calculations were done from first principles requiring the solution of thousands of linear systems per iteration until self-consistency was reached. The other code simulated a laser ablation process. In laser ablation, a beam of light heats and then vaporizes the surface of one material forming a plasma plume which deposits a thin layer of the material on nearby surfaces. The simulation studied how the shape, direction, and density of the plume could be controlled using the shape and intensity of the laser and thus indirectly controlling the deposited film. The ability to coat a surface with a material that normally would not adhere to it makes laser ablation of great interest to DOE. One conclusion drawn from these tests was that the Austin POWER/4 had a problem in its OS version or perhaps a hidden compute-intensive process running on it continuously.
The performance of the POWER/4 shared memory system was found to be much lower than expected, this and other production problems led to IBM's decision not to make the POWER/4 a commercial product.
POWER/2
The POWER/2 Architecture is the follow-on of the RIOS-1 design. The design has been enhanced by adding dual fixed point execution units, dual floating point execution units and larger instruction and data cache sizes, 32 KB and 256 KB respectively. The POWER/2 has the ability to do quad-word floating point load and store and hardware The POWER/2 has a four-way set associative dual port 256 KB data cache divided into four separate data cache unit chips. The cache line size is 256 bytes, and the cache is implemented as a store-back cache to minimize the memory bus traffic.
The storage control chip contains the controls and configuration registers for the system memory, controls for the system 1/0 bus, and controls the interface between the system memory and the data cache units. This chip also controls the interface to the system read-only memory used for initialization of the CPU.
The first tests we ran on the POWER/2 were matrix kernels and scientific benchmarks. The theoretical peak performance of the POWER/2 is 266 Mflops, given it can do four operations per cycle and has a clock rate of 66.6 Mhz. Amazingly, we were able to achieve performance of 236 Mflops from the Linpack benchmark solving a linear system with 1000 unknowns. The benchmark is written completely in Fortran 77.
Normally, hand tuning and assembly language routines would be required to get this close to peak performance on a workstation. The POWER/2 was able to achieve this through extensive optimization routines in the x l f Fortran compiler. Achieving this performance occurred only after a process of evaluating several beta compiler options and helping IBM decide which options should be available and which should be default settings. A detailed study of the POWER architecture on linear algebra operations was documented in an ORNL technical report [I] .
Scientific applications from materials science, climate modeling, and contaminatetransport simulations were compiled and run on the POWER/2. The high quality compiler optimization made these applications run in the 50-150 Mflop range, which was much better than any other workstation at the time.
Inventions
No inventions were produced as part of this cooperative research effort.
Commercialization
The POWER/2 architecture is now being marketed by IBM across a full range of products from desktop computers like the Model 390 up to massively parallel computers like the 512 processor SP-2 at Cornell.
Future collaboration
Our relationship with IBM was quite good at the end of the CRADA period. IBM asked if we would be interested in continuing to evaluate prototype hardware and beta software as part of our mutual research interests. IBM shipped us $80,000 worth of hardware and software to continue the research. We drafted a CRADA extension to match their contributions, describing our work in evaluating the IBM 530 symmetric multiprocessor architecture. Due to staff turnover at IBM, it was several months before IBM had evaluated, negotiated, and signed the agreement. The signed agreement was Since this letter was sent after MMES should have had the signed agreement, the impression it gave of ORNL's (lack of) desire to establish CRADA's was much worse than if the letter had been sent during the long negotiation phase. Neither the ORNL PI nor the IBM PI had knowledge that such a letter was being drafted by the Office of Technology Transfer. They learned of it only after it had been sent, so they were not able to soften the impact of this impression. I particularly felt for the IBM PI who had worked for months pushing this agreement through the slow IBM legal system.
Our relationship with IBM cooled significantly after the letter was sent. The ORNL and IBM PI'S have remained on good terms, and IBM has recently asked if ORNL could help in the research and evaluation of their AIX 4.2 operating system. This research would involve no exchange of funds and would be completed under a standard IBM beta tester agreement.
Conclusions
Due to problems with the POWER/4 architecture, some of which were brought to light by this CRADA, IBM decided not to develop the POWER/4 architecture as a product.
Instead, IBM is focusing its marketing on the POWER/2 architecture, which our evaluations showed performed even better than expected. The cooperative research venture has benefited both parties. IBM received a steady stream of evaluation/problem reports that it used to improve the commercial acceptance of its products. ORNL had the opportunity to evaluate DOE computational problems on state-of-the-art prototype hardware.
In contrast the POWER/2 architecture has exceeded performance expectations and has demonstrated its ability to solve large computational problems of interest to the user community. IBM is now focused on marketing POWER/2 systems and developing and entire line of computers based on this technology.
