Is real-world gaming technology the next big thing in the more academically based high-performance computing arena? The authors put PlayStation 3 to the test.
T
he heart of the Sony Play Station 3-the Cell proces sor-wasn't originally intended for scientific number crunching, just as the PlayStation 3 itself wasn't meant primarily to serve such purposes. Yet, both these items could impact the highperformance computing world in significant ways. This introductory article takes a closer look at their po tential to do so; an extended version of it is published as a University of Ten nessee technical report (www.cs.utk. edu/~library/2008).
The Cell in a Nutshell
The Cell processor's main control unit is the Power Processing Element (PPE), which is a 64bit, twoway si multaneous multithreading (SMT) processor that's binarycompliant with the PowerPC 970 architecture. The PPE consists of the Power Pro cessing Unit (PPU), 32 Kbytes of L1 cache, and 512 Kbytes of L2 cache.
Although the PPU uses the Power PC 970 instruction set, it has a rela tively simple architecture with inorder execution, which results in consider ably less circuitry than its outofor der execution counterparts as well as lower energy consumption. The high clock rate, high memory bandwidth, and dual threading capabilities make up for the potential performance de ficiencies stemming from the PPU's inorder execution architecture. The SMT feature, which comes at a small 5 percent increase in the hardware's cost, can deliver up to a 30 percent in crease in performance. The PPU also includes a shortvector single instruc tion, multiple data (SIMD) engine called VMX, which is an incarnation of the PowerPC's AltiVec.
However, the Cell processor's real power lies in the eight Synergistic Processing Elements (SPEs) that ac company the PPE. Each SPE con sists of a Synergistic Processing Unit (SPU), 256 Kbytes of private memory (referred to as the local store), and a memoryflow controller that delivers powerful direct memoryaccess ca pabilities to the SPU. The SPEs are the Cell's shortvector SIMD work horses and possess a large 128entry, 128bit vector register file as well as a range of SIMD instructions that can operate simultaneously on two dou bleprecision values, four singlepre cision values, eight 16bit integers, or 16 8bit characters. Most instructions are pipelined and can complete one vector operation in each clock cycle, including fused multiplication-addi tion in single precision, which means that the SPU can accomplish two floatingpoint operations on four val ues in each clock cycle. This trans lates to a peak of 2 × 4 × 3.2 GHz = 25.6 Gflop/s for each SPE and adds up to a staggering peak of 8 × 25.6 Gflop/s = 204.8 Gflop/s for the entire chip. Figure 1 shows a schematic of the Cell processor's design.
All the Cell processor's compo nents, including the PPE, the SPEs, the main memory, and the I/O sys tem, are connected via the element interconnection bus, which has four unidirectional rings (two in each direction) and a tokenbased arbi tration mechanism that plays the role of traffic light. Each partici pant is hooked up to the bus with a bandwidth of 25.6 Gbytes/s; the bus has an internal bandwidth of 204.8 Gbytes/s, which means that for all practical purposes, you shouldn't be able to saturate it.
The Cell chip draws its power from the fact that it's a parallel machine with eight small, fast, specialized numbercrunching and processing elements. The SPEs, in turn, rely on a simple design with short pipelines, a huge register file, and a powerful SIMD instruction set.
The Cell is essentially a distrib utedmemory system on a chip, on which each SPE possesses its private memory stripped of any indirection mechanisms to make it faster. This puts explicit control over data mo tion in the hands of the programmer, who must use techniques closely re sembling message passing, a model that some might think is challenging but is the only one known to be scal able today.
The PlayStation 3
The PlayStation 3 is probably the cheapest Cellbased system on the market: it contains a Cell processor (with the number of SPEs reduced to six), 256 Mbytes of main memory, an NVIDIA graphics card with 256 Mbytes of its own memory, and a giga bit Ethernet (GigE) network card.
Sony made several convenient pro visions for installing Linux on the PlayStation 3 in a dualboot setup. Installation instructions are plentiful on the Web, but the basic gist is that a virtualization layer-also called the hypervisor-separates the Linux kernel from the hardware. Devices and other system resources are virtu alized, but Linux device drivers can work with them. The Cell processor in the PlayStation 3 is identical to the one you would find in highend IBM or Mercury blade severs, with the exception that two SPEs aren't available (one is disabled for chip yield reasons). Nevertheless, a Cell with one defective SPE still passes as a good chip in the PlayStation 3. If all the SPEs are nondefective, a good one is disabled during manufactur ing. Another SPE is hidden from the application by the operating system's hypervisor.
The GigE card is accessible to the Linux kernel through the hypervisor, which both makes it possible to turn the PlayStation 3 into a networked workstation and facilitates building PlayStation 3 clusters via network switches. You can program such in stallations by using the messagepass ing interface (MPI). The network card has a direct memoryaccess unit, which you can set up via dedi cated hypervisor calls that enable data transfers without requiring the main processor's intervention.
Programming
All Linux distributions for the Play Station 3 come with the standard GNU compiler suite, including C (GCC), C++ (G++), and Fortran 95 (GFORTRAN), which now also pro vides support for OpenMP through the GNU GOMP library. The pro grammer can use OpenMP to exploit the PPE's SMT capabilities. IBM's software development kit for Cell delivers a similar set of GNU tools, along with an IBM compiler suite that includes C/C++ and, more recently, Fortran (with support for Fortran 95 and partial support for Fortran 2003). The kit is available for installation on Cell or x86based systems, with code compiled and built in crosscompila tion mode, a method often preferred by experts. These tools practically guarantee compilation of any exist ing C, C++, or Fortran code on the Cell processor, which makes the ini tial port of any existing software ba sically effortless.
As Table 1 shows, several program ming models and environments have emerged for the Cell processor; it seems to have ignited similar enthu siasm in the scientific highperfor mance computing, embedded systems, and graphics communities as well. Naturally, the programming tech niques proposed for the Cell are as diverse as the communities involved: they include sharedmemory, distrib utedmemory, and streamprocessing models and represent both data and taskparallel approaches.
A separate problem is related to programming for a cluster of Play Station 3s-such a cluster is essen tially a distributedmemory machine, and there's almost no programming 
Scientific Computing
In spite of its power, the PlayStation 3 has severe limitations for scientific computing. First, it can only achieve its astounding peak of 153.6 Gflop/s for computeintensive tasks in single precision arithmetic, which, besides delivering less precision, isn't com pliant with the IEEE floatingpoint standard (its doubleprecision peak is less than 11 Gflop/s). Second, it only implements truncation rounding, so denormalized numbers are flushed to zero, and NaNs ("not a number") are treated as normal numbers. Finally, memorybound problems are limited by the main memory's bandwidth of 25.6 Gbytes/s. This is a very respect able value compared to cuttingedge heavyiron processors, but it sets the upper limit of memoryintensive singleprecision calculations to 12.8 Gflop/s and doubleprecision calcu lations to 6.4 Gflop/s, assuming two operations are performed on one data element. However, the largest disproportion in the PlayStation 3's performance is between the Cell processor's speed and that of the GigE interconnection. GigE isn't crafted for performance and, in practice, only about 65 percent of its peak bandwidth can be achieved in MPI communication. Also, because of the extra layer of indirection be tween the operating system and the hardware (specifically, the hypervisor), the incurred latency can be as big as 200 μs, which is at least an order of magnitude below today's standards for highperformance interconnections. Even if you could lower the latency and gain a larger fraction of the peak bandwidth, the communication capac ity of 1 Gbyte/s is way too small to keep the Cell processor busy. A com mon remedy for slow interconnections is to run larger problems, but in this case, the main memory's small size (256 Mbytes) turns out to be the limit ing factor.
Taken all together, even simple examples of computeintensive work loads such as matrix multiplication can't benefit from running in paral lel on more than two PlayStation 3s.
Rather, only the extremely compute intensive, embarrassingly parallel problems have a fair chance of success in scaling to PlayStation 3 clusters. Such distributed computing prob lems, often referred to as screensaver computing, have gained popularity in recent years: the trend initiated by the SETI@Home project had many fol lowers, including the very successful Folding@Home project.
T he idea of manycore processors reaching hundreds, if not thou sands, of processing elements per chip is emerging, with some researchers aiming for distributedmemory sys tems on a chip, an inherently more scalable solution than sharedmemory setups. Owing to this, the technol ogy delivered by the PlayStation 3 through its Cell processor provides a unique opportunity to gain experi ence, which is likely to be priceless in the near future.
But a major shortcoming of the current Cell processor for numeri cal applications is the relatively slow speed of doubleprecision arithme tic. The next incarnation promises to include a fully pipelined double precision unit that will deliver the speed of 12.8 Gflop/s from a single SPE clocked at 3.2 GHz and 102.4 Gflop/s from an eightSPE system, which is going to make the chip a very serious competitor in the world of scientific and en gineering computing. Although in agony, Moore's law is still alive, and we're entering the era of billiontransistor processors. Given this, the current Cell processor uses a rather modest num ber of transistors (234 million). It isn't hard to envision a Cell processor with more than one PPE and many more SPEs, perhaps exceeding the performance of one teraflop/s for a single chip. 
DePArTmeNT eDITOrS

