The Cell Broadband Enginee (Cell/B.E.) processor is the first implementation of the Cell Broadband Engine Architecture (CBEA), developed jointly by Sony, Toshiba, and IBM. In addition to use of the Cell/B.E. processor in the Sony Computer Entertainment PLAYSTATIONt3 system, there is much interest in using it for workstations, media-rich electronics devices, and video and image processing systems. The Cell/B.E. processor includes one PowerPCt processor element (PPE) and eight synergistic processor elements (SPEs). The CBEA is designed to be well suited for a wide variety of programming models, and it allows for partitioning of work between the PPE and the eight SPEs. In this paper we show that the Cell/B.E. processor can outperform other modern processors by approximately an order of magnitude and by even more in some cases.
Introduction
Until recently, improvements in the performance of general-purpose processor systems were derived primarily from higher processor clock frequencies and wider issue superscalar and deeper super-pipelined designs. However, without a commensurate increase in the memory speed, these approaches led only to relatively increased memory latencies and even more complex logic to hide those latencies. Furthermore, because of hardware limits on the number of concurrent accesses to memory, complex processor cores often ended up underutilizing the execution pipelines and memory bandwidth.
The approach taken by the Cell Broadband Engine (Cell/B.E.) processor designers was to focus on improving performance/area and performance/power ratios [1] . These goals are largely achieved using simple, yet powerful cores that use area more efficiently with less power dissipation. Supported by a high-bandwidth interconnection bus, these cores can work both independently and cooperatively. By supporting a large number of simultaneous memory accesses from the direct memory access (DMA) engines, which can move data with negligible processor assistance, the Cell/B.E. processor design allows for effective use of the memory bandwidth as well. Architecturally, the design philosophy resulted from the recent trend of having multiple general-purpose cores in the same chip; in the Cell/B.E. processor, the cores are simple and are designed to be able to work together efficiently and in novel ways. Extensive documentation on the Cell/B.E. processor and its programming environments can be found in [2] .
In Section 2, we introduce the performance characteristics of the Cell/B.E. processor, focusing on the PowerPC * processor element (PPE), the synergistic processor elements (SPEs), the element interconnect bus (EIB), the Rambus XDR ** dynamic random access memory (DRAM), and the input/output interfaces (IOIFs). Finally, in Section 3, we characterize the performance of several applications that exploit the Cell/B.E. processor features and compare the results with those of a few other general-purpose processors.
2. Cell/B.E. Architecture, bandwidths, and latencies Figure 1 shows a high-level view of the first implementation of the Cell/B.E. processor. It includes a general-purpose 64-bit PPE. In addition, the Cell/B.E. processor incorporates eight SPEs interconnected by a high-speed, memory-coherent EIB. The initial ÓCopyright 2007 by International Business Machines Corporation. Copying in printed form for private use is permitted without payment of royalty provided that (1) each reproduction is done without alteration and (2) the Journal reference and IBM copyright notice are included on the first page. The title and abstract, but no other portions, of this paper may be copied or distributed royalty free without further permission by computer-based and other information-service systems. Permission to republish any other portion of this paper must be obtained from the Editor.
implementation of the Cell/B.E. processor is targeted to run at 3.2 GHz. The execution units on the SPEs follow the single-instruction multiple-data (SIMD) execution model of vector processors and account for much of the computational power of the Cell/B.E. processor. When single-precision (SP) floating-point (FP) fused multiply-add instructions are in use, the eight SPEs in the first-generation Cell/B.E. chip can perform up to 64 FP operations per processor cycle.
The integrated memory controller provides a peak bandwidth of 25.6 GB/s to an external Rambus XDR DRAM, while the integrated input/output (I/O) controller provides an aggregate peak raw bandwidth of 25 GB/s on inbound links and 35 GB/s on outbound links. The EIB supports a peak bandwidth of 204.8 GB/s for intrachip data transfers among the PPE, the SPEs, the memory interface controller (MIC), and the IOIF controllers.
PowerPC processor element
The PPE is a dual-issue, in-order implementation of the IBM PowerPC Architecture*, with multithreading capability and integrated vector multimedia extensions (VMX). The PPE is responsible for overall control of the chip and is typically used in the management and allocation of synergistic processor units (SPUs) and their tasks.
As shown in Figure 2 , the PPE consists of three main units: the instruction unit (IU), the execution unit (XU), and the vector/scalar execution unit (VSU), which contains the VMX and floating-point unit (FPU). The IU contains the Level 1 (L1) instruction cache (ICache), branch prediction hardware, instruction buffers, and dependency checking logic. The main division between the IU and the rest of the system is at the instruction issue 3 (IS3) stage, which is the main stall point for the PPE. The XU contains the integer execution units (FXUs) and the load-store unit (LSU). The VSU contains all of the execution resources for FP and VMX instructions, as well as separate VMX and FP instruction queues in order to increase overall processor throughput. A pipeline timing diagram of the PPE is presented in Figure 3 .
Although the PPE is considered an in-order machine, several mechanisms allow it to achieve some of the benefits of out-of-order execution, without the associated complexity of instruction or memory access reordering hardware. First the processor can make forward progress on a thread even when a load from that thread misses the cache. The processor continues to execute past the load miss, stopping only when there is an instruction that is actually dependent on the load. This allows the processor to send up to eight requests to the L2 cache without stopping. This can be a great benefit to FP and SIMD code, since these typically have a very high data cache miss rate, and it is often easy to identify independent loads. In addition to allowing loads to be performed out of order, the PPE uses ''delayed execution pipelines'' to achieve some of the benefits of out-of-order execution. Delayed execution pipelines allow instructions that normally would cause a stall at the issue stage to move to a special ''delay pipe'' to be executed later at a specific point.
Synergistic processor element
The SPE comprises an SPU and a memory flow controller (MFC). The SPU [3] is a compute engine that supports the SIMD execution paradigm with 256-KB of dedicated local storage. The MFC consists of a DMA controller with an associated memory management unit to aid effective-to-real address translation, as well as an atomic unit to handle synchronization operations with other SPUs and the PowerPC processor unit (PPU).
The SPU is a dual-issue, in-order machine with a large 128-entry, 128-bit register file used for FP, integer, and branch operations. It operates directly on instructions and data from its dedicated local store and relies on a channel interface to the DMA controller to access the main memory and other local stores. The channel interface, which is in the MFC, runs independently of the SPU and is capable of translating addresses and transferring data between the main memory and the local storage while the SPU continues with the program execution.
Simply stated, the SPU is based on a SIMD architecture. The set of operations allowed by its instruction set architecture closely resembles that of the
Figure 2
Block diagram of the PPE. (BRU: branch unit; I-ERAT, D-ERAT: instruction/data effective-to-real address translation.) L1 Figure 4 shows the main functional units in an SPE: an FPU for SP, double-precision (DP), and integer multiplies; a fixed-point unit for arithmetic, logical operations, and word shifts; another fixed-point unit for permutes, shuffles, and quad-word rotates; a control unit for instruction sequencing and branch execution; a local store unit for loads and stores and to supply instructions to the control unit; and a channel/DMA transport that is responsible for controlling input and output through the MFC.
Each functional unit in Figure 4 is assigned to one of the two execution pipelines. The FPU and the fixed-point unit are on the even pipeline while the rest of the functional units are on the odd pipeline. The SPU can issue and complete up to two instructions per cycle, one on each of the execution pipelines. A dual issue occurs when a group of fetched instructions has at least two ready-to-execute instructions, one of which is executed by a unit on the even pipeline and the other by a unit on the odd pipeline.
Instruction fetches are initiated in one of three ways: instruction flush condition, inline prefetch, or software hint. The instruction fetch logic reads 32 instructions at a time into its instruction line buffer from which two instructions at a time are sent to the issue logic. When the operands are ready, the issue logic sends the instructions to the functional units for execution in the same cycle. Pipeline length varies from two to seven cycles. Figure 5 shows an SPU pipeline diagram.
Figure 3
Pipeline diagram of the PPE. MC1  MC3  MC4  MC2  …  MC10  MC11  MC9   IC1  IC3  IC4  IC2  IB1  ID2  ID3  ID1  IS1  IS3  IS2 Features such as a fixed access time to the local store, simple rules to issue instructions, software-inserted branch hints, and a large register file are exposed to the compiler and applications for performance tuning. With only moderate effort to tune codes, we have seen a wide variety of applications approach the theoretical limit of two instructions issued per cycle in the SPU.
Element interconnect bus
The EIB in the Cell/B.E. processor allows for communication among the PPE, the SPEs, the off-chip memory, and the external I/O ( Figure 6 ). The EIB consists of one address bus and four 16-byte-wide data rings, two of which run clockwise and the other two counterclockwise. Each ring can allow up to three concurrent data transfers as long as their paths do not overlap. The EIB operates at half the speed of the processor.
Each requester on the EIB starts with a small number of initial command credits to send out requests on the bus. The number of credits is the size of the command buffer inside the EIB for that particular requester. One command credit is used for each request on the bus. When a slot becomes open in the command buffer as a previous request progresses in the EIB request pipeline, the EIB returns the credit to the requester.
When a unit requires access to a data ring in order to send data to another unit, it makes a single request to the data ring arbiter on the EIB. The arbiter processes requests from all requesters and decides, as optimally as it can, which data ring is granted to which requester and the time at which the data ring is granted. The memory controller is given the highest priority to prevent stalling of the requester of the read data, while all others are treated equally with a round-robin priority. Any ring requester can use any of the four rings to send or receive data. The data arbiter does not grant a data ring to a requester if the transfer would cross more than halfway around the ring on its way to its destination or would interfere with another data transfer already in progress.
Each unit on the EIB can simultaneously send and receive 16 bytes of data every bus cycle. The maximum data bandwidth of the entire EIB is limited by the maximum rate, one 16-byte data transfer per bus cycle, at which addresses are ''snooped'' across all
Figure 5
Pipeline timing diagram of the SPU. ID3  IS1  IF1  IF2  IF3  IF4  IF5  IB1  IB2  ID1  ID2 units connected to the EIB. Since each snooped address request can potentially transfer up to 128 bytes, the theoretical peak data bandwidth on the EIB at 3.2 GHz is 128 bytes 3 1.6 GHz ¼ 204.8 GB/s. The sustained data bandwidth on the EIB will often be lower than the peak bandwidth because of several factors: the locations of the destination and the source relative to each other, the potential for interference between the new transfer and those already in progress, the number of Cell/B.E. chips in the system, whether the data transfers are to or from memory or between local stores in the SPEs, and the efficiency of the data arbiter.
Reduced bus bandwidths can result when all requesters access the same destination (memory or local store) at the same time, when all transfers are in the same direction and cause idling on two of the four data rings, when there are a large number of partial cache line transfers lowering the bus efficiency, or when each data transfer is six hops, inhibiting the units on the way from using the same ring.
We ran a series of experiments on the hardware in our laboratory using a core frequency of 3.2 GHz. In our experiments, all four pairs of SPEs did streaming reads or writes to each other's local stores. For instance, SPE0 reads from SPE2, SPE4 from SPE6, SPE1 from SPE3, and SPE5 from SPE7, and vice versa in all cases. Table 1 shows the results from a few of the runs. We use the notation SPEx $ SPEy to indicate that SPEx reads from the local store of SPEy, and vice versa.
The sustained effective data bandwidth in our experiments varied from 78 GB/s (38% of peak) to 197 GB/s (96% of peak). In the case in which the bandwidth is only 78 GB/s, the communicating SPEs are farthest apart (see Figure 6 ) and only one transfer happens on each of the four rings (this determines the lower bound for the EIB bandwidth). We would expect 102.4 GB/s in this case, but because of the limitation of the data ring arbiter design, we achieve about 75% of the expected bandwidth. In the case in which the bandwidth is 95 GB/s, the communicating SPEs are five hops away from each other and still prevent other transfers from taking place because of path overlap. In the remaining cases, the bandwidth achieved is close to the peak of 204.8 GB/s.
Memory subsystem
The MIC in the Cell/B.E. chip is connected to the external Rambus XDR DRAM through two XDR controller I/O (XIO) channels that operate at a maximum effective frequency of 3.2 GHz (400 MHz, octal data rate). Each XIO channel can have eight banks for a maximum size of 256 MB, for a total memory size of 512 MB.
The MIC has separate independently operating read and write request queues for each XIO channel. For each channel, the MIC arbiter alternates the dispatch between read and write queues after a minimum of every eight dispatches from each queue or until the queue becomes empty, whichever comes first. High-priority read requests are given precedence over normal reads and writes.
Writes of 16 bytes or more, but less than 128 bytes, can be written directly to memory using a masked-write operation; writes less than 16 bytes require a readmodify-write operation. Because of the small number of buffers for read-modify-write operations, the read part of the read-modify-write operation is given a higher priority than normal reads, while the corresponding write part of the operation is given a higher priority than normal writes.
Other performance-enhancing features in the MIC include a fast-path mode, in which an incoming request can bypass an empty request queue and be dispatched immediately to reduce read latency by several cycles, and a mode in which reads can be speculatively dispatched to the DRAMs even before the combined response is received from the EIB.
With both XIO channels operating at 3.2 GHz, the peak raw memory bandwidth is 25.6 GB/s. However, normal memory operations such as refresh and scrubbing typically reduce the bandwidth to about 24.6 GB/s. The peak bandwidth assumes that all of the banks are 
Test configuration
Aggregate EIB bandwidth at 3.2 GHz (GB/s)
kept active all of the time by the incoming request streams, that all requests are of the same type (read or write), and that each request is exactly 128 bytes. If streaming reads and writes are intermingled, the effective bandwidth can be further reduced to about 21 GB/s; the bandwidth loss in this case arises from the overhead of turning around the MIC-to-XIO bidirectional bus. Finally, the Cell/B.E. processor also implements resource allocation to provide controlled access to the critical shared resources such as the memory or the I/O links. This controlled access allows time-critical applications to meet timing targets by preventing contention for the shared resources. Data and commands on the IOIF are transmitted as packets. In addition to the command, response, and data, each packet may carry information such as the data tag, data size, command identifier, and flow control information, as well as other information. Because of these overheads and potentially nonoptimal arrival times of data and commands, the effective bandwidth on the two interfaces may be typically lower, ranging from 50% to 80% of the raw bandwidth. Of course, other factors such as the prevailing data traffic on the EIB, resource allocation, speed of the I/O devices, ordering characteristics of the I/O data traffic, and interrupts can potentially reduce the I/O bandwidth further.
Application examples and their performance characteristics
In this section, we present the results for a number of applications that showcase the performance of the Cell/B.E. chip. The applications cover a wide range: matrix multiplication, LINPACK [4] , MPEG-2 video decoding [5] , triangle transform and lighting, and cryptography algorithms such as Advanced Encryption Standard (AES), Triple Data Encryption Standard (TDES), and Message Digest 5 (MD5) algorithm. Some of the performance results in this section were obtained from our cycle-accurate SPEsim performance model (an internal tool), because it provides considerable insight into pipeline behavior. In other cases, the results were measured on the hardware with no operating system support and overheads. All applications, which were optimized to take advantage of the Cell/B.E. microarchitecture features in using techniques such as the ones described in [6] , were compiled with a prerelease version of the IBM XL C compiler [7] on the Linux ** operating system.
The results from the hardware indicate that the simulator is typically more than 98% accurate for compute-intensive code, while in some cases such as LINPACK, which has significant memory activity, the modeling accuracy can drop to about 88% because of simplifying assumptions in the EIB and memory system model.
Optimization of matrix multiplication
Our matrix multiplication program calculates C ¼ A 3 B, where A, B, and C are square matrices of order N. Each element c ij of C is computed as follows:
A well-known optimization to reduce the required memory bandwidth is to partition the matrices into smaller square matrices of order M , N: ; where A ij , B ij , and C ij are square matrices of order M. Each matrix block C ij is then solved by
The first optimization for the SPE architecture was to take advantage of the four-way SIMD using 32-bit multiply-add operations to perform up to eight SP FP operations per cycle.
The next optimization was to take advantage of DMA engines in the SPE by adopting a double-buffer approach to overlap computations on the previously fetched data blocks with transfers of subsequent data blocks to and from memory. This effectively prevents memory stalls.
Additional optimizations, such as overlapping the execution of blocks, loop exchange, and software pipelining, were applied to balance the usage of the two SPU pipelines and maximize the dual-issue rate. With these tuning efforts, the matrix multiplication program on a single SPU improved its performance by almost 60 times over the original scalar code running on an SPU.
The data presented in Table 2 was obtained from a cycle-accurate SPEsim simulator. All SPUs had performance characteristics similar to those shown in Table 2 . The accuracy of the results was 99.6% from our SPEsim simulator for matrix multiplication with a matrix size of 256 3 256 (one SPU, SPEsim ¼ 25.12 Gflops, hardware ¼ 25.01 Gflops).
Since operations in each data block are independent of those in other blocks, the matrix multiplication algorithm is easily parallelized to all eight SPUs. Figure 7 shows that the matrix multiplication performance increases almost linearly with the number of SPUs, especially with large matrix sizes. Using eight SPUs, the parallel version of matrix multiplication achieves 201 Gflops, very close to the theoretical maximum of 204.8 Gflops. With seven SPUs, we observed that the load balancing could cause nonlinear performance scaling for smaller matrix sizes such as 512 3 512. As matrix size increases, this problem disappears.
Assuming that matrix multiplication can achieve its peak SP FP capability, a Pentium ** 4 processor with SSE3 (streaming SIMD extensions) at 3.2 GHz can achieve 25.6 Gflops, while the Cell/B.E. processor can achieve 201 Gflops, greater by almost a factor of 8.
Optimization of LINPACK
LINPACK solves a dense system of linear equations Ax ¼ b, where A is a matrix of order N, and x and b are single-dimension arrays of size N. Matrix multiplication is a key part of LINPACK. The LINPACK implementation on the Cell/B.E. processor is based on the blockpartitioned algorithm for LU factorization [8] .
LINPACK single-precision FP performance
The initial implementation of SP LINPACK was a scalar version. When run on the cycle-accurate SPEsim simulator with a matrix size of 1,024 3 1,024 using partitioned blocks of 64 3 64, it achieved only 0.26 Gflops with a 3.2-GHz SPE.
With optimizations added to hide the memory latency by overlapping data transfers with computation using double buffers, as well as working to maximize dual issue with optimal instruction scheduling, the final optimized version of the program achieved 16.5 Gflops on a 1K 3 1K matrix, 22.0 Gflops on a 4K 3 4K matrix, and 23.5 Gflops on an 8K 3 8K matrix ( Table 3) . The efficiency, defined as the ratio of achieved performance to peak performance, increases significantly with the matrix size and optimization. Table 4 shows the performance of LINPACK when it is optimized to run on all eight SPEs. Note that the computational efficiency as a percentage of peak performance improved significantly with the larger matrix size. Eight SPUs running LINPACK at 3.2 GHz achieve 155.5 Gflops on a 4K 3 4K matrix. Table 5 compares the performance results of parallelized LINPACK from the SPEsim and the hardware. As the table shows, the SPEsim results for LINPACK are off by up to 12%, except for the 4K 3 4K case, because not all of the complexities of DMA and the memory system were modeled.
LINPACK double-precision FP performance
Although the SPU DP FP performance is not as high as the SP performance, it is still good. Each SPU is capable of executing two DP instructions every seven cycles. With fused multiply-add, an SPU can achieve a peak 1.83 Gflops at 3.2 GHz. With eight SPUs and fully pipelined DP FP support in the VMX of the PPE, the Cell/B.E. processor is capable of a peak 21.03-Gflops DP FP performance, compared with a peak SP FP performance of 230.4 Gflops.
Figure 7
Performance of parallelized matrix multiplication for matrices of different sizes. The initial implementation of DP LINPACK was a scalar version. When run on the SPEsim with a matrix size of 1,024 3 1,024, using partitioned blocks of 64 3 64, it achieved 0.27 Gflops on a 3.2-GHz SPE. With additional tuning, the final version achieved 1.55 Gflops on a 4K 3 4K matrix, or about 85% of peak performance, as shown in Table 6 .
The DP version of LINPACK on the Cell/B.E. processor was also parallelized to run on all eight SPUs. Table 7 summarizes the performance of LINPACK parallelized to eight SPUs, running on SPEsim. Table 8 compares the results measured on hardware with those measured on the SPEsim model. The modeling accuracy for DP LINPACK is greater than 97% in all cases. With a 1,024 3 1,024 matrix, the computational efficiency of parallelized LINPACK is greater than 64% when running on all eight SPUs. The best parallelized LINPACK DP FP result measured on hardware is 11.82 Gflops for a 4,096 3 4,096 matrix, with an efficiency of 81%. The LINPACK performance on IA-32 (Pentium 4) and IA-64 (Itanium ** ) machines [9] is summarized in Table 9 along with the SPU results.
The LINPACK implementation on the Cell/B.E. processor has the highest DP FP performance in the chart, exceeding the most recent IA-32 and IA-64 machines available today.
Optimization of MPEG-2 video decoding
The Cell/B.E. processor is targeted primarily for game applications [1] , some of which demand a high video processing capability; consequently, significant effort has been devoted to optimizing MPEG-2 video decoding. Figure 8 shows the major components of an MPEG-2 video decoder: a variable-length decoder (VLD), an inverse quantizer (IQ), an inverse discrete cosine transform (IDCT), the motion compensation (MC) section, and control logic. The VLD code in an MPEG-2 decoder has many branches and needs significant optimization to run well on the SPU. Our optimizations include algorithmic tuning, lookup-table modification, fast bit manipulation with SPU intrinsics, static branch prediction with the ''builtin_expect'' pragma of the GCC compiler, function inlining, 1 and global register assignment. Most of the optimizations also reduced the instruction count and eliminated or minimized branch penalties. A fast IDCT algorithm [10] was adopted to operate an 8 3 8 two-dimensional IDCT using 512 multiplyadds and 128 add or subtract instructions. This IDCT algorithm was regular and it was easy to keep the precision required by the MPEG-2 standard for 16-bit 3 16-bit multiplication. Most operations in IDCT were four-way SIMD for the inner product, except matrix transpose, which was eight-way SIMD.
The MC element was implemented primarily with eight-way SIMD and some 16-way SIMD. Function inlining and nested ''if else'' to ''switch case'' conversion were adopted to eliminate some branches and to improve code scheduling by enlarging basic block size.
MC required small pixel block transfers (e.g., 16 3 16 pixels) from system memory to the local store in order to construct the predicted blocks. If a video frame were stored in a raster scan manner, MC would require numerous small DMA transfers (e.g., 16 transfers of 16 bytes). However, in the Cell/B.E. processor, the DMA transfers are most efficient when performed on 128-byte naturally aligned boundaries. To increase efficiency, the data structure was rearranged to place each macroblock in a 384-byte contiguous area, consisting of a 16-3 16-pixel luminance block and two 8-3 8-pixel chrominance blocks (in the case of the 4:2:0 format). With this arrangement, a macroblock could be retrieved from the main memory to a local store with fewer 128-byte DMA transfers. Table 10 shows our results from the optimized decoder running on the cycle-accurate SPU simulator. As is evident, a single 3.2-GHz SPU can easily support any kind of real-time MPEG-2 video decoding.
On hardware, our optimized MPEG-2 decoder was able to decode 1,379 common intermediate format (CIF) frames per second (fps), about 10% lower than the simulation result. Adjusted for modeling error, each [11] . With linear scaling to 3.2 GHz, this frame rate increases to 354 fps. Although differences in the video stream data can change these numbers to some extent, given that the Cell/B.E. processor has eight SPUs, the Cell/B.E. processor should outperform a general-purpose processor with SIMD capability by a fair margin.
Optimization of triangle transform light
Transform light is a critical stage in vertex-based graphic pipelines. Because of its structured data, the transform-light implementation on the Cell/B.E. processor can benefit greatly from the SIMD support in the SPU. The optimizations here were focused primarily on loop unrolling, increasing the dual-issue rate, overlapping the DMAs with computation, and avoiding branches.
With many more registers than general-purpose processors, our transform light on the SPU benefited significantly from aggressive loop unrolling.
For comparative analysis, we implemented a besteffort transform light on a single-core PowerPC 970 (PPC970) system running at 2 GHz. The results in Table 11 show that an optimized transform-light compute kernel on one SPU performs approximately 43% better than on a PPC970/VMX system when scaled to 3.2 GHz. Given that the Cell/B.E. processor contains eight SPUs, the Cell/B.E. processor should outperform the PPC970/VMX by a wide margin.
Cryptography performance
Cryptography is a rapidly emerging technology that is increasingly found in many applications. The Cell/B.E. processor, with its SIMD pipelines and a large register file, can execute cryptography functions efficiently. Optimization of the code focused on aggressive loop unrolling and register-based table lookups to take advantage of the large register file of an SPU. Tuning was also done for dual issue, bit permutation, and byte shuffling. Table 12 summarizes the performance results of several key cryptography algorithms and hash functions on a single SPE and on a Pentium 4 processor with SSE.
Results demonstrate that the cryptography performance of a single SPE is typically better (up to 2.3 times at the same frequency) than that of a Pentium 4 processor with SSE.
Cell/B.E. processor performance summary Minor et al. 2 showed that their implementation of the terrain rendering engine (TRE) on the Cell/B.E. processor runs 50 times faster than on a PPC970 with the VMX running at 2 GHz. With the innovative microarchitectural features of the Cell/B.E.processor, we showed that a wide variety of algorithms on the Cell/B.E. processor achieve performance that is equal to or significantly better than a general-purpose processor. For applications that can take advantage of SPU SIMD pipelines and PPU threadlevel parallelism, the results will be similar. Table 13 summarizes the performance advantage of the Cell/B.E. processor (or its one SPU) over that of other processors.
Summary
With eight decoupled SPU SIMD engines, each with dedicated resources including DMA channels, the Cell/B.E. processor has eight times more SIMD capability (for up to 16-way data parallelism) than traditional processors with vector architecture extensions. The results presented in this paper demonstrate that the Cell/B.E. processor can perform particularly well in cases in which a general-purpose processor would normally become compute bound by a single SIMD application. As with other computer systems, optimization efforts devoted to tuning an application will have a direct impact on the performance that can be realized. Applications that fully use SIMD capabilities in the eight SPEs of the Cell/B.E. processor will provide performance unequaled by any other contemporary processor. With its performance density and efficiency, the Cell/B.E. processor can outperform competing leading-edge processors on a variety of workloads by approximately an order of magnitude, in some cases even more, when software is tuned and optimized for this architecture. 
