Abstract-The Sony/Toshiba/IBM (STI) CELL processor introduces pioneering solutions in processor architecture. At the same time, it presents new challenges for the development of numerical algorithms. One is the effective exploitation of the differential between the speed of single-and double-precision arithmetic; the other is the efficient parallelization between the short-vector Single-Instruction, Multiple-Data (SIMD) cores. The first challenge is addressed by utilizing the well-known technique of iterative refinement for the solution of a dense symmetric positive definite system of linear equations, resulting in a mixed-precision algorithm, which delivers double-precision accuracy while performing the bulk of the work in single precision. The main contribution of this paper lies in addressing the second challenge by successful thread-level parallelization, exploiting fine-grained task granularity and a lightweight decentralized synchronization. The implementation of the computationally intensive sections gets within 90 percent of the peak floating-point performance, while the implementation of the memory-intensive sections reaches within 90 percent of the peak memory bandwidth. On a single CELL processor, the algorithm achieves more than 170 Gflops when solving a symmetric positive definite system of linear equations in single precision and more than 150 Gflops when delivering the result in double-precision accuracy.
Ç

MOTIVATION
I N numerical computing, there is a fundamental performance advantage in using the single-precision floatingpoint data format over the double-precision data format, due to the more compact representation, thanks to which twice the number of single-precision data elements can be stored at each stage of the memory hierarchy. Short-vector Single-Instruction, Multiple-Data (SIMD) processing provides yet more potential for performance gains from using single-precision arithmetic over double precision. Since the goal is to process the entire vector in a single operation, the computation throughput can be doubled when the data representation is halved. Unfortunately, the accuracy of the solution is also halved.
Most of the processor architectures available today have been at some point augmented with short-vector SIMD extensions. Examples include The different architectures exhibit big differences in their capabilities. The vector size is either 64 bits or, more commonly, 128 bits. The register file size ranges from just a few to as many as 128 registers. Some extensions only support integer types, others also operate on singleprecision floating-point numbers, and yet others also process double-precision values.
Today, the Synergistic Processing Element (SPE) of the Sony/Toshiba/IBM (STI) CELL processor [1] , [2] , [3] can probably be considered the state of the art in short-vector SIMD processing. Possessing 128-byte-long registers and a fully pipelined, fused add-multiply instruction, it is capable of completing eight single-precision floating-point operations each clock cycle, which, combined with the size of the register file of 128 registers, delivers close-to-peak performance on many common workloads. At the same time, built with multimedia and embedded applications in mind, the current incarnation of the CELL architecture does not implement double-precision arithmetic on par with single-precision performance, which makes the processor a very attractive target for exploring mixedprecision approaches.
Another important phenomenon in recent years has been the gradual shift of focus in processor architecture from aggressive exploitation of instruction-level parallelism toward thread-level parallelism, resulting in the introduction of chips with multiple processing units commonly referred to as multicore processors. The new architectures deliver the much desired improvement in performance and, at the same time, challenge the scalability of existing algorithms and force the programmers to seek more parallelism by going to much finer levels of problem granularity. In linear algebra, it enforces the departure from the model relying on parallelism encapsulated at the level of BLAS and shifts to more flexible methods of scheduling work.
RELATED WORK
Iterative refinement is a well-known method for improving the solution of a linear system of equations of the form Ax ¼ b. Typically, a dense system of linear equations is solved by applying a factorization to the coefficient matrix, followed by a back solve. Due to round-off errors, the solution carries an error related to the condition number of the coefficient matrix. In order to improve the computed solution, an iterative refinement process can be applied, which produces a correction to the computed solution at each iteration. In principle, the algorithm can produce a solution correct to the working precision.
Iterative refinement is a fairly well-understood concept and was analyzed by Wilkinson [4] , Moler [5] , and Stewart [6] . Higham gives error bounds for both single-and doubleprecision iterative refinement algorithms, where the entire algorithm is implemented with the same precision (single or double, respectively) [7] . He also gives error bounds in single-precision arithmetic, with refinement performed in double-precision arithmetic. An error analysis for the case described in this work, where the factorization is performed in single precision and the refinement in double precision, is given by Langou et al. [8] .
The authors of this work have previously presented an initial implementation of the mixed-precision algorithm for the general nonsymmetric case using LU factorization on the CELL processors. Although respectable performance numbers were presented, both the factorization and the refinement steps relied on rather classic parallelization approaches. Also, a somewhat general discussion of algorithmic and implementation details was presented. This work extends the previous presentation by introducing a novel scheme for the parallelization of the computational components of the algorithm and also describes in much more detail the implementation of both computationally intensive and memory-intensive operations.
Although the thread-level parallelization of the Cholesky factorizations presented here is quite unique, due to the unique nature of the CELL processor, many of the concepts are not new. Seminal work in the area of parallel dense matrix factorizations was done by Agarwal and Gustavson [9] , [10] .
ALGORITHM
The standard approach to solving symmetric positive definite systems of linear equations is to use the Cholesky factorization. The Cholesky factorization of a real symmetric positive definite matrix A has the form A ¼ LL T , where L is a real lower triangular matrix with positive diagonal elements. The system is solved by first solving Ly ¼ b (forward substitution) and then solving L T x ¼ y (backward substitution). In order to improve the accuracy of the computed solution, an iterative refinement process is applied, which produces a correction to the computed solution, x.
The mixed-precision iterative refinement algorithm using Cholesky factorization is outlined in Algorithm 1. It can be observed that all operations of Oðn 3 Þ complexity are handled in single precision, and all operations performed in double precision are of at most Oðn 2 Þ complexity. The coefficient matrix A is converted to single precision for the Cholesky factorization. At the same time, the original matrix in double precision is preserved for the residual calculation.
The algorithm described above and shown in Algorithm 1 is available in the LAPACK 3.1 library and implemented by the routine DSGESV.
IMPLEMENTATION
Essential Hardware Features
An extensive hardware overview would be beyond the scope of this publication. Vast amounts of information are publicly available for both experienced programmers [11] and newcomers to the field [12] , [13] . It is assumed that the reader has some familiarity with the architecture. Here, the features that have the most influence on the implementation presented are mentioned.
The CELL is a multicore chip that includes nine different processing elements. One core, the POWER Processing Element (PPE), represents a standard processor design implementing the PowerPC instruction set. The remaining eight cores, the SPEs, are short-vector SIMD engines with big register files of 128 128-bit vector registers and 256 Kbytes of local memory, referred to as local store (LS).
Although standard C code can be compiled for the execution on the SPEs, the SPEs do not execute scalar code efficiently. For efficient execution, the code has to be vectorized in the SIMD sense, by using C language vector extensions (intrinsics) or by using assembly code. The system's main memory is accessible to the PPE through L1 and L2 caches and to the SPEs through DMA engines associated with them. The SPEs can only execute the code residing in the LS and can only operate on the data in the LS. All data has to be transferred in and out of LS via DMA transfers.
The theoretical computing power of a single SPE is 25.6 Gflops in single precision and roughly 1.8 Gflops in double precision. The floating-point arithmetic follows the IEEE format, with double-precision operations complying numerically with the standard and single precision providing only rounding toward zero. The theoretical communication speed for a single SPE is 25.6 Gbytes/s. The theoretical peak bandwidth of the main memory is 25.6 Gbytes/s as well (shared by the PPE and all SPEs).
The size of the register file and the size of the LS dictate the size of the elementary operation subject to scheduling to the SPEs. The ratio of computing power to the memory bandwidth dictates the overall problem decomposition for parallel execution.
Factorization
A few varieties of Cholesky factorization are known, in particular the right-looking variant and the left-looking variant [14] . It has also been pointed out that those variants are borders of a continuous spectrum of possible execution paths [15] .
Generally, the left-looking factorization is preferred for several reasons. During the factorization, most time is spent calculating a matrix-matrix product. In the case of the rightlooking factorization, this product involves a triangular matrix. In the case of the left-looking factorization, this product only involves rectangular matrices. It is generally more efficient to implement a matrix-matrix product for rectangular matrices, and it is easier to balance the load in parallel execution. The implementation presented here is derived from the left-looking formulation of the Cholesky factorization, which follows the implementation of the LAPACK routine SPOTRF.
Due to the limited size of the LS, all numerical operations have to be decomposed into elementary operations that fit in the LS. The simplicity of implementing the Cholesky factorization lies in the fact that it can be easily decomposed into tile operations-operations on fixed-size submatrices that take one to three tiles as input and produce one tile as output. These elementary operations will further be referred to as tile kernels. Fig. 1 illustrates the steps of the left-looking Cholesky factorization and how each step is broken down to tile operations.
Computational Kernels
Implementation of the tile kernels assumes a fixed size of the tiles. Smaller tiles (finer granularity) have a positive effect on scheduling for parallel execution and facilitate better load balance and higher parallel efficiency. Bigger tiles provide better performance in sequential execution on a single SPE.
In the case of the CELL chip, the crossover point is rather simple to find for problems in dense linear algebra. From the standpoint of this work, the most important operation is matrix multiplication in single precision. It turns out that this operation can achieve within a few percent off the peak performance on a single SPE for matrices of size 64 Â 64. The fact that the peak performance can be achieved for a tile of such a small size has to be attributed to the large size of the register file and fast access to the LS, undisturbed with any intermediate levels of memory. Also, such a matrix occupies a 16-Kbyte block of memory, which is the maximum size of a single DMA transfer. Eight such matrices fit in half of the LS, providing enough flexibility for multibuffering and, at the same time, leaving enough room for the code. A discussion of tile-size consideration was also presented before by the authors of this publication [16] . Table 1 represents the Cholesky tile kernels for a tile size of 64 Â 64 as BLAS and LAPACK calls.
It has already been pointed out that a few derivations of the Cholesky factorization are known, in particular the right-looking variant and the left-looking variant [14] . Dynamic scheduling of operations is another possibility. However, no matter which static variant is used or whether dynamic execution is used, the same set of tile kernels is needed. The change from one to another will only alter the order in which the tile operations execute.
All the kernels were developed using SIMD C language extensions. Extra effort was invested into optimizing the matrix multiplication (SGEMM 1 ) kernel performing the calculation C ¼ C À B Â A T , since this operation dominates the execution time. All the tile kernels were written by consistently applying the idea of blocking with a block size of four. A short discussion of each case follows.
The SSYRK kernel applies a symmetric rank-k update to a tile. In other words, it performs the operation T T À A Â A T , where only the lower triangular part of T is modified. The SSYRK kernel consists of two nested loops, where in each inner iteration, a 4 Â 4 block of the output tile is produced, and the body of the inner loop is completely unrolled. The #define and nested #define C language directives are used to create a single basic block-a block of straight-line code. Static offsets are used within the basic block, and pointer arithmetic is used to advance pointers between iterations.
The construction of the SSYRK tile kernel is presented in Algorithm 2. The unrolled code consists of four distinct segments. A computation segment (line 3) includes only multiply and add operations to calculate a product of two 4 Â 64 blocks of tile A, and it produces 16 four-element vectors as output. A permutation/reduction segment (line 4) performs transpositions and reductions on these 16 vectors and delivers the four four-element vectors constituting the 4 Â 4 block of the output tile. The two segments mentioned above handle the off-diagonal square blocks (lines 3 and 4), and two more segments handle the triangular diagonal blocks (lines 6 and 7). It is an elegant and compact but suboptimal design. The kernel could be further optimized by using pipelining of iterations (double buffering), similar to the SGEMM kernel described in the following paragraphs.
Permute/reduce block [j, i] 5:
end for 6:
Compute block [j, j] 7:
Permute/reduce block [j, j] 8: end for block is a 4 Â 4 submatrix of tile T .
The SGEMM kernel performs the operation C C À B Â A T , which is very similar to the operation performed by the SSYRK kernel. One difference is that it updates the tile C with a product of two tiles, B and A T , and not the tile A and its transpose. The second difference is that the output is the entire tile and not just its lower triangular portion. Also, since the SGEMM kernel is performance critical, it is subject to more optimizations compared to the SSYRK kernel.
The main idea behind the additional optimization of the SGEMM kernel comes from the fact that the SPE is a dualissue architecture, where arithmetic operations can execute in parallel with permutations of vector elements. Therefore, a pipelined implementation is possible, where the operations of the permutation/reduction segment from iteration k can be mixed with the operations of the computation segment from iteration k þ 1. The two nested loops used for SSYRK are replaced with a single loop, where the 256 4 Â 4 blocks of the output tile are produced in a linear row-major order, which results in Algorithm 3. The STRSM kernel computes a triangular solve with multiple right-hand sides B B Â T ÀT . The computation is conceptually easiest to SIMDize if the same step is applied at the same time to different right-hand sides. This can be easily achieved if the memory layout of tile B is such that each four-element vector contains elements of the same index of different right-hand sides. Since this is not the case here, each 4 Â 4 block of the tile is transposed, in place, before and after the main loop implementing the solve. The operation introduces a minor overhead but allows for a very efficient implementation of the solve-one that achieves a good ratio of the peak with small and simple code.
Algorithm 4 presents the structure of the code implementing the triangular solve, which is an application of the lazy variant of the algorithm. The choice of the lazy algorithm versus the aggressive algorithm is arbitrary. The code includes two segments of fully unrolled code, both of which 
Solve block j 6: end for block is a 64 Â 4 submatrix of tile B.
The SPOTRF kernel computes the Cholesky factorization, T L Â L T , of a 64 Â 64 tile. This is the lazy variant of the algorithm, more commonly referred to as the left-looking factorization, where updates to the trailing submatrix do not immediately follow panel factorization. Instead, updates are applied to a panel right before the factorization of that panel.
It could be expected that this kernel is implemented using Level-2 BLAS operations, as this is the usual way of implementing panel factorizations. Such a solution would, however, lead to the code being difficult to SIMDize and yield very poor performance. Instead, the implementation of this routine is simply an application of the idea of blocking with a block size equal to the SIMD vector size of four. Algorithm 5 presents the structure of the SPOTRF tile kernel. block is a 4 Â 4 submatrix of tile T . Table 2 compares the tile kernels in terms of source and object code size and performance. Although performance is the most important metric, code size is not without meaning, due to the limited size of LS. Despite the fact that the code can be replaced in the LS at runtime, it is desirable that the entire code that implements the single-precision factorization fits into the LS at the same time. Code motion during the factorization would both complicate the code and adversely affect the performance.
Although the matrix multiplication achieves quite good performance-23 Gflops, which is 90 percent of the peak-there is no doubt that better performance could be achieved by using assembly code instead C language SIMD extensions. Performance in excess of 25 Gflops has been reported for similar although not identical SGEMM kernels [17] . It is remarkable that this performance can be achieved for operations of such small granularity, which has to be attributed to the unique features of the CELL architecture, especially register file and memory organization.
It is worth noting that although significant effort was required to optimize the SGEMM kernel (and yet more would be required to further optimize it), the other kernels involved a rather small programming effort in a higher level language to deliver satisfactory performance (execution time shorter than execution time of SGEMM kernel). This means that the Pareto principle (also known as the 80-20 rule) (http:// en.wikipedia.org/wiki/Pareto_principle) applies very well in this case. Only a small portion of the code requires strenuous optimizations for the application to deliver closeto-peak performance.
Parallelization
The presentation of the parallelization of the Cholesky factorization needs to be preceded by a discussion of memory bandwidth considerations.
The SGEMM kernel can potentially execute at a rate very close to 25.6 Gflops on a single SPE. In such a case, it performs the 2 Â 64 3 ¼ 524; 288 operations in 20.48 s. The operation requires the transmission of three tiles from the main memory to the LS (tiles A, B, and C) and a transmission of one tile from LS to the main memory (updated tile C), consuming a bandwidth equal to This means that eight SPEs performing the SGEMM operation at the same time will require a bandwidth of It has been shown that arithmetic can execute almost at the theoretical peak rate on the SPE. At the same time, it would not be realistic to expect the theoretical peak bandwidth from the memory system. By the same token, data reuse has to be introduced into the algorithm to decrease the load on the main memory. A straightforward solution is to introduce 1D processing, where each SPE processes one row of tiles of the coefficient matrix at a time.
Please see Fig. 1 for the following discussion. In the SSYRK part, one SPE applies a row of tiles A to the diagonal tile T , followed by the SPOTRF operation (factorization) of the tile T . Tile T only needs to be read in at the beginning of this step and written back at the end. The only transfers taking place in between are reads of tiles A. Similarly, in the SGEMM part, one SPE applies a row of tiles A and a row of tiles B to tile C, followed by the STRSM operation (triangular solve) on tile C using the diagonal triangular tile T . Tile C only needs to be read in at the beginning of this step and written back at the end. Tile T only needs to be read in right before the triangular solve. The only transfers taking place in between are reads of tiles A and B. Such work partitioning approximately halves the load on the memory system. It may also be noted that 1D processing is a natural match for the left-looking factorization. In the right-looking factorization, the update to the trailing submatrix can easily be partitioned in two dimensions. However, in the case of the left-looking factorization, 2D partitioning would not be feasible due to the write dependency on the panel blocks (tiles T and C).
One-dimensional partitioning introduces a load balancing problem. With work being distributed by block rows, in each step of the factorization, a number of processors are going to be idle, which are equal to the number of block rows factorized in a particular step modulo the number of processors. Fig. 3 shows three consecutive steps on a factorization with the processors being occupied and idle in these steps. Such behavior is going to put a harsh upper bound on achievable performance.
It can be observed, however, that at each step of the factorization, a substantial amount of work can be scheduled, to the otherwise idle processors, from the upcoming steps of the factorization. The only operations that cannot be scheduled at a given point in time are those that involve panels that have not been factorized yet. This situation is illustrated in Fig. 4 . Of course, this kind of processing requires dependency tracking in two dimensions, but since all operations proceed at the granularity of tiles, this does not pose a problem.
In the implementation presented here, all SPEs follow a static schedule, presented in Fig. 4 , with cyclic distribution of work from the steps of the factorization. In this case, a static schedule works well, due to the fact that the performance of the SPEs is very deterministic (unaffected by any nondeterministic phenomena like cache misses). This way, the potential bottleneck of a centralized scheduling mechanism is avoided. Fig. 2 presents the execution chart (Gantt chart) of factorization of a 1,024 Â 1,024 matrix. The colors correspond to the ones in Fig. 1 . The two shades of green distinguish the SGEMM operation in odd and even steps of the factorization. The yellow color marks the barrier operation, which corresponds to the load imbalance of the algorithm.
It can be observed that load imbalance is minimal (the yellow region), dependency stalls are minimal (the white gaps), and communication and computation overlapping is very good (the colored blocks represent purely computational blocks).
Synchronization
With the SPEs following a static schedule, synchronization is required such that an SPE does not proceed if data dependencies for an operation are not satisfied.
Several dependencies have to be tracked. The SSYRK and SGEMM operations cannot use as input tiles A and B that have not been factorized yet. The off-diagonal tiles A and B are factorized if the STRSM operation has completed on these tiles. In turn, the STRSM operation cannot use as input a diagonal tile T that has not been factorized yet. A diagonal tile T is factorized if the SPOTRF operation has completed on this tile. Dependency tracking is implemented by means of a replicated progress table. The progress table is a 2D triangular array with each entry representing a tile of the coefficient matrix and specifies if the tile has been factorized or not, which means the completion of an SPOTRF operation for diagonal tiles and an STRSM operation for off-diagonal tiles.
By replication of the progress table, the potential bottleneck of a centralized progress tracking mechanism is avoided. Each SPE can check dependencies by testing an entry in its local copy of the progress table. The SPE completing the factorization of a tile updates the progress tables of all other SPEs, which is done by a DMA transfer and introduces no overhead due to the nonblocking nature of these transfers. Since the smallest amount of data subject to DMA transfer is 1 byte, the progress table entries are 1 byte in size. These transfers consume an insignificant amount of bandwidth of the EIB, and their latency is irrelevant to the algorithm.
Communication
The most important feature of communication is double buffering of data (overlapping computation and communication). With eight tile buffers available, each operation involved in the factorization can be double-buffered independently.
Thanks to this fact, double buffering is implemented not only between operations of the same type but also between different operations. In other words, data is always prefetched for the upcoming operation, no matter what operation it is. In the absence of dependency stalls, the SPEs never wait for data, which results in big portions of the execution chart without any gaps between computational blocks (Fig. 5) .
Tiles are never exchanged internally between LSs but always read from the main memory and written to the main memory. An attempt to do otherwise could tie up buffers in the LS and prevent the asynchronous operation of SPEs. At the same time, with the work partitioning implemented here, the memory system provides enough bandwidth to fully overlap communication and computation.
Reads of tiles involve dependency checks. When it comes to the prefetching of a tile, a dependency is tested, and a DMA transfer is initiated if the dependency is satisfied. The DMA completion is tested right before the processing of the tile. If the dependency is not satisfied in time for the prefetch, the prefetch is abandoned in order to not stall the execution. Instead, right before the processing of the tile, the SPE busy-waits for the dependency and then transfers the tile in a blocking way (initiates the transfer and immediately waits for its completion). Fig. 6 shows the performance of the single-precision Cholesky factorization calculated as the ratio of execution time to the number of floating-point operations calculated as N 3 =3, where N is the matrix size of the input matrix. Table 3 gives numerical performance values for selected matrix sizes in gigaflops and also as ratios relative to the peak of the processor of 204.8 Gflops and the peak of the SGEMM kernel of 8 Â 23:01 ¼ 184:8 Gflops.
Performance
The factorization achieves 95 percent of the peak of the SGEMM kernel, which means that overheads of data communication, synchronization, and load imbalance are minimal, and at this point, the only inefficiency comes from the suboptimal performance of the SGEMM kernel. Hopefully, in the future, the kernel will be fully optimized, perhaps using hand-coded assembly.
Refinement
The two most expensive operations of the refinement are the back solve (Algorithm 1, steps 3 and 8) and residual calculation (Algorithm 1, step 6).
The back solve consists of two triangular solves involving the entire coefficient matrix and a single right-hand side (BLAS STRSV operation). The residual calculation is a double-precision matrix-vector product using a symmetric matrix (BLAS DSYMV operation).
Both operations are BLAS Level-2 operations and, on most processors, would be strictly memory bound. The STRSV operation actually is a perfect example of a strictly memory-bound operation on the CELL processor. However, the DSYMV operation is on the borderline of being memory bound versus computationally bound due to the very high speed of the memory system versus the relatively low performance of the double-precision arithmetic.
Triangular Solve
The triangular solve is a perfect example of a memorybound workload, where the memory access rate sets the upper limit on achievable performance. The STRSV performs approximately two floating-point operations per each data element of 4 bytes, which means that the peak memory bandwidth of 25.6 Gbytes/s allows for at most 25:6 Gbytes=s Â 2 ops=float =4 bytes=float ¼ 12:8 Gflop=s;
which is only 1/16 or 0.625 percent of the single-precision floating-point peak of the processor. Owing to this fact, the goal of implementing memory-bound operations is to get close to the peak memory bandwidth, unlike for computationally bound operations, where the goal is to get close to the floating-point peak. This task should be readily achievable, given that a single SPE possesses as much bandwidth as the main memory.
Practice shows, however, that a single SPE is not capable of generating enough traffic to fully exploit the bandwidth, and a few SPEs solving the problem in parallel should be used. Efficient parallel implementation of the STRSV operation has to pursue two goals: continuous generation of traffic in order to saturate the memory system and aggressive pursuit of the algorithmic critical path in order to avoid dependency stalls. A related question is the desired level of parallelism-the optimal number of processing elements used. Since the triangular solve is rich in dependencies, increasing the number of SPEs increases the number of execution stalls caused by interprocessor dependencies. Obviously, there is a crossover point, a sweet spot, for the number of SPEs used for the operation.
Same as other routines in the code, the STRSV operation processes the coefficient matrix by 64 Â 64 tiles. Triangular solve is performed on the diagonal tiles, and a matrix-vector product (SGEMV equivalent) is performed on the offdiagonal tiles. Processing of the diagonal tiles constitutes the critical path in the algorithm. One SPE is solely devoted to the processing of the diagonal tiles, while the goal of the others is to saturate the memory system with processing of the off-diagonal tiles. The number of SPEs used to process the off-diagonal tiles is a function of a few parameters. The efficiency of the computational kernels used is one of the factors. In this case, the number four turned out to deliver the best results, with one SPE pursuing the critical path and three others fulfilling the task of memory saturation. Fig. 7 presents the distribution of work in the triangular solve routines.
The solve is done in place. The unknown/solution vector is read in its entirety by each SPE to its LS at the beginning and returned to the main memory at the end. As the computation proceeds, updated pieces of the vector are exchanged internally by means of direct LS-to-LS communication. At the end, SPE 0 possesses the full solution vector and writes it back to the main memory. Synchronization is implemented analogously to the synchronization in the factorization and is based on the replicated triangular progress table (the same data structure is reused). Fig. 8 shows the performance, in terms of gigabytes per second, of the two triangular solve routines required in the solution/refinement step of the algorithm. The two routines perform slightly differently due to the different performance of their computational kernels. The figure shows clearly that the goal of saturating the memory system is achieved quite well. Performance as high as 23.77 Gbytes/s is obtained, which is 93 percent of the peak.
Matrix-Vector Product
For most hardware platforms the matrix-vector product would be a memory-bound operation, the same as the triangular solve. On the CELL processor, however, due to the relative slowness of the double-precision arithmetic, the operation is on the border of being memory bound and computationally bound. Even with the use of all eight SPEs, saturation of the memory is harder than in the case of the STRSV routine.
The DSYMV routine also operates on tiles. Here, however, the double-precision representation of the coefficient matrix is used with a tile size of 32 Â 32 such that an entire tile can also be transferred with a single DMA request. The input vector is only read in once, at the beginning, in its entirety, and the output vector is written back after the computation is completed. Since the input matrix is symmetric, only the lower portion is accessed, and implicit transposition is used to reflect the effect of the upper portion-each tile is read in only once but applied to the output vector twice (with and without transposition).
Since load balance is a very important aspect of the implementation, work is split among SPEs very evenly by applying the partitioning reflected in Fig. 9 . Such work distribution causes multiple write dependencies on the output vector, and in order to let each SPE proceed without stalls, the output vector is replicated on each SPE, and the multiplication is followed by a reduction step. The reduction is also performed in parallel by all SPEs and introduces a very small overhead, compensated by the benefits of very good load balance. Fig. 10 shows the performance, in terms of gigabytes per second, of the double-precision symmetric matrix-vector product routine. A performance of 20.93 Gbytes/s is achieved, which is 82 percent of the peak. Although the DSYMV routine represents a Level-2 BLAS operation and is parallelized among all SPEs, it is still computationally bound. Perhaps, its computational components could be further optimized. Nevertheless, at this point, the delivered level of performance is considered satisfactory.
LIMITATIONS
The implementation presented here should be considered a proof-of-concept prototype with the purpose of establishing an upper bound on the performance achievable for mixedprecision dense linear algebra algorithms on the CELL processor. As such, it has a number of limitations. Only systems that are multiples of 64 in size are accepted, which means that the cost of solving a system of size 65 Â 65 is equal to the cost of solving a system of size 128 Â 128. Also, only systems with a single right-hand side are supported.
There are no tests for overflow during conversions from double to single precision. There is no test for the positive definite property during the single-precision factorization step, so it is up to the user to guarantee this attribute. The maximum size of the coefficient matrix is set to 4,096, which makes it possible to fit the progress table in each LS. This also makes it possible to fit the entire unknown/ solution vector in each LS, which facilitates internal LS-to-LS communication and is very beneficial for performance. The current implementation is wasteful in its use of the main memory. The entire coefficient matrix is stored explicitly without taking advantage of its symmetry, in both single-precision representation and double-precision representation, an issue that can be resolved by using specialized storage formats [18] . Fig. 11 compares the performance of a single-precision factorization (SPOTRF), the solution of the system in single precision (SPOSV), and the solution of the system in double precision by using factorization in single precision and iterative refinement to double precision (DSPOSV). These results were obtained on an IBM CELL blade using one of the two available CELL processors. Huge memory pages (16 Mbytes) were used for improved performance [16] . The performance is calculated as the ratio of the execution time to the number of floating-point operations, which is set in all cases to N 3 =3. In all cases, well-conditioned input matrices were used, resulting in two steps of refinement delivering accuracy equal to or higher than the one delivered by the purely double-precision algorithm.
RESULTS AND DISCUSSION
At the maximum size of 4,096, the factorization achieves 175 Gflops, and the system solution runs at the relative speed of 171 Gflops. At the same time, the solution in double precision using the refinement technique delivers the relative performance of 156 Gflops, which is an overhead of less than 9 percent compared to the solution of the system in single precision. It can also be pointed outthat by using the mixed-precision algorithm, double-precision results are delivered at a speed more than 10 times greater than the double-precision peak of the processor. 12 shows results obtained on the Sony PlayStation 3, using the six available SPEs and 256 Kbytes 2 of available memory allocated using huge pages (16 Mbytes). For the maximum problem size of 2,048, a performance of 127 Gflops was achieved for the factorization, 122 Gflops for the single-precision solution, and 104 Gflops for the double-precision solution. In this case, the double-precision solution comes at the cost of roughly 15 percent overhead compared to the single-precision solution.
CONCLUSIONS
The CELL Broadband Engine has a very high potential for dense linear algebra workloads, offering a very high peak floating-point performance and a capability to deliver performance close to the peak even for quite small problems. The same applies to the memory system of the CELL processor, which allows for data transfer rates very close to the peak bandwidth for memory-intensive workloads.
Although the double-precision performance of the CELL processor is much lower than the single-precision performance, mixed-precision algorithms permit exploiting the single-precision speed while achieving full doubleprecision accuracy.
FUTURE PLANS
One of the main considerations for the future is application of the pipelined processing techniques to factorizations where the panel does not easily split into independent operations, like the factorizations where pivoting is used.
Another important question is the one of replacing the static scheduling of operations with dynamic scheduling by an automated system and also the impact of such mechanisms on programming ease and productivity.
CODE
The code is publicly available at http://icl.cs.utk.edu/ iter-ref/ ! CELL BE Code. Fig. 11 . Performance of the mixed-precision algorithm versus the single-precision algorithm on IBM CELL blade system (using one CELL processor).
2.
Only approximately 200 KB available to the application. . For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.
