Computing the actions of Wilson-Dirac operator contributes most of the CPU time for the grand challenge problem of simulating Lattice Quantum Chromodynamics (Lattice QCD). This routine exhibits many challenges in implementation on most computational environments because of the multiple patterns of accessing the same data, making it difficult to align the data efficiently at compile time. Additionally, the low computation to memory access ratio makes this computation bounded by the memory bandwidth and the memory latency.
INTRODUCTION
Efficient implementation for computing the actions of WilsonDirac operator is of critical importance for the simulation of Lattice Quantum Chromodynamics (Lattice QCD) [16, 4, 6] . Simulating Lattice QCD aims at understanding the strong interactions that bind quarks and gluons together to form hadrons. In Lattice QCD, a four-dimensional space-time continuum is simulated, where the quantum field (composed of quarks) resides at the lattice sites and the gauge field (composed of gluons) resides at the links between these sites. Lattice spacing should be small to obtain reliable results which requires enormous amount of computations. Figure 1 shows the discretization of the four-dimensional space-time continuum of the Lattice QCD.
The use of accelerator for scientific computing has always been targeted by many researchers. Among recently attractive technologies are graphic processing units (GPUs) [12, 2, 3, 9, 13] and the Cell Broadband Engine [17, 1, 4, 10] . The Lattice QCD community, as well as other high performance computing communities, started exploring the possibility of using these accelerators to build cost effective supercomputers for scientific computations.
Using GPUs, for instance, has been investigated [7, 8] for Lattice QCD, especially with the advent of general purpose programming environment such as Cuda [11] for NVidia graphic cards. The main challenge in these environments is the over-protection that most manufacturers adopt to hide their proprietary hardware designs.
Traditionally, the implementation of this routine on most architectures exploits a small fraction of the floating-point computational power because the computation is bandwidth limited. For instance, an SSE2 implementation on Intel Xeon is in the range of 0.8 GFlops per core of double precision computations [8] . The performance on BlueGene/P is measured at 0.6 to 1.1 GFlops per core [6] depending on the lattice size.
Using Nvidia G80 GPUs to implement this routine delivers 6.2 GFlops [8] of single precision computations. One of the main limitations on the use of this architecture is the limited bandwidth between the main memory and the GPU.
The use of the Cell broadband engine is also under consideration by many Lattice QCD groups. An analytical model to predict the performance limits of simulating Lattice QCD is developed [4] . This model predicted that the performance will not exceed 34% of the theoretical peak performance. Some simplified kernels were also tested on the Cell BE [10] showing a peak performance estimate of 20 GFlops of single precision computation. These studies confirmed the fact that the computation of Lattice QCD is bandwidth limited and tried to predict the performance of a real implementation.
In this study, we introduce an implementation of the main kernel routine for simulating Lattice QCD. In this implementation, we tried to provide answers to two main questions; the first question is how to SIMDize the computation in an efficient way; the second question is how to distribute the lattice data and how to handle memory efficiently.
For efficient SIMDization, we introduce runtime data fusion, a technique that aligns data at runtime that cannot be aligned optimally at compile time. Furthermore, while allocating lattice data on the main memory, we introduce analysis for data frames to create optimized DMA requests that removes redundancy of data transfers as well as improves contiguity of memory accesses. Our implementation achieves 31.2 GFlops in single precision computation and 8.8 GFlops in double precision computation.
The rest of this paper is organized as follows: Section 2 introduces the Cell broadband engine architecture and the used software development environment. Section 3 introduces the Wilson-Dirac computation kernel. The SIMDization problem is tackled in Section 4. Section 5 details the proposed memory layout and the analysis leading to optimizing the memory transfers. We comment on the utilization of the Cell BE on Section 6. Section 7 concludes this paper.
CELL BROADBAND ENGINE AND ITS SOFTWARE DEVELOPMENT ENVIRONMENT
In this study, we target developing an efficient implementation of the main kernel routine of simulating Lattice QCD on the Cell Broadband Engine (Cell BE). We used IBM Cell BE SDK 3.0 [5] . We explored our implementation on the Cell BE as well as the future generation Cell with enhanced double precision (EDP).
1
We used the simulator provided in the SDK to analyze the performance of our implementation and we verified the performance, except for the Cell EDP, on an IBM BladeCenter QS20 system with dual-Cell BE processors (running at 3.2 GHz). Figure 2 outlines the basic components of the Cell BE processor. The Cell BE chip is composed of multiple cores; a PowerPC compatible master processor, with dual SMT (PPE) and eight synergistic processing elements (SPEs).
The execution unit on the PPE can handle control-flow intensive codes while the execution unit on the SPE is optimized to handle SIMD computations. Each SPE has a 128 register file; each register is 16-byte wide. The SPE has a small (256 KB) special memory called local store.
The main data is usually stored in the external memory and data are transferred back and forth with the main memory through special DMA APIs. Each SPE has two pipelines, one is mainly specialized in doing integer and floating point operations (even pipeline) and the other is mainly specialized in doing shuffling, branching and load/store operations (odd pipeline).
WILSON-DIRAC OPERATOR
In this study, we ported the computation of the actions of Wilson-Dirac operator on a spinor field (representing quarks) based on the code of the ETMC collaboration, see for instance [15, 14] .
Computing the actions of Wilson-Dirac operator is the most time consuming operation in simulating Lattice QCD. Equation 1 details the computation of the actions of WilsonDirac operator. This computation involves a sum over spinors (ψi) multiplied by gauge links (Ui,µ) through the spin projector (I ± γµ). unity matrix, and κµ represents the hopping term that controls the convergence to a solution. U † represents the conjugate transpose of a complex matrix U .
The representation of each gauge link is a special unitary SU(3) matrix (3 × 3 complex variables). Each spinor is represented by four SU(3) vectors composed of three complex variables. The routine implementing this computation is called Hopping Matrix.
Parallelization of the Hopping Matrix routine involves dividing the lattice into two subfields odd and even, as shown in Figure 1 . Each spinor of the odd subfield is surrounded by spinors of the even subfield and vice versa. The computation sweeps on spinors from one subfield making the other subfield temporarily constant, thus breaking data dependency. The odd and even subfields are stored in separate grids to improve spatial contiguity of memory accesses. The computation involves gathering the data of the surrounding spinors and gauge field matrices to compute an output spinor. The computation order follows the storage order of the output spinors. This ordering of memory accesses is one of the most bandwidth friendly orderings that will prove critical to performance in Section 5.
Even though Equation 1 shows regular computation across all sites of the lattice, the computation usually faces the chal- lenge of the low ratio of floating-point operations to memory references. This makes this computation memory bandwidth and latency bounded.
In Section 4, we discuss the problem of efficiently SIMDizing this code, while the data alignment and communication are investigated in Section 5.
SIMDIZING WILSON-DIRAC COMPUTATIONS ON THE CELL BROADBAND ENGINE
The main problem that prevents SIMDizing the computation of Wilson-Dirac operator efficiently is the multiple patterns of accessing the same data, due to the spin projector in Equation 1, making no single representation optimal at runtime. Each gauge link SU(3) matrix is accessed twice (positive and negative directions). The computation may involve the original matrix or the conjugate transpose of the matrix. A gauge link matrix is usually stored in only one representation. The problem is exacerbated for spinors because each spinor is accessed in eight different contexts depending on the space direction. Each access involves different spinor vectors and operations alternating between vector addition, subtraction, conjugate addition, and conjugate subtraction.
Aligning data such that all these computations are performed optimally at the same time is not possible. Different earlier approaches for SIMDizing this code align the data on one layout and then use shuffle operations to change alignment of the data at runtime to perform the needed computations, for instance the SSE2 implementation of the HMC method by Urbach et al. [15] .
Another problem is that data are represented by 3 × 3 complex matrices and four 3 × 1 complex vectors, which do not map perfectly to power of 2 data alignment. For the Cell BE processor data should be aligned to the 16-byte boundaries to be efficiently accessed. DMAs are also better aligned to 128-byte boundaries.
Efficient SIMDization of the code requires alignment of the data in a way that reduces the dependency between instructions, reduces the number of shuffle instructions, and allows efficient instructions such as multiply-add to be executed.
In this work, we introduce runtime data fusion as a solution for the above problems of data alignment. We show that the performance can be greatly improved using this technique.
Runtime data fusion
Two conventional representations of complex structures are commonly used; the first combines the real and the imaginary parts into one structure; the second separates the real and the imaginary parts into two separate arrays. Figure 3 shows these data layouts for storing a 3 × 1 complex vector and 3 × 3 complex matrix for double precision variables aligned to 16-byte words.
In Table 1 , we list the number of instructions that are needed to do a matrix-vector multiplication. We consider the average instruction count for doing vector×matrix multiply and vector×transposed-conjugate matrix multiply. For the first representation, the real and the imaginary parts face different treatments, thus requiring shuffles. The second representation separates the real and the imaginary parts into two separate arrays, which involves additional shuffles for transposing the matrix.
The computational requirements based on these alignments favor separating the real and the imaginary parts. The number of floating points is reduced because of the possibility to use multiply-add and multiply-subtract instructions. These instructions cannot be used with the first layout because the real and the imaginary parts share the same 16-byte word.
Even with the larger number of instructions, the first layout is favored by the perfect alignment within the boundary of a memory word. The second layout requires either to use padding (25% of the total space for spinors and 10% for the gauge field links) or the data will not be perfectly aligned. Increasing the size of the data because of alignment can severely reduce the performance of this application because it is bandwidth limited as will be detailed in Section 5. Not aligning data to 16-byte boundaries will severely penalize loading and storing data and will nullify the computation benefit achieved by reducing the instruction count. We adopted the first alignment of complex data as the base for comparison, especially that the original implementation (optimized for SIMDization on Intel SSE2) adopted it throughout the whole code for simulating Lattice QCD. Considering the fused version in Figure 3 , the number of floating-point instructions can be reduced significantly and no shuffling is needed except at the fusion/scatter stages. The fusion introduced in Figure 3 .c shows the two-way matrix fusion for complex variables of double precision. The fused matrix removes the need for shuffling in case of conjugate access to the array elements because the real and the imaginary parts are aligned to separate words. Transposing or conjugating the matrix does not involve additional shuffles.
This fused alignment is unfortunately not possible at compile time, especially for spinors because it requires having a unique order of accessing spinor at compile time. In other words, given a spinor i we need to determine a unique spinor that will always precede it in the computation and a unique spinor that will always follow it. The surrounding spinors in every access context can be different and it will be a very large space overhead to keep multiple coherent copies.
Because of the performance benefit associated with data fusion and difficulty of doing it statically, we consider the following proposal for runtime data fusion:
1. Data are fused at runtime for a number of structures depending on the number of variables per memory word, which usually involves some startup shuffle operations.
2. An optimized kernel of computation is written assuming fused data. Fused data are kept alive in registers as long as they are needed.
3. The final results for the optimized kernel (the output spinors) are then scattered back before storing them to the memory.
The steps involved in code transformations to support runtime data fusion are shown in Figure 4 . Our technique involves fusing unrolled code to align data that can not be aligned well statically. The fusion process involves grouping data that will be accessed with the same pattern of access on SPU word boundary (i.e., aligning them to a 16-byte boundary). For single precision computations, 4-byte floating points, spinors are computed in groups of 4 output spinors. Consequently, the input gauge links and the input spinors are combined into groups of four (4-way fusion). For double precision, optimal alignment requires fusing the computation for two output spinors (2-way fusion). Runtime fusion adopts the same data structure used for the base; no padding is needed. The fused data only exist during computation, living in registers.
The main feature of the Cell SPE that allows this technique is the large register file. Merging data structures in the beginning of computation incurs minimal overhead if all fused data are kept in registers as long as they are needed for computations. Runtime fusion can prove difficult for other processors with SIMD instruction set, that have small register count, for instance Intel SSE. We need to keep not only input fused data but also intermediate results in registers, especially that most data are not accessed frequently. During the computation of a group of two spinors in double precision, almost 6 KB of memory are accessed while the register file can hold only 2 KB. Knowing that some additional registers are needed to hold shuffle patterns, intermediate results, and other bookkeeping operations, careful register lifeness analysis of registers is needed to minimize the possibility of spilling the register file to the memory. We did this analysis on a basic block size ranging between 2-2.4 Kilo instructions. In our implementation, we managed to use almost 110 of the SPE 128-bit registers without the need to spilling fused data or intermediate results to the memory. Figure 5 shows the dynamic instruction decomposition for four implementations of the base complex alignment and the fused versions. We have two computation templates in terms of the shuffles needed, one for the double precision and the 2-way single precision, and the other kernel is for the 2-way double precision and the 4-way single precision. Perfect fusion kernel, 4-way single and 2-way double, provides better chance for reducing the shuffles of data. The shuffling is reduced to less than 37% of the original shuffle operations for the single precision computations when we use the 4-way fusion and less than 22% for the double precision computations. Not using fusion for single precision, not presented, will incur additional overheads during transposing matrixes and would provide a much worse performance. Figure 5 shows also reductions in floating point operations, in addition to the reduction in shuffle instructions, because runtime fusion of data allows using multiply-add or multiply-subtract more frequently as clarified earlier in Table 1 .
The impact on the average execution cycles per spinor computation is given in Table 2 . For these performance cycles, we aligned all data in the local store of the Cell SPE. The execution cycles are reduced significantly for the versions with fusion compared with the versions with less or no fusion. For single precision the reduction is 35% for the 4-way fusion compared with the 2-way fusion. For double precision, the reduction is 40% on the current generation Cell BE and is predicted to be 53% on the future generation Cell EDP.
In Table 2 , we define the efficiency of the memory instructions as the percentage of the minimum load/store instructions, needed by the computations to the actual load/store instructions executed during computations. As shown in the table, the memory instructions efficiency is very high in our implementation, ranging from 82.7% to 99%. Careful register lifeness analysis is needed to achieve this but it is also facilitated by the large register file available on the Cell BE that makes it possible to hold intermediate computation of the spinor. The efficiency is generally above 97% except for single precision with the 4-way fusion which has additional memory operations associated with the fuse and scatter phases. Still the overall performance with the 4-way fusion is much better than the 2-way fusion. Table 2 also shows the performance in GFlops for these implementations and the bandwidths needed to access the memory. Apparently, if the data are requested from the main memory system outside the Cell BE chip, then the memory subsystem will not be able to afford these bandwidths.
2
Only double precision on the Cell BE requires moderate bandwidth because the execution of double precision computation is severely penalized during the issue stage of instruction execution. This table shows that the performance, in general, will be bounded by bandwidth; which justifies our choice of avoiding implementations based on data alignment that requires large paddings.
Efficient handling of memory defines the limits of the achievable performance on the Cell architectures for Lattice QCD. This topic will be explored in details in the next section. Figure 6 shows the decomposition of the execution cycles at runtime. For double precision on Cell BE, the performance is dominated by the issue stall. The performance improvement for the 2-way fusion is attributed to the reduc- tion in the number of issued FP due to using multiply-add and multiply-subtract as one instruction instead of two separate instructions. For single precision on the Cell BE and double precision on the Cell EDP, the shuffle operations are reduced (fewer cycles in the critical path of the execution time is needed by the odd pipeline and fewer stall cycles on shuffles). Additionally, some nops are removed from the critical path of execution on the Cell EDP with double precision computation.
Using runtime data fusion, the un-overlapped odd pipeline cycles (implying stall of the even pipeline, responsible for FP operations) is reduced by 20%, 63%, and 66% for single precision on the Cell BE, double precision on the Cell BE, and double precision on the Cell EDP, respectively.
LATTICE QCD MEMORY ALIGNMENT
Traditionally, the computation of Wilson-Dirac operator is parallelized by aligning part of the lattice in a computing element's memory and the results of computation from this computing element are communicated to the neighboring computing elements. Assigning part of the lattice per local store has the advantage of reducing the DMA requests with the memory. Applying the same model on the Cell BE is challenged by the following:
• The local store associated with the SPE is very small in size. The number of output spinors that can be computed in the local store is no more than 64 double precision spinors or 128 single precision spinors. This leads to handling trivially small sublattice size.
• The computation-to-communication ratio of a small lattice is severely small. The results on each local store will need to be communicated not only to the other SPEs sharing the chip but also to the SPEs on other Cell chips. The need for simulating lattice of millions of spinors leads to the need for thousands of Cell processors making the slow communication a dominating factor for performance.
• The synchronization between the SPEs will be very frequent. In Table 2 , we show the number of cycles needed to finish one frame of computation for 64 single precision spinors or one frame of 32 double precision spinors. Computing a frame requires cycles in the order of tens of thousands. At 3.2 GHz, these cycles leaves trivially small amount of inter-communication time and cannot scale well on a parallel machine. In our experiments, we noticed that the variation in execution time due to the wait for DMA is very large which makes any frequent synchronization on this architecture ineffective. Synchronization causes contention on resources because all SPEs will be acquiring memory, doing computation or waiting at a synchronization point.
The conventional approach to program the Cell BE is to store the data on the main memory, outside the Cell chip, then to bring frames of data for processing. Each SPE takes responsibility of doing the computation for part of the dataset. To compute one spinor, the data communicated with the external memory is 1504 bytes (assuming 16-byte alignments) for single precision computations and 2880 bytes for double precision. This leads to bytes/FP ratio equals to 0.935 and 1.79 for single precision and double precision computations, respectively. Considering solely the bandwidth restriction of the Cell memory system at 25.6 GB/s, the upper limit on performance will be 27.4 GFlops for single precision computation and 14.3 GFlops for double precision. Putting data in the main memory solves the limited storage size of the SPE, but direct application of this approach will assign SPE computing threads the job of requesting data to operate upon. The data need to be aligned in contiguous memory regions to facilitate fetching them.
As explained earlier, each input spinor generally appears in the computation of eight output spinors. The context of access (relative access with other spinors) is not the same for each of these spinor access scenarios. Because of coherence we cannot have multiple copies easily for spinors; instead we need to compute indices of spinor locations before fetching them. An implementation of this approach on the Cell SPE will face the following obstacles:
• It is extremely inefficient to do address calculation in the SPE for spinors because of the need for control flow and integer operations for many spinor sites. These computations cannot be easily SIMDized.
• It is not possible to compute indices of spinors and store them in the local store because of the limited size of the local store. Alternatively, storing them on the main memory will add an additional step of fetching indices from the main memory in the critical path of execution, in addition to stressing the memory scarce bandwidth.
• Computing the location of each spinor individually then requesting a group of them will be associated with DMA fragmentation which can severely impact the performance and reduces the effective bandwidth observed during execution. Because we know that storing spinors in the main memory is the solution suitable traditionally for the Cell BE and since we know that there is redundancy on accessing the data (especially for spinors), we analyzed the data available within frames targeting the following objectives:
• Reducing the stress on the scarce bandwidth.
• Improving contiguity of DMA requests (having fewer DMAs).
The next section shows that frame analysis can lead to achieve these goals at least in part.
Contiguity analysis of the data space
The data accessed during spinors computation belongs mostly to the gauge field and the spinor field. Accessing each of these data structures has different attributes, as follows:
• For the gauge field 1. Each gauge link is surrounded by two spinors.
2. The gauge field is not updated during the computation of the Wilson-Dirac operator. Consequently, each gauge link can have multiple copies in separate locations because no coherence mechanism is needed between these copies.
3. The gauge links can be reordered arbitrary to improve the performance. Because two spinors are surrounding each link then preserving contiguity within one spinors subfield will require at most two copies for each gauge link.
4. The spinors surrounding a gauge link always belong to two separate spinor subfields (one in the odd subfield and the other in the even subfield). One of these spinors is updated while the other is considered constant (belonging to the constant subfield). During a sweep (even-to-odd or oddto-even) each gauge field link appears only once.
Based on the above, the best way to access the gauge field is to replicate the gauge field matrices based on the spatial locality for each sweep of computation. This alignment can be done at compile time. Two gauge fields can be created, one is ordered for contiguity during the odd-even sweep of computation and a redundant copy is ordered for the evenodd sweep. In one sweep, no redundancy of data exists and data are organized to guarantee contiguity of access.
• For the spinor field Multiple copies of the same spinor will require coherent update for all copies which will add load to the potentially overloaded memory system. For spinors, we considered analyzing the frames of spinors accessed during computation for contiguity and redundancy. Figure 7 shows part of the indices accessed in one frame of spinors. Computing an output spinor requires accessing one row of the input spinors. Looking at the indices of each row (needed to compute one spinor) we find no spatial locality. Looking across neighboring output spinors, we observe some contiguity for the input spinors streamed from the same directions of the space, for instance the columns for +t, -t, +x, and -x.
We can issue DMA requests from the memory based on their spatial locality in memory to improve the contiguity of access from the external memory. Knowing that the local store is not arranged like cache, the non-contiguous access during execution of spinors in the local store is fortunately not penalized.
Contiguity of access favors issuing DMAs based on column major ordering for spinors from the main memory while accessing spinors in a row-major fashion from the SPE local store during computations.
The second observation is that some of the spinors within the frame are repeated. To save bandwidth, we can bring the non-redundant part of the frame and then use the very fast local store to local store transfers. This can reduce the stress on the memory system thus reducing the effect of the limited bandwidth on performance. We similarly search for redundant data that can be moved between consecutive frames. Effectively, two copy lists are created; the first is used for copying redundant spinors within the same frame; and the second is used to copy spinors between consecutive frames. Because the values of spinors changes during computation, while their locations remain fixed, the analysis to create optimized DMA lists and to create copy list is needed only once at the start of the program execution. This job is done more efficiently on PPE because the computation is control flow intensive. The PPE can also use the addresses of the spinors within the SPE local store so that absolute indexing is given back to the SPE to improve its runtime performance.
Each SPE receives the addresses where its optimized DMAs are located in the start of creating its thread. In our implementation, we allocate a buffer for information about DMAs for 64 frames in the local store (occupying about 32 KB). Each frame represents the data necessary for computing 64 single precision spinors or 32 double precision spinors. In total, a DMA list carries information about 2048 to 4096 output spinors. The eight SPE can hold the optimized DMA information for a lattice size 16 3 × 16 of single precision (or half of that for double precision).
To handle larger lattice sizes the optimized DMA lists, created in the beginning of the program, can be fetched in pieces. The overhead of this process is minimal since it occurs once every 64 frames of computation and the memory request involves a small amount of data. The layout of the data in the local store including the optimized DMA list is shown in Figure 8 .
Benefiting from these optimized DMAs depends on whether the bandwidth is the performance bottleneck or not. Figure 9 shows the use of double buffering to overlap computation with communication. For single precision computation on the Cell BE, the DMA performance is in the critical path of execution. For double precision on the Cell BE, the computations hide the latency for the DMA completely. For the Cell EDP, simulations show that the wait for DMA will be put back in the critical path of execution for double precision computations. The optimized DMA, shown on the right of Figure 9 , reduces the stress on the DMA and introduces additional work to the computing thread. The overall execution time can be lowered if DMA is bounding the performance. The computation engine with the optimized DMA and the inter-frame analysis is shown in Algorithm 1. The computation uses double buffering with the optimized DMA to overlap computation and communication whenever possible. It also interleaves sending the results to the memory with the computation whenever possible.
Performance with DMA
In this section, we present the performance considering DMA, based on the 4-way fusion for single precision and the 2-way fusion for double precision. We also report the performances estimated by the simulator for the future generation Cell EDP with enhanced double precision engine.
We show the performance with three implementations of the DMA
• Base DMA with no optimization: The row major access for spinors is adopted thus accessing fragmented data.
• DMA with contiguity: The spinor data in the local store are aligned in column major format and the DMAs are requested form memory for all data in contiguous fashion whenever possible.
• Optimized DMA: Only non redundant data are requested from the memory in addition to the earlier contiguous access to memory. The data are complemented either from the data brought with the current frame or from the data available in the previous frame. Figure 10 shows the execution time breakdown for the three DMA schemes on the Cell BE and the Cell EDP. Enqueuing unoptimized DMA requests contributes large percentage of the execution time ranging from 50% to 90%. The main reason for this large delay is that only 16 DMA requests are allowed at a time. Exceeding this limit causes the SPE execution to stall waiting for the completion of DMAs and thus reducing the chance of overlapping computation with communication. Another problem with fragmented DMAs is that it is not always aligned to 128-byte boundaries of the address space. The fragmentation with single precision computation is more severe compared with the fragmentation with double precision. Aligning data to 128-byte boundary is not advantageous because it will reduce the number of spinors that can be computed per frame to half.
Contiguous DMA requests are created by observing the contiguity on spinors accessed in the same frame of spinors. Enqueuing time becomes negligible and the wait time for DMAs shows up as the limiting factor for single precision on the Cell BE with about 62% of the execution time and is predicted for double precision on the Cell EDP to be 66% of the execution time. The double precision performance on the Cell BE is not affected by the latency of the DMA.
Optimized DMA reduces the stall for DMA by 30-40%. Buffer repairs (complements) are less than 5% of the total execution time for single precision on the Cell BE and double precision on the Cell EDP. For double precision on the Cell BE, the buffer repair takes 2.5% of the execution time. Table 3 summarizes the performance in GFlops and the bandwidth needed in GB/s. Compared with contiguous DMA, optimized DMA achieves 37% performance improvement for single precision. Future Cell EDP is expected to observe 39% performance improvement for optimizing DMA on double precision computation compared with using contiguous DMA access alone. The improvement in performance is higher than the saving in bandwidth because other savings are associated with reducing the requested data such as reducing the queuing delay, reducing fragmentation repairs, and reducing the memory controller occupancy.
For double precision on the Cell BE, the best performance is achieved with contiguous DMA without removing redundancy because copying spinors is added to the critical path of execution. With redundancy removal, the reduction in performance is about 2% while almost 26% of the bandwidth is saved which potentially can save power consumption.
Another observation is that the saving in bandwidth is different for single precision (24%) compared with double precision (26%). The bandwidth saving is dependent on the fragmentation of data and the amount of redundancy within a frame of spinors. The fragmentation for double precision frame is lower compared with single precision frame because of the larger size of the data structures representing the spinors and the gauge field links. On the other hand, the redundancy within a frame is dependent on the frame size where for single precision it is 64 while it is 32 for double precision. The bigger the frame size the better the chance to find redundancy within a frame. Figure 11 shows the increase in the amount of redundancy within a frame with the increase of frame size. Increasing the lattice size also requires increasing frame size to find the same amount of redundancy.
The simulations we conducted for the Cell EDP conservatively assumed that the local store size (and thus the frame size) will be the same as the current generation Cell BE. If the local store size increases, then the effectiveness of optimizing DMA, by removing redundancy, is expected to increase.
SPES UTILIZATION
The SPEs are fully utilized for double precision computation on the current generation Cell BE. For single precision computation on the Cell BE and double precision on the Cell EDP, SPEs stall for large percentage of the execution cycles. The main reason is that the cycles that the memory controller will be fully occupied bringing one frame of data, based on 25.6 GB/s, is not less than 11.5 KCycles. Having eight SPEs per Cell BE, each SPE will have 92.2 KCycles to do computation. These cycles are about 3 times the cycles needed for computing a single precision frame on the Cell BE. A higher ratio is predicted for double precision on the Cell EDP, if the memory bandwidth is not improved. This shows why the SPEs will be underutilized in these cases. Figure 12 shows the performance achieved by partial use of the Cell BE computational resources. The experiments were done on IBM Blade Center QS20 system varying the used SPE count from 1 to 8. For single precision calculation, it is apparent that as few as four SPEs can achieve most of the performance the Cell processor can afford, considering the 4-way fusion. The performance slows down by 5% if we increase the SPEs from four to eight because we use more contending SPEs on the limited bandwidth. For single precision computation, we achieve the same maximum throughput (defined by the bandwidth) for the 2-way fusion and the 4-way fusion. For single precision, 2-way fusion, we achieve the maximum throughput with 6 SPEs compared with 4 SPEs for the 4-way fusion. For double precision the 2-way fusion steadily shows better performance compared with no fusion.
Although the flattening throughput is disappointing, it allows an additional degree of freedom for designing a multicell system to simulate Lattice QCD. To handle the imbalance of the computational resources with the bandwidth for Lattice QCD on the Cell BE, the following alternatives can be explored:
• The memory of some of the SPEs can be used as an extended memory for the other SPEs, for instance, to hold optimized DMA information. The inter-SPE bandwidth is much higher than the bandwidth with the external memory. The Cell processor will then be able to hold information about a very large lattice size (for instance 32 3 × 64).
• In addition to the computing SPEs, other SPEs can manage their local storage as additional buffers for frames. Creating extended frames increases the chance of detecting redundancy within each frame and thus reducing the pressure on the memory bandwidth.
• Few SPEs can be used and the others are turned off to save power, or computing threads can migrate between SPEs to reduce hotspots on the Cell BE chip. The runtime fusion can be viewed as power-saving optimization.
Single on Cell 
CONCLUSIONS
Porting the computation of the actions of Wilson-Dirac Operators on Cell broadband engine requires two special design processes; the first process is SIMDizing the code to achieve good performance on this architecture; the second process involves optimizing DMA requests.
For Wilson-Dirac operator, no single compile-time data layout can be optimal for SIMDizing this code. In this work, we presented the "runtime data fusion" model to overcome the difficulty of static fusion and to reduce the need for the shuffle operations. We show that we can achieve 79.6 GFlops for single precision computation and 8.8 GFlops for double precision (50 GFlops for double precision is expected on the future Cell EDP).
Considering DMA without analysis makes the observable performance no more than 8.4 GFlops for single precision and 4.3 GFlops for double precision (6 GFlops on the Cell EDP).
We show that analysis of frames of data can help in saving 26% percent of the bandwidth in addition to improving contiguity of access. The PPE undertakes the job of analyzing data frames for contiguity and redundancy removal once in the beginning of the program and then transfer to the SPEs optimized DMA requests and buffer repairs. With optimized DMA, we observed 31.2 GFlops for single precision, 8.75 GFlops for double precision, and we predict 16.6 GFlops of double precision performance on the Cell EDP based on the performance simulator provided by IBM.
