We report our implementation experience of a lattice gauge theory code on the Cell Broadband Engine, which is a new heterogeneous multi-core processor. As a typical operation, we take a SU(3) matrix multiplication which is one of the most important parts of lattice gauge theories. Employing full advantage of the Cell/B.E. including SIMD operations and many registers, which enable the full use of the arithmetic units through the loop-unrolling, we obtain about 200 GFLOPS with 16 SPE, which corresponds around 80 % of the theoretical peak. To our knowledge, this is the fastest value of this operation obtained on the Cell/B.E. so far. However, when we measure the whole time including the data supply, the speed drops down to about 13 GFLOPS.We found that the bandwidth of the data transfer between the main memory and EIB, 25 GB/s, is a bottleneck. In other words, it is possible to run the arithmetic units on the Cell/B.E. with 200 GFLOPS speed, but the current socket structure of Cell/B.E. prevents it. We discuss several techniques to improve the problem partially by reducing the transferred data.
Introduction
Multi-core processors will be a standard computational engine in the high performance computing field in near future, because of their low price and energy-efficiency. Yet, there have been a few investigations how to extract their computational power [1] , [2] , [3] , [4] , [5] , [6] , [7] , [8] , [9] , [10] and [11] .
In this report, we take a typical operation of lattice gauge theories and study how we can write its high performance code on a multi-core processor, Cell Broadband Engine (Cell/B.E.). The lattice gauge theory was formulated by Wilson [12] , and has become very powerful tool to study the non-perturbative aspects of the theory together with numerical simulations. Lattice QCD (Quantum Chromodynamics) is a version of the lattice gauge theories, and is a basic framework for the hadron physics and nuclear physics. In these thirty years, the computational power has been increasing very rapidly, and together with this progress, the lattice QCD has become a more reliable tool. In the era of multi-core processors, the lattice QCD is expected to have strong predictive power especially for hadron physics at finite temperature and density, which includes the astrophysics and high energy heavy ion experiments.
We consider the complex 3 × 3 matrix (SU(3) matrix) multiplication, 
This is most fundamental operation in the lattice QCD. Most technique which we will investigate in this paper can be applied also to the following operation, 
This 3 × 3 complex matrix times vector calculation appears in the large sparse matrix solver, that is the most time consuming part in the lattice QCD. We consider a case in which the above matrix×matrix operations are executed 573, 440 = 112 × 5120 times in the single precision. This corresponds to a lattice size of
The Cell/B.E. is a novel multi-core architecture computational processor. Originally it was developed for a game machine, PS3, and later used for a super-computer, Load-Runner, in Los Alamos National Laboratory, US. IBM also provides a blade server, QS20 for scientific calculations. It was upgraded to QS22. Boards with Cell/B.E. are also available. In addition, QS20 or 22 has two Cell/B.E. in a Board(see Fig.1 ). In this paper, we report an implementation of a code for SU(3) matrix multiplication on QS20.
The Cell/B.E. has eight operation system processor cores called Synergistic Processor Element (SPE) and one system order processor core called PowerPC Processor Element (PPE). It is a multi-core CPU that defines a family of heterogeneous processors. The PPE is PowerPC architecture compliant processor. It is intended to be used as management processor, on which the operating system, memory-management and SPE run-time management application are executed. On the other hand, the SPE is specialized to calculation and has the new architecture with the ability of the 128bit Single Instruction Multiple Data (SIMD) operation that instructs only data within the Local Store (LS) 256kB. Each processor is connected by a high-speed bus called Element Interconnect Bus (EIB). In addition, the EIB is connected to main memory and I/O device, and each processor core performs data access via the EIB.
It is necessary to forward a program and data to LS from main memory so that program execution in the SPE. The SPE uses Direct Memory Access (DMA) transfer for data transmission.
The DMA is used to forward data directly between main memory and LS (or, main memory and I/O device). The DMA data transfer instructions are performed concurrently with the SPE program execution because each SPE has Memory Flow Controller (MFC). See Fig.1 .
Theoretical peak speed of our calculation, Eq.(1), on Cell/B.E. is 268.769518291 GFLOPS, (Hardware peak speed is 460GFLOPS.)
Data structure for communication between PPE and SPE through DMA
We pack the matrices of 16KByte in a structure shown in Listing 1. In order to fit the algorithm done on SPE, we separate the real and imaginary parts of a complex matrix, and pack 112 matrices. (1) is given in Algorithm 1. The speed of this scalar operation is shown in Fig.2 as a function of the number of SPE. for n = 0 to NV-1 do for i = 0 to 2 do for j = 0 to 2 do
end for end for end for 3. Optimization
SIMD Calculation
The SIMDization is the operation technique that can process plural data by one instruction(see Fig.3 ). On the Cell/B.E., the SIMD numerical operations handle 128 bits data, i.e., four real 32 bit data, two double 64 bit data, simultaneously. We combine four sequential data A (n) i, j with n = k, k + 1, k + 2, k + 3 as a vector A (n) i, j . Thus the loop of n in Algorithm 1 is changed to 2. We show a calculation result of Scalar Operation and SIMD Operation in Figure 4 . When we use only PPE, it takes 365.080039(msec) , while it is 40.55844 (msec) when we use SIMD(1SPE). When we use 16SPE, it reaches to 10.13485(msec).
Loop Unrolling and register optimal usage
The SPE has a large number of registers. Therefore, loop unrolling results in an efficient code. Present compiler is not strong enough to use this feature.
We must optimize code by hand. We develop a loop by manual operation, and it widens the code range so that a compiler can optimize many registers.
In addition, the dependency due to the register competition decreases, and one can conceal a stall. Furthermore the total road/store number in the loop decreases, and can conceal the access latency to LS. These are important advantages. We should avoid the resister competition. For this purpose we must keep it in mind that the load from LS to register needs six cycles. Then, it produce a high performance code in which no variable is called which is just used, or to rearrange the instruction of operations. We show a calculation result of Loop Unrolling in Figure 5 . for j = 0 to 2 do 3:
for n = 0 to NV/4-1 step 4 do 4:
end for 8: end for 9: end for
Real bottleneck
So far, we obtained very satisfactory result, i.e., we can use 80% of the SPU hardware power. Needless to say, we use whole Cell system to perform a calculation, i.e., we should provide data to SPU through Main Memory.
We measure the time including the data transfer process, and find that the speed drops down to 12 GFLOPS. This means that we failed to provide number of data which SPU process. In order to improve the situation by hiding the time for communication, we construct a double buffering programming. The essential idea of this method is to overlap the data transfer and floating calculations.
Algorithm 3 Single Buffering
repeat SPE gets Buf1 that includes A and B from MM. SPE calculates C = A × B SPE puts Buf2 that includes C to MM until no data is left in MM.
Algorithm 4 Double Buffering
parallel do SPE gets Buf1-black : SPE calculates data in Buf1-red end parallel do repeat parallel do SPE puts Buf2-red : SPE calculates data in Buf1-black SPE puts get Buf1-red : SPE calculates data in Buf1-black SPE puts Buf2-black : SPE calculates data in Buf1-red SPE puts get Buf1-black : SPE calculates data in Buf1-red end parallel do until no data is left in MM.
During buffer transfer between MM(Main Memory) and SPE, SPE's calculation units are idle. Then we can prepare two buffers, say Buf-red and Buf-black and when one buffer is sent, the SPE can deal data in other buffer(see Algorithm 3 and 4). However, we find that the speed in this case is 13 GFLOPS.
One can go further, i.e., one may construct triple-buffering. But, we see that already for the double buffering case, we gain essentially little. Indeed, the EIB and the bandwidth between the EIB and each SPE is enough wide. However, all data are supplied from the main memory, and 16 SPE's crunch huge amount of data. In our highly efficient code, the SPE's process matrices A and B with a rate of 145 GBYTE per second. Moreover, we have to return the matrix C. In total, we need a data transfer band width of about 220 GB/s. However, the band width to the main memory through the MIC is only 25.6 GB/s, one order of magnitude less. Therefore even we use many buffering method, there is no way to improve the performance. Now the essential point is to reduce data which are required for SPE calculations. For SU(3) matrix, not all elements of 3 × 3 matrices A,B and C are independent. The third row of each matrix can be constructed from the first and second rows.
Using this " reconstruction" technique, we can reduce data, and the speed is now 22 GFLOPS. At PPE, we simply drop the third column of the matrices A and B. We reconstruct A and B as 3 × 3 matrix at SPE, the computational time of the reconstruction is negligible. We return the matrix C to the PPE of 3 × 3 form.
The real degrees of freedom of a SU(3) matrix is eight, i.e., U = e iA with A =
2 A c . However, in this case, it takes time to calculate A in the PPE. Therefore, it is reasonable to use the above reduction, (4).
Concluding remarks
In this paper, we reported our high-performance code redeveloping experience for SU(3) matrix multiplication on Cell/B.E.. This calculation is a central part of lattice gauge theories. After tuning the code, the performance with 16 SPU's is 200 GFLOPS, in single precision. This tuning technique can be applied also for SU(3) matrix times a vector, and we can expect a good result also for the fermion matrix times vector calculation, which appears in a solver for
where D is a huge sparse matrix and contains SU(3) matrices. Taking the data transfer into account we reach a speed of 22 GFLOPS, i.e. 9% of the theoretical peak speed. This is sustained value, and from the cost/performance consideration this is still not bad value. Cell/B.E. can be a handy supercomputing resource even at a small laboratory.
It should be noticed that in real simulations most SU(3) matrix multiplications appear as
When n is large, the ratio of the communication to the calculation decreases. Therefore the band width problem becomes weaker. We stress that the arithmetic calculation speed on the Cell/B.E. is excellent, and the real bottle neck is the band width to the main memory through the MIC, that is only 25.6 GB/s, whereas the bandwidth between the EIB and each SPE is wide enough. Therefore, it is highly desirable that in the next generation of Cell/B.E., this band-width should be widened.
