Lattice Gauge Theory computations
The numerical investigation of quantum chromodynamics (QCD) on a four-dimensional space-time grid is one of the grand challenges in high-performance scientific computing [1, 2] . QCD is considered to be the fundamental theory of strongly interacting particles. After 20 years of research, the strong coupling regime of QCD [3] still has not been solved in a non-perturbative analytical approach, and it is by now widely believed that the numerical treatment of the theory on the lattice using very fast parallel supercomputers is the only viable scheme to extract quantitative physical results [4] . The results from lattice gauge theory (LGT) simulations are urgently needed as theoretical input for current and future accelerator experiments that attempt to observe new physics beyond the Standard Model of elementary particle physics [5] .
LGT computes functional integrals using Monte Carlo methods known from statistical physics [6] . A representative ensemble of field configurations is generated by a Markov process (simulation phase). These configurations are subsequently analysed by composing (and averaging over) hadronic correlators from quark Green's functions (analysis phase). The correlators then serve to extract physical observables like hadron masses and decay constants.
The numerical problem and computational effort
The enormous amount of floating-point operations which have to be calculated in the stochastic simulation in order to achieve statistically significant physical results has Talk presented by N. Attig led to a concentration of activities in this field of research. Several collaborations in Japan (CP-PACS), the U.S. (MILC), United Kingdom (UKQCD), and Italy (APE) are investigating QCD systematically either on special-purpose or commercial high-end parallel supercomputer hardware performing with several hundreds of Gflops.
The large-scale projects SESAM 1 [7] and TχL 2 [8] , a common effort of German and Italian high-energy physicists to study full QCD with two flavours of dynamical Wilson fermions, take place on APE100/Quadrics systems in Zeuthen and Rome and on CRAY T3E systems installed at the Research Centre Jülich [9] . In the following, the physical model of this application which is common many QCD investigations and its numerical implementation are discussed in some detail.
In the simulation phase a representative ensemble of gauge field configurations on a space-time lattice is generated using a Hybrid Monte Carlo algorithm (HMC) on the parallel supercomputer APE100/Quadrics. The SESAM project has simulated on a 16
3 × 32 lattice while the TχL production is still ongoing on a 24 3 × 40 lattice. The APE100 systems are equipped with 256 and 512 nodes respectively. Each node has a theoretical peak performance of 50 Mflops, the HMC algorithm reaches a sustained performance of 50% on a 512-node machine. The total computer time used so far adds up to 250 Teraflop hours.
The analysis of these gauge field configurations is performed on the CRAY T3E systems at the Research Centre Jülich. The most time-consuming part of this analysis phase is the frequent computation of the QCD Green's functions as solutions of huge sparse systems of linear equations [10] on each gauge field configuration
where φ is some input vector, M is the Dirac fermion matrix and ψ the required solution. The solution (quark propagator) is then used to investigate the properties of hadrons by constructing the appropriate hadronic correlators. For the choice of lattice fermions considered here, namely the so-called Wilson fermions, (1) has the explicit form
where U µ (x) c,c is an SU(3) matrix for the gauge field (gluon), computed in the simulation phase, ψ(x) c α is a 4 × 3 complex matrix for the particle (quark) and m is a 4 × 4 matrix containing the spin components. The index x runs over all space-time lattice points (4 dimensions). In particular, M is a non-hermitean complex sparse matrix with the following structure: -8 non-diagonal 12 × 12 matrices (nearest neighbours in space-time).
-Each 12 × 12 matrix is a tensor product of an SU (3) matrix U times the Dirac 4 × 4 matrix m.
In order to solve this system of linear equations we apply Krylov subspace solvers 3 , complemented by preconditioning techniques.
Inversion algorithms
The class of Krylov subspace iterative methods for solving (1) is characterised by the generic form 1. choose an initial guess ψ 0 and set
Here q (m−1) (M ) is a polynomial of degree < m. Examples are the conjugate gradient or the minimal residual algorithms. At present, the state-of-the-art method for the fermion matrix is the stabilised biconjugate gradient (BiCGstab) algorithm [11] .
Further improvement of the inversion algorithm can be obtained by using preconditioning techniques, which should reduce the number of iterations and the computing time necessary to achieve a given accuracy. To precondition (1) we take two nonsingular matrices V 1 and V 2 which act as left and right preconditioners respectively, i.e. we replace (1) by
The matrix V = V 1 V 2 is called the preconditioner. The efficiency of the preconditioning method depends on how good an approximation V is of M , as well as the computational overhead entailed in the method: solving systems with V 1 and V 2 should be cheap. Currently, the most efficient way is to use symmetric successive over-relaxation (SSOR) preconditioning (for details see [12, 13] ):
with M = I − L − U , I the diagonal part, L the strictly lower triangular part and U the strictly upper triangular part. The convergence rate still depends on the ordering of the sites in the lattice. In case of even-odd ordering, where all even sites are labeled before the odd ones on each processor like a checkerboard (see Fig. 1 ), the preconditioned matrix V −1
can be computed explicitly. For larger sub-systems we use parallel SSOR preconditioning. In particular, the sub-blocks have the same size as the partitions of the lattice assigned to a processor. Thus, in parallel SSOR, the parallelism is adapted to the parallel system. The use of the Eisenstat trick is essential:
In this way the SSOR preconditioning is about as expensive as M ψ, because (I −L)x = y (forward solve) and (I − U )x = y (backward solve) are easy to solve. The largest vector computed in the SESAM and TχL projects is of a size of 12 Mwords. This number is given by the volume factor (e.g. 24 3 × 40), the four Dirac components, three colour components, and a factor 2 due to complex elements. As a consequence, we are faced in the analysis phase, where about ten thousand configurations have to be used for calculating the propagators, with a requirement of about 10 Teraflop hours computing time on the order of 1 Terabyte of data. It is therefore evident that optimization of the QCD analysis codes on parallel supercomputers is mandatory in order to carry out ambitious computations of this kind efficiently.
CRAY T3E architecture
The CRAY T3E, which is the second generation of Cray Research MPP systems, is the ideal machine for this kind of application. It is a fully scalable MIMD system with distributed memory and global address space. The application processing elements (application PE) are connected by a network which has the topology of a three-dimensional torus. The system is self-hosting and scalable from 8 to 2048 PEs. For every 16 application PE an operating system node (OS PE) is required.
Each processing node is equipped with a DEC Alpha EV5 microprocessor. In the CRAY T3E-600 the processor clock rate is 300 MHz, while in the T3E-900 the clock rate is 450 MHz. The instruction rate is up to 4 per clock cycle (2 floating-point, 2 integer/logic) which results in a peak performance of 600 Mflops for the T3E-600 and 900 Mflops for the T3E-900, respectively. The microprocessor uses IEEE 64-bit arithmetic. Furthermore, each microprocessor has 3 on-chip caches: one 8 KByte data cache (Dcache), segmented into 256 cache lines with 4 64-bit words each, one 8 KByte instruction cache (Icache) with the same segmentation and one 96 KByte secondary cache (Scache), separated in 3 associative sets of 512 cache lines each, serving either Dcache and Icache. The load latencies are 2 clock periods (CP) from Dcache and 8 to 10 CP from Scache, the load bandwidths are 2 loads/CP from Dcache or Scache, or 1 store/CP. Each PE is equipped with a local memory (DRAM) of 128, 256 or 512 MByte. To maximize local memory bandwidth, 6 data stream buffers are available. They do a prefetch from DRAM to cache for vector-like data references leading to a better performance. Additionally, a set of 512 off-chip memory mapped external (E) registers can be used, which directly load/store into/from the cache registers from/to the global memory.
For QCD simulations it is sometimes sufficient to perform the calculations in halfprecision. On the T3E operations can only be performed in 64-bit precision due to the 64-bit arithmetic of the processor. Nevertheless, it is possible to use half-precision variables (32 bit). Before being processed, they are extented to 64 bit, then processed and finally reduced again to 32 bit. So it is not possible on the T3E to increase the number of operations per clock period but one can profit from a better cache usage.
The calculations are performed at the Research Centre Jülich on two CRAY T3E systems: on a 512-node T3E-600 and and on a 256-node T3E-900. Table 1 summarizes the main characteristics of these machines installed at Jülich.
Implementing the inversion algorithm on the CRAY T3E
The strategy for computing the quark propagators is as follows. First, a four-dimensional processor grid with n proc processors is defined, where p x · p y · p z · p t = n proc . Then, the system is partitioned by dividing the four-dimensional lattice of size The speed of the algorithm and the convergence depends on the layout and the ordering of the fields on each processor. The cache is used best by running the spacetime lattice index last. Two efficient orderings of the lattice sites are the even-odd or checkerboard ordering and the local lexicographic ordering. The first one is easy to vectorize or parallelize, because the even (odd) points are independent from each other (Fig. 1) , while the second one is the most efficient way with regard to the number of iterations [12] . In the forward (backward) solves the neighbours before (after) the current site have to be considered, thus one needs to communicate the results of the points on the boundary as soon as they are computed. In order to reduce the overhead for the communication we introduce a further blocking within local lexicographic (BLL) ordering on each processor (Fig. 1) , which accelerates the Mflop rate by a factor of 3.
Cutting the local sub-lattice once in order to decouple the local data elements is sufficient to achieve efficient pipelined communication. Hence, the blocking is called one-dimensional. We remark that it would not be useful to perform two-dimensional blocking (leading to four distinct local partitions) as the efficiency of local lexicographic SSOR depends on the local lattice size and decreases for smaller local lattices.
In Fig. 1 the ordering of the sites corresponds to the alphabetic ordering a − q. The arrows illustrate the points, which have to be taken into account for the update of one of the sites in a two-dimensional analogue.
The program starts with an input phase. The gauge fields ({U }, 159 MByte on the 24 3 × 40 lattice) and sometimes the starting vectors (φ, ψ o , 637 MByte each on the 24 3 × 40 lattice) are read in (in half-precision, because they were generated on the APE100 in 32-bit precision). The I/O is implemented in parallel by letting the p t processors read n t files and send the data to those processors belonging to the same time-slices. Making use of user-triggered disk-striping by distributing the files over different partitions, a performance of up to 120 MBytes/s can be reached. When the data are read in and distributed, they finally have to be converted to 64 bit. Now the inversion algorithm can start.
One iteration step for solving the linear equation M ψ = φ involves four basic operations:
1. communication of the boundary values, 2. vector inner products, 3. AXPY vector operations (y = y + α * x), 4. matrix-vector multiplications.
For the communication, the Cray-specific shared memory routines (shmemput and shmemget) are used which transfer the data directly between the local and remote memories. For standard message-passing routines like MPI about 10% of the total computer time is spent on communication. With shared memory routines, communication is a factor of 2 to 3 faster for our code.
In case of the inner products and AXPY operations, Fortran 90, BLAS, and assembler routines can be compared. We found that for our application there is nearly no difference between the different routines, but on the Fortran level, it is worthwhile taking care that all cache entries are used as often as possible and padding the arrays to avoid cache conflicts. E.g. the operation y = x + β * y followed by z = x + α * y runs with 86 Mflops (on the T3E-900) while the performance goes up to 175 Mflops if both operations are done in one step z = x + α * (z + β * y).
Around 80% of the computer time is spent for the forward/backward solve or, in case of even-odd ordering, for the matrix-vector multiplication, which from the programming point of view is essentially the same. Therefore, we spent most of our effort on optimizing this operation. Looking at the fermion matrix in more detail, the number of SU (3)-multiplications can be reduced by a factor of 2 by noting that the 4 × 4 matrix m α,α can be split into the matrices p of size 4 × 2 and q of size 2 × 4. This is known as the Wilson trick. The matrix M of (2) takes the form
For further discussion of the parallel implementation, let us consider the last term
A standard way to compute (7) in four steps is:
1. -Communicate the boundary of χ (done globally for the whole vector χ).
Using even-odd ordering there is another possible way to proceed in five steps:
At first glance one would expect case 1 to be better suited to exploit cache reuse, while case 2 should be able to exploit the stream buffers more effectively. This is confirmed by the single-processor results listed in Tab. 2, where one clearly sees that the multiplication U χ has a higher Mflop rate in case 2, while for the final multiplication case 1 has the higher rate since it has better cache usage. From this performance comparison it can be concluded that at least on a Fortran programming level, case 1 is the better choice. Also, as expected, nearly all floating-point operations are generated by the complex 3 × 3 matrix-vector multiplication with U µ (x). On the other hand, the multiplications with p and q are mainly load and store, because m has a simple form, e.g. for µ=2:
These multiplications are hard to optimize because the performance is limited by the bandwidth between memory and cache.
Comparing the results on the T3E-600 with the ones obtained on the T3E-900 one clearly sees that the application speeds up very well for those parts which are not memory-bounded, e.g. U 
Single-processor optimization
The performance obtained up to now implies automatic (Fortran 90 compiler options for intrinsic vectorisation and unrolling) and manual optimizations (index permutations, loop interchange to improve cache usage, common block padding to avoid Scache conflicts and keeping the number of active data streams to 6 or less wherever possible). To improve the performance further, we decided to use assembler programming for the kernel routines. As a first step, the multiplication with U was rewritten in assembler. The performance results are listed in Tab. 3. A significant improvement can be observed in cases 1 and 2, indicating that the hand-written assembler routine makes much better use of the registers. (7) 67 (81) 45 (72) 89 (116) 52 (98) Further improvements are expected by extending the assembler programming to more steps needed for the calculation of (7), especially for case 2 where the inner loop over the sites x implies a very efficient use of the streams buffers. It might be possible that a hand-written assembler program can benefit much more from this effect than the corresponding Fortran routine. Our attempt is to speed up case 2 as much as possible in order to finally obtain a BiCGstab algorithm with even-odd odering which converges faster in real time than the corresponding one with BLL ordering. For case 1 we do not expect any further major acceleration because only site after site can be calculated.
One possibility is to combine the last three steps of case 2 in one assembler program. In this case, we measured a performance of 108 Mflops (T3E-600), and 120 Mflops (T3E-900) respectively. However, the assembler code now contains the gather instruction of the nearest neighbour sites as well as the multiplication with the matrix q, which have both very low Mflop rates. Combining these steps with the matrix-vector multiplication leads to more effective use of cache resident data and eliminates some of the intermediate steps (e.g. explicitly calculating χ ). This increases the performance from 52 (98) Mflops (T3E-900) to 68 (120) Mflops in the complete computation of (7).
Furthermore, it is possible to combine all steps of case 2 in one assembler routine, except the communication, which is done globally as in case 1. In this case the single-processor performance of the assembler routine reached 85 (140) Mflops on the T3E-600 and 92 (190) Mflops on the T3E-900. This disapointing performance is caused by the extra load and store instructions that have been added to the assembler routine. To understand why this happens one has to consider the architecture of the EV5 RISC chip: even though four instructions are performed per clock period, some instructions cannot be issued simultaneously in the same clock period (e.g. loads and stores). In addition, only one floating-point multiply or add may be issued in the same clock period. These factors make it very difficult to balance the requirements of the processor with the practical nature of the problem, when the assembler code becomes saturated with load and store instructions, while the number of floating-point operations only increases marginally. On the other hand, a performance improvement of around 20% for the complete computation of (7) can be realized. Our currently best performance values for the BiCGstab algorithms are summarized in Tab. 4. These data clearly show that the BiCGstab algorithm with even-odd ordering in the implementation of case 2 is only slightly faster than the one with BLL ordering in the implementation of case 1. As mentioned in Chap. 5, the number of iterations with BLL ordering is roughly one half compared to even-odd ordering. Therefore, in real time BLL ordering is still about 80% faster than even-odd ordering.
Summary
We have presented a discussion of optimized codes for lattice QCD on CRAY T3E systems. To improve the state-of-the-art BiCGstab inversion algorithm, we implemented SSOR blocked lexicographic preconditioning which can be efficiently implemented on parallel machines. Our optimization efforts were concentrated on the matrix-vector multiplications since most of the processing time is spent in these routines. It was demonstrated that assembler programming of kernel routines improves the overall performance substantially. In assembler programming, we found that one of the main difficulties in optimizing codes on the CRAY T3E processing nodes is the efficient implemention of instruction scheduling with respect to the complex memory system (on-chip caches, stream buffers, E-registers), and the processor's capability to perform four instructions per clock period.
The final performance of 20% of the peak speed on the T3E-600 with half-precision data is consistent with the well known limitations of direct mapped cache architectures. Also on other popular machines used by the QCD community like the APE100/Quadrics or CP-PACS, assembler programming and register optimization is mandatory in order to achieve relative performances between 50 and 60% of their peak speed. However, these systems profit from hardware optimizations relevant for efficient QCD matrix-vector multiplications.
