It is shown that the 21264 Alpha processor can reach about 20% sustained efficiency for the inversion of the Wilson-Dirac operator. Since fast ethernet is not sufficient to get balancing between computation and communication on reasonable lattice-and system-sizes, an interconnection using Myrinet is discussed. We find a price/performance ratio comparable with state-of-the-art SIMD-systems for lattice QCD.
Introduction
The urgent need for cheap sustained compute power for lattice QCD (LQCD) provides a strong motivation to fathom the potential of PC or workstation clusters. It is not a long time ago that PCs and workstations have become both speedy and cheap enough to render their clustering in commodity networks economical, in view of local performance, scalability and total system size. Moreover, to render clusters efficiently one needs open source operating systems such as Linux. The apparent success of Beowulf clusters and the tremendous peak compute power of Alpha processors as realized in the Avalon cluster [1] immediately have called the attention of the lattice community.
We are going to investigate two different cluster approaches, both based on Compaq Alpha processors:
One system (NICSE-TS) we have designed and benchmarked, using state-of-the-art iterative solver codes, is a four-node cluster of 533 MHz 21164 EV56 Alpha processors, installed as a test-system at the John von Neumann-Institut für Computing in Jülich/Germany and operated under Linux. Since QCD involves only nearestneighbor interaction, a mesh based connectivity appeared to be the natural parallel architecture in order to handle the ensuing interprocessor communication between the nodes.
Our second test-cluster (ALiCE-TS) has been * Talk presented by N. Eicker.
designed with respect to the experiences gained by NICSE-TS. Besides the shift to 21264 EV6 Alpha processors we are using Myrinet, a Gbit network. This promises the interprocessor connectivity to be fast enough to compute LQCD on Alpha clusters. As Myrinet provides a multistage crossbar, we have given up the former mesh approach. This test system again consists of four workstations. We will show that ALiCE-TS is superior to the "cheap" NICSE-TS in terms of price/performance ratio by nearly a factor of two. The paper is organized as follows: in Section 2 we give the specifications for the two variant clusters tested, Section 3 describes our benchmark codes and contains some results and in Section 4, we give price/performance ratios.
The Testbeds
The benchmark systems consist each of four single processor nodes with two different generations of Alpha processors. The connectivity is fast Ethernet and Myrinet, respectively.
NICSE-TS
NICSE-TS is a four-node system with fast Ethernet connectivity. The system is located at NIC, FZ-Jülich. The nodes are very similar to the Avalon-nodes, i.e. they contain:
• 533 MHz 21164A Alpha microprocessors, 2 MB 3 rd level cache, Samsung Alpha-PC 164UX motherboards
• ECC SDRAM DIMMs (256 MB per node)
• D-Link DFE 500 TX Ethernet cards
The main difference to Avalon is the networksetup. Where Avalon has an all-to-all network using switches, the NICSE-TS uses a 2-D torus. Thus we need four Ethernet cards per node where Avalon only employs one. On the other hand we do not need any switch. We expect, that the network performance scales to a large number of nodes for nearest neighbor communication. Allto-all communication can be achieved by the routing capabilities of the Linux kernel.
ALiCE-TS
ALiCE-TS is a four-node cluster with switched Myrinet connectivity. This system is hosted at Wuppertal University. It includes:
• 466 MHz 21264 Alpha microprocessors, 2 MB 2 nd level cache, Compaq DS10 motherboards
• ECC SDRAM DIMMs (128 MB per node)
• 64-bit 33MHz Myrinet-SAN/PCI interface
• MPI based on Myrinet GM library ALiCE-TS has been purchased as prototype system for the design of the 128 node Wuppertal Alpha-Linux-Cluster Engine (ALiCE).
QCD Benchmarks and Results
The computational key problem of LQCD is the-very often repeated-inversion of the Dirac matrix. It has been shown in [2] , that such systems are most efficiently solved by Krylov subspace methods like BiCGStab. State-of-the-art is the application of parallel local lexicographic preconditioning within BiCGStab [3] .
The results of this paper's benchmarks are based on two codes:
BiCGStab is a sparse matrix Krylov solver with regular memory access, where computation and communication proceed in an alternating fashion. In this case, DMA capabilities of the communication cards are not exploited.
SSOR is the same solver but with locallexicographic SSOR preconditioning. The SSOR process leads to rather irregular memory access and extensive integer computations. This code is very sensitive to the memory-to-cache bandwidth. Since communication overlaps with computation, DMA can be exploited.
Both codes are written in C and compiled under the GNU egcs-1.1.2 C compiler. Timing was done with MPI Wtime. For both codes there exist two versions:
1. To test single node performance, the code runs without communication operations, otherwise carrying out exactly the same operations as the following parallel version.
2. On the 4-node test machines, the physical system is laid out in a 2-D fashion, consequently, communication is carried out along two dimensions, namely z-and t-directions. Assuming N proc = N z × N t processors, the global lattice is divided in N t slides in tdirection where every slide consists of N z slides in z-direction.
In the sequel, we are going to employ a local lattice of size 16 2 × 4 × 8 on 2 × 2 processors such that we emulate a realistic 16 3 × 32 system on 4 × 4 processors.
Single-node results
The basic operation in the iterative solution of the Dirac matrix is the product of a SU(3) matrix with two color vectors. The average number of flops per matrix vector operation is N f lop = 171. The number of complex words to get from memory in order to carry out this process is N cwords = (9 + 2 × 12) leading to N bytes = 528 bytes for double precision arithmetics. Therefore we expect the maximal performance that can be reached for a single node to be limited by
in a steady state of computation and data flow, given a maximal memory bandwidth of 300 and 1300 MB/sec, respectively. Note that our problem size is chosen to be larger than the available caches. The real performances will be smaller due to BLAS-1 and BLAS-2 operations within BiCGStab. Table 1 One processor benchmark. Numbers in MFlops. Table 1 shows that, on the UX board, the performance of BiCGStab comes close to the limiting value given, while the DS10 performance deviates by more than a factor of 2 from the estimate. The local lattice size presumably is too small to lead to saturation of the bandwidth for the DS10
2 . However, as a main result, we find that the improvement in performance going from the 533 MHz Alpha 21164 to the 466 MHz Alpha 21264 chip is around a factor of two, using identical codes. Furthermore, the SSOR preconditioner with irregular memory access is, as has been expected, less effective than the simple BiCGStab.
Four-node results
The impact of interprocessor communication for both connectivities is determined on the fournode testbed systems. Table 2 Four processor benchmark. Numbers in MFlops.
As shown in Table 2 , the results for the fast 2 The STREAMS benchmark [4] gives a real bandwidth of 580 MB/sec instead of the theoretical value of 1300. This difference explains the factor of two.
Ethernet mesh (denoted by UX) are disappointing. The performance of both codes, SSOR and BiCGStab, is reduced by more than a factor of two compared to the single node result. The main degradations are due to the massive protocol overhead forcing the processor into administration instead of computation. User-level networking interfaces promise to circumvent this problem in the near future, but are currently not available for our configuration. It is satisfying to see, by comparing Tables 1  and 2 , that the Alpha 21264-Myrinet system (denoted by DS10) with Myrinet GM library has a communication loss in the range of only 10 to 20 %. We expect a further considerable improvement of these results by employing software with reduced protocol stack like SCore [5] or ParaStation [6].
Conclusion
Comparing price/performance ratios we arrive at the following estimates: An Alpha 21164 system, connected in a fast Ethernet mesh, wouldas an optimistic estimate-lead to a 4 GFlops device (sustained) for 128 processors with a price of about 80 k$ per GFlops.
A 128 processor DS10 Alpha-Linux-Cluster connected by Myrinet, however, promises to reduce costs to 40 -50 k$ per GFlops (estimated from list prices as of July 1999) and is therefore in the range of state-of-the-art dedicated QCD machines.
