Direct numerical simulation of the Navier-Stokes equations (DNS) is an important technique for the future of computational fluid dynamics (CFD) in engineering applications. However, DNS requires massive computing resources. This paper presents a new approach for implementing high-cost DNS CFD using low-cost cluster hardware.
Introduction
Computational fluid dynamics (CFD) always has been a discipline in search of greater computer power. As coding techniques and processor speeds have improved, user demands for increased accuracy have met and exceeded each increase in computational ability. This problem has been magnified as the phenomena studied with CFD simulations have expanded from traditional applications used by aerospace engineers and meteorologists to more diverse problems. For instance, the CFD group at the University of Kentucky has worked on noise reduction in inkjet printers [18] and on optimizing the thermal performance of a fan-sink cooling system for high power electronics [17] .
High computational cost CFD problems typically are solved at national supercomputer centers or similar facilities. However, access to these facilities is limited, with most organizations outside of government and academia having no access at all. Alternately, large corporations and universities can purchase shared-memory supercomputers such as those built by SGI and HP; however, these machines are expensive, with unclear upgrade paths and relatively high maintenance costs. Smaller companies and research institutions simply cannot afford these systems. Thus, the very high up-front cost of CFD analysis has placed the technology beyond the reach of many of the researchers and engineers whose applications could benefit the most from the new abilities of CFD codes.
This paper presents an alternative approach to the computational challenges of advanced CFD problems: use inexpensive, high-performance, clusters of PCs -"Beowulfs." [1] Over the past few years, many very positive claims have been made in favor of PC clusters, and the hype usually gives better performance than the actual systems do. However, by carefully engineering the cluster design, using (and building) tools to improve application performance, and restructuring the application code for the cluster, it is possible to realize most of the benefits that are so often claimed for clusters.
Since February 1994, when we built the first parallel-processing Linux PC cluster, we have been very aggressively pursuing any hardware and software system technologies that can improve the performance or give new capabilities to cluster supercomputers. DNSTool requires a good network with high bisection bandwidth; we developed a new class of network architectures, and design tools for them, that make it possible to build an appropriate network at minimal cost. DNSTool also requires lots of memory bandwidth and fast floating point math; we provide these by using uniprocessor Athlon PCs and tools to accelerate floating point performance using 3DNow!. Our cluster, KLAT2 (Kentucky Linux Athlon Testbed 2), was not designed to only run DNSTool, but it was designed to run applications like DNSTool exceptionally well. KLAT2 (figure 1) is a cluster of 64 (plus 2 "hot spare") 700MHz AMD Athlon PCs. Each PC contains 128MB of main memory and four 100Mb/s Fast Ethernet interfaces. Nine (and one spare) 32-way Ethernet switches are used in an unusual network architecture to interconnect the machines with low latency and high bandwidth.
Of course, it was necessary to restructure DNSTool to run more efficiently on KLAT2. Many of the changes were generic and would improve the code performance on any cache-based machine. Some of the changes reduced the total FLOPs needed for the computations, reducing total time somewhat by replacing floating point recomputations with more awkward reference patterns -not really what we should have done if achieving the "peak MFLOPS rate" was our goal. Still other changes were highly specific to the features of KLAT2, such as use of Athlon 3DNow! support. The changes were not easy, but applying them to other CFD codes will now be less difficult.
The combination of DNSTool and KLAT2 yields exceptionally good price/performance: $2.75 per MFLOPS double-precision 80/64-bit operations and $1.86 per MFLOPS using single-precision 3DNow!. Maintenance is made relatively painless by including spares in the initial system (and we have included these "unnecessary" spare parts in the system cost). Upgrading the system can be done incrementally by our tools to design an improved network, replacing the processors with a later (faster) version, etc. Further, although we have not experimented with clusters larger than KLAT2, given larger CFD problems, there is nothing about the design of KLAT2 or DNSTool that would prevent efficiently scaling the system to many more processors.
Thus, the design tools and implementation of KLAT2 and DNSTool represent a significant step toward moving complex CFD simulations from the supercomputer center to the center of your lab.
The next section of this paper describes the equations that govern DNSTool's CFD simulation. Section 3 describes KLAT2 and, more importantly, the tools that we have created to optimize the system design and to help code use the system more efficiently. In section 4, we detail how DNSTool was restructured and how we used our tools to tune the code for KLAT2. Section 5 presents both the CFD output and computational performance obtained from applying DNSTool to a real engineering problem: flow over a single turbine blade. The complete parts list and itemized cost of KLAT2 also is given in section 5. The final section of the paper summarizes the contributions and suggests directions for future research.
The CFD Algorithm
In theory, almost all fluid dynamics problems can be solved by the direct application of the NavierStokes equations to a sufficiently fine grid. Direct numerical simulation of the Navier-Stokes equations (DNS) provides a useful tool for understanding the complex physics in engineering processes. Even though with current computer technology, DNS can only be applied to simplified problems, DNS is likely to become the standard practice for engineering simulations in the 21st century. The challenges in applying DNS to practical engineering problems involve: (1) more accurate numerical solutions, (2) high demands on computer memory, and (3) greater computational speed. Traditionally, the focus of CFD research was on improving the numerical accuracy of the solution through more complicated modeling of the physical phenomena. Continual advances in computer architecture and technology have opened the door for focusing on the latter two areas. Meeting the latter two challenges will allow the application of DNS directly to practical engineering problems.
The Governing Equations for DNS
Direct numerical simulation is based on solving the full, time-dependent Navier-Stokes Equations for an ideal gas which express the conservation of mass, momentum, and energy for a compressible Newtonian fluid. The equations written in curvilinear coordinates are 
where
The stress tensor and the heat flux vector
where and ¹ are molecular viscosity and thermal conductivity, respectively. The pressure is related to the density and temperature, º , according to the equation of state
where ½ is the ratio of the specific heats. The coefficient of viscosity,
, and thermal conductivity,
¹
, are related by the constant Prandl number
Numerical Algorithms
There are wide variety of CFD techniques that can solve these governing equations. Of the several CFD codes employed or under development at the University of Kentucky, we have chosen to use DNSTool [9] , a code package originating at the University of Technology Munich and undergoing further development at the University of Kentucky. DNSTool is specifically designed to solve three-dimensional flow problems using the direct numerical simulation model for turbulence. It has primarily been used to simulate hypersonic flow around blunt objects [10] , but is readily applicable to a wide range of engineering CFD problems. The choice of this particular code is based on it being MPI-ready, having been tested on numerous multiprocessor platforms, and on its clean program organization, making it readily accessible to code-performance enhancements. The base DNSTool code is over 20000 lines of C, broken into more than 100 separate files. Briefly, DNSTool solves the governing equations as follows. The flux vector, representing the advective and dissipation flux terms of the Navier-Stokes equations, are solved with the AUSM flux-splitting scheme [14] É
where the geometrical quantities to account for the curvilinear coordinates are contained in 
where the superscripts
Likewise, the pressure is evaluated as per Liou [13] 
The dissipation term evaluation is also M-dependent, although the construct is somewhat different: e f h g p i q r 1 9 ' s C u t v 6 e x w y
with the VL term found from
and the AUSM dissipation term from
The values of t are a function of the pressure distribution, while the value of r g is a function of the
The left and right extrapolated values of the conservative variables are determined by a thirdorder MUSCL-type extrapolation, in which any given quantity z y or z is defined by
The diffusive flux gradients are evaluated using a variation of the circulation method
or in terms of the finite volume grid
These spatial calculations are performed in a series of three sweeps in the i-, j-, and k-directions. This trio of sweeps are made each sub-iteration of the multi-step time integration, leading to as many as 12 sweeps in the case of the 4th-order Runge-Kutta method. Both the 2nd-order (for unsteady solutions) and 4th-order (for steady solutions) Runge-Kutta methods were used by DNSTool/KLAT2 machine.
Parallelization
The parallel structure of DNSTool is based on splitting the computational grid into sub-blocks, which are then distributed to each processor (figure 2). For a complex geometry, this is a nontrivial exercise where an uneven load-balance across different processors could significantly reduce the computational efficiency of the overall code. The partitioning of the grid into sub-blocks is performed independently of the grid generation, using virtual partitions to define the domains corresponding to each processor. The current algorithm governing the partition process is based on spectral recursive bisection in conjunction with a small database of the computational speeds associated with each node. The splitting is performed as a preprocessing task, generating a file that maps the grid sub-blocks onto the processors. This file is the only difference between performing a serial computation and a parallel computation with DNSTool.
Communications between the grid sub-blocks occurs when the sub-blocks exchange data about the flow variables at the boundaries. As show in figure 2 , the flow variables on the edge of one grid block are communicated to the dummy points of the neighboring grid block, and vice versa. DNSTool requires such a communication step after each update, or subiteration, of the flow variables. The low-level implementation of the communication between the sub-blocks uses a MPI-based communications system. The communication model is a mailbox algorithm where the data is sent in a non-blocking communication mode as early as possible.
The overall parallel framework of DNSTool is managed by MBLIB, a code library that efficiently controls the domain decomposition and associated communication patterns. On top of this library the efficient compressible Navier-Stokes solver described previously is implemented. Thanks to MBLIB, the Navier-Stokes solver is isolated from the parallel decomposition-the solver code is the same whether the overall computation is performed on a single domain or any number of subdomains. 
Engineering a Cluster Supercomputer
Since February 1994, when we built the first parallel-processing Linux PC cluster [4] , we have been very aggressively pursuing any hardware and software system technologies that can improve the performance or give new capabilities to cluster supercomputers. As the latest in a long sequence of clusters that we have built, KLAT2 continues our tradition of using innovative new technologies. However, unlike all our previous systems, KLAT2's base configuration uses no custom hardware -all of the performance improvements are accomplished by changing the way in which standard hardware components are configured and/or restructuring the software to take better advantage of the hardware. Without using custom hardware, there are three key approaches that we can use to improve the performance of a CFD code on KLAT2:
1. Optimization and general restructuring of the CFD code to improve performance. Many of these optimizations are not really specific to KLAT2, but improve performance on most computers. We have performed many such optimizations, most of which involve restructuring to change the memory reference behavior and to reorder the calculations so that more redundant computations can be eliminated. Our goal has been to minimize execution time despite the fact that some of the optimizations also reduce the achieved FLOPs rate.
2. Tailoring of the network hardware structure and communication patterns to optimize network performance. The new techniques we used to design KLAT2's network allow us to optimize network performance in general or even to specialize KLAT2's network for this CFD code. All runs reported here were performed with a generalized network rather than one specialized for this code.
3. Use of the Athlon's 3DNow! vector floating point instructions. Unfortunately, these instructions only support 32-bit precision for floating point arithmetic and also have various performance issues that make them difficult to use. However, we have developed compiler technology and other support that makes use of 3DNow! feasible. This is one of the key reasons that KLAT2 uses AMD Athlon processors.
Achieving high performance from this CFD code requires the first two approaches. The third approach, use of 3DNow!, we viewed with skepticism until we were able to verify that 3DNow! single-precision floating-point arithmetic has sufficient precision to ensure that the CFD results are valid.
The following subsections discuss KLAT2's special architectural characteristics, its network and support for 3DNow!, that we have used to improve the performance of this CFD code. The specific optimizations and general restructuring of the CFD are discussed in section 4.
Optimizing Network Performance
The cost and baseline performance of AMD Athlon processors is outstanding, and KLAT2's basic performance relies heavily upon that fact, but processors alone are not a supercomputer -a high-performance interconnection network is needed. Further, KLAT2 uses uniprocessor nodes instead of multiprocessor (shared-memory SMP) nodes for two key reasons: (1) using single-processor nodes eliminates inter-processor memory access conflicts, making full memory bandwidth available to each processor and (2) SMP Athlon PCs were not widely available at the time KLAT2 was built. This increases the importance of network performance and also increases the number of nodes, making the network larger and more expensive. Our solution to these network design problems is a new type of network: a "Flat Neighborhood Network" (FNN) [7] .
This network architecture came from the realization that the switching fabric need not be the full width of the cluster in order to achieve peak performance. Using multiple NICs (Network Interface Cards) per PC, single-switch latency can be achieved by having each PC share at least one switch with each other PC -all PCs do not have to share the same switch. A switch defines a local network neighborhood, or subnet. If a PC has several NICs, it can belong to several neighborhoods. For two PCs to communicate directly, they simply use NICs that are in the same neighborhood. If two PCs have more than one neighborhood in common, they have additional bandwidth available with the minimum latency.
Before discussing how we design and use FNNs, it is useful to consider a small example. What is the best interconnection network design that can be built using four-way switches for an eight-PC cluster? A very popular answer is to use a fat tree [12] , because fat trees easily provide the full bisection bandwidth. Unfortunately, most inexpensive switches cannot handle routing for a fat tree topology. Assuming that appropriate switches are available, the eight-PC network would look like figure 3.
For this fat tree, the bandwidth available between any pair of PCs is precisely that of one link; thus, we say that the pairwise bandwidth is 1.0 link bandwidth units. The bisection bandwidth of a network is determined by dividing the machine in half in the worst way possible and measuring the maximum bandwidth between the halves. Because the network is symmetric, it can be cut arbitrarily in half; the bisection bandwidth is maximal when all the processors in each half are sending in some pattern to the processors in the other half. Thus, assuming that all links are bidirectional, the bisection bandwidth is 8*1.0 or 8.0.
Pairwise latency also is an important figure of merit. For a cluster whose nodes are physically near each other, we can ignore the wire latency and simply count the average number of switches a message must pass through. Although some paths have only a single switch latency, e.g. between A and B, most paths pass through three switches. More precisely, from a given node, only 1 of each of the 7 other nodes can be reached with a single-switch latency. Thus, 1/7 of all pairs will have 1.0 switch latency and 6/7 will have 3.0 switch latency; the resulting average is (1.0 + 3.0*6)/7, or 2.7 switch latency units.
Instead of using a fat tree, suppose that we use a FNN to connect these same eight PCs with four-way switches. Unlike the fat tree configuration, the FNN does not connect switches to switches, so cheap, dumb, switches can be used. However, more NICs are needed. At least for 100Mb/s Ethernet, the cost savings in using dumber switches more than compensates for the larger number of NICs. In this case, each PC must have 3 NICs connected in a configuration similar to that shown by the switch numbers and colors in figure 4 .
Unlike the fat tree, the FNN pairwise bandwidth is not the same for all pairs. the FNN shown has some symmetry, FNN connection patterns in general do not have any basic symmetry that could be used to simplify the computation of pairwise bandwidth. However, no PC has two NICs connected to the same switch, so the number of ways in which a pair of connections through an S-port switch can be selected is S*(S-1)/2. Similarly, if there are P PCs, the number of pairs of PCs is P*(P-1)/2. If we sum the number of connections possible through all switches and divide that sum by the number of PC pairs, we have a tight upper bound on the average number of links between a PC pair. Because both the numerator and denominator of this fraction are divided by 2, the formula can be simplified by multiplying all terms by 2. In other words, the average pairwise bandwidth for the above FNN is ((4*3)*6)/(8*7), or about 1.28571428. Not only does the average pairwise bandwidth of the FNN beat that of the fat tree, but the bisection bandwidth also is greater. Bisection bandwidth of a FNN is very difficult to compute because the definition of bisection bandwidth does not specify which communication pattern to use; for FNNs, the choice of pattern can dramatically alter the value achieved. Clearly, the best-case bisection bandwidth is the number of links times the number of processors; 8*3.0 or 24.0 in our case. If we assume that only pairwise permutation communication patterns are used, a very conservative bound can be computed as the number of processors times the average pairwise bandwidth; 8*1.28571428 or 10.28571428. However, pairwise permutation patterns do not generally yield the maximum bisection bandwidth for a given cut because they ignore bandwidth available using multiple NICs within each PC to send to distinct destinations at the same time. In any case, bisection bandwidth is significantly better than the fat tree's 8.0.
Even more impressive is the FNN design's pairwise latency: 1.0 as compared with 2.7 for the fat tree. No switch is connected to another, so only a single switch latency is imposed on any communication.
However, the biggest surprise is in the scaling. Suppose that we replace the six 4-way switches and eight PCs with six 32-way switches and 64 PCs? Simply scaling the FNN wiring pattern yields pairwise bandwidth of ((32*31)*6)/(64*63) or 1.47619047, significantly better than the 8 PC value of 1.28571428. FNN bisection bandwidth increases relative to fat tree performance by the same effect. Although average fat tree latency decreases from 2.7 to 2.5 with this scaling, it still cannot match the FNN's unchanging 1.0.
It also is possible to incrementally scale the FNN design in another dimension -by adding more NICs to each PC. Until the PCs run out of free slots for NICs, bandwidth can be increased with linear cost by simply adding more NICs and switches with an appropriate FNN wiring pattern. This is a far more flexible and cheaper process than adding bandwidth to a fat tree.
If FNNs are so great, why has it not been done before? There are four important reasons:
1. It only works well if fairly wide wire-speed switches are available; only recently have such switches become inexpensive.
2. Routing is not trivial; as a minimum, each machine must have its own unique routing table.
Optimal routing using multiple NICs as a higher-bandwidth channel is conceptually like channel bonding [2] , but requires a much more sophisticated implementation because this bonding is destination-sensitive (i.e., NICs may be used together when sending to one PC, but grouped differently when sending to another PC).
We solved the last three problems by creating a genetic search algorithm (GA) [11] that can design an optimized network, print color-coded wiring labels, and construct the necessary routing tables.
Although it was somewhat difficult to create the GA, and the execution time was sufficiently large that we actually run it on a cluster, the GA program is capable of optimizing the network design for any communication patterns or other characteristics specified -an important new capability beyond that of traditional networks. KLAT2's genetically designed FNN is shown in figure 5 . This version of KLAT2's FNN was partially optimized for another code (ScaLAPACK [3] ) that uses communication patterns that are completely different from those of the CFD code discussed here -it was not optimized for this CFD. However, KLAT2's FNN uses its nine 31-way switches to yield average pairwise bandwidth of (((31*30)*8)+(8*7))/(64*63) or 1.859 bidirectional 100Mb/s Ethernet links/pair (371.8Mb/s per pair). Multiplying that by 32 PC pairs communicating across the bisection yields a very conservative bisection bandwidth of 11.9Gb/s; the uplink switch (which holds the two hot spares and the links to the outside world) adds an additional 1.8Gb/s of bisection bandwidth, for a total of 13.7Gb/s. The upper bound bandwidth on KLAT2's FNN without the Redesigning and physically rewiring KLAT2's FNN to be optimal for the CFD code is a relatively simple and quick process; we let the GA run for about two hours on a single 1GHz Athlon to design the network and the physical rewiring took two people a total of about 2.5 hours. Figure  6 shows KLAT2's color-coded wiring tags for the network partially optimized for ScaLAPACK (left) and the new network pattern optimized for DNSTool (right). The immediate result of this optimization was an additional speedup of about 1%; much more substantial performance improvements are possible by adjusting DNSTool's order of communications within a timestep and using FNN advanced routing [6] . Certainly, this rewiring and tuning of the communication code is appropriate if KLAT2 (or any other cluster) is dedicated to running this code. However, even with the extra pressure of using uniprocessor nodes, the basic FNN properties proved sufficient to efficiently handle the demands of this communication-intensive CFD code. Thus, throughout this paper, we have quoted the conservative "generic" FNN performance numbers for DNSTool.
At this writing, a version of the FNN design GA has been made freely available at http://aggregate.org/FNN/ via an interactive WWW form. In order to make the GA run fast enough for interactive use, the GA was simplified to use only a generic communication cost function. Once we have completed making the user interface more friendly, we plan to distribute the full GA as public domain source code.
SIMD Within A Register 3DNow! Optimizations
Over the past five years, the on-chip space available for microprocessor logic has reached the point where adding SIMD (Single Instruction Stream, Multiple Data Stream) or vector parallelism can be very cost effective. However, traditional SIMD and vector implementations rely on high-performance memory interfaces that simply are not compatible with current microprocessor datapaths. Thus, a new flavor of parallelism has been developed to mate SIMD semantics with conventional uniprocessor datapaths: something that we generically call SIMD Within A Register (SWAR) [5] . SWAR differs from SIMD in that SWAR parallelism partitions registers and datapaths into multiple fields that are processed simultaneously. Thus, SWAR requires much more constrained data layout, does not directly support disabling of processing elements, and has parallelism width that is not constant, but a function of the data size (e.g., twice as much parallelism with 16-bit data as with 32-bit data).
The concept of SWAR long predates the microprocessors with hardware supporting this type of execution. For example, population count (important because the population count of the XOR of two values is the Hamming distance) has long been implemented using a SWAR algorithm using bitmasks, shifts, and adds. However, there were no software tools developed for SWAR programming. The first SWAR hardware support, in processors like the Hummingbird PA-RISC and the MMX Pentium and K6, was very tightly focused on speeding a few hand-coded algorithms -especially MPEG decode for playing DVDs -so high-level programming models were not a concern. However, we had extensive experience in building optimizing compilers for SIMD machines, so we saw an immediate need to develop a better model and programming tools.
Thus, about 6 months before Intel began shipping the first Pentium MMX processors, we built our first high-level language and optimizing compiler technology for SWAR. Our highlevel language, SWARC, is a vector C dialect supporting first-class arrays with arbitrary precision specified for each variable using the same syntax C uses for specifying precision of bit fields within a struct: int:2 i; declares i as an integer having not less than two-bit precision. To facilitate rewriting only a small portion of a C code in SWARC, the language is designed as a true C superset and is implemented by a module compiler, Scc. The compiler, which is still evolving, has been freely available from our SWAR WWW site, http://shay.ecn.purdue.edu/swar/, for over two years.
So that Scc's SWAR output code integrates cleanly with C code, Scc actually generates C code and invokes GCC on that. Because GCC does not know about any of the new SWAR instructions (i.e., MMX, 3DNow!, etc.), Scc generates GCC-compatible inline assembly macros for these instructions. These macros are actually quite high-level in that they allow direct access to the variables of the ordinary C code with which they are compiled, so they easily can be edited by hand to further tune performance. In fact, our macros for MMX, 3DNow!, and SSE are all freely available and often are used for coding from scratch. For example, to multiply corresponding elements of 2-element float arrays c=a*b, one could use the following 3DNow! macros: Despite the higher-level abstraction, with appropriate care in specifying the access of variables, each of the macros results in precisely one instruction. Of course, the code also will be faster if the float arrays start on 64-bit aligned addresses, the tight dependence structure of the sequence is interleaved with or near other code that the Athlon can use to fill pipeline slots, etc.
Scc uses quite a few very aggressive optimization techniques to ensure that the output code is fast, including a sophisticated pipeline timing model and use of a modified exhaustive search combined code scheduler, register allocator, and addressing mode selector. However, often the Sccgenerated code can be improved somewhat by hand editing -for example, by inserting prefetch instructions or applying information about data values (e.g., integer compares can be used for float comparisons if the signs of the values are known). In some relatively rare cases, it is just as easy to write the parallel code directly using the macros as it would be using Scc. To aid in those cases, we also have developed a code rescheduling tool that can automatically reschedule sequences of the macros to improve pipelining.
The choice of Athlons for KLAT2 was largely inspired by our long experience with SWAR in general and with MMX, 3DNow!, and SSE in particular. Although Pentium III SSE theoretically offers comparable performance to that of 3DNow! on an Athlon, there are a variety of minor differences that make our compiler technology and benchmarking strongly favor 3DNow! on the Athlon. Aside from our compiler technology being far better tuned for 3DNow! than SSE, the AMD processors make pipeline bubbles less likely and less harmful. One reason is the difference between a single 128-bit SSE pipeline and two 64-bit 3DNow! pipelines; another is the more aggressive rescheduling done by the K6-2 and especially by the Athlon.
Although we view ScaLAPACK more as a library or benchmark than as a full code, we have created a 3DNow!-aware version that complies with the rules for the LINPACK benchmark as specified at http://www.top500.org/, and that code provides a good basis for performance comparison. Using 32-bit single-precision 3DNow!, KLAT2's 64 700MHz Athlons achieve a very impressive 64.459 GFLOPS on ScaLAPACK for N=40,960. (N1/2 was 13,824; see http://aggregate.org/KLAT2/ScaLAPACK/ for other details.) That translates to just over 1 GFLOPS per processor or 1.44 FLOPs/clock cycle, including all the communication overhead of ScaLAPACK using our FNN-aware version of LAM MPI. It also is less than $0.64 per MFLOPS.
The way we made ScaLAPACK 3DNow!-aware was very simple [8] . ScaLAPACK uses BLAS and BLACS, BLAS uses ATLAS [16] , and BLACS uses MPI. Most of the runtime is actually inside a single ATLAS-created routine, SGEMM. ATLAS is a remarkable tool that actually constructs many different variations on that routine and automatically benchmarks them to select the best coding. Using our tools, it took us less than three days to modify the ATLAS-optimized SGEMM to make good use of 3DNow!. The DGEMM/SGEMM performance using a 900x900 element matrix on a single 700MHz Athlon system is shown in table 1.
In fact, if we could get the same performance for ScaLAPACK that we get for single-node Although cache, TLB, and other effects prevent us from achieving the theoretical 3DNow! peak, the measured performance of the Athlon 3DNow! is substantially better than one can easily achieve with SSE. Even the IA32 legacy floating point performance is quite impressive given IA32's use of a stack model (rather than general registers) and the Athlon's GHz-enabling deep pipelines. The IA32 floating point register stack generally results in serial dependence chains that the processor must aggressively reschedule in order to fill pipeline slots well enough to achieve even one FLOP/clock.
Of course, using 3DNow! to speed-up a full-featured CFD code is much more difficult than a single subroutine of ScaLAPACK because performance depends on many more routines and the nature of the algorithm is substantially less cache-friendly. With our tools, three days of tuning was sufficient for SGEMM; it has taken us weeks to restructure the various CFD routines for 3DNow!.
Implementation and Optimization
We employed a variety of coding techniques to improve the computational efficiency of the solver. To guide our tuning, we performed computations on a 60 x 30 x 30 grid for 10 timesteps on a single processor. Using the gprof tool, we were able to exam the amount of CPU time spent on each set of routines. An example of the gprof output for the untuned DNSTool is presented in tables 2 and 3. These gprof usage profiles reveal that 90% of the CPU time is spent in the high-level routines that solve the viscous flux and perform the AUSM flux-splitting computations. Further examination revealed that among the lower-level routines that made up viscous_flux and ausm_plus, the primary subroutines were the fill_uvwT_stencil, gradient, and ausm_plus_flux routines. These three routines became the focus of our optimization efforts.
As an initial step, all of these routines were aggressively 'cleaned', focusing on removing all redundant calculations that reproduced work performed elsewhere. Repeated calculation of constant terms were removed, and wherever feasible, calculations were shifted outside the innermost loops. These steps created both a more streamlined and a more readable code, improving the usability of DNSTool.
A second key optimization came from restructuring the basic data storage of the variables to better match a cache-based memory system. The solver is employed in a series of sweeps, in the i-, j-, and k-directions, requiring the remapping of the 3-D data arrays into 1-D arrays corresponding to a single direction. The original data layout consisted of a separate 3-D array for each of the flow and geometric variables. When remapping from these 3-D arrays, the full set of the variables at one 3-D index is required to generate the values at one index point in the 1-D arrays. On a cache-based memory system, this would require fetching a distinct cache-line for each variable read. In the i-direction, the rest of these fetched cache-lines contain data elements that will be used in subsequent iterations (i+1, i+2, etc.). Unfortunately, in the other two directions (j and k), the rest of the data in the fetched cache-lines will not be used, wasting most of the memory bandwidth.
The new data layout is a 3-D array of structures, with each structure holding the relevant variables for one 3-D index point. Thus, when remapping from the 3-D array of structs, all the variables needed to calculate the values for one index point in the 1-D array are adjacent in memory. This results in only one or two cache-lines being fetched from main memory for the entire set of flow and geometry variables per 3-D index, dramatically reducing the required memory bandwidth. Additionally, when using single-precision floats, the fields within the structure were ordered so that pairs of adjacent fields could be loaded into 3DNow! registers with individual load instructions, see figure 7 . The pairs were selected so that the calculations on them also could be performed in parallel. As shown in the figure, this new data vector has the form
. Combined with some effort to align the data storage with the 64-byte cache-lines of the Athlon, this more efficient use of the memory and cache effected across-the-board improvements whether using double-precision or single-precision.
The most costly routine in DNSTool was the fill_uvwT_stencil, whose purpose was to extract initial memory layout 
, and compute the intermediate values needed to for the flow gradient calculation on each face as per equation 33. This routine is used repeatedly, with slightly different forms depending on the direction of the computational sweep. We aggressively rewrote the computations in terms of 3DNow!, taking particular advantage of the fast reciprocal for the Ñ -division and our new data structure. Efforts to implement pre-fetching of the variables only yielded minimal improvementsthe Athlon scheduler is apparently good enough that only extremely aggressive pre-fetching will improve on its performance. Overall, these combined enhancements resulted in a sevenfold reduction in the single processor CPU time in the long-stride directions (j and k), and a fivefold reduction in the i-direction.
Improvement to the subroutine gradient was achieved in much the same manner as the fill_uvwT_stencil routines, through the combination of the improved data structure and converting the calculation of the finite-volume circulation theorem to 3DNow! macros. Equation 33 illustrates why the volume was included in the flow variable vector of the new data structure, as the volume is a component in many of the flow variable computations.
The final routine that was aggressively optimized with 3DNow! macros was ausm_plus_flux. In this routine, two important features were the fast reciprocal square root function (particularly useful for calculating the speed of sound, Û º Ü Þ Ý ß á à ã â ), and replacing if-then structures with bitmasking techniques (for instance in the choice of ä 5 å and ä ae definition in equation 24) . It should be noted that the 3DNow! coding done in DNSTool was strictly in the form of Cmacros; no assembler code was explicitly included in the code. Neither were the Athlon extensions to 3DNow! applied to the code; the 3DNow! code we used is compatible with the K6-2 as well as Athlon processors. Although the SWARC compiler was useful as a reference starting point for 3DNow! coding of DNSTool routines, as it turned out, we did not directly use any code generated by it. We did use our code rescheduler to improve some of the 3DNow! code in DNSTool. Throughout the 3DNow code, care was taken to optimize register usage; 3DNow! has only 8 register names, but careful use of these 8 allows the Athlon's register renaming logic to make better use of its 88 physical registers. The new data structure also improves register use by reducing the pressure on the IA32 registers that are used for addressing. The vector-array allows a single base address to be kept in an IA32 register and indexed by short constants instead of forcing the IA32 registers to juggle six or more apparently independent pointers, one for each individual variable array.
The results of this effort, including the speed-up due to switching from double-precision to single-precision floating-point operations and including 3DNow! macros, are show in tables 4 and 5. As can be seen, the fill_uvwT_stencil routines have dropped precipitously to the 9, 10, and 11th most CPU-costly routines. The gradient subroutine is now the most costly, but its CPU time has fallen from 5.24 seconds to 1.73 seconds. Numerous routines in which no 3DNow! optimizations were used saw speed-ups due to the aggressive removal of redundant calculations, the new data structure, and the improved cache-memory alignment. The overall computational time fell by a factor of 3, from 50.4 to 17.7 seconds.
For additional comparison, we have included in tables 6 and 7 the double-precision results, which eliminate the advantages of moving from 64-bit to 32-bit variables, including making the 3DNow! macros useless. As can be seen, there is considerable improvement in the code simply from 'cleaning', cache-memory alignment, and the new data structure. 
Results
The primary objective of this paper is to demonstrate the computational improvements yielded from the implementation of the various techniques detailed in the previous sections. The simulation results obtained so far are not sufficient for drawing conclusions about the flow physics in the lowpressure turbine; however, CFD results presented do indicate the complexity and realism of the engineering problem involved in this project and potential for performing CFD simulations on KLAT2-like clusters. The cost of KLAT2 is then itemized. Combining the runtimes obtained with the cost of building KLAT2 demonstrates the cost effectiveness of our approach.
Simulation Results
The particular simulation we have focused on for this paper is the direct numerical simulation of the flow over a single turbine blade. This research is connected to NASA's Low-Pressure Turbine Physics program, which aims to improve the efficiency of turbomachinery in aircraft engines. Our specific concentration is understanding the turbulent cascade, i.e. the process by which the flow over the blade switches from laminar to turbulent flow conditions. This transition is not wellunderstood, but its properties are often critical to the performance of many aerospace systems, including turbomachinery. Fundamental understanding of the cascade through numerical simulation requires the use of computationally-intensive models such as LES and DNS, since the more empirically based, lessdemanding approaches such as RANS cannot properly model laminar-turbulent transition. For the full turbine cascade, complex geometric effects must be taken into account, requiring dense grids in certain critical regions. For these reasons, the turbine blade flow is an ideal candidate for testing on a cluster using DNSTool.
The selected grid configuration for the CFD computation has been used in both twodimensional and three-dimensional research at the University of Kentucky on the turbine cascade [15] . As can be seen in figure 8 , the grid is curvilinear, representing an inflow region from the nozzle into the turbine, through the turbine blades, and then an outflow region. The overall grid is 400 x 200 x 200, or 16 million grid points. The flow over the blade is considered periodic in both the j-and k-directions, with the exception of the j-direction between the two solid blade boundaries. The flow is subsonic throughout, with a transitional region occurring somewhere on the blade surface depending on the precise flow parameters.
For the purposes of timing, we ran steady-state simulations of the flow. The results of these simulations are presented in figures 9 and 10 for an inflow Mach number of 0.1. This solution combined with homogeneous turbulence is an initial condition for unsteady turbulent simulations. The results for the unsteady simulation are as of yet preliminary; even with the superior performance of KLAT2, a single unsteady simulation would require eight weeks of dedicated usage. A technical point is that the grid used in these sample calculations is probably not sufficiently fine for a strict DNS; rather, the presented results are more properly QDNS. Evidence suggests that our grids are within a factor of 5 of a strict DNS computation-for comparison, the Flow Physics and Computation Division of Stanford University claims a DNS computation for a similar problem domain using 86 million grid points (http://www-fpc.stanford.edu/).
The Cost of KLAT2
KLAT2's cost is somewhat difficult to specify precisely because the most expensive components, the Athlon processors, were donated by their manufacturer, AMD (Advanced Micro Devices). In the interest of fairness, we have quoted the retail price for these processors as found on Multiwave's WWW site on May 3, 2000. Similarly, although most applications (including those discussed in this paper) use only 64 nodes, KLAT2 also has 2 "hot spare" nodes and an additional switch layer that are used for fault tolerance and system-level I/O; because we consider these components to be an integral part of KLAT2's design, we have included their cost. We also included 16 spare NICs and several spare surge protectors. Due to University of Kentucky purchasing guidelines and part stocking issues, purchases from the same vendor were sometimes split in odd ways and there were various inconsistencies about how shipping was charged; although the vendor totals are correct, we have had to approximate the component cost breakdown in these cases.
In summary, KLAT2's total value is about $41,200, with the primary costs being roughly $13,200 in processors, $8,100 in the network, $6,900 in motherboards, and $6,200 in memory. We believe the cost breakdown in figure 11 to be a conservative upper bound on the full system cost.
Some people have taken issue with our accounting of the assembly cost of KLAT2 because of the large number of student volunteers. However, people often forget that, as a university, it is our job to inspire and train students. In fact, we could have purchased the PCs as assembled systems without significant extra cost, but buying components allowed us to build precisely the configuration we wanted and, more importantly, was pedagogically the right thing to do. Few projects inspire students like being able to contribute to the construction of a new type of supercomputer, so we encouraged as many people as possible to participate. The construction of the PCs was easily accomplished within a single day (April 11, 2000) with no more than six students working at any time and most of the students having their first experience in PC construction; wiring the cluster took two people only a few hours (thanks to the color-coded cables). The cost of the soda and pizzas may seem a flippant accounting, but are actually a good match for what it would have cost for a well-experienced research assistant to assemble everything without additional help or distractions. 
Computational Results
The current GFLOPS and $/MFLOPS are given in tables 8 and 9, while figure 12 shows the distribution of the wall-clock times over 3500 timesteps. The relatively rare very slow timesteps occur due to an unfortunate resonance between minor load imbalances and TCP/IP network traffic; it should be possible to eliminate these glitches, but their contribution to total runtime was not sufficient for us to make that a priority. The presented computations are for a DNSTool run on a 400 x 200 x 200 turbine blade grid (16 million grid points). This grid point count excludes points outside the physical flow region needed for boundary conditions or block overlap. The presented results are the average for 1000 timesteps for a long time integration. The walltime is maximum over the whole cluster for each iteration. All of the computations are double precision on the 64-processor KLAT2. Results are presented for both the isolated numerical computation and the overall computation including I/O.
The GFLOP count is measured by first doing the identical computation on a SGI Origin 2000 and using SGI performance tools; specifically, using mpirun -np 64 ssrun -ideal to run the code, followed by applying prof -archinfo ideal.* to the output, which yields the total FLOP count for the computation. Dividing this number by the walltime expended in the KLAT2 simulation yields the given GFLOP/s results.
We do not think that we have yet achieved optimal performance for this cluster-CFD code combination. In addition to the potential improvements discussed in section 3, we have not done much work on the optimal grid configurations for this cluster, either in terms of the construction and load distribution of the subgrid blocks or the overall best grid density. The I/O also remains a relatively slow part of the code, playing a large role in the difference between the overall and peak performance. Even without these potential improvements, the overall performance of DNSTool on KLAT2 is outstanding.
Conclusion
In this paper, we have described the techniques and tools that we created and used to optimize the performance of the combination of DNSTool and KLAT2. This tuning of system performance involved both recoding of the application and careful engineering of the cluster design. Beyond restructuring of the code to improve cache behavior, we used various tools to incorporate 3DNow! into the single-precision version, and even used a GA to design the cluster network.
Using similar techniques for KLAT2 to execute ScaLAPACK yielded better than $0.64 per MFLOPS single precision, but ScaLAPACK is a relatively easy code to speed up. Although ScaLAPACK requires substantial network bandwidth, only a single computational routine (DGEMM or SGEMM) needed to be optimized to obtain that record performance. In contrast, achieving $2.75 per MFLOPS double precision and $1.86 per MFLOPS single precision for DNSTool requires many more optimizations to many more routines and generally taxes the machine design much more severely.
The high computational complexity of the DNS approach to CFD provides important and unique abilities: DNS yields high-quality results for problems that faster, more approximate, CFD techniques cannot yet handle. Thus, the low cost of running DNS on machines like KLAT2 can make a qualitative difference in the range of CFD problems that can be approached, either to directly solve them or to design computationally cheaper models that can suffice for practical engineering applications.
In much the same way, the techniques and tools that we developed and used to engineer KLAT2, and to tune the performance of DNSTool for KLAT2, also represent qualitative advances over previous approaches. For example, 3DNow! was originally intended for 3D graphics in video games, but our tools make it significantly more accessible for scientific and engineering codes. Even if it is not optimized for the particular communication patterns that will be used, the GA-designed FNN provides low latency and high bisection bandwidth; the ability to tune for a specific application's communication patterns, generating deliberately asymmetric designs when appropriate, is a unique additional benefit.
Most importantly, the optimized DNSTool and the tools to design and similarly optimize your own clusters with specific applications either are or will be freely available from http://aggregate.org/. Our primary direction for future work is to take these tools one giant step further, creating easily-replicated personalized turnkey superclusters (PeTS) that will provide scientists and engineers with supercomputer performance for specific applications without requiring them to become experts in computing.
Acknowledgements

