The purpose of this paper is to summarize recent performance results from an important ASCIrelated application and to speculate on how trends within the computer industry and in computer architecture relate to these results.
on the order of a million cells, a canonical goal of ASCI is to do 3-D, billion-cell problems. Adaptive mesh refinement (AMR) and unstructured grids will become increasingly important, implying both irregular memory access and non-uniform inter-processor communication patterns. Such methods, coupled with current discretization schemes, also imply relatively low ratios of floating-point operations to memory operations, typically less than one FLOP per memory reference, on average.
ASCI Performance Modeling Research. We are addressing performance issues by implementing a comprehensive program to characterize ASCI algorithms and application codes. The purpose of this work is to gauge the performance progress of ASCI codes, develop scalable strategies for code development, reveal scalability issues in hardware and software, allow for technical planning for future ASCI architectures in the context of a five-year estimate of commercially-available technologies, and engage industry and university partners in application-driven ASCI performance problems. An additional goal is the development and dissemination of a suite of "Compact Applications" in which critical ASCI performance problems are embodied in open, compact codes.
Scalability analysis must address two key questions: (1) What single-processor computational efficiency can we expect on future microprocessors; and (2) What parallel efficiency can we expect on future systems? Insight into both of these is gained through development of models that incorporate key characteristics of both the applications and the architectures.
ASCI Applications. Although many ASCI codes will simulate the same physics and chemistry effects, current code development projects at Los Alamos are partitioned into approaches according to mesh strategies, discretization schemes, and programming styles. For example, the CRESTONE project is oriented towards Eulerian Hydrodynamics using structured Cartesian grids with cell-by-cell AMR, with vectorizable Fortran77. In contrast, the BLANCA project focuses on structured, Arbitrary Eulerian-Lagrangian (ALE) techniques written with an object-oriented C++ framework that separates the physics and mesh manipulation from the parallel programming implementation. Other projects use arbitrarily-connected, unstructured meshes and Fortrango.
Particle transport via both Monte Carlo and deterministic methods is an important component of the ASCI workload, accounting for upwards of 50430% of simulation time on current DOE systems. As such, there has been a great deal of research devoted to improving the performance of codes that carry out this kind of simulation [3] . In recent years there have been Modeling Scalability. We recently developed a performance model for algorithms consisting of multiple wavefronts partitioned and pipelined on multidimensional processor grids [9] . We applied this model to Cartesiancoordinate, deterministic particle transport, as abstracted in the ASCI Compact Application "SWEEP3D" [IO] . The algorithm in SWEEP3D is inherently recursive and in a 2-D MIMD domain-decomposition using a message-passing model, wavefront-like "sweeps" through the processor grid are generated. Parallel efficiency is improved by logically stacking additional work from the third dimension and other (nonspatial) discretized variables, although at the expense of additional communication. Overlap of communication and computation occurs at some (but not all) steps in the simulation, and our model captures both this overlap and all aspects of the parallel efficiencycommunications tradeoff.
understand particle transport scalability as a An important use of our model was to function of per-processor sustained speed, and MPI latency and bandwidth on a futuregeneration system -a hypothetical, meshtopology (i.e., non-clustered) 100-TFLOPS-peak machine with 20,000 processors that might be in existence around 2004. We considered both conservative and optimistic changes in CPU and network technology. Interestingly, the model showed that on a one billion-cell problem, this application is compute bound; i.e., interprocessor communication is not the primary bottleneck (although communication does become important for smaller problem sizes).
Modeling Single-CPU Memory Performance. It was interesting for us to learn that single-processor performance is the dominant factor for SWEEP3D, because we had also been studying what factors limit singleprocessor performance in a variety of applications. Many codes of which we are aware achieve only 510% of peak performance on typical RISC microprocessors, in terms of either MFLOPS or clocks per instruction (CPI) [ l l -131.
Many recent studies have attempted to identify the microprocessor architectural features that lead to diminished performance relative to peak [12, 141. Memory performance consistently stands out as a critical bottleneck in all these studies. Our own studies, in which we use a simplified empirical parameterization along with data from hardware event counters to obtain memory stall time [15], showed that on single processors of the MIPS RIO000 memory stall time for SWEEP3D accounts for about 45% of total CPI. This means that if one could optimize the existing code so as to eliminate all memory stall time, the improvement would be limited to about a factor of two in execution time, which would correspond to about 80 MFLOPS per processor, or about 20% of peak [ 113.
Applying this result to the scalability study of SWEEP3D (above) leads to the conclusion that machines constructed of microprocessors reasonably expected to be in existence within the next few years may be unable to satisfy an ASCI performance goal. For example, our wavefront scalability model predicts that in order to run a billion-cell SWEEP3D problem in 60 hours, we would require 2,500 MFLOPS sustained pernode performance. This implies either considerably larger than 10% sustained performance relative to peak or what is probably an impossibly-high peak rate.
Furthermore, known trends in CPU speed vis-a-vis memory speed suggest that in the future, memory performance may play an even larger role than it does now, further limiting the achieved performance relative to peak [ 161.
More recent work in our group is oriented towards understanding processor performance in the absence of memory effects. This work shows that processor inefficiency in SWEEP3D is also probably due to a mismatch between the instruction mix in SWEEP3D and the microarchitectural characteristics of the MIPS R10000, such as its allocation of functional units [17] .
A key ASCI strategy is to use commodity off-the-shelf (COTS) technologies to compose larger systems in an attempt reduce costs and improve price/performance ratios over traditional supercomputers. On the one hand, a potential problem with this is that scientific computation comprises only a small portion of desktop and server workloads and is therefore not considered to be an important driver for RISC microprocessor architecture. On the other hand, the needs of scientific and commercial workloads are not entirely orthogonal, since recent performance studies have shown that memory performance of commercial workloads is relatively worse than that of scientific workloads [18] [19] [20] [21] [22] .
Two factors have led several prominent researchers to question whether superscalar processors will prevail as the microprocessors with the greatest commercial impact [23-251. The first is a combination of the processing rate inefficiency often observed in today's microprocessors coupled with the likely additional pipeline-stall, instruction fetch, and cache hit rate affects brought about by increasing memory latency. The second is the extent to which processor size and power requirements will limit the applicability of superscalar processors to multimedia-based workloads that are emerging as the dominant application regime of the future. Many believe that these new workloads, which will result from a huge consumer market need for video, sound, speech, graphics, telephony, and network processing, will cause drastic change in the architecture of commodity systems [23, 25, 261. Thus, an important question is what impact this architectural shift will have on ASCI. In particular, we wonder about the extent to which new features implemented to support media applications might still be able to support (or actually enhance) numerical simulation. Foremost among these features is SIMD processing. Several recent studies [27, 281 have
COTS Technology.
shown that many important media processing kernels are highly vectorizable. In an effort to better support this kind of workload, all major microprocessor manufacturers have introduced short, vector-like extensions to their instruction sets [29] . These extensions have limited capability and usually operate only on narrow data types common to media applications. However, recent studies have demonstrated quantitatively that a more traditional, long-vector architecture is considerably faster on some media applications than the short vector extensions [30] . An important conclusion reached in these studies is that microprocessors consisting of a superscalar core tightly coupled to a CMOSbased, multipipeline vector unit can provide a scalable, cost-effective solution for desktop computing [31] . Such an architecture might well help preserve any investment in superscalaroptimized code [32] and at the same time afford significant benefit to those applications that remain vectorizable.
Efforts to more fully understand what alignment may exist between media-based and numerical workloads are underway in our group. Subsequent publications will compare these workloads at both the algorithmic level as well as in terms of instruction level parallelism. 
Ref er en

