and Los Alamos National Laboratory, where it's now located. Aside from its performance, Roadrunner has two distinguishing characteristics: a very good power/ performance ratio and a "hybrid" computer architecture that mixes several types of processors. By November 2008, the traditionally architected Jaguar computer at Oak Ridge National Laboratory was tied with Roadrunner in the per formance race, but it requires almost 2.8 times the electric power of Roadrunner. This difference translates into mil lions of dollars per year in operating costs.
Processors Are Changing and for Good Reason
As in any business, profit is the driving force in processor design. Without an improved user experience to generate sales, companies have no economic incentive to bring new designs to market. Traditionally, this improvement in user experience came through performance advances in gen eralpurpose processors (GPPs). However, this path has bifurcated: the rapid proliferation of embedded processors has created demand for specialized lowpower designs, and a variety of challenges to traditional means of improving performance is causing designers to rethink GPPs. Here, we illustrate some of these challenges to provide a rationale for the changes coming to hardware and software, which John Manferdelli discusses in his article, "The ManyCore Inflection Point for Mass Market Computer Systems." 2 One traditional approach to improving processor perfor mance is simply to increase the processors' clock frequen cies. However, because power consumption is proportional to clock frequency, the heat density per fixed area of the processor chip increased to the point where simple cool ing methods were inadequate. Designers are now lowering processors' clock frequencies, negating the "free" perfor mance improvements that applications were getting from successive generations of faster single processors.
Another technique for transparently improving perfor mance is instructionlevel parallelism, common in GPPs since the late 1990s. These superscalar processors simulta neously execute multiple instructions on redundant func tional units and must automatically detect and avoid data dependencies between sequential instructions. Superscalar processors also employ pipelined, speculative, and outof order execution that lead to a combinatorial number of gates related to dependency checking, branch prediction, and instruction scheduling.
3 These techniques create pro cessor designs that are difficult to design and verify; yet, the nature of the instruction stream that executes ulti mately limits the promised performance improvements. Designers are moving back to simpler instructionsched uling models and are using the newly freed space for dif ferent purposes.
4
The final architectural challenge is the growing discrep ancy between the time required to execute an instruction and the time required to retrieve data from memory. For most current processors, we can expect a single instruction to take 10 or fewer clock cycles, whereas fetching data from main memory might take several hundred cycles. This problem is compounded when the processor executes mul tiple instructions simultaneously. Traditionally, program mers have worked around this difference by using larger hierarchies of fast cache memory, which act as a bridge be tween the processor and the main system memory. How ever, this cache memory is expensive and complex, and the deeper hierarchies that appear in contemporary processors are increasingly difficult to verify for correctness. Design ers are beginning to introduce new memory subsystems to processors, including crossbar switches and programmer controlled local storage.
Looking beyond processor design, another broad cat egory of challenges deals with the increase in processor power consumption as the fabrication process size de creases. Companies currently use a 45nm manufacturing process in which SRAM cell area is measured in fractions of square micrometers.
5 Many complicated physical ef fects exist at this scale (see, for example, Yannis Tsividis's book, Operation and Modeling of the MOS Transistor 6 ), but the net effect is that transistors leak significant amounts of power. Advances in material sciences should help stem this problem, but the shrinking feature size implies that we'll soon have more transistors on a chip than we can afford to power simultaneously. It's expected that finegrained pow er management features will appear in processors (or even to programmers), leading to heterogeneous performance across even homogeneous processors.
In the face of these significant challenges, processor de signers must still meet the economic driver of providing a better user experience. Rather than pursue even more complex single processors, they place multiple copies of a processor onto one chip to form a multicore processor. each of these cores tends to be slower and less complex than the single processors of just five years ago, but they provide performance increases in aggregate. However, this trend shifts more work to the application programmer. Not only do programmers have to make up the loss in single core performance through better optimization, but they must also explore ways of parallelizing their applications to take advantage of more cores. Although this is nothing new for the scientific computing community, it's a fundamental shift in the broader software industry.
In addition to moving to multicore designs, companies have introduced chips that contain a mix of general and specialpurpose cores. These heterogeneous multicore chips represent the most significant challenge (and op portunity!) for software developers. In contemporary het erogeneous multicore chips, specialpurpose cores tend to be shortvector processors, which are especially useful in computer graphics applications. Whereas GPPs have had some form of vector operations available for some time (SSe, Altivec, and so on), offloading this workload to a standalone processor allows for greatly increased parallelism. Scientific programmers certainly use vec tor processors, but expect to see much more specialized processors appearing in the future, such as cryptographic engines, compression, video decoding, and so on. These will pose special problems for the highperformance com puting (HPC) community, both in terms of how (or if) to use them, as well as managing the power they draw when not in use.
Hardware Changes that Disrupt software Development
The changes occurring in computer architectures are cre ating a ripple effect in the software development arena, even for traditionally serial applications. The availability of specialized processors forces developers to decompose their programs across functional units. Deep memory hi erarchies, especially coupled with disjoint address spaces, require special attention to data motion costs. Shortvector processors constrain data structure design, so developers must look to parallelism, often in terms of multithreading, for performance gains.
even in the HPC community, where programming has always involved some level of adaptation to distinctive hard ware, programs must evolve to new levels of complexity. We can no longer imagine that all processors have equal access to resources such as memory, network, or I/O: developers must schedule tasks on the processors with the best balance of functionality and resource access. Power management considerations can become explicit in programs, such as putting an idle functional unit into a reduced power state. Moreover, as system sizes continue to increase, reliability and resilience become significant issues. Can we detect and recover from hard, soft, and even silent data corruption? How do we restart calculations on systems with a mean time between interrupts measured in hours?
Compounding these technical challenges is the dire lack of parallelism experience among the general software developer community. The state of the tools available to developers makes this problem worse: threading libraries and primitives added to fundamentally serial languages are challenging to use. Hardware vendors have recognized these problems and are working on a variety of solutions. In the hardware itself, transactional memory could remove some of the challenges of thread programming, and inno vations such as scout threads might provide more transpar ent performance increases.
Hardware vendors are also creating software solutions to address these problems, from new compiler technolo gies to libraries and language extensions. But for the most part, these tend to be proprietary, useful on only one vendor's hardware. One exception to this is the khronos Group's OpenCL standard, 7 which is an API for pro gramming attached accelerator processors, such as gen eralpurpose computations on graphics processing units (GPGPUs). Some companies have pursued longterm strategies such as establishing research labs at universities to directly tackle some of today's challenging problems and creating a stream of talented and experienced gradu ates. vendors have also introduced forms of declarative programming into their tools to help programmers focus on what should happen, rather than how the computer should execute the task.
Although the intensity of these activities is encouraging and will certainly bring advances, they're not a sufficient solution for HPC. Most of this work focuses on program ming a single chip or a desktop: large clusters of these complicated nodes don't command enough market share to warrant the investment. Another, somewhat subtle chal lenge for the HPC market is a forced change in program ming languages. Hardware vendors have focused most of their investment for new software tools on C/C++ com pilers. Although some of these developments will trickle down into Fortran tools, it's unlikely that Fortran will be wellsuited to take advantage of the new hardware.
A Comprehensive Approach
Perhaps the greatest challenge that software developers face at this time can be simply termed as diversity. Hardware designers have provided a dizzying array of options, and each one could encourage several different programming approaches. eventually, this period of rapid innovation will settle to a smaller set of stable technologies, but we don't have the luxury of waiting until that happens. Fortunate ly, the HPC community already has a tool that's flexible enough to confront most of our programming challenges in Roadrunner (www.lanl.gov/roadrunner).
At the top level, LANL configured Roadrunner as a rel atively traditional cluster of clusters that uses InfiniBand for the interconnect and supports HPCstandard message passing interface (MPI) communications. At the node level, however, Roadrunner becomes unique. Figure 1 provides a conceptual node schematic. The node's root-at least with respect to the network-is a blade server with two dual core AMD Opteron processors. Attached to that are two accelerator blade servers based on the IBM PowerxCell 8i processor. This processor conforms to the Cell Broadband engine architecture specification that Sony, Toshiba, and IBM created 8 and is itself a heterogeneous multicore chip. The PowerxCell 8i has a generalpurpose core (the PowerPC processor element [PPe] ) and eight shortvector engines (the synergistic processing element [SPes] ). The PPe has a traditional twolevel cache, whereas the SPes use a programmermanaged local store: 256 kbytes for program text and data. The programmer explicitly moves data from main memory to the local store using asynchro nous communication calls, allowing truly overlapped com munication and computation. each SPe contains 128bit registers and a statically scheduled, inorder, dualissue in struction pipeline and can achieve 12.8 Gflops/s in double precision. This gives each Roadrunner node a more than 400 Gflops/s peak.
Roadrunner's design lets developers gracefully transition their existing applications to the new architecture: MPI applications can run unchanged on the Opteron cluster of clusters. Although this makes a nice starting point for de velopers, such applications can access only a small fraction (3.5 percent) of the machine's peak performance. Accelerat ing these applications requires identifying portions of the code to move to the PowerxCell processors. As provisioned, Roadrunner has equal numbers of PowerxCell processors and Opteron cores and the same amount of memory avail able to each. This admits the conceptually simple pairing of one Opteron core with one PowerxCell processor. Develop ers can incrementally accelerate their applications by mov ing more functions from the Opteron to the PowerxCell.
Moving a function to the SPe can seem a daunting task at first. The small local store and the need to explicitly move data between it and the main memory are constraints that haven't been relevant for some time in generalpurpose programming. Confronting the data structure and align ment implications of the vectoronly SPe instruction set can shake the confidence of even the most skilled scalar in struction programmers. The key observation to overcom ing these challenges is simply that the SPe makes many operations explicit for programmers. It's not that a GPP isn't doing these same tasks, it just has more hardware with which to do them. SPe programmers must be cognizant of data locality and will see data motion in explicit instruc tions. But awareness of these issues is exactly what program mers require to achieve high performance on cachebased GPPs! We invariably obtain performance increases on our GPPs when we apply the optimization lessons learned from porting code to the SPe. This is a substantial benefit that will outlive any particular architecture.
Although accelerating applications through function off load provides an expedient path to performance, develop ers realize the machine's research potential when they start with a fresh look at application design. Rather than look at the SPes as Opteron accelerators, we can reverse the model and think of the Opterons as communication man agers for the PowerxCell processors. Although it's easiest to think of every SPe running the same instructions on different portions of data, they're independently program mable and can communicate with each other directly. This admits a variety of streaming and process ganging models. More opportunities arise when we discard the OpteronPowerxCell pairing and find ways to distribute the work of one process across all 40 processors on the node. By treat ing the node as a manycore processor, developers can bet ter understand the mismatch between high onprocessor performance and the slow access to offprocessor resources that we'll see in manycore designs. Developers have exploited all of these design techniques when creating highperformance applications for Road runner. As a result of a peerreviewed competition, LANL awarded time on Roadrunner to 10 research teams for a va riety of openscience applications. These applications ran the gamut of scientific fields (and scales!), from simulations of cellusomes and viral phylogenetics to supernovae light curves and a largescale structure of the universe. In addi tion to the direct scientific contributions that these teams have made, we're studying the application development proc ess in each team to better inform the next generation of application and tool developers.
B
ringing a new largescale computational resource on line takes considerable time and effort. This fact alone buffers the HPC community from the most rapid techno logical changes, as technology choices are often made well in advance of system delivery. Although this provides some continuity for the HPC community, we shouldn't be com placent to the changes occurring in the broader market. Computer technology is changing, and although we can't predict or prescribe which technology path will eventually dominate, we can be assured that we'll be programming differently in the future. If nothing else, the lines of code dedicated to explicit control of memory, functional unit power, resilience, and data communication will soon out number the lines of code doing the real computation.
To make this transition as smooth as possible, we must start by preparing ourselves. As you design an application, think about the implications of running on heterogeneous or hybrid architectures. Try to maximize the shortvector and multithreadedprocessor uses that we have today, with out assuming that a compiler will take care of this for you. Think about the costs of moving data around the system, whether that's between local memory and a processor or between nodes in a parallel system. The next step is to experiment and add to the commu nity's body of knowledge. Write applications for alterna tive processors, such as FPGAs or Cell processors. Learn to program a small, accelerated system, such as a work station with a GPGPU or a cluster of Sony PlayStation consoles. experiment with functional programming lan guages, such as Clojure or Haskell, to see how this class of languages can provide a powerful abstraction from the hardware. These sorts of investments prepare us for the future, but they can also pay dividends in terms of better utilization of today's technology.
Finally, we must engage the broader community to en sure that standards and tools support HPC needs. People are just beginning to consider how to develop tools that can alleviate some of the new burdens placed on program mers-now's the time to share the experience we've gained from years of programming parallel systems. As the indus try struggles with the rapid rise of parallel computing, we can act as mentors, helping avoid the mistakes that we've made, while looking for the insights that will come from a fresh perspective on our longstanding problems.
