We seek to understand which supercomputer architecture will be best for supercomputers at the Petaflops scale and beyond. The process we use is to predict the cost and performance of several leading architectures at various years in the future. The basis for predicting the future is an expanded version of Moore's Law called the International Technology Roadmap for Semiconductors (ITRS). We abstract leading supercomputer architectures into chips connected by wires, where the chips and wires have electrical parameters predicted by the ITRS. We then compute the cost of a supercomputer system and the run time on a key problem of interest to the DOE (radiation transport). These calculations are parameterized by the time into the future and the technology expected to be available at that point.
We seek to understand which supercomputer architecture will be best for supercomputers at the Petaflops scale and beyond. The process we use is to predict the cost and performance of several leading architectures at various years in the future. The basis for predicting the future is an expanded version of Moore's Law called the International Technology Roadmap for Semiconductors (ITRS). We abstract leading supercomputer architectures into chips connected by wires, where the chips and wires have electrical parameters predicted by the ITRS. We then compute the cost of a supercomputer system and the run time on a key problem of interest to the DOE (radiation transport). These calculations are parameterized by the time into the future and the technology expected to be available at that point.
We find the new advanced architectures have substantial performance advantages but conventional designs are likely to be less expensive (due to economies of scale). We do not find a universal "winner," but instead the right architectural choice is likely to involve non-technical factors such as the availability of capital and how long people are willing to wait for results. Tables   Table 1. Hardware (ITRS) Parameters………….………………………..…………..8 Table 2 . Relative performance for a single sweep….….…………………………..13 Table 3 . Assumed Chip Prices……………...………………………………………..14
Introduction
We seek to increase the throughput of supercomputers from today's 10 Teraflops to 1 Petaflops or more -an increase of 100x or more. To understand the challenges, let's begin with Little's Law from queuing theory:
In this expression, concurrency is the number of activities that take place at once, latency is the time per activity, and throughput is the number of activities that are completed per unit time.
To increase throughput by 100x or more would require increasing concurrency and decreasing latency by a combined factor of 100 or more.
In past generations, supercomputer throughput increased largely by speeding the clock rate of the microprocessors. By conventional reasoning, an increase in clock rate speeds the time for every part of a calculation and gives the effect of decreasing latency.
Unfortunately, this trick won't work again. Microprocessor clock rates have sped up so much that the performance of the memory subsystem has become the rate-limiting attribute.
This leaves two solutions, both of which are explored in this paper:
1. Reducing latency further by an architectural change to the memory subsystem.
2. Increasing concurrency by increasing the number of processors.
However, the "economy of scale" principle runs in opposition to any innovative solution. To be specific, innovative designs that require custom chips incur substantially higher costs.
To judge what approach is best for future generation supercomputers, we need to estimate the overall effectiveness of ASCI applications on conventional plus alternative architectures.
We will explore a benchmark radiation transport problem with a 1000 3 array of cells, two particle types, 1000 energy levels, and 5000 angles [ref. 1] and ask how future PIM-based and Red Storm-like MPPs would perform for this problem.
We answer the question using the following methods:
1. We use the ITRS tables to project the performance of chips over the years. 2. We compare three computer designs: Red Storm-like, pure PIM, and PIM+DRAM. 3. We estimate the cost of the hardware, the time to perform one sweep, and the cost of that sweep (the cost of the hardware per one second of the machine life-time times the number of seconds in the sweep).
Hardware
The relevant ITRS information is summarized in Table 1 . R is the serial data rate available per signal pad (pin), N pads is the number of signal pads available on the chip. B chip = R ⋅ N pads /2 is overall bandwidth per chip. The division by 2 converts the number of pads into the number of differential pairs. The memory capacity per chip, Gb/chip, is computed from DRAM Gb/cm 2 and the chip size. Cells/chip is based on the bits required for the radiation transport problem (discussed later). The chip's clock rate allows us to compute how many instructions we can expect to execute per second at one per clock for each PIM RISC core and two per clock for superscalar MPUs. Tables  23a & 23b  2002  update   Tables 3a  & 3b 2002  update   Tables 1e  & 1f 2002  update   Tables  1i & 1j  2002  update   Tables 4c  & 4d 2002 update
Architecture
We are taking an abstract approach to computer architecture with the objective of determining an upper bound on the possible performance of various architectures. More specifically, we define each architecture by its principal chips and interconnections between them. We assume a future engineer will fill the principal chips with logic and memory such that performance on this algorithm will be optimized. This is a tall order particularly for the PIM architecture, as its proponents often have well-developed ideas about PIM internal logic -with different people having different and non-overlapping ideas.
We are assuming that each design is connected in a 3-D mesh. The PIM node design is shown in Figure 1 . Each node consists of a single PIM chip and each connection uses one-sixth of the signal pads. The RS (Red Storm-like) system's node is shown in Figure 2 . The router chip is connected to six neighbors and a single MPU chip. Each connection is assigned 1/7 of the router's pads. The MPU assigns 1/7 of its pads to the connection with the router and the other 6/7 to the memory bus. The memories, assumed to have the same number of pads as the router and MPU, leave 1/7 of their signal pads unused. The PIM+DRAM design is shown in Figure 3 . The PIM is connected to six neighbors and to a memory bus. Here we are assuming that 1/7 of the pads are devoted to each use, so its communication rate is the same as the Red Storm design, but the memory bandwidth is much lower. 
The Radiation Transport Problem
A radiation transport problem with a 1000 3 array of cells, two particle types, 1000 energy levels, and 5000 angles is of a ferocious size, even if we begin by Thus, S cell = 2 ⋅ 8 ⋅ N a ⋅ N s ⋅ N e ⋅ S flt = 160 MB is the size of a cell. The factor of 2 allows two floats for old and new values. The 8 converts the angles per octant back into total angles. The overall space requirements for the 10 9 cells is 1.6 × 10 17 bytes. With 1GB memory DIMMs at $5 apiece, the machine would cost $800 million for memory chips alone.
We will restrict ourselves to PIM systems that can contain at least one cell entirely within the on-chip memory. If a cell will not fit, the PIM will begin to resemble an MPU with caching. By this rule, the problem will not run on the Blue Gene/ Cyclops (BG/C) currently being developed by IBM: the cells are more than 26 times too large to fit in a BG/C PIM.
We will also abandon here the idea that we can run the standard code SWEEP3D [ref 3.] with its two dimensional partition of space. In the reference SWEEP3D model, the memory per node has to contain some number of columns. Each column contains D cells of 160MB each, or 1.6 × 10 11 bytes. With one column per node, that is 160 1GB DIMMS per node and 1,000,000 nodes in the machine. With future semiconductor technology, we can reduce the number of DIMMs per node, but still 1.6 × 10 11 bytes of DRAM per node is steep, and it leaves the nodes memory-heavy and processor-starved.
It does not appear viable to solve the problem with the current generation of semiconductors. What are the prospects for running the problem on a future system?
Proposed Solution Method
We will follow a simple estimation technique and consider only the data size and data movement required, and number of instructions to be executed. We will do the following:
• Redesign the algorithm to use 3-D partitioning, rather than SWEEP3D's 2-D.
• Calculate the number of cells that can be stored at a node, not bothering to convert to integers, but allowing fractional cells per node in the calculations.
• Assume one or more perfectly cubic blocks of cells are allocated to each node.
• Ignore the idle time between sweeps and just calculate the time required for a node to participate in a single sweep: the sweeps are along enough that we are within 1/1000 of the correct figure.
• Calculate the FLOPS or IPS (instructions per second) rate provided and the time required to execute the sweep.
• Calculate the time to execute one sweep.
• Estimate the cost of the chips in a machine and, the cost per second of machine time assuming a three-year lifetime.
• Estimate the cost of a sweep.
For all the machines we need to consider the cost of passing sweeps of data in and out of the nodes and the cost of processing each value in the sweep. For those designs with external DRAM, we need to calculate the cost of loading and storing cells from the DRAM.
Let
• s flt = 8 ⋅S flt be the size of a float in bits. (Substitute some other number of bits per byte if you wish. They would need parity.)
9 be the size of a cell in bits. The 2 allows two floats for old and new values. The 8 converts the angles per octant to total angles.
• s msg = s flt be the size in bits of a value sent from one cell to another in a sweep.
Recall, N sweep = N a ⋅ N s ⋅ N e = 1,250,000 is the number of floats sent from one cell to a single neighbor in a sweep. -7 is the time required to move the entire number of bits in a sweep from one cell to another cell off-chip using all the pads. It needs to be divided by the fraction of the pads being used, but since that will be different in the different systems, we will leave it out of this formula. Since the nodes will contain blocks of cells, T sweep must also be multiplied by the number of cells exposed along the side of the block and by the number of sides sharing a single link (six in the case of router-MPU link on the RS, one for a neighbor link in the PIMs).
Formulae for the execution time and chips required are shown in Table 2 . They compare the amount of time required to send and/or receive data from the neighbors, the amount of time to swap cells from and to DRAM memory, the instruction execution time, and the number of chips required. 
MPU Chips Required Router Chips Required
For the pure PIM machine, each node can contain only as many cells, C c , as can fit on a single chip. C c 2/3 is the number of cells exposed along one side of the cube to communicate with off-PIM neighbors. The communication time is 6 ⋅ C c 2/3 ⋅ T sweep , since the PIM can communicate with all neighbors simultaneously and 1/6 of the pads are used for each.
One would imagine that RS and PIM+DRAM machines would contain single blocks of size m ⋅ C c . Unfortunately, that would require all the cells from the m memory chips to be loaded and stored for each set of messages received from neighboring node. We need to use a form of striping, where the overall space of cells is partitioned into m blocks each of which is partitioned among the nodes. With m = 64 DRAMs per node, we would partition the 1000 3 cells into 4 3 blocks of 250 3 cells. These blocks are partitioned among the nodes, giving C c to each. The blocks are processed one at a time in overall sweep order performing their parts of the sweep. Assuming C c cells fit in cache, the RS and PIM+DRAM can load and store each cell only once per sweep, albeit much faster in the RS with its higher memory bandwidth.
The communication time m ⋅ 6 ⋅ 7 ⋅ C c 2/3 ⋅ T sweep for RS includes 7, to account for the speed of the router links and 6 to count the number of neighbors sharing the single router/MPU link.
The factor U in the formulae for instruction execution time indicates the number of instructions on the average executed updating a cell for each particle/energy/angle element of the sweep streams. P PIM is the number of RISC processors per PIM. Figure 4 shows the values we assume for P PIM over the range of years. It is based on 2.5% of the chip space being devoted to processor cores and 3,000,000 transistors per processor. I pc is the number of instructions executed per cycle by a superscalar (MPU) processor (we assume it is two), and C ps is the number of cycles executed per second. We let m, the number of DRAM chips per node, be 64 for both the RS and the PIM+DRAM systems. We are assuming the chip prices are those given in Table  3 * . These will allow us to compute the price of the minimal system necessary to solve the problem. We assume the lifetime of a machine is three years. The price per second of the machine times the number of seconds required to perform one sweep gives us the cost of the sweep. 
Assumed Processors/Chip

Results
The costs of the minimal systems to solve are shown in Figure 5 , and the sweep times are shown in Figure 6 . These give us the costs per sweep shown in Figure  7 .
Cost of minimal system $1,000,000 $10,000,000 $100,000,000 $1,000,000,000 $10,000,000,000 $100,000,000,000 $1,000,000,000,000 The cost of a minimal PIM-only system to solve the radiation-transport problem is more than an order of magnitude larger than a PIM+DRAM or Red Storm-like system, but this is only to be expected from the differences in memory prices. The number of PIMs in a PIM system is equal to the number of DRAMs in a RS system. With 64 DRAMs per MPU and router, the overall chip cost per DRAM on a RS system is 5 + ((150 + 300) / 64) = 12.03 dollars, so the PIM system price is about 25 times that of the RS. The sweep time on the PIM system is an order of magnitude lower than the RS, which makes the cost per sweep about the same, although the 2013 and 2016 years currently appear to be a win for PIMs. Figure 8 and Figure 9 illustrate the fact that the equivalence of RS and PIM systems is very much a result of our assumptions about the costs of chips. If PIMs were to become a commodity, the price would decline. At $30 per PIM, the order of magnitude decline in cost produces an order of magnitude improvement in cost per sweep. At $10 per PIM, the system prices become nearly equivalent.
System costs
100,000
10,000,000 1,000,000,000 100,000,000,000 10,000,000,000,000 
Conclusions
We have predicted system costs, speed, and cost per sweep of PIM, Red Stormlike, and PIM+DRAM systems over the next decade when applied to a large radiation transport problem. One argument to dismiss PIM-based systems out of hand is that their internal memories are too small and their bandwidth to external DRAM is too low. Arguing that the small internal PIM memory would force recoding the SWEEP3D family of algorithms is true, but the large problem size would force an equivalent recoding for RS-like systems. The sweep time and cost per sweep for a PIM+DRAM system is a bit worse than for the RS, which supports the argument that low bandwidth to off-chip DRAM will be the bane of PIM+DRAM systems.
A problem of the size studied here will certainly stress any hardware procurement budget in the near term. Assuming PIM chips cost about the same as routers, PIM systems can be expected to cost an order of magnitude more than Red Storm-like systems, but since they are an order of magnitude faster, the cost per sweep will be about the same (ignoring the cost of waiting for the answer). Significant declines in PIM prices would bring a PIM system's cost closer to a RS-like system's and give the PIM systems a significantly lower cost per sweep.
