W indows
started the Wintel era, in which Microsoft Windows running on Intel x microprocessors dominated the computer industry and changed the world. Retaining the x instruction set across many generations let users buy new and more capable microprocessors without having to buy software to work with new architectures.
Today's more complex applications and consumers' shift to mobile devices have increased the critical need for energy e ciency and for smoothing the software-upgrade process. This has deeper implications for instruction sets and architectures than many people realize.
Microprocessor vendors are now able to change architectures even at the instruction-set level if they rst remotely upgrade users' software to make it compatible with both the current and new architectures. A computer's architecture has served as a communication interface between architects and programmers. Architects minimized changes to avoid forcing programmers to perform costly rewriting. However, freezing the instruction set reduces the architect's exibility when trying to increase energy e ciency. It would be more e ective for architects to develop more energy-e cient architectures with less regard to instruction-set changes and let the marketplace decide whether the energy savings outweigh the cost of rewriting software.
This illustrates the need to change the process of optimizing architectures for existing software to one that can nd new architectures. Each benchmark or test application originates with a widely used program, so each test embodies the best algorithms for the current architecture. Some simulations go a step further and use instruction traces collected from existing code's inner loops, meaning the results are based on the instruction set of the computer that created the trace. However, carrying artifacts of the current architecture into simulations inhibits change.
COMPUTER ARCHITECTURE SIMULATIONS
To illustrate the problem, microprocessor benchmark sites such as CPUBoss (www.cpuboss.com) show that comparing the software benchmark performance of a modern laptop's microprocessor with one that is ve years old has barely improved. Computer performance has actually increased substantially during this time, but the performance gain is mostly for new code running on new processors. Because benchmarks are usually run only when a processor is rst introduced, results for new code are not listed for older processors.
Computer architectures and algorithms include discrete structures that aren't amenable to continuous para meter optimization. A computer's architecture and each algorithm that runs on it can be depicted as a graph with computations in boxes and data movement shown by lines. Typical hardware boxes contain processors, caches, and memory, as well as lines that represent buses or other interconnects. Algorithm graphs are more speci c, with boxes containing mathematical calculations, sorting operations, or storage operations for speci c data structures. Execution e ciency depends on the way the algorithm's graph maps to the hardware graph.
While First, simulations should run a broad range of algorithms for a particular problem without the testers downselecting the algorithms to t speci c architectures. For example, a test set could include multiple algorithms for sparse matrix multiplication instead of a speci c algorithm or instruction trace. Applying the feedback process to the best algorithm for a data ow or processor-in-memory architecture should optimize those architectures instead of converting them back to the von Neumann style.
Second, changing the feedback position between Figures a and b lets the process nd the best architecture for a speci c problem instead of nding the architectural compromise 
EXAMPLE: SPARSE MATRIX MULTIPLY
Realizing the benefits of the process illustrated in Figure 1b requires a technology advance best suited to a new architecture. Otherwise, the process will unhelpfully rediscover an existing architecture. A suitable advance is 3D stacked memory, whose bestknown examples are hybrid memory cube (HMC) and high-bandwidth memory (HBM). Academic visionaries see stacked memory as an intermediate step toward systems with fully integrated logic and memory. 2 We'll use HBM's second generation-HBM2-in subsequent examples. First, we'll illustrate the performance potential of stacked memory on a sparse matrix multiply problem. Sparse matrix multiply is the most important step in some important scientific codes, such as multigrid solutions to partial differential equations. The task is to compute C = AB, with A, B, and C being sparse matrices.
In matrix notation, a matrix M is comprised of elements m rc , where r and c are the row and column indices, respectively. Elements of the matrix product c ij are a sum of products, sometimes called the dot or inner product, of a row of A multiplied by a column of B, c ij = Σ k a ik b kj . This dense matrix multiply is executed efficiently as vectors.
If the matrix is sparse, perhaps 99.999 percent of the products a ik b kj are zero, due to one or both variables being absent and assumed to be zero. As a result, the terms used to compute a specific c ij occur at different times interleaved with the calculation of other c ij instances, which requires looking up the partially summed c ij instances in memory before adding to them. While this may sound easy, it leads to billions of essentially random memory updates out of a gigabyte-size pool of memory, as Figure 2a illustrates. This makes sparse matrix multiplication on a von Neumann computer inefficient due to resource-intensive randomaccess memory activity.
A Sandia National Laboratories study benchmarked a series of sparse matrix multiply algorithms on various types of hardware, including a system based on Intel Knight's Landing processors. This system had double data rate synchronous DRAM (DDR SDRAM) channels and 3D HBM stacked memory, as illustrated in Figure 2b . HBM's wider data buses, resulting from the use of high-density vias running through the silicon rather than PC board traces, increases the processormemory bandwidth about 10 times, from DDR SDRAM's 25 GBps to 250 GBps. Both systems used the same discrete structure illustrated in Figure 2c .
A NEW AND DIFFERENT APPROACH
While 3D memory increases bandwidth and eases the memory bottleneck, fully integrating logic and memory like the structure shown in Figure  2d eliminates it entirely. 2 DRAM refresh requires each bit to be read and rewritten every 64 ms. DRAM is organized as banks of 8,192 rows, requiring about 0.25 ms for a full memory refresh. The DRAM's internal data rate of reading and rewriting HBM2's maximum 8-Gbyte memory stack in 0.25 ms is equivalent to 32,000 GBps, which is 1,200 times faster than DDR SDRAM and 120 times faster than the stacked memory shown in Figure 2b .
The recent Sandia Labs study benchmarked the Expand-Sort-Compress (ESC) algorithm, 3 whose block diagram is shown in Figure 2e . In lieu of executing the statement c ij = c ij +a ik b kj all at once, the ESC algorithm streams records { i, j, a ik b kj } produced after the multiplies are performed into a memory array. The array is then sorted using i, j as the key, arranging records so that the ones with the same index are next to each other. The additions are then performed in an e cient array scan. This alternative algorithmic approach uses more bandwidth yet is more e cient because it has a regular access pattern. The approach is not very e ective on the standard von Neumann architecture shown in Figure a because of its memory bottleneck, but is more e ective with the increased bandwidth available from the D memory in Figure b . Current DRAM architectures can't alter data during refresh and as a result cannot perform sorting or addition for these algorithms. However, the fully integrated architecture of Figure d would support the algorithms and yield higher performance.
COMPARING APPROACHES
The type of continuous variable optimization in Figure a can iteratively adjust clock rate or cache size to achieve an architecture that is well balanced across an entire application suite. However, optimizing continuous variables can't separate the addition and multiply operations found in a single box in Figure c C ontinued computing advances depend on nding energye cient alternatives to the von Neumann architecture. Current methods and simulation tools miss the target because they retain artifacts of the von Neumann architecture. However, I suggest the method can and should be adapted to test new architecture-algorithm combinations. This would create a hybrid human-computer system in which humans create architectures and algorithms using brainpower, and then computer simulation assesses the quality of each combination. 
ACKNOWLEDGMENTS

