EDITOR ERIK P. DEBENEDICTIS
M
aintaining the microprocessor's scale-up path beyond the early 2000s would have required clock rate increases beyond the physical limits of transistors. The idea at the time was to shift to a multicore microprocessor architecture running at an almost unchanging clock rate of 2-4 GHz. Unfortunately, the resulting shared-memory parallel architecture was too hard to program, and the potential performance gain was not enough to motivate research to simplify it. Without performance improvements at the processor level, new features could not be added to applications, slowing economic progress.
Industry and government responded with physical science research, and while ways to extend Moore's law through smaller transistors, 3D chips, and new memory devices were found, faster transistors were not. 1 During the same period, microprocessor manufacturers diversified their product lines beyond multicore, essentially finding new approaches to speeding up programs without faster transistors. This process began with making graphical processing units (GPUs) and field 
ACCELERATING THE MICROPROCESSOR
Most of a microprocessor's data movement and logic is overhead, interpreting the instruction set and moving data between memory elements and the functional units that carry out the actual steps of an algorithm. A custom accelerator chip dispenses with this overhead by moving information between functional units using wires. Wires cannot be changed after manufacture, leading to a tradeoff between performance and programmability. An FPGA is an array of functional units with wire segments that can be reconnected into any wiring pattern in a matter of seconds. While an FPGA dispenses with the microprocessor's overhead, the wiring requires constant signal amplification and routing that decreases its speed and energy efficiency compared to a custom chip. Instead of selling computers with one-size-fits-all instruction sets, such as x86, companies now offer a collection of architectural options. For example, many microprocessor chips include a GPU, which has been generalized so it can assist the microprocessor in executing the user's application as well as performing fast graphics. The GPU does not replace the microprocessor, but gives the programmer an architectural choice that can be exploited to optimize speed and energy efficiency.
FPGAs are a tour de force of architectural diversity. All FPGAs contain an array of gates, but some of the gates can be removed and a microprocessor substituted into the equivalent space. The same is true for arithmetic functions, memories, I/O drivers, phase-locked-loops, specialized logic for neural networks, and so forth.
These new products form a malleable architecture that changes over an application's life cycle. In the old days, a processor's architecture was defined first and then code was written for the architecture. With today's product mix, code starts being written for whatever architecture is available. Then the code is analyzed to refine the architecture, which may be tested and used on an FPGA but eventually fabricated as a custom accelerator chip.
Speed and energy-efficiency improvements for the architectures just described are compatible with semiconductor roadmaps 1 and often reach 1000×, 4 so Moore's law continues for hardware at least.
SOFTWARE FOR THE NEW ARCHITECTURES
Programmers are almost universally trained to write software in a particular style, namely expressions, loops, and subroutines. This type of programming co-evolved with the von Neumann architecture and microprocessor but is no longer limited to it. Programming is taught as though all computers were microprocessors, but some software will also run on parallel computers, such as supercomputers, using the single program multiple data (SPMD) interpretation-or used to drive synthesis of specialized logic designs for FPGAs or custom chips.
Interestingly, current FPGAs and custom chips can be designed in either a hardware style or the expressionloop-subroutine software style. The hardware style in Figure 1 shows boxes representing a microprocessor and a memory, just like a hardware schematic shows integrated circuits on a circuit board. The alternative in Figure 2 is a single box containing expressions, loops, and subroutines that describe both structure and function. The two forms are almost equal in expressive power, but hardware designers will be familiar with the former and programmers the latter. 
C O M P U T E R
W W W . C O M P U T E R . O R G / C O M P U T E R
REBOOTING COMPUTING
that programmers might miss unless it is called to their attention. The variable MEM in the VHDL code ultimately represents a physical memory, but the language allows variables to be referenced in many places in the code. Logic synthesis software will analyze the code for usage patterns of variables, "inferring" that the software array MEM is used in a way consistent with implementation as a physical memory. A physical memory can only be read or written once per clock cycle, creating the requirement that the VHDL code have programmatic structure that precludes more than one reference to array MEM per clock cycle. Ignoring the red code in Figure 2 for now, the IF statement selects whether MEM is read or written. Including the red code creates both a write and a read on the same branch of the IF, so synthesis would fail with an error.
To create FPGAs and custom chips, people acting as both programmer and architect will have one part of their brain track the computation while another part of their brain tries to guide the synthesizer toward a satisfactory architecture.
The interconnection between processor and memory in Figure 1 is fondly called the von Neumann bottleneck, and it creates the microprocessor's performance limitation that can only be alleviated by raising the clock rate-the issue that ignited the crisis in the early 2000s. Figure 2 alleviates the bottleneck by preventing its formation in the first place.
AN ACCELERATOR OR A MALLEABLE ARCHITECTURE?
Products are available today with many different architectures, but I'd like to describe them collectively as a single new architecture that changes form across its life cycle.
Expression-loop-subroutine programming won, becoming more popular than dataflow programming, logic diagrams, and other approaches. However, we now see that software written in this style can be used for fundamentally different computing structures. Instead of considering the von Neumann architecture sacrosanct, I suggest we try to support expression-loop-subroutine programs for as many architectures as we can.
The new approach would start with a programmer-designer expressing what the computer should do as source code, illustrated in Figure 3 . The code will run on a standard microprocessor without change to demonstrate its function and allow debugging-but it might be slower and less energy efficient than desired.
To improve speed and energy efficiency, the code will be fed into an enhanced synthesis tool of the type now used for FPGAs and custom chipswhere the future processor, illustrated in Figure 3 , might include microprocessors, GPUs, accelerators, reconfigurable logic, and so on. While today's synthesis tools synthesize all the VHDL they're given as input, I'm suggesting an enhanced tool that partitions the code, synthesizing a portion as an accelerator but leaving the rest as software to run on a microprocessor. This subdivision concept is illustrated by the colored program text in the source code being allocated to structure in the processor of the same color.
For even more efficiency, the synthesized result could be sent to a chip design house. This expensive option will be much more attractive than in past decades because the design will have been tested using reconfigurable, FPGA-type logic.
What I've outlined is a variant of the well-known vision of computers with special-purpose accelerators. The architecture I'm suggesting does not need to include a specific accelerator, but is a configurable set of resources that can be turned into an accelerator on the fly based on source code. So it's a computer that runs any code you give it through multiple levels of optimization, making it a malleable architecture without special purpose anything. P rogrammers used to focus on functional correctness, because effort toward making code efficient for hardware execution competed with the exponential improvement of hardware due to Moore's law. While hardware seems destined to continue improving, the improvement path will no longer be transparent to programmers.
However, additional work will be needed to refine the programming of these new configurable accelerated architectures.
Compilers for creating software are about as powerful as their hardware synthesis counterparts, 3 but we need one system not two. The tools in both categories address common issues such as efficient team development, so let's set aside the common issues. Programming languages like C++ focus on issues irrelevant to tools in the hardware category, such as minimizing errors that enable malware. Conversely, hardware tools try to help the designer avoid the red code in Figure The second memory usage in red within a clock cycle is acceptable in simulation but cannot be synthesized because it is not consistent with the operation of a physical memory.
2, which is perfectly good as software but can't be synthesized as hardware. What we need is a single language for systems that are software at the high level but seamlessly transition to hardware at lower levels. We'll know we've succeeded when somebody writes the equivalent of Kernighan and Richie's 228-page tutorial 2 on C for the new architecture. I suggest we consider FPGAs as an alternative to quad-core microprocessors in laptops and other consumer products. One core can run the user's application while a second runs the OS. The third and fourth cores are seldom used but are an important marketing feature. However, manufacturers could build laptops around FPGAs with an embedded ARM or IBM Power microprocessor, making the FPGA resource available for exciting new applications such as machine learning. This might be a better marketing feature than two seldom-used cores as well as technically encourage software developers to address important software challenges.
I also see hardware with untapped potential at the applications level, such as hardware in "beta test" or even available for purchase that has the architectural diversity of the vision I've described in this column. Looking beyond software tools, what's needed is for early adopters to develop and test key applications using the new hardware. The reward for doing so will be a head start on use of a new architecture that is likely to scale up for a long time. 
ACKNOWLEDGMENTS

