Abstract-The power budget is expected to limit the portion of the chip that we can power ON at the upcoming technology nodes. This problem, known as the utilization wall or dark silicon, is becoming increasingly serious. With the introduction of 3-D integrated circuits (ICs), it is likely to become more severe. Thus, how to take advantage of the extra transistors, made available by Moore's law and the onset of 3-D ICs, within the power budget poses a significant challenge to system designers. To address this challenge, we propose a 3-D hybrid architecture consisting of a CPU layer with multiple cores, a fieldprogrammable gate array (FPGA) layer, and a DRAM layer. The architecture is designed for low power without sacrificing performance. The FPGA layer is capable of supporting a large number of accelerators. It is placed adjacent to the CPU layer, with a communication mechanism that allows it to access CPU data caches directly. This enables fast switches between these two layers. This architecture reduces the power and energy significantly, at better or similar performance. This then alleviates the dark silicon problem by letting us power ON more components to achieve higher performance. We evaluate the proposed architecture through a new framework we have developed. Relative to the out-of-order CPU, the accelerators on the FPGA layer can reduce function-level power by 6.9× and energy-delay product (EDP) by 7.2×, and application-level power by 1.9× and EDP by 2.2×, while delivering similar performance. For the entire system, this translates to a 47.5% power reduction relative to a baseline system that consists of a CPU layer and a DRAM layer. This also translates to a 72.9% power reduction relative to an alternative system that consists of a CPU layer, an L3 cache layer, and a DRAM layer.
I. INTRODUCTION

P
OWER consumption is among the top concerns of today's chip designers. Although Moore's law continues to enable increased transistor count, the power budget constrains the portion of the chip we can power ON. This problem is known as the utilization wall or dark silicon [1] , [2] . It is expected to be a major showstopper for high-performance designs at the upcoming technology nodes. Another recent trend toward 3-D integrated circuits (ICs) aggravates this problem further. In the 3-D ICs, multiple dies are stacked together in one package, making more transistors available than predicted by Moore's law. This means designers have more transistors Manuscript received July 7, 2015 ; revised September 17, 2015 ; accepted September 23, 2015 . Date of publication October 19, 2015 ; date of current version April 19, 2016 . This work was supported by the National Science Foundation under Grant CCF-1216457.
The authors are with the Department of Electrical Engineering, Princeton University, Princeton, NJ 08544 USA (e-mail: xianminc@princeton.edu; jha@princeton.edu).
Digital Object Identifier 10.1109/TVLSI.2015.2483525 than they can afford to power ON. Thus, they face the challenge of how the extra transistors can be translated to better performance under the given power budget. Adding specialized accelerators to aid general-purpose CPUs has been proven to be an effective method for reducing power consumption. The power saved by these accelerators can be used to power extra computational components to deliver better performance. These accelerators include coprocessors with special instructions, general-purpose graphic processing units (GPUs), field-programmable gate arrays (FPGAs), and application-specific ICs (ASICs). As opposed to CPUs, they have specialized computation and control logic, thereby reducing the number of active transistors and, hence, power consumption. Moreover, the specialization may also offer a performance advantage over traditional CPUs. This can be traded for further power reduction by operating accelerators at a lower frequency.
Among different types of accelerators, ASICs are the most power-efficient but offer almost no functional flexibility. By contrast, FPGAs provide higher flexibility and reasonable power efficiency, making them perfect candidates for low-power computation. Previous studies have shown the advantage of FPGAs over CPUs and GPUs in this respect. For example, Kestur et al. [3] show that using the FPGAs for basic linear algebra subroutines can achieve 2.7× to 293× better energy efficiency than using GPUs and CPUs. Chen and Singh [4] show that using the FPGAs, based on highlevel synthesis (HLS), for a document filtering algorithm can achieve 5.5× and 5.25× better energy efficiency than using GPUs and CPUs, respectively. Other works also report similar energy efficiency improvements of FPGAs over GPUs and/or CPUs [5] , [6] . Although these works establish the advantage of FPGAs in reducing energy consumption, the accelerators used in them are highly customized for certain applications. In addition, these FPGA accelerators are only suitable for large computational jobs, since FPGAs are located off-chip and switching computation to them is costly.
In this paper, we propose a 3-D hybrid architecture for low-power computation to forestall the dark silicon problem while enabling the highest performance possible without hitting the power wall. The architecture is easy to use and meant for general-purpose computations. It consists of a CPU layer, an FPGA layer, and a DRAM layer. It takes advantage of through-silicon via (TSV)-based 3-D IC technology to provide vast FPGA resources close to CPUs. We use a mechanism to allow FPGA accelerators to access CPU data caches directly. This enables fast switches between a CPU and an FPGA accelerator. The FPGA layer is quite flexible in terms of how it is used. It could be configured as a large number of small accelerators, a small number of large accelerators, or something in between. Moreover, the use of a separate FPGA layer allows it to be fabricated separately using the state-of-the-art FPGA design. Since the architecture is geared toward general-purpose computation, we use a state-of-the-art HLS tool to generate the accelerators directly. This eliminates the costly manual design process. Note that the proposed architecture is generic in the sense that it can be extended to include multiple FPGA layers and multiple DRAM layers. We have developed an evaluation framework to simulate systems with such an architecture. Our experimental results show that, compared with the out-of-order CPU, the accelerators on the FPGA layer can reduce function-level power and EDP by 6.9× and 7.2×, respectively, and application-level power and EDP by 1.9× and 2.2×, respectively, while delivering similar performance. Compared with a baseline system consisting of a CPU layer and a DRAM layer, the proposed architecture can achieve 47.5% power reduction for the entire system. Compared with an alternative system consisting of a CPU layer, an L3 cache layer, and a DRAM layer, the proposed architecture can achieve 72.9% power reduction for the entire system.
We summarize the contributions of this paper as follows. 1) We propose a new 3-D architecture with the CPU, FPGA, and DRAM layers for low-power computation. 2) We demonstrate a communication mechanism that allows the FPGA layer to access CPU data caches, directly crossing clock domains. 3) We develop a framework to evaluate the proposed architecture. 4) We evaluate performance, power, and thermal impacts of the architecture, both at the function and application levels. The rest of this paper is organized as follows. Section II reviews related work on reconfigurable computing and 3-D ICs. Section III presents our proposed architecture. Section IV describes the framework we have developed to evaluate the architecture. Section V presents the experimental results. Finally, the conclusion is drawn in Section VI.
II. RELATED WORK
This section reviews related work on the reconfigurable computing and the 3-D ICs that will help place our work in the proper context.
A. Reconfigurable Computing
Reconfigurable computing uses reconfigurable fabrics to perform computations. A reconfigurable computing system can be built using two methods: 1) using customized reconfigurable fabrics and 2) using commodity FPGAs. Several designs, such as Garp, NAPA1000, and Chimaera, were constructed using the first method [7] - [11] . Customized reconfigurable fabrics can usually run at higher frequencies than commodity FPGAs due to the customization. They can also be embedded very close to the CPU to enable fast communication. However, one drawback of this method is that only a small amount of reconfigurable fabric can be placed near the CPU due to die area limitations. This limits the complexity of accelerators implemented on the reconfigurable fabric. Another drawback is that this method cannot take advantage of the latest developments in the FPGA industry. In contrast, systems built using the second method use the state-of-the-art commodity FPGAs [12] - [16] . The FPGA is off-chip and connects to the rest of the system through a bus or other mechanisms. The off-chip FPGA offers a considerable amount of reconfigurable resources. However, its drawback is that the communication between the FPGA and the CPU is costly in terms of both time and energy. We will show that our proposed architecture can overcome these drawbacks using 3-D stacking. This not only enables fast communication but also provides enough FPGA resources and takes advantage of the state-of-the-art FPGA designs.
Designing FPGA-based accelerators requires hardware expertise, which is a barrier to wider adoption of reconfigurable computing systems. Fortunately, recent developments in the HLS have significantly lowered this barrier. HLS tools directly convert functions written in high-level languages to hardware accelerators [17] - [22] . Some researchers have also studied direct compilation of compute unified device architecture (CUDA) or OpenCL (designed for heterogeneous parallel computing) to hardware [23] , [24] . This takes advantage of the parallelism exposed by both languages and is suitable for systems with off-chip FPGAs. In this paper, we use a stateof-the-art HLS tool called Vivado HLS (known as AutoESL before Xilinx's acquisition) [25] . A third-party report indicates that it can achieve a quality level similar to that of manual designs by hardware experts, for several applications [26] .
B. 3-D ICs
The 3-D ICs have multiple layers of dies, which are connected by the TSVs, stacked in a single package. This makes the so-called More-than-Moore scaling possible. In the 3-D ICs, different layers can be manufactured and tested separately. This not only increases the yield but also allows different layers to be optimized and manufactured in different technologies. Researchers have studied various approaches for using the 3-D IC architectures to improve performance and/or reduce power consumption. One popular approach is to put caches or memories in a layer separate from the CPU layer [27] - [32] . Although this approach is easy to implement, it is confronted with diminishing returns after memory/cache sizes exceed certain thresholds. Another approach for using the 3-D ICs is to split computational components, such as CPU cores, into multiple layers. This can reduce the wire length compared with the original 2-D designs, thus lead to performance and power benefits. However, this approach has two drawbacks: 1) it requires a very large number of TSVs, which are costly and 2) since computational components generally consume more power than caches or memories, stacking multiple layers of computational components may create thermal problems [33] . Compared with the above two approaches, as an alternative, adding an FPGA as a separate layer has two unique benefits: 1) the FPGA layer adds computational power and 2) since an FPGA layer consumes much less power than a CPU layer, adding it is much less likely to lead to thermal problems. Because of these benefits, a recent effort, the FlexTiles platform, makes a move in this direction [34] . However, it uses a special embedded FPGA, which does not take advantage of recent FPGA developments. In this platform, the FPGA accelerators are connected to CPUs through a network-on-chip, incurring a large communication overhead. It also uses a dataflow programming model, which limits its use. As opposed to the FlexTiles platform, our proposed architecture takes advantage of the state-of-the-art FPGA and HLS technologies. It is easy to use and suitable for general-purpose computation. We also conduct experiments to evaluate its power and thermal impact, which the FlexTiles work lacks.
III. HYBRID ARCHITECTURE
This section presents our hybrid 3-D architecture. We first introduce the system architecture and then present its implementation details, e.g., FPGA acceleration, instruction set architecture (ISA) extension, CPU-accelerator communication, and memory access optimization. Finally, we discuss manufacture, test, cost, and reliability of the proposed architecture.
A. System Architecture
As shown in Fig. 1 , our proposed hybrid 3-D architecture consists of three layers: 1) CPU; 2) FPGA; and 3) DRAM. An extended configuration may consist of multiple FPGA and DRAM layers. For simplicity, we will focus only on the basic configuration in this paper. The cores located on the CPU layer run at a high frequency (1 GHz or higher) and generate the most heat. Therefore, the CPU layer is located closest to the heat sink to expedite heat dissipation. Adjacent to the CPU layer is the FPGA layer. It contains accelerators for lowpower computation. These accelerators run at a much lower frequency (several hundred megahertz) than the CPU cores. As a result, they generate much less heat and are placed farther from the heat sink. Above the FPGA layer is the DRAM layer. It contains system's main memory. It usually generates the least heat and is, hence, placed farthest from the heat sink.
Adjacency of the CPU and FPGA layers enables fast communication between the CPU cores and the accelerators via TSVs that connect the two layers. Each CPU core can have, associated with it, one or multiple accelerators, which are implemented above the core on the FPGA layer, for the given application. An accelerator accesses the main memory via the data cache of the CPU core it belongs to. Such a design has several advantages: 1) accelerators can use existing infrastructure, including the CPU memory hierarchy and memory controllers, which saves hardware resource; 2) caches on the CPU layer can expedite memory accesses and increase accelerator performance; 3) sharing caches provides a uniform view of the memory space among CPU cores and accelerators, simplifying programming; and 4) sharing caches eliminates the need for copying data, enabling fast switches between a CPU core and an accelerator.
B. FPGA Accelerator
The accelerators implemented on the FPGA layer are generated directly from C/C++ code using an HLS tool. The tool we chose is Vivado HLS, a state-of-the-art tool from Xilinx [25] . During HLS, the tool maps the input arguments and return values of C/C++ functions to input and output signals of the generated accelerator. If an input argument is a value, it is mapped to an input signal that passes the value directly. If an input argument is a pointer, it is mapped to a memory access interface, which consists of the address, data, and handshaking signals. A return value, if it exists, is mapped to an output signal, passing the value directly. Besides these signals, Vivado HLS also adds necessary control signals to the accelerator, including clock, reset, and handshaking signals.
Inside the accelerator, sequential C/C++ code is mapped to a hardware structure based on the data and control flows. Fig. 2 (a) and (b) shows an HLS example for an N × N matrix multiplication. Fig. 2(a) shows the original function in C, which multiplies matrices A and B and stores the result as matrix C. All matrices are stored in row-major order. In the function, the two outer loops iterate through each element of matrix C, and the innermost loop iterates through the corresponding row of matrix A and column of matrix B to calculate the matrix C element. Fig. 2(b) shows the generated accelerator structure. Inputs A, B, and C are pointers and mapped to three memory access interfaces. Input argument N is mapped directly to a value-passing signal. In the generated accelerator, the data path contains four multipliers to multiply four pairs of inputs and three adders to sum up the products. It is capable of handling four iterations of the original function's inner loop. The data path is controlled by the generated finite-state machine.
During accelerator generation, we use three types of optimizations supported by Vivado HLS to increase accelerator performance: 1) pipelining; 2) loop unrolling; and 3) dataflow. The pipelining optimization splits the data path into stages by inserting pipeline registers, allowing it to tackle one batch of inputs per cycle to increase throughput. The loop unrolling optimization unrolls a loop to allow multiple iterations of the loop body to be finished in parallel. This increases the performance at the cost of extra hardware resources. For instance, in the N × N matrix multiplication example, the innermost loop of the original function is unrolled four times. The dataflow optimization enables task-level (functions and loops) pipelining by inserting buffers, such as first-in first-out (FIFO) or ping-pong buffers. All these optimizations can be performed via directives, which do not require any source code change [35] .
The generated accelerator usually works at a lower frequency than the CPU core. However, it may still perform computations faster than the CPU core due to the following reasons: 1) it can have parallel data paths; 2) it can use customized pipelines to achieve higher throughput; 3) it can merge multiple operations into a single operation in its data path; and 4) it can have customized communication mechanisms inside, which is not constrained by the memory bottleneck.
To connect an accelerator to a CPU core working in another clock domain, we add a wrapper to the generated accelerator for interclock-domain communication, as shown in Fig. 3 . The wrapper has two parts: 1) configuration/status registers and 2) interface protocol conversion. The configuration/status registers store the inputs and internal states. The interface protocol conversion part converts the generated accelerator interface to an interface consisting of FIFOs to provide a simple interface to connect to the rest of the system. Each pair of an inbound FIFO and an outbound FIFO is grouped into a channel. Based on the functionality, a channel belongs to one of two possible categories: 1) control channel and 2) data channel. A control channel is responsible for the communication of control information, including the information to set up the accelerator and the execution result. A data channel is responsible for memory access when the accelerator is running. The outbound FIFO of each data channel conveys the memory access request, and the inbound FIFO conveys the response. Each accelerator has only one control channel 
C. CPU ISA Extension
Using instructions to directly configure and call accelerators also allows fast switches between a CPU and an accelerator. Therefore, we added two new instructions to the CPU ISA to access the accelerators: 1) WRITE_ACC and 2) RUN_ACC. WRITE_ACC writes configuration registers in the accelerator. It has two operands: 1) the value to be written and 2) the register index. RUN_ACC starts the accelerator. It also has two operands: the accelerator index and another that indicates where the execution result (the return value of the original function) will be stored. Table I summarizes the two instructions. In the ALPHA instruction format, each instruction can have at most three operands, but only two are used in our design. As we will see, the ISA extension is simple enough to implement, yet support complex accelerators.
A typical call to an accelerator has two steps: 1) the CPU uses the WRITE_ACC instruction to write accelerator registers and 2) it uses the RUN_ACC instruction to start the accelerator and receive the execution result. During instruction execution, an accelerator may access memory through the CPU data cache. Meanwhile, the CPU waits until the accelerator finishes or a translation lookaside buffer (TLB) miss occurs (discussed later). A more sophisticated implementation may support the simultaneous running of the CPU and accelerator. However, we did not choose this implementation, since our main goal is to reduce power.
D. CPU-Accelerator Communication
An FPGA accelerator can access the CPU data cache directly. To enable this sharing, a switch is added to enable multiplexing, as shown in Fig. 4 . At any given time, either the CPU core or the accelerator has access to the data cache. Since the switch only adds a 2-to-1 multiplexer between the CPU kernel and the data TLB (DTLB)/cache, it only incurs a minor increase in path delay, which we analyze further in Section V. The CPU and accelerator are connected through a pair of FIFOs. These FIFOs are used as buffers to alleviate the speed mismatch, because the CPU core operates at a higher frequency and the accelerator at a lower one. On the CPU side, we use an adapter, called CPU-side adapter, to connect the FIFOs with the data cache switch. On the FPGA side, another adapter, called accelerator-side adapter, is used to connect the FIFOs to accelerator channels. Each accelerator may have more than one channel. Therefore, the acceleratorside adapter embeds both multiplexing and demultiplexing functions inside. Because the function of the accelerator-side adapter is fixed, it can be prefabricated directly into the FPGA, as in the case of the 3-D network interface [36] . This enables the accelerator-side adapter to work at the same frequency as that of the CPU core. As a result, the CPU data cache and both adapters work at a higher frequency than the accelerator, thus providing enough bandwidth for the accelerator. This is similar to the idea of multipumping [37] , where the cache operates at a higher frequency than the FPGA accelerator, allowing multiple memory accesses per cycle by the accelerator. Note that our communication mechanism does not require any extra resources for caches and memory controllers. As a result, it saves resources compared with the multipumping design.
Six types of packets are used for CPU-accelerator communication, as shown in Fig. 5 . Three go from the CPU to the accelerator and the other three in the opposite direction. In each packet, the first two bits indicate the packet type. The remaining bits provide other information, such as the address. Details of each type of packet are given as follows. The value field in the packet is the memory read result.
4) Done (Indicating Accelerator Execution is Finished):
The value and size fields in the packet indicate the value and data size of the return value (if it exists), respectively. Binary codes 00, 01, 10, and 11 indicate 1-, 2-, 4-, and 8-byte data size, respectively. 5) Memory Read (Reading Memory): The address and size fields correspond to the address and data size of the read operation, respectively. Data size encoding is the same as above. 6) Memory Write (Writing Memory): The address, value, and size fields of the packet correspond to the address, value, and data size, respectively. Data size encoding is the same as above. Fig. 6(a) shows an example of CPU-accelerator communication using the packets introduced above. First, the CPU executes a WRITE_ACC instruction, which sends a Write packet (packet a) to the accelerator. Then, the CPU executes a RUN_ACC instruction, sending a Start packet (packet b) to the accelerator. During accelerator execution, it accesses memory twice: for a memory read and a memory write. Memory read leads to a Memory read packet (packet c) from the accelerator to the CPU and a Memory response packet (packet d) in the opposite direction as a response. Memory write leads to only a Memory write packet (packet e) from the accelerator to the CPU. When the accelerator has finished, it sends a Done packet (packet f ) to the CPU. In our design, the accelerator accesses the virtual memory space directly. This means that a TLB miss may occur during memory accesses. Fig. 6(b) shows an example of such a scenario where the Memory read packet (packet i ) triggers a TLB miss exception. In this situation, the CPU immediately suspends the RUN_ACC instruction and starts the TLB miss exception handling procedure. At the same time, the CPU-side adapter stores the packet that triggers the exception and stops receiving any new packet. When the CPU is handling an exception, the accelerator keeps running until it can no longer continue. Once the CPU has finished exception handling, it re-executes the RUN_ACC function, which resends a Start packet (packet j ) to the accelerator. When the CPU-side adapter receives it, it does not forward the packet to the accelerator. Instead, it deletes the packet and sends the stored packet (renamed as packet k) to the CPU to reread the address that caused the TLB miss. At this time, the memory read succeeds and the accelerator continues its execution. Since packets j and k only exist between the CPU and the CPU-side adapter, we have marked them with a dotted line in Fig. 6(b) .
E. Memory Access Optimization
Besides the optimizations performed with directives mentioned in Section III-B, we also perform optimizations specifically targeted at speeding up memory access. During HLS, data channels (memory interfaces) are generated for pointerpassed arguments in the original function. Since the accelerator operates at a lower frequency than the CPU core and each data channel can only access one batch of data per cycle, memory access may become a performance bottleneck. Note that the root of this problem is the low-frequency nature of FPGA accelerators. In our architecture, the accelerator accesses the CPU data cache directly, which operates at a higher frequency and provides enough bandwidth. To tap into the bandwidth, we need to duplicate the data channels to access the cache data in parallel. Fig. 7(a) shows an example of an accelerator with duplicated channels for pointer argument A. As mentioned earlier, the CPU data cache has enough bandwidth to feed the duplicated channels. If this bandwidth is still not sufficient, we can create multiport scratchpad memories inside the FPGA. They load needed data from the CPU data cache at the beginning of accelerator execution, provide these data to the accelerator and store results during execution, and store the execution results back to the CPU data cache at the end. They have multiple ports to deliver sufficient bandwidth during accelerator execution. Fig. 7(b) shows an example of a scratchpad memory added for the data channel for A. Choi et al. [37] and Putnam et al. [38] show that inserting caches or scratchpad memories is possible with the help of HLS tools. In our implementation, since we do not have access to the source code of Vivado HLS, instead of modifying the HLS tool, we modify the source code of the accelerated functions to direct the HLS tool to synthesize the desired scratchpad memories. We use this to mimic direct HLS support.
F. Manufacture, Test, Cost, and Reliability
In this section, we provide a qualitative analysis of manufacture, test, cost, and reliability of our proposed architecture.
First, to manufacture a chip based on our proposed architecture, the die for each layer can be manufactured independently and then bonded together using face-to-back bonding. The number of TSVs needed for CPU-accelerator communication is determined by the data width of the FIFOs. Based on the packet sizes targeted in Section III-D, the two FIFOs used by each CPU core are 71-and 132-bit wide, respectively. For a typical design with eight cores, this translates to 1624 [(71 + 132) × 8] TSVs. Together with 608 TSVs (512 bits for data, 64 bits for address, and 32 bits for control) needed by the memory controller, the total number of TSVs needed is 2232. A TSV redundancy scheme may require a slightly higher number to cope with TSV failures in manufacture. However, since the increase is minor and depends on the manufacturing process (e.g., the redundancy scheme presented in [39] achieves a 95% recovery rate by adding one redundant TSV per 25-TSV block), we have not modeled TSV redundancy in this paper.
Second, 3-D ICs pose significant test challenges [40] . To achieve a better yield, the die for each layer needs to be tested before bonding. In our proposed architecture, since each layer contains complete functionality, prebond testing is easier compared with the case when fine-grain partitioning of functionality is employed across multiple die layers. In particular, we can add an independent test controller to each layer for this purpose, and each module on each layer can be treated as an isolated test island and tested separately [41] .
Third, the cost of a 3-D IC consists of wafer cost, bonding cost, packaging cost (depends on both pin count and die size), and cooling cost [42] . Our proposed architecture incurs little overhead for the first three costs, given the relatively small number of TSVs it uses. Since it reduces power consumption and alleviates hotspots on chip, as shown in Section V-F, it also reduces the cooling cost.
Finally, in a 3-D IC, the difference in thermal expansion coefficients between the TSV copper and the surrounding silicon causes thermal-induced stresses that lead to mechanical reliability issues. We can use the techniques and tools proposed in [43] to deal with such issues.
IV. EVALUATION FRAMEWORK
We designed a customized framework to evaluate our proposed architecture. Fig. 8 shows the structure of the framework. It consists of four parts: 1) HLS; 2) performance simulation; 3) power simulation; and 4) thermal simulation. The HLS part contains Vivado HLS, which converts a C/C++ function into accelerator hardware at the registertransfer level. The function to be converted is chosen based on software profiling results. The generated accelerator has multiple versions with the same underlying hardware structure but described in different languages: Verilog hardware description language (HDL), VHDL, and SystemC. Then, a script parses the accelerator interface and changes the benchmark source code to use the generated accelerator instead of the original function. The new source code is cross-compiled into an executable binary for performance simulation.
The performance simulator we use is gem5 [44] , an opensource full-system simulator. Its open-source nature allows us to add accelerator simulation support. In particular, we integrate the SystemC version of accelerator code, as well as wrappers generated by our scripts, into gem5 code and simulate them together. We choose to use the SystemC version, since it can be compiled with gem5 source code (written mainly in C++ and Python). Since the gem5 simulator and the SystemC simulation engine have two separate event queues, we employ the method described in [45] to synchronize the two event queues and enable joint simulation.
After the integration, gem5 yields performance results, including component-level utilization statistics and waveforms for the accelerator (output of the SystemC simulation engine). For most system components except accelerators, TSVs, adapters, and DRAM, we use McPAT to estimate power [46] . McPAT takes the component-level utilization statistics, together with the system configuration, to estimate power for each component. As for the accelerator, we first synthesize, place, and route the accelerator's Verilog HDL code to get its low-level implementation details. Then, we use them with the dumped waveform to estimate the accelerator power. This is done using the Xilinx Vivado tool [25] . For TSVs and adapters, we build power macromodels by adding up the power consumptions of their internal components, similar to the way McPAT is built. We use the macromodels, along with their utilization statistics, to estimate their power. For DRAM, we use Micron's DRAM power calculator to estimate power (not shown in the figure) [47] . After we have all the power data, we use HotSpot (with 3-D IC support extension from [31] ) to perform thermal simulation [48] . We have automated the entire process, shown in Fig. 8 , to expedite experimentation. 
V. EXPERIMENTAL RESULTS
This section presents the experimental results. We first discuss the experimental setup. Then, we show the resource utilization rate, function-level/application-level performance, and power of FPGA accelerators. Then, we provide results for performance and power optimization. Finally, we evaluate the system-level power and thermal impact of the proposed architecture.
A. Experimental Setup
To evaluate the proposed architecture, we use an implementation with an eight-core CPU layer, an FPGA layer, and a DRAM layer. We refer to it as CPU + Acc henceforth. To understand how accelerators in the proposed architecture impact the performance and power, we compare CPU + Acc with a baseline system (referred to as CPU-only) consisting of a CPU layer and a DRAM layer, but no FPGA layer. Since CPU-only has one less layer than CPU + Acc, we also compare CPU + Acc with another system that has the same number of layers. The system, referred to as CPU + L3, consists of a CPU layer, an L3 cache layer, and a DRAM layer. Table II shows the configurations of the three systems. They are all assumed to be implemented in the 28-nm CMOS technology except the DRAM layers, which are implemented in the 32-nm CMOS technology. We further assume that memory cells of all caches are implemented with low-standby-power transistors to reduce leakage power, while the rest of the system is implemented with high-performance transistors to improve performance. Such a feature is supported by the cache-modeling tool, CACTI [50] , which is used inside McPAT.
We assume 1/2/6-µm dimensions for TSV diameter/pitch/ depth, which are the minimum ITRS diameter/pitch/depth guidelines for years 2012-2014 (they are also within the predicted range for years 2015-2018) [51] . We calculate the TSV capacitance using the method given in [52] for power estimation. For the CPU layer of CPU + Acc, the estimated die size is 84.07 mm 2 of which TSVs consume 0.02%. In CPU-only and CPU + L3, the CPU layers have slightly smaller (0.62%) die sizes than that of CPU + Acc, because they do not have the CPU-side adapters. For the FPGA layer of CPU + Acc, we use Xilinx Kintex FPGA (xc7k160tfbg484-3; implemented in the 28-nm technology). Its die size is [53] . We scale it to the same size as the L3 cache layer (87.76 mm 2 ) of CPU + L3 for a fair comparison. For the DRAM layers of all three systems, since the die size information of commercial DRAMs is not easily accessible, we assume 6F 2 area per memory cell and 56.08% area efficiency to estimate the die size [54] , where F is the feature size. Following this method, we estimate the die size of a 1-GB DRAM to be 94.11 mm 2 . In all three systems, the CPU cores run at 2 GHz. In CPU + Acc, the FPGA accelerators run at 300 MHz, which is the best we can get on the FPGA. The switch between the CPU kernel and the DTLB/cache adds 16.5-ps delay to the path delay (based on McPAT), which amounts to 3.3% of the CPU clock period. We assume that a margin exists in CPU + Acc to tolerate this delay increase and allows the CPU cores to run at 2 GHz. Another method to estimate the delay overhead is to estimate both the clock period and the switch delay with fanout-of-4 inverter delay, as in [55] . This method also yields similar results.
We use the benchmarks listed in Table III to test CPU-only, CPU + L3, and CPU + Acc. The first two benchmarks are microbenchmarks: 1) integer and 2) floating-point 512×512 matrix multiplications. Both split 512×512 matrices into 16 × 16 submatrices to increase cache performance. The next five benchmarks (gsm to bzip2) are from the CHStone suite plus bzip2 [56] . Since several of these benchmarks are very short, their main functions are repeated multiple times using loops to extend their duration in our experiments. The remaining benchmarks, blackscholes and ×264, are from the PARSEC benchmark suite [57] , [58] . Table III includes Table IV shows the percentage of resources used by the generated accelerators. In the name column, the -1 and -2 suffixes indicate the first and second functions when two functions are accelerated in the same benchmark. These suffixes are also used in subsequent figures in this section. Four types of resources are listed in the table: lookup table (LUT) , register, block RAM (BRAM), and DSP blocks. Most accelerators use more than 10% of the LUTs except ×264. For ×264, the accelerated functions, ×264_pixel_sad_×4_16 × 16 and pixel_satd_w × h, are small functions and thus consume only 2.3% and 8.2% of the LUTs, respectively.
B. Resource Utilization
C. CPU Versus Accelerator: Function-Level Power, Energy, Performance, and EDP
In this section, we compare the power, energy, performance, and EDP of the out-of-order CPU core and the FPGA accelerators when running the same functions. The first two bars, marked as CPU and Acc, in Fig. 9(a)-(d) compare the power, energy, execution time, and EDP of the CPU (in CPU-only) and the FPGA accelerators (in CPU + Acc). All results are normalized to that of the CPU. 2 When estimating CPU power and energy, we include data cache/TLB and instruction cache/TLB inside the core. When estimating accelerator power and energy, we only include the data cache (D$) and DTLB, since accelerators only access them. From Fig. 9(a) , we can see that the accelerators consume significantly less power than the CPU. This is because the accelerators contain dedicated computational structures, whereas the out-of-order CPU contains components that do not contribute directly to computation, such as components dedicated to instruction fetching and decoding, branch prediction, and register renaming. When accelerators are used, the data cache and DTLB consume a very small amount of power, owing to the fact that the leakage power of memory cells inside the data cache is reduced using low-standby-power transistors. To better understand the impact of using accelerators, we also show the absolute power values in Fig. 10 . On an average, the accelerators consume only 22.8% of the power consumed by the CPU. The power reduction translates to energy reduction in Fig. 9(b) , where the accelerators can be seen to consume only 20.2% of the energy consumed by the CPU on an average. As mentioned earlier, although a CPU core runs at a much higher frequency than an accelerator, an accelerator may still deliver better performance due to its dedicated structure. This is evident from Fig. 9(c) , where, compared with the CPU, accelerators deliver similar or even better performance for many benchmarks except fmmul, aes, and bzip2. On an average, the execution time of the accelerators is 82.9% of that of the CPU. When combining energy and execution time into one metric, EDP, we see that the accelerators outperform the CPU on all benchmarks. The average EDP of the accelerators is 21.5% of that of the CPU. Note that in accelerator power consumption calculation, the leakage power of the entire FPGA is included, which makes the comparison unfair to the accelerators. When multiple accelerators exist on the FPGA, this leakage power gets amortized. Thus, for each accelerator, we should only count the portion of power consumed by the accelerator based on the portion of resources (LUTs) it uses. In Fig. 9(a)-(d) , the third bar, labeled "+Partial L," shows the result after this adjustment. On an average, the power, energy, and EDP drop to 14.5%, 13.2%, and 13.9% of the CPU, translating to 6.9×, 7.6×, and 7.2× reductions, respectively. The execution time remains the same after the adjustment. 2 The operating temperature is assumed to be 65°C for power estimation. To evaluate the benefit of accessing data directly from the CPU data cache, we measure the data cache miss rate of the CPU and accelerators. The result is shown in Fig. 11(a) . For most benchmarks, the accelerators have higher miss rates than the CPU. This is because the CPU runs software functions, which have intermediate variables. These variables usually end up in the data cache and drive the average miss rate down. By contrast, accelerators only access the cache for pointer-passed arguments. These are inputs from outside and less likely to be in the cache. This makes the cache miss rates higher. To show this difference, Fig. 11(b) compares the amount of data cache accesses by the CPU and accelerators. On an average, the amount of data cache accesses by accelerators is only 23.6% of that of the CPU. In summary, all accelerator miss rates are <9%, and the average is 3.6%. This means that the CPU data cache provides a very good caching for accelerators.
D. CPU Versus Accelerator: Application-Level Power, Energy, Performance, and EDP
In this section, we evaluate the power, energy, performance, and EDP of the out-of-order CPU core and the FPGA accelerators when running entire benchmark applications on the proposed architecture. When using accelerators, the CPU is responsible for the unaccelerated parts of each benchmark. Due to this reason, we use CPU + Acc to label results, the same as the system name. Similarly, we use CPU-only and CPU + L3 to label the results when running benchmarks on CPUs of CPU-only and CPU + L3, respectively. Fig. 12(a)-(d) shows the power, energy, execution time, and EDP, respectively, when running each benchmark. On an average, power, energy, execution time, and EDP of CPU + Acc are 59.7%, 53.5%, 93.8%, and 53.5% of CPU-only, respectively. As in the case of function-level results, we also include results when the FPGA leakage power is scaled according to resources used and show them as the bar labeled +Partial L. in these figures. On an average, the power, energy, and EDP of CPU + Acc are now 52.3%, 46.3%, and 46.3% of CPU-only, translating to 1.9×, 2.2×, and 2.2× reduction, respectively. As expected, the power reduction at the application level is less than at the function level. This is because the accelerators only account for part of the execution of each benchmark, and the CPU is still responsible for the rest. To deal with this issue, we could use multiple accelerators to cover a larger part of the execution of each benchmark. This has already been addressed in our results by accelerating multiple functions for certain benchmarks.
From Fig. 12(a)-(d) , we found that CPU + L3 performs very similarly to CPU-only. The average execution time of CPU + L3 is 99.8% of the CPU-only. This is because the L2 cache of CPU + L3 is big enough to accommodate the working data sets, rendering little advantage from adding an L3 cache. It confirms the fact that adding caches has diminishing returns after cache sizes exceed certain thresholds. As a result, the comparison results between CPU + L3 and CPU + Acc are very similar to those between CPU-only and CPU + Acc. On an average, the power, energy, execution time, and EDP of CPU + Acc (with +Partial L.) is 52.2%, 46.3%, 94%, and 46.3% of CPU + L3, respectively.
E. Performance and Power Optimization
As performance results indicate, CPU + Acc runs slower than CPU-only for some benchmarks. This may not be allowed when execution time is a top concern. To solve this problem, we exploit an automatic selection mechanism to direct the benchmark to run on the CPU if the accelerator is slower. This mechanism ensures that the execution time is better than or the same as that of CPU-only. Note that once an accelerator is generated, whether it is faster or slower than the CPU can be determined by simulations beforehand. For situations where execution speed is data-dependent, a dynamic selection method can be used, which chooses the CPU or accelerator dynamically, epoch by epoch, using the history data. In our experiments, we assume that the time budget for a benchmark is the same as its execution time on CPU-only. We further assume that the execution time using the CPU or accelerators can be predetermined by simulations.
With the power budget assumed above, for benchmarks on which CPU + Acc runs faster, we also apply a dynamic voltage and frequency scaling (DVFS) to the CPU core to further reduce its power. To show that this is applicable, Fig. 13 shows the benchmark execution time versus the CPU frequency when using CPU + Acc. The values are normalized to CPU-only. For six out of the nine benchmarks, even when we lower the CPU frequency, CPU + Acc can still deliver better performance than CPU-only. This means we can use DVFS to achieve more power reduction for these benchmarks. To incorporate DVFS into power estimation, we scale McPAT results based on voltage and frequency obtained using the method given in [59] . We assume that the frequency changes from 2 to 1.5 GHz in 0.1-GHz step, and V dd changes linearly with frequency in this range. The bar labeled +Auto. Sel. in Fig. 12(a)-(d) shows the results of using automatic selection and DVFS. In Fig. 12(a) , we can see that the average power increases to 58.9% of CPU-only, from 52.3% before. This is because the CPU is chosen to run some benchmarks due to the execution time constraint. Similarly, the average energy per benchmark in Fig. 12(b) increases to 55.5% of CPU-only, from 46.3% before. In Fig. 12(c) , all the execution times can be seen to be equal to or shorter than those of CPU-only. This ensures no adverse impact on execution time. On an average, the execution time is reduced to 86.7% of CPU-only, from 93.8% before. From Fig. 12(d) , we see that the average EDP increases to 53.9% of CPU-only, from 46.3% before.
F. System Power and Thermal Impact Evaluation
In this section, we evaluate the system power and thermal impact of the proposed architecture. The system power consists of power of all the components in the system, including all caches and main memory. To fully take advantage of the FPGA resources, we evaluate a scenario where multiple homogeneous accelerators exist on the FPGA layer of CPU + Acc. We use multithread settings for the benchmarks with multithread support (blackscholes and ×264), and duplicate single-thread version for those without this support. A CPU core and an accelerator (two accelerators if two functions are accelerated) are allocated to run each thread. Note that the number of threads is limited by the FPGA resource utilization shown in Table IV and the number of CPU cores. We keep increasing the number of threads until we exhaust at least one type of FPGA resource or reach the maximum number of CPU cores. The final thread/accelerator numbers are given in the last column of Table IV. Fig. 14(a) shows the comparison of system power of CPU-only, CPU + L3, and CPU + Acc. We found that CPU + Acc consumes much less power than CPU-only. On an average, CPU + Acc reduces power consumption by 47.5% relative to CPU-only. By contrast, CPU + L3 consumes much higher power than CPU-only and CPU + Acc. This is because the L3 cache of CPU + L3 consumes considerable amount of power, even though we choose to use low-standby-power transistors for memory cells in the L3 cache. On an average, CPU + Acc reduces power consumption by 72.9% relative to CPU + L3.
The differences of power consumption lead to differences in maxium temperatures, or hotspots, on the chip. Fig. 14(b) shows comparisons of maximum temperatures. We can see that the maximum temperature of CPU + Acc is much lower than those of both CPU-only and CPU + L3. On an average, the maximum temperature of CPU + Acc is 49.7°C, 6.4°C lower than that of CPU-only (56.1°C), and 9.5°C lower than that of CPU + L3 (59.2°C). This indicates that CPU + Acc does not aggravate the thermal problem, which is a major problem in the 3-D ICs. On the contrary, it helps to alleviate the thermal problem.
VI. CONCLUSION
In this paper, we proposed a 3-D CPU-FPGA-DRAM hybrid architecture for low-power computation. The architecture features an FPGA layer between the CPU and the DRAM layers. The FPGA layer implements accelerators that work with CPU cores to achieve low-power computation. This helps address the dark silicon problem. We also take advantage of HLS to make the proposed architecture easy to use and suitable for general-purpose computation. To evaluate the architecture, we created an evaluation framework and conducted various experiments. At the function level, experimental results show that the accelerators on the FPGA layer can reduce power and EDP by 6.9× and 7.2×, respectively, compared with an outof-order CPU while delivering similar performance. At the application level, the accelerators on the FPGA layer can reduce power and EDP by 1.9× and 2.2×, respectively. For the entire system, this translates to 47.5% power reduction compared with the baseline system without the FPGA layer and 72.9% power reduction compared with an alternative system which has the FPGA layer replaced with an L3 cache layer. These results demonstrate that the proposed architecture is effective in achieving the power reduction goal.
