Abstract-This paper describes an architecture and FPGA synthesis toolchain for building specialized, energy-saving coprocessors called Irregular Code Energy Reducers (ICERs) for a wide range of unmodified C programs. FPGAs are increasingly used to build large-scale systems, and many large software systems contain relatively little code that is amenable to automatic, semi-automatic, or even manual parallelization. Whereas accelerator approaches have traditionally achieved energy benefits as a side effect from increasing performance via parallel execution, ICERs aim to achieve energy gains even on code with little exploitable parallelism.
I. INTRODUCTION
As reconfigurable fabrics scale in capacity and capability, the systems they implement will similarly expand in scope and complexity. Designers will increasingly use large bodies of existing code in a high-level language like C to specify the behavior of these systems. By moving from a pure software implementation to a hybrid architecture, designers hope to improve performance and, increasingly, reduce energy consumption.
Although traditional high-level synthesis (HLS) tools make it easier to create coprocessors that increase performance by several orders of magnitude, these approaches have their limits. HLS tools must infer parallel execution from serial code, so they face the same challenges as parallelizing compilers: Analyzing pointers in free-form code is difficult, memory parallelism is often scarce, and it is often difficult to extract and formulate efficient parallel schedules for critical loops. Frequently, parallelization is only possible after significant human refactoring of the underlying algorithms (e.g., [1] , [8] , [9] ).
Because current HLS tools focus on performance above all else, these tools can only save energy on code they can accelerate. However, power and energy concerns are becoming increasingly dominant constraints for all executed code, and HLS techniques should be able to significantly reduce energy consumption even when they cannot improve performance. This paper describes a new approach to HLS that focuses on building custom coprocessors that increase energy efficiency for unmodified C code, regardless of whether acceleration is possible. We call these coprocessors Irregular Code Energy Reducers, or ICERs. The ICER toolchain does not rely on parallelization techniques to build efficient hardware. As a result, a design can incorporate ICERs for any code in which it spends a large fraction of execution time, regardless of whether the code contains extractable parallelism. We envision ICERs working along with conventional accelerators to maximize both performance and efficiency for complex FPGA-based systems.
We evaluate ICERs using a collection of large, hardto-parallelize, irregular programs such as a graph flow solver, search, and a B-tree implementation. Our results show that, relative to a baseline system with soft processor cores, ICERs can increase energy efficiency of individual functions by up to 9.5×. For whole applications, ICERs reduce energy consumption by 2.8×. ICER performance is almost identical to soft core performance, on average.
II. ARCHITECTURE OVERVIEW
The ICER toolchain automatically converts application source code into a hardware-software partitioned system consisting of one or more ICERs integrated with a soft core. It profiles input applications and selects regions of code for conversion into hardware based on dynamic execution coverage. Unlike conventional C-to-FPGA design flows, our toolchain's primary goal is energy efficiency rather than performance. This shift in focus allows us to support a wider range of programming constructs than conventional accelerator design flows, such as arbitrary pointers and recursion.
In this section, we first describe a system architecture integrating ICERs with a soft core processor and its memory hierarchy. Then, we overview the automatic selection and generation of ICERs.
A. System Architecture Figure 1 shows a block diagram of an ICER-enabled system. The CPU controls the ICERs and executes code that no ICER covers. The CPU and ICERs share the L1 data cache. Below, we describe the soft core and ICER components, their interfaces, and the execution and memory model for the system. to a smaller fabric. When the profiler selects a function to convert into hardware, the compiler will insert stubs that enable the runtime to select between using the ICER or executing the code on the soft core.
Execution model
When an ICER finishes execution or raises an exception, control transfers back to the soft core. The soft core extracts the cause from the ICER exception register and executes an appropriate software handler. The soft core has access to all internal state in the ICER via a secondary interface and can re-initialize the ICER to resume execution starting in an arbitrary control state. As a result, the toolchain supports code that contains non-inlineable function calls, such as dynamically linked library and system calls.
Memory As Figure 1 shows, the ICERs and the soft core share the coherent L1 data cache and they use the same address space. The ICERs ensure compatibility by splitting basic blocks into control blocks containing at most one memory operation and activating only one control block at a time, guaranteeing that memory operations execute in the correct order. If a memory operation takes multiple cycles, execution of the current block stalls until the memory request completes.
Soft core We use an energy-efficient, pipelined MIPS processor as our soft core processor. The MIPS processor core derives from the MIT Raw [17] processor and has an eight stage, in-order, single-issue pipeline. The core includes an L1 data cache, instruction memory, and the Raw networkon-chip router for one Raw tile. Microbenchmarks show that the Raw processor soft core operates within ∼20% of the dynamic power of a similarly configured MicroBlaze, with comparable or better instruction throughput.
B. ICER Interface
Fast invocation makes it profitable to build ICERs even for small functions. The toolchain inserts wrappers around each selected region. These invoke the ICERs, passing any global variables by reference as additional input arguments. The coherent memory interface makes marshalling costs similar to function invocations. To transfer control to Benchmark Description Suite LOC bzip2 [16] Data compression algorithm SPEC 2000 7625 cjpeg [6] JPEG image compression EEMBC 13272 mcf [16] Single-depot vehicle scheduling SPEC 2000 2478 radix [21] Sorting algorithm SPLASH-2 895 viterbi [5] Viterbi decoder EEMBC 11154 b-tree [3] Range traversal on a b-tree IBS 222 an ICER, the soft core passes up to eight arguments over the secondary network, starts the ICER, and goes to sleep in a clock-gated state.
C. ICER Generation
The ICER toolchain makes use of the OpenIMPACT (1.0rc4) [13] and LLVM (2.4) [10] compiler infrastructures to select and transform code regions into ICERs. It accepts all C programs that the above tools accept, including programs with arbitrary pointer references, gotos, switch statements, and loops with complex conditions. The ICER toolchain uses inlining to remove function call overhead where possible.
By design, the ICER datapath and control closely resemble the data and control flow graphs of the original C code as expressed in OpenIMPACT's Lcode intermediate representation, although our toolchain splits basic blocks containing multiple memory operations. This allows for simple semantics when transferring control between an ICER and the soft core during the ICER's execution. Every static operator in the intermediate representation becomes a dedicated functional unit, and every basic block live-out value becomes a register. Memory operations within an ICER share a single time-multiplexed cache interface.
The toolchain also constructs a control unit alongside the datapath that follows the control flow of the software computation. This control unit activates one basic block per cycle, tracking the transitions between basic blocks via branch outcomes. For multi-cycle and variable-latency operations, the control unit remains in the current active basic block until the operation completes.
III. RESULTS
In this section, we discuss the benchmarks we use to evaluate ICERs, describe our experimental methodology, and analyze the impact of using ICERs on performance and efficiency for both the targeted function and whole application. Table I describes the six irregular applications we use. They come from SPEC 2000, EEMBC, SPLASH-2, and IBS (an Irregular Benchmark Suite). The benchmarks perform irregular, non-parallelizable computation, including data compression, sorting, and data-dependent graph traversals. The average size of an input benchmark program is 5,941 lines of code (excluding headers).
A. Benchmarks
We used the ICER toolchain to automatically generate nine ICERs. Table I shows statistics for the resulting hardware. The coverage column measures the percentage of execution time spent in the code regions converted into ICERs. Clock frequencies for the ICERs range from ∼80 to 149 MHz, matching or exceeding the soft core synthesis frequency of 80 MHz. FPGA resource usage (slice registers, LUTs, DSP48Es, etc.) varies from roughly one-third to two times that of a single soft core.
B. Methodology
We synthesized ICERs using the Xilinx toolflow for the Virtex 5 family of FPGAs. The specific device targeted was an xc5vlx110t-ff1136-3. We synthesized the soft core using Synopsys Synplify followed by mapping, placement, routing, and optimizations using Xilinx tools.
We use a cycle-accurate simulation infrastructure based on btl [17] . When the toolchain generates ICERs it also generates models of the new hardware for the cycleaccurate system simulator.
To measure ICER power usage, our simulator traces all ICER inputs and outputs for sample periods of 10,000 cycles. From each trace, we create a testbench to drive a post-place and route model using the Synopsys VCS (C-2009.06) logic simulator. This generates VCD activity files, which we use as inputs to Xilinx XPower. A similar process generates power numbers for the soft core using samples from equivalent portions of software execution. Figure 2 shows the energy-delay product (EDP) improvement, speedup, and energy of ICER-enabled systems and ICERs, normalized to the MIPS soft core. ICERs use up to 9.5× less energy (5.3× on average) for the regions of code that they target. They do this while maintaining comparable or better levels of performance, resulting in an average EDP improvement of 5.1×.
C. ICER Performance and Efficiency
Performance trends for the entire applications are very similar to those for the targeted functions. Excepting btree, ICER-based system performance is comparable to or better than soft core performance. For both b-tree and mcf, a lack of memory pipelining in ICERs limits performance. At the application level, energy gains are highly correlated with application coverage, because of the soft core's high clock tree energy and greater BRAM energy. Across all benchmarks, ICERs use 2.27× less energy, improving EDP by 2.32×. Code regions with poor memory performance show the largest energy improvements, with mcf achieving a 9.5× improvement for covered code.
Figure 2 (bottom) shows the breakdown of component energy (block RAMs, DSP, wires, logic, and clock) across the workload for the soft core MIPS processor, combined Figure 2 . ICER EDP improvement, speedup, and energy breakdown ICERs significantly improve energy-delay product (top, higher is better), maintaining performance (middle, higher is better), while greatly reducing energy (bottom, lower is better) compared to a soft core MIPS processor. Bars labeled 'App' report values for the whole benchmark, and bars labeled 'ICER' correspond to code covered by the ICER. Energy and EDP improvements are closely correlated with application coverage.
system, and ICERs in isolation. In every case, the largest component is clock tree energy. ICERs greatly reduce clock energy and all but eliminate block RAM energy for the code that they target. Even at the application level, ICERs reduce clock energy by half. ICERs provide great savings here, but the clock still accounts for a large fraction, highlighting the importance of clock gating the soft core and inactive ICERs. It also showcases how the ICER execution model is an excellent fit for FPGAs: Since only one basic block in an ICER is active at a time, the synthesis tools can clock-gate ICERs more aggressively to take advantage of their very low duty cycles.
IV. RELATED WORK This section compares ICERs with previous efforts in high-level synthesis, custom coprocessor design, and other FPGA-based accelerator platforms.
High-level synthesis High-level synthesis research has been going on for several decades leading to a variety of commercial tools, as detailed in a recent book [4] . The primary goal of C-to-silicon synthesis frameworks such as AutoESL [23] , Impulse C [7] , Synopsys Synphony/PICO [15] , CHiMPS [14] , and Altera C2H [11] is to reduce the effort that creating accelerators requires, by building them directly from a high-level language. To accelerate execution, these tools must either infer parallel execution from serial code or force the programmer to rewrite their code in a more explicitly parallel language or dialect [18] . Because of this, they face the same challenges as parallelizing compilers. In addition, acceleration typically requires a parallel memory system that is difficult to integrate with existing serial soft cores. Because of the difficulty of these challenges, existing tools tend to compromise on automation and backward compatibility. In contrast, ICERs focus on energy first and performance second. This allows the approach to be completely automated, achieve high execution coverage, retain backward compatibility, and save energy on arbitrary code.
Reconfigurable substrates Several related efforts examine the benefits of coupling non-commodity reconfigurable fabrics with a processor core for program acceleration. GARP [2] and Chimaera [22] were early works that proposed automated approaches for offloading execution to reconfigurable fabrics integrated with a hard core. Tartan [12] examined the implications of mapping entire programs onto a hierarchical coarse-grained asynchronous fabric. Warp [19] performs dynamic translation of binaries to a specialized FPGA substrate optimized for on-the-fly synthesis, but employs an additional soft core to run the high-performance synthesis infrastructure. Conservation cores [20] have recently been proposed to create energyefficient ASICs for irregular applications, but have limited reconfigurability.
V. CONCLUSION
We have presented ICERs, customized logic circuits that reduce the dynamic power of FPGA system components traditionally run on soft cores. Tight coupling with a soft processor, including sharing of the L1 data cache and support for arbitrary control transitions between the soft core and ICER allow ICERs to be drop-in replacements for the code they implement. This greatly eases system-level design and testing complexity, and allows for full automation of both ICER construction and system integration with no programmer intervention. ICERs retain the performance of the soft cores they replace, but reduce compute energy by 5.3× and improve EDP by 5.1×.
