With the shorter time-to-market and the rising cost in SoC development, the demand for post-silicon programmability has been increasing. Recently, programmable accelerators have attracted more attention as an enabling solution for post-silicon engineering change. However, programmable accelerators suffers from 5∼10X less energy efficiency than fixed-function accelerators mainly due to their extensive use of memories. This paper proposes a highly energy-efficient accelerator which enables post-silicon engineering change by a control patching mechanism. Then, we propose a patch compilation method from a given pair of an original design and a modified design. Experimental results demonstrate that the proposed accelerators offer high energy efficiency competitive to fixed-function accelerators and can achieve about 5X higher efficiency than the existing programmable accelerators.
INTRODUCTION
High-level synthesis has become a key technology in SoC development to achieve a short turn-around time and a low design cost. Fixed-function accelerators synthesized by such a technology offer 100∼1000X more energy efficiency than general-purpose Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CODES+ISSS '11, October 9-14, 2011 processors, and hence used in many embedded application domains to meet both high performance and high energy efficiency requirements. For example [2] , an ASIC implementation of OFDM receiver, which is one of the central technologies in the next generation mobile phone, can achieve the efficiency of 200GOPS/W (5pJ/op) in a 90nm technology. On the other hand, efficient embedded processors achieve 4GOPS/W (250pJ/op, 50X more energy than ASIC) and mobile general-purpose processors achieve 0.04GOPS/W (25nJ/op, 5,000X more energy than ASIC). Thus, fixed-function accelerators are becoming increasingly important. According to ITRS 2009 Update [11] , an SoC will have more than 1,000 accelerators in the next decade ( Figure 1 ). Due to extremely high non-recurring-engineering costs in ASIC development, the engineering change (EC) methodology has been utilized to make a design change at the very end of the design process. Engineering change typically takes place due to bug fixes and design specification changes. Figure 2 shows an example of en- gineering change [4] from FAAD2 [3] , an open-source Advanced Audio Coding (AAC) decoder software. From Version 1.33 to Version 1.34, a bug is fixed by changing the signs of expressions. Even though engineering changes affect a limited portion of a design, any post-silicon EC requires a significant amount of design and fabrication efforts. Consequently, the demand for post-silicon programmability has been increasing. Recently, programmable accelerators have attracted more attention as an enabling solution for post-silicon engineering change. However, programmable accelerators suffer from low energy efficiency mainly due to their extensive use of memories in their controllers and the centralized register files.
In this paper, we propose a novel patchable accelerator which can achieve both high performance and high energy efficiency. Since EC affects only a limited portion of a design, we employ a patching mechanism instead of using a horizontal microcoded controller. Figure 3 shows a design flow of the proposed patchable accelerator. Given a high-level description of an original design, a patchable accelerator consisting of custom datapath, hardwired FSM and patch FSM is synthesized. When engineering change takes place due to bug fixes or specification changes after fabrication of the chip, a patch data is compiled from the modified design description. By loading the data onto the patch memory in the patchable accelerator, the accelerator behaves as described in the modified design description. Thus, a respin (i.e. a re-fabrication of the chip) can be avoided and hence a time-to-market and a development cost can be dramatically reduced. Since the proposed accelerator is an enhancement of a fixed-function accelerator, the proposed approach can be applied to any fixed-function accelerator generated by typical high-level synthesis tools. The tradeoff between efficiency and programmability can be made by controlling the amount of patch memory. Then, we propose a patch compilation technique from a given pair of an original design and a modified design. Experimental results demonstrate that the proposed accelerator can achieve a higher energy efficiency than the existing programmable accelerators. The main contributions of this paper are as follows.
• A novel energy-efficient patchable accelerator enabling postsilicon ECs (Section 3).
• A practical patch compilation method (Section 4).
• A comparison of energy efficiency between the proposed accelerator and the existing accelerators (Section 5). 
RELATED WORK
In this section, we review the prior work in the following two categories:
Programmable Accelerators
There have been several attempts to introduce programmability to fixed-function accelerators. No-Instruction-Set Computer (NISC) [9] consists of a programmable controller and a custom datapath. The controller is a horizontal microcoded controller consisting of a control memory and a state register. The functionality of the accelerator can be modified by changing the content of the control memory. Since highly-customized datapaths have limited connections between FUs and registers, centralized register files are introduced to increase flexibility. Programmable Loop Accelerator (PLA) [5] offers further flexibility introducing several techniques such as MOV operation, global bus, port swapping, and so forth. Although these programmable accelerators are shown to be more efficient than embedded processors, they are still 5∼10X less efficient than fixed-function accelerators [4] . This is mainly due to their extensive use of memory in the controller and the centralized register files as shown in our experimental results.
Energy-Efficient Processors
As modern processors continue to integrate more processors on a chip, utilization wall problem [13] has emerged as a critical issue which limits the fraction of the processors one can use at the same time. To overcome this utilization wall problem, energyefficient processors such as ELM [2] and AnySP [14] have been proposed. Also, HiveLogic Platform from Silicon Hive [12] provides a highly-customized VLIW architecture with distributed register files. While these processors can cover a broad spectrum of applications, they are still less efficient than a fixed-function accelerator for a specific application.
More recently, conservation cores [13] has been proposed to reduce the energy further by incorporating specialized accelerators tightly coupled with each processor. Also, Hameed et al. pro- vided an thorough study on the sources of inefficiency in processors and concluded that processors can achieve ASIC-like efficiency by bringing in application-specific customized accelerators [6] . These studies clearly demonstrate that application-specific customization is the key to achieving a high energy efficiency.
ACCELERATOR ARCHITECTURE

Fixed-Function Accelerator
In this section, we explain a fixed-function accelerator since our accelerator is an enhancement of a fixed-function accelerator. Figure 4 (a) shows an architecture template of a fixed-function accelerator. It consists of functional units (FUs), interconnects between FUs, a hardwired controller. Each FU performs a pre-defined set of operations with respect to its type. Typical FU types are an adder (ADD), a subtractor (SUB), a multiplier (MUL), a comparator (CMP), and a shifter (SHFT). A local store (LS) is a RAM to hold the values of arrays and global variables. An LS is also used to communicate with external hardware components. An LS has two types of ports; one for the address and the other for the data. A register (R1, R2, . . . ) is used to hold the value in a variable. We model every write or read port of an LS and a register as a distinct FU. Thus, any access to a memory unit can be scheduled and bound in the same way as other functional units. These FUs are connected by sparse point-to-point interconnect consisting of multiplexers (MUXes) and wires. Each FU input is connected to either the output of a functional unit or the output of a multiplexers. Similarly, each FU output is connected to some inputs of FUs and multiplexers through a wire. The inputs of a multiplexer are connected to outputs of functional units. Using multiplexers, each FU input can select the input signal. A hardwired controller is a hardwired logic implementation of a finite state machine (FSM) which generates the control signals for FUs and multiplexers. The controller has an input which determines the state transition, which realizes an if-then-else controlling mechanism.
Programmable Accelerator
An architecture template of a programmable accelerator is shown in Figure 4 (b). Compared to a fixed-function accelerator, a hardwired controller is replaced with a programmable controller which is a horizontal microcoded controller consisting of an instruction memory and a program counter which holds the current state. Each address in the memory corresponds to a state and the corresponding data includes the control signals for the state as well as the set of next states. Also, a register file and a constant file (i.e. constant generator) are introduced to increase the flexibility. In this way, a programmable accelerator extensively uses memories to offer high programmability and flexibility.
Proposed Patchable Accelerator
In spite of a high programmability of a programmable accelerators, it is unnecessary to offer full programmability particularly for engineering changes. The basic idea of our accelerator architecture is to offer the minimum programmability for engineering changes by incorporating a small amount of memory into a fixedfunction accelerator. The proposed accelerator can achieve high energy efficiency by using a hardwired controller for unmodified states. Besides, the degree of programmability can be controlled by the amount of extra memory.
As mentioned earlier, our proposed accelerator is an enhancement of a fixed-function accelerator. An architecture template of a patchable accelerator is shown in Figure 5 . To enable post-silicon engineering changes, we add two components: a patch logic and a register file. A patch logic modifies the control signals of some FSM states. The details are explained in the next section. A multiport register file (RF) is used when the datapath registers are not available. A patch logic and a register file are connected to a fixedfunction accelerator through global buses, which is a similar idea to PLA [4] . There are two global buses: one is a control bus for patching control signals and the other one is a data bus for the communication between FUs and a register file. As shown in Figure 5 , a data bus increase only one input to each multiplexer, the performance overhead is low. Furthermore, FUs may be enhanced to increase the flexibility. For example, one of the adders in Figure 5 is replaced with an adder/subtractor. If the original datapath does not have essential types of FUs, they may be added and connected through the data bus.
We would like to note that loop accelerators such as PICO-NPA [10] and PLAs [5] have slightly different architectures to execute software-pipelined loops. Though this paper focuses on a simplified architecture in Figure 5 for ease of explanation, an extension to such loop accelerators should be straightforward.
Patch Logic
A patch logic can modify control signals for several states. For unpatched states, the control signals are generated from the hardwired controller. As shown in Figure 6 , the patch logic consists of a state patching stage and a control patching stage. Suppose that the hardwired controller implements the states {s 1 , ...s n } and the control signal memory contains the control signals for the states {s n+1 , ..., s n+m }. The state patching stage converts a subset of the hardwired controller states to the patch memory states {s n+1 , ..., s n+m }. If the converted state corresponds to the patch memory state, the control signals are generated from the patch memory.
Using an example in Figure 7 , we explain the patching mechanism. Suppose that the datapath has two ALUs and one multiplier. An initial dataflow graph (DFG) in Figure 7 (a), the scheduling result is Figure 7 (b) . The hardwired controller implements s1, s2 and s3, and the state transition is s1→s2→s3 →s1→ · · · . Next, the dataflow graph after EC is shown in Figure 7 (c) where a subtractor is changed into a multiplication. Since this change corresponds to s2, the state needs to be re-scheduled. Thus, a new state s4 in the patch logic is introduced as shown in Figure 7 (d) . The scheduling result is stored in the control signal memory, and s2 is converted to s4 as shown in Figure 7 (e). Since the input of the operation n5 has been changed, the state s3 is also re-scheduled. After patching, the state transition is s1→s4→s5→s1→ · · · . To prevent a performance degradation due to the patching mechanism, the controller is pipelined by introducing control signal registers at the output as shown in Figure 6 . Such a control-pipelined execution can be achieved by scheduling every branch instruction one control step ahead.
PATCH COMPILATION METHOD
Overall Flow
The design flow of patchable accelerators is shown in Figure 8 . Given an original design description of an application, a fixedfunction accelerator is generated by high-level synthesis. Then, a patchable accelerator is generated by enhancing the fixed-function accelerator as explained in Section 3.3. When engineering change takes place due to bug fixes or specification changes after fabrication of the chip, the original design description is modified accord-ing to the engineering change. Then, a difference CDFG, which will be explained in the next section, is computed by analyzing a textual difference between the original design description and the modified design description. The patch compiler takes the difference CDFG and the accelerator architecture, a patch memory data is compiled. By loading the data onto the patch memory in the patchable accelerator, the accelerator behaves as described in the modified design description. The remainder of this section explains the patch compilation method. Basically, the patch compilation is performed by incrementally scheduling and binding each modified operation. We first formulate the incremental scheduling and binding problem and then explain how to solve the problem in detail.
Problem Formulation
Given a high-level description of an application, a control data flow graph (CDFG) is constructed by analyzing the description. It is assumed that the underlying expressions are of a static single assignment (SSA) form. A CDFG consists of a control flow graph (CFG): G C = (V C , E C ) and a data flow graph (DFG):
. A CFG consists of control nodes V C and control edges E C where each control node corresponds to a basic block and each control edge represents a control flow between two control nodes. A basic block includes one or more operation nodes and does not include any conditional execution. A DFG consists of operation nodes V D and data edges E D where each operation node corresponds to an operation and each data edge represents a data dependency between operations.
Design descriptions before and after EC are represented as a Difference-CDFG (Δ-CDFG) which is a single CDFG structure combining two CDFGs before and after EC. In a Δ-CDFG, the set of operation nodes V D is a union of four disjoint sets V D = V F ∪ V N ∪ V R ∪ V M : a set of unmodified operations V F , a set of added operations V N , a set of removed operations V R , and a set of modified operations V M . A modified operation is an operation such that any of its inputs is a newly-added operation. Hence, a modified operation requires neither re-scheduling nor re-binding but the corresponding control signals need to be modified. Now, a set of operations before EC is V F ∪ V R ∪ V M and a set of operations after EC is V F ∪ V N ∪ V M . For example, the operations in Figure 7 are partitioned as follows: V F = {n1, n2, n4}, V N = {n6}, V R = {n3} and V M = {n5}. The CFG in a Δ-CDFG is equivalent to the CFG after EC. The operations which do not exist after EC, V R , do not have any corresponding basic block. Similarly, the data edges E D in a Δ-CDFG are equivalent to the data edges of the DFG after EC. That is, the removed operations V R have neither incoming edges nor outgoing edges. Those edges are unnecessary because the removed operations will be removed during the patch compilation. -And-Bind(G C , G D , D, S , B, R) / 
Incremental Scheduling and Binding
Our algorithm performs the scheduling, FU and register binding concurrently. For each operation node n ∈ V D , the scheduler finds the state in which n is executed, the FU binder finds the FU which executes the operation of n, and the register binder finds the register which stores the result of n. Although the present algorithm does not deal with operation chaining for ease of explanation, operation chaining can be performed in a straightforward manner. Moreover, our implementation of the patch compiler can perform operation chaining. Since the patch memory and the registers in the register file are limited resources, it is preferable to find the schedule and bind which minimize the usage of the resources. To achieve this goal, the proposed scheduling algorithm shown in Figure 9 is based on the Swing Modulo Scheduling [8] . The swing modulo finds the schedule such that the critical path is prioritized in the first place and the variable lifetime is minimized in the second. Therefore, the numbers of patched states and used registers in the register file can be minimized. Note that the present algorithm does not perform a modulo scheduling, i.e. software pipelining. However, it can be easily extended to perform a modulo scheduling in a straightforward manner. First, the removed operations are all unscheduled and unbound (Line 1-2) so that the scheduling slots become available. Also, the modified operations are scheduled and bound to newly-created states in the patch memory (Line 3-4). For each basic block B, SMS-Sort() determines the scheduling order of operation nodes B using the swing modulo scheduling algorithm [8] (Line 5-6). For each operation node n in the sorted order, Available-Slots() finds a set of states S in which n can be scheduled (Line 7-8). Scan-Direction() determines the direction how the states in S are scanned. For each state in the direction, the binding is performed (Line 10-13) . If no binding is found, a new state is created (New-State()) in the patch memory and binding is performed again (Lines 14-16) . Finally, the patch memory data is generated (Generate-Patch-Data()). For each state, a control word is generated according to the FU and register binding. Figure 10 shows the FU and register binding algorithm. Given a scheduled operation node n, Available-FUs() finds a set of FUs which can be bound to n. Then, Sort-FUs() sorts the FUs in ascend- ing order of their binding costs. The binding cost of an operation node n to an FU f is the number of required registers in the register file when n is bound to f . For each FU f in sorted order, n is bound to f (Line 3-4). For each input or output m of n, we check if all the dependents are already scheduled. If so, a register is bound to store the value. If there is no individual register available, a register in the register file is bound. If some dependents are not scheduled yet, the register binding is inserted into a pending queue and it will be performed again once all the dependents are scheduled. If the register binding is successful for all inputs and outputs, the procedure returns to Schedule-And-Bind(). Otherwise, the binding of other FUs are performed.
EXPERIMENTAL RESULTS
Tool Implementation
We have implemented the proposed patch compiler in Cyneum synthesis and optimization framework which we have developed recently. Internally, LLVM compiler infrastructure [7] is used for analyzing an input C program and building a CDFG in SSA form. Given a pair of C programs before and after EC, a Δ-CDFG is constructed by analyzing the difference between two CDFGs. Also, a datapath organization, the scheduling and binding information corresponding to the original C program are given as inputs to the compiler. After patch compilation, the enhanced datapath and the patch memory data is generated in synthesizable Verilog HDL.
Energy Efficiency Comparison
In this section, we compare the energy efficiency of the proposed accelerators against fixed-function accelerators and programmable accelerators. As a benchmark design, a C description of 8x8 inverse discrete cosine transform (IDCT) is used. Then, we designed five types of accelerators: fixed-function, 16-state patchable, 32-state patchable, 128-state patchable, and fully programmable accelerators. Every accelerator implements 99 states and takes 727 cycles to complete the execution. Note that we present the 128-state patchable accelerator only as a reference. Since engineering change is assumed to be small (10 ∼ 20%) compared to a whole design, 16-state or 32-state patchable accelerator should be sufficient. A fixed-function accelerator shown in Figure 4 (a) is synthesized by a typical high-level synthesis algorithm. Then, a fully programmable accelerator shown in Figure 4 (b) is generated from the fixed-function accelerator by replacing the hardwired controller with a horizontal microcoded controller and the distributed registers with a centralized register file having multiple read and write ports. Using a standard cell library from Nangate implemented in the virtual 45nm technology FreePDK45 [1] , the designs are synthesized using Synopsys Design Compiler Ultra with a high-effort option including gated clock optimization. All memory elements such as control memory, register/constant file and local store are implemented using flip-flops, i.e., no SRAM is used in the designs. The mapped netlists are placed and routed using Cadence SoC Encounter. Then, we simulated one whole execution of 8x8 IDCT using Synopsys VCS. Using the simulated data in VCD format, the energy consumption is calculated using Synopsys PrimeTime PX. Table 1 presents the comparisons of the five accelerators with respect to their post-layout area, operating frequency and energy efficiency. Figure 11 shows the energy breakdown of the five accelerators. As for the area, the programmable accelerator is about 7X larger than the fixed-function accelerator due to the control memory. The area of the 32-state patchable accelerator is about 43% larger than the fixed-function accelerator. As for the operating frequency, the programmable accelerator is much slower mainly due to the access time to the control memory and the register file. In contrast, the 32-state patchable accelerator is competitive to the fixed-function accelerator. As for the energy efficiency, the programmable accelerator is 5X less efficient than the fixed-function accelerator due to the programmable controller and the centralized register file. Without using the patching mechanism, the 32-state patchable accelerator shows only a few percentages of efficiency degradation. With fully using the patching mechanism, the degradation of energy efficiency is 14%. As can be seen from the energy breakdown, the register energy of the proposed accelerator does not increase significantly because the distributed registers are mostly used and the register file is not used frequently. These results clearly demonstrates the effectiveness of the proposed accelerators.
Patch Size & Runtime Evaluation
Next, we evaluated the proposed compilation method using the four benchmarks described in Table 2 . Like [5] , engineering change examples are obtained by iteratively applying a random graph perturbation to the original CDFG. Figure 12 shows the three types of graph perturbation. The first type of perturbation selects an operation node randomly and changes the node type randomly. The second type selects a data dependence edge randomly and change the source of the edge randomly. The third type inserts a new operation node of a random type at the randomly-selected place. For each iteration, one of the three types is randomly chosen and applied to the CDFG. If a node has no outgoing edge after applying graph perturbations, the node is removed from the CDFG. The degree of engineering change is estimated by the number of graph perturbations. Figure 13 (a) presents the average patch size with respect to the engineering change size. The increase rate is dependent to many aspects such as the original scheduling, binding, datapath structure, and control complexity. If the original scheduling is very tight, any engineering change may introduce a new state. Figure 13 (b) presents the average register file size. The graph shows that a large register file is not necessary for the engineering changes in this experiment. Finally, Figure 13 (c) presents the average patch compilation runtime. This demonstrates that the patch compilation method is applicable to practical designs.
CONCLUSIONS
This paper first proposed a novel energy-efficient patchable accelerator which enables post-silicon engineering change. The proposed accelerator can achieve high energy efficiency by implementing the controllers mostly by hardwired logic and providing a control patching mechanism. Then, we proposed a patch compilation method from a given set of an original design and a modified design. The experimental results demonstrated that the proposed accelerators offer high energy efficiency competitive to fixed-function accelerators and can achieve about 5X higher efficiency than the existing programmable accelerators.
