Application-Specific Instruction set Processor (ASIP) has become an increasingly popular platform for embedded systems because of its high performance, flexibility, and short turn-around time. The hardware extension in ASIPs can speed-up program execution. However, it also incurs area overhead and extra static energy consumption. Traditional datapath merging techniques reduce the circuit overhead by reusing hardware modules for executing multiple operations. However, they introduce structural hazard for multiple custom instructions in sequence, and hence reduce the performance improvement. In this article, we introduce a pipelined configurable structure for the hardware extension in ASIPs, so that structural hazards can be remedied. With multiple subgraphs of operations selected, we design a novel operation-to-hardware mapping algorithm based on Integer Linear Programming (ILP) to automatically construct a resource-efficient pipelined configurable functional unit. Different resource sharing schemes would affect both the hardware overhead and the overall performance improvement. We analyze the design trade-offs between resource efficiency and performance improvement. At the end, we present our design space exploration results by setting the optimization objective to area, area and delay, and delay respectively. 
INTRODUCTION
Application-Specific Instruction set Processors (ASIPs) have become a promising design platform for modern embedded systems, satisfying their demanding requirements on performance, cost, power consumption, and turn-around time. With customized Instruction Set Architecture (ISA) and custom hardware extensions, the performance of an ASIP is improved greatly over general-purpose processors for the selected applications. Over the past decade, we have seen many industrial successes in ASIP design 39:2 H. Lin and Y. Fei in the embedded system domain, such as Tensilica Xtensa processor [Gonzalez 2000 ], CoWare LisaTek products [CoWare Inc. 2012] , Altera Nios/NiosII [Altera Corp. 2012] , and Xilinx MicroBlaze [Xilinx Inc. 2012] .
While the traditional ASIP design focuses on performance improvement [Goodwin and Petkov 2003; Sun et al. 2004] , with the proliferation of battery-powered electronic devices, energy efficiency has become an imperative consideration for modern embedded processors design. There have been a lot of research work in ASIP design that tackle the energy-efficiency issue, including optimizing individual processor components, for example, cache, register file, and instruction fetch stage [Lin and Fei 2009; Ravindran 2007; Ravindran et al. 2007] , and managing the entire power consumption of the processor [Fei et al. 2004; Lee et al. 2003 ]. However, most of them target dynamic energy consumption and do not consider the static energy specifically. Leakage energy consumption has become an important issue in modern electronic circuits. When the process technology shrinks below 65 nm, the leakage power increases to be comparable to dynamic power [Kim et al. 2003b] . Although some recent technologies, like the "high-k dielectric technology" [Intel 2012 ] adopted in top-of-the-line processes, can effectively suppress the leakage current, static energy remains an important component of the total energy consumption.
Custom hardware extensions in ASIP design are able to reduce the execution time and dynamic energy effectively. However, the static energy consumption caused by these extensions may greatly offset the dynamic energy reduction. The more custom instructions are implemented, the more custom hardware extensions are needed to support their execution, and possibly more static energy consumption. Improving the resource efficiency of custom hardware extensions in ASIPs, that is, lowering the hardware overhead for selected custom instructions, has been considered an effective method to reduce the die area overhead, chip cost, and also overall energy consumption.
When implementing resource-efficient custom logic for ASIPs, traditional resource sharing techniques, such as datapath merging and reconfigurable datapath synthesis [Brisk et al. 2004; de Souza et al. 2005; Huang and Malik 2001; Moreano et al. 2005; van der Werf et al. 1992; Zuluaga and Topham 2008] , are widely used to generate a configurable hardware extension that can share the functional units among multiple custom instructions. However, such resource sharing will introduce potential structural hazards for executing custom instructions, because these techniques do not consider the possibility of custom instructions being executed in parallel. At any time, the shared functional units are exclusive for one custom instruction. For many complicated embedded applications, multiple subgraphs within one basic block can be chosen as custom instructions to accelerate the application. If not far away in execution sequence, some custom instructions may compete for the single shared hardware extension at certain clock cycles. Therefore, traditional resource sharing algorithms become unsuitable. We observed that for given testbench applications, multiple custom instructions are selected and many of them are multicycle. Therefore, such contention for common resources exists and will slow down program execution. We address this issue by turning the shared custom hardware to be pipelined.
In this article, we propose a pipelined configurable custom hardware structure to remedy the structural hazards, yielding optimal resource efficiency and minimum static energy. We formulate our hardware minimization problem into an Integer Linear Programming (ILP) problem, and solve it for optimal resource sharing among multiple candidate custom instructions. We also analyze the trade-off between resource efficiency and performance improvement.
The rest of the work is organized as follows. In Section 2, we discuss the motivation for a pipelined configurable Custom Functional Unit (CFU), and briefly describe the proposed CFU structure. In Section 3, we formulate the resource sharing problem into two phases, operation scheduling solved by an ILP formulation, and operation to operator assignment solved by a proposed heuristic with linear complexity. We then demonstrate the experimental setup and results in Section 4. Trade-offs between resource efficiency and performance improvement in ASIPs are also shown. Following that, we conclude our work in Section 5.
EXPLORING RESOURCE SHARING OF HARDWARE EXTENSIONS IN ASIP DESIGN
In this section, we discuss the unique features of resource sharing for hardware extensions in ASIP design, and propose our pipelined configurable CFU structure. We assume that the widely used technique for custom instruction generation, operation fusion [Goodwin and Petkov 2003] , is adopted. Selected groups of operations in a program's basic blocks are fused into single complex operations. Both the execution time and dynamic energy consumption of the application can be reduced greatly. Figure 1 illustrates the traditional resource sharing process among multiple custom instructions. Each custom instruction performs a subgraph of basic operations. To reduce the area overhead, the custom hardware is shared by the two instructions, and includes a superset of functional units for different operation subgraphs. For example, subgraph 1 needs 2 adders and 1 multiplier, and subgraph 2 needs 3 adders, and the configurable custom functional unit contains 3 adders and 1 multiplier. Configurable interconnections, that is, MUXes, are inserted so that different custom instructions get executed on different subsets of functional modules. For this kind of custom hardware synthesis, traditional resource sharing techniques such as multiple graphs and paths merging Moreano et al. 2005; Zuluaga and Topham 2008] are applied. The work in de Souza et al. [2005] formulates such datapath merging problem into a maximum compatible clique searching problem, and evaluates the performance and complexity of several typical heuristic solutions. In Zuluaga and Topham [2008] , the problem is converted into a substring matching problem and a heuristic is developed to find reasonable solutions. The authors also analyzed the effect of resource sharing on the critical path delay. The trade-off between area overhead and critical path delay is demonstrated by experimental results. However, these techniques have a common assumption that each custom instruction is executed on the configurable custom hardware exclusively, that is, the functional units are used in a time-multiplexing fashion. Our resource sharing technique differs from the previous work in that it considers the specific features for executing custom instructions on ASIPs. Figure 2 illustrates the partial pipeline of an extended processor. In this processor, we assume out-of-order execution, that is, instructions are issued and executed whenever the resources and operands for them are ready. Without resource sharing in the custom hardware, a functional unit will be generated for each custom instruction and exists in parallel to the baseline functional units. For example, two independent custom instructions, for example, ci1 and ci2 as shown in Figure 2 , can be issued from the instruction queue and executed simultaneously in different custom functional unit (CFU1 and CFU2), with the control flow marked in solid lines in the figure. The instruction-level parallelism of the program is well maintained. However, with resource sharing, the two custom functional units are merged into one, shown as the shaded module (CFU) in Figure 2 . At any time only one custom instruction can possibly occupy the merged CFU. As a result, when instruction ci1 is in execution, instruction ci2 has to wait for the availability of hardware resource, that is, the traditional resource sharing method causes structural hazard for custom instructions and therefore may degrade the performance improvement. In fact, because of the I/O port constraints of the register file, multiple independent custom instructions can be generated in a frequently executed basic block. When these custom instructions are multicycle and executed subsequently, structural hazards may arise and cause performance degradation.
Resource Sharing of Hardware Extensions by Multiple Custom Instructions
To remedy such structural hazard so as to improve the performance of an ASIP, we introduce a novel pipelined configurable hardware structure and develop our resource sharing algorithm for the CFUs. In the previous work [Brisk et al. 2004;  Resource Sharing of Pipelined Custom Hardware Extension 39:5 Fig. 3 . Mapping a subgraph onto a configurable custom functional unit. Zuluaga and Topham 2008] , the authors discussed implementing resource sharing on a pipelined structure. However, their "postresource-sharing" pipelining method decouples the pipeline generation from resource sharing, and therefore may result in suboptimal solutions. Moreover, in Brisk et al. [2004] , the authors proposed a coarse-grained method for implementing the pipeline stages, assuming each operation has one cycle execution delay, whereas in fact several operations in a sequence can be assigned to the same pipeline stage as long as the accumulative delay does not exceed one clock cycle. Our proposed fine-grained technique would yield better performance improvement. Figure 3 (a) illustrates the architecture of an example configurable CFU in our flow. Functional modules are arranged in rows and connected in a feed-forwarding manner, that is, signals transfer from the outputs of components in upper-level rows to the inputs of components in the subsequent rows. MUXes are inserted to allow configuration of interconnections, and control signals for the MUXes are generated by the custom instruction decoding logic. Thus, different custom instructions can change the connection of functional modules in the configurable CFU. For a selected subgraph to be implemented on the shared hardware, as shown in Figure 3 (b), its operations are mapped onto the dark functional components in the configurable CFU, shown in Figure 3 (a) . If the critical path delay of the datapaths exceeds one clock cycle of the processor, registers are inserted into the configurable CFU. Hence, custom instructions can be fed into the shared hardware extension in a pipelined sequence. With this pipelined structure, at one execution cycle, multiple multicycle custom instructions can possibly share the resource without conflict, because different custom instructions can use functional units at different pipeline stages. Considerable area reduction can be achieved through our novel resource sharing algorithm proposed in Section 3. Figure 4 shows the cycle-accurate execution under the three different resource sharing scenarios in a single-issue out-of-order processor. We assume ci1 is executed in four cycles and ci2 is executed in two cycles. When there is no resource sharing, different instructions have to be executed on separate hardware in the same cycle to avoid structure hazard. In the traditional resource sharing without pipelining, the processor pipeline has to be stalled for the second instruction due to structural hazard. With our resource sharing with pipelining, the performance is as good as the first case, and the hardware overhead is greatly reduced.
Structure of the Configurable CFU
The configurable CFU architecture proposed is similar to the Configurable Compute Accelerator (CCA) presented in Clark et al. [2005] . However, different from their design, we introduce pipelining within the shared hardware. Moreover, unlike the previous work which predefines a CCA structure and develops an algorithm to identify and map appropriate DFGs onto it, our approach automatically generates the optimal configurable custom hardware for the target application.
In the following sections, we assume that the subgraphs of operations for custom instructions have been selected by certain synthesis tools, like the one in Atasu et al. [2003] , and focus on our resource sharing algorithm with the proposed pipelined hardware architecture.
SOLUTION TO THE RESOURCE SHARING PROBLEM
In this section, a new resource sharing problem for ASIP design is identified in our custom hardware generation process for better performance and energy efficiency. The selected subgraphs of operations can be implemented in a pipelined datapath if they need multiple cycles for execution. The problem then becomes how to schedule the operations of different subgraphs into the execution cycles in the pipelined datapath with configurable hardware. The functional modules in a pipeline stage can be shared among different subgraphs for custom instructions. The goal is to maximize the resource sharing so as to reduce the area overhead and static energy consumption of the configurable custom hardware.
Although the concept of resource sharing has been widely used in both High-Level Synthesis (HLS) [Coussy et al. 2009 ] and previous custom hardware synthesis in ASIP design, the resource sharing problem we describe differs distinctly from both of them. HLS targets resource sharing within one DFG and it shares resource between different cycles. In our problem, we do not allow operations in the same DFG to share resource. Previous research for resource-efficient custom hardware synthesis in ASIPs, like de Souza et al. [2005] and Moreano et al. [2005] , have not considered the potential structural hazard of multiple datapath merging, and the shared resources are not pipelined for multiple custom instructions to execute on them simultaneously. Figure 5 depicts the three different types of resource sharing, that is, in HLS, traditional resource sharing in custom hardware synthesis, and our proposed resource sharing in pipelined structure. In Figure 5 , each row illustrates one type of resource sharing respectively. The dotted arrows demonstrate examples of sharing in each resource sharing type. The solid bars indicate the places where registers are inserted.
Problem Definition
We next show that our problem can be transformed into a process of operation scheduling and mapping onto the hardware generated on-the-fly. Figure 6 shows such a process, where two subgraphs of operations, g 1 and g 2 , are represented by two Directed-Acyclic-Graphs (DAGs) with nodes for operations and edges for data dependency between operations. We assume that the grey nodes have the same operation type, that is, they can share the same operator in the custom hardware. Operations in these two graphs can be scheduled to different virtual stages while maintaining their data dependency, for example, in an As-Soon-As-Possible (ASAP) or As-Late-As-Possible (ALAP) manner. When mapping the operations onto the custom hardware, these virtual stages are hence mapped to the CFU's hardware pipeline stages. A hardware stage can cover several consecutive virtual stages, depending on the latency in each virtual stage. The edges crossing the lines of hardware pipeline stages represent positions for pipeline registers. For a scheduling scheme of multiple DAGs, we estimate the critical path delay and group virtual stages into pipeline stages. Next, after the operation scheduling for each subgraph is done, we develop a heuristic method to determine the actual resource sharing scheme to reduce the interconnection complexity. Only operations of different datapaths that are of the same type and at the same virtual stage are allowed to share functional modules, such as, the two nodes in gray at virtual stage 4 in Figure 6 . Finally, the area of the needed functional modules at each pipeline stage and the Muxes for interconnection configuration are summed up for hardware overhead estimation. Different scheduling schemes of DAGs will affect both the number of pipeline stages for each DFG and the overall hardware overhead, that is, resource efficiency. As illustrated in Figure 7 , custom instruction 1 always takes two cycles to execute, and custom instruction 2 can take one or two cycles, depending on the scheduling plan. For the first scheduling plan, three adders are needed in pipeline stage 1 and custom instruction 2 executes in one cycle. For the second scheduling plan, two adders are needed in pipeline stage 1 and custom instruction 2 executes in two cycles, with more sharing between the two subgraphs. These two scheduling schemes demonstrate possible trade-off between execution delay and custom hardware area overhead. Among all the scheduling plans for the DAGs, we select the one that can reduce the hardware overhead most and meanwhile satisfy the performance requirement.
Resource Sharing of Pipelined Custom Hardware Extension 39:9
Integer Linear Programming Solution for Resource Sharing
To solve the resource efficiency problem, we develop a novel algorithm for operation scheduling and resource sharing for multiple DAGs based on Integer Linear Programming (ILP). In contrast to the previous iterative algorithms [Brisk et al. 2004; Zuluaga and Topham 2008] , we take multiple DAGs in at the same time and find an optimal CFU template to implement all of them based on a one-time ILP solving process. Both the area overhead and delay should be estimated in the objective function to guide the exploration for optimal solutions. We present the entire ILP problem formulation as follows.
Primary Variable Definition. For each operation in the DAGs, we define a set of binary variable {s i,l }, where i is the index of the operation (unique for each operation in all the DAGs), and l is the index of virtual stages. If operation i is scheduled in virtual stage l, s i,l is assigned 1, otherwise 0. Clearly, for an operation i, only one of these variables is assigned 1.
Parameter Definition. For each operation i, a set of {type i,k } is defined, where k is the index of operation types. type i,k = 1 indicates the operation belongs to type k. Similarly, for each operation i, a set of {group i, j } is defined, where j is the index of DAGs. group i, j = 1 indicates the operation belongs to DAG j. The value of these parameters are determined by the given DAGs.
Assistant Variable Definition.
To evaluate the delay and execution cycle of each DAG, we add a set of assistant variables. C l represents the delay for virtual stage l. C l equals to the largest delay of the functional units that are in the same stage l. AC l is the accumulated delay from the primary input to stage l, that is, AC l is the summation of C t , t = 1, ..., l. If AC l exceeds n * T cycle and AC l−1 is within n * T cycle , it means virtual stage l will need to be put to pipeline stage n+1. T cycle is the clock cycle for the processor.
Constraints. There are several rules for the resource sharing that should be reflected by the constraints in ILP. First of all, when scheduling operations to different virtual stages and implementing them by functional components, the data dependency within each DAG should be maintained. In other words, if there is an edge connecting two operation nodes in a DAG, the source node of the edge (parent node) should be assigned to a virtual stage earlier than the virtual stage where the destination node (child node) is assigned. With the variables definition described before, this constraint is presented in Eq. (1), where node i is the parent node of node j.
Obviously, one operation can only be assigned to one virtual stage, as represented in Eq. (2).
Objective Function. For this problem, we want to find the optimal operation scheduling for each DAG, so that: (1) when implementing these operations on hardware stages in the CFU, as many operations as possible can share the functional components in the CFU and hence the hardware cost for the CFU is minimized; (2) the execution cycles of each DAG (representing different custom instructions) will not be 39:10 H. Lin and Y. Fei increased much with resource sharing. This reflects the trade-off between resource efficiency and performance for each custom instruction. We put both the area and delay estimation into an unified objective function.
For each virtual stage, the number of functional unit instances needed for each operation type should be equal to the largest number of this type of operations assigned to this stage among all the DAGs. This is a MA X () function. The total hardware overhead is a summation of these MA X () functions for each stage and each type, multiplied by the unit area of each type. Since the MA X () function is not a linear function, we add additional variables and constraints to convert this objective function into a linear form.
Here i, j, k, l are the operation index, DAG index, operation type index, and virtual stage index, respectively. M j,k,l in Eq. (3) represents the number of operations of type k in DAG j that are assigned to virtual stage l. X k,l denotes the number of operators of type k at virtual stage l. The MA X () function is hence converted to the set of constraints presented in inequity (4).
To estimate the execution cycle delay for each DAG, a set of integer variables, K j , represents the cycles needed for the jth DAG. Inequity (5) gives the estimation of K j . K j * T cycle should be the ceiling of AC L , where virtual stage L is the last stage for DAG j. AC l is described earlier in "Assistant Variable Definition". These inequalities exist for every operation i in DAG j, that is, group i, j = 1.
Note that AC l * s i,l is not a linear representation and should be linearized for ILP. We take the approach in Wang et al. [2006] for C = B * A, where A is a binary variable and M is an upper bound of B. It can be linearized as follows.
The final objective function is presented linearly in Eq. (7), where A k is the unit hardware area of operator type k. We use coefficient α to adjust the consideration of resource sharing versus delay. Both resource overhead and delay are normalized. A REA is the summation of the total area cost for each DAG if implemented separately. T j is the number of cycles needed for DAG j if implemented separately with the operations in an ASAP scheduling. The resource sharing problem is to find the best set of s i,l values so as to minimize the objective function.
For all the preceeding equations, the ranges of i, j, k are determined by the actual number of operations in all the DAGs, operations types, and number of DAGs, respectively. The range of l, that is, the number of possible virtual stages in the CFU is determined by the total number of operations in the DAGs, considering the extreme case when each operation is assigned to a separate virtual stage, as shown in Figure 8 . With all these equations and constraints presented, the resource sharing problem is now formulated into an ILP problem. We revisit Figure 6 which shows a possible scheduling plan. The delay of each operation is marked by the node, normalized to the processor's clock cycle. The delay of each virtual stage is estimated and annotated on the leftmost. Based on this, we assign pipeline hardware stages to virtual stages, and the pipeline stages are marked on the right. We explore different possible scheduling plans and find the optimal solution with the objective function minimized. The ILP problem scale increases as the total number of operation nodes in DAGs increases. For the aforesaid ILP modeling, the total number of variables is O(N 2 ), where N is the number of operations.
Reducing Interconnections in the CFU
With the operation scheduling algorithm, we have minimized the functional units needed for the proposed CFU structure. Within each virtual stage, we can arbitrarily select operations of the same type from different subgraphs to share a functional unit. However, depending on the data dependency between operations in each subgraph, two different functional resource sharing schemes may result in different number of MUXes, that is, interconnection complexity in the hardware implementation. Figure 9 shows an example. Suppose the resource allocation for virtual stage 1 is already done, and we are now allocating resources for virtual stage 2 and mapping operations onto them. At virtual stage 2, there are three multiplications in the two subgraphs, and only two multipliers are needed for them. Multiplications 1 and 2 from subgraph 1 are mapped onto the two multipliers, respectively. In plan a, multiplication 3 from subgraph 2 shares the same multiplier with multiplication 1, and in plan b with multiplication 2. We see that scheme b needs one more MUX.
To minimize the overall area overhead, the interconnection needs to be optimized as well. Previous research formulates the interconnection optimization problem into a maximum compatible clique searching problem , which is NP complete, and finds a solution based on heuristics. However, they target nonpipelined hardware structure as in the "full sharing case" in Figure 5 . Within our proposed CFU structure, operators are pipelined and arranged in rows in a data-feed-forwarding style. Hence, we only need to consider the interconnections between rows. Because there is no backward data flow in the hardware, we can consider interconnection generation stage by stage. We propose a heuristic with linear complexity operation binding and MUX insertion.
We now consider binding operations in two adjacent virtual stages to operators (the actual number of each kind in each stage is already determined as shown in the previous section). Assume the operation sharing for the upper row (virtual stage) is fixed. When multiple operations are mapped onto the same operator, this operator would have several output connections to the operations in the following row, as the dark operator (adder) shown in Figure 9 . We analyze two different sharing schemes. In the first one, in Figure 9 (a), operations which have the same parent operator (multiplication 1 and 3, considering the left input of the two operations) are mapped onto the one operator. In the second case, in Figure 9 (b), operations which have different parent operators (multiplication 2 and 3) are mapped onto the same operator. Since the inputs for the operator are now from different sources at different times, the operator would need a MUX at its input to execute different subgraphs. Our goal is to choose an appropriate sharing scheme (with the number of operators being determined) to minimize the interconnection overhead. Algorithm 1 shows the pseudocode for generating the operation sharing scheme in each virtual stage. It takes the operation scheduling solution from ILP as the input, and generates the actual operation to operator binding for all the subgraphs, targeting interconnection minimization.
The complexity of the algorithm is linearly dependent on the total number of operations in all subgraphs. The algorithm starts from the row that connects with the primary inputs (virtual stage 1), and considers the assignment of operations to operator stage by stage. The ordering process at statement 4 is not linear, and could have a complexity of O(M 2 ), where M is the total number of subgraphs. However, because M is very small compared to the number of operations, the term M 2 can be considered as a constant multiplier to O(N) when we estimate the complexity of the entire algorithm. The whole process is still O(N), where N is the total number of operations in the subgraphs. Operations are assigned to operators based on the three criterions in lines 15 to 19. The first criterion tries to share operations with common parent operators on the same operator. The second criterion is applied to generate operators that have more output edges. The more output edges, the more possibility that its child node operations can share operators to reduce the interconnection. The third criterion is for adding operators when there is no one existing.
EXPERIMENTAL RESULTS
In this section, we present the experimental setup and show the resource sharing results based on our proposed pipelined CFU structure, using the ILP operation scheduling algorithm and linear operation to operator assignment algorithm. First we implement an ASIP custom instruction synthesis flow. We extract a set of subgraphs from each testbench for custom instruction implementation. We take the SUIF and MachSUIF framework [SUIF Compiler System 2012; Machine SUIF 2012] develop our own passes for front-end compiling, program profiling, and DFG extraction. Based on the profiling information, we implement a DFG exploration and custom instruction selection algorithm similar to the one proposed in Atasu et al. [2003] . The best subgraphs for custom instructions are selected to speed up the application with the constraints of register file I/O ports. The timing and area information of different types of functional units in the CFU is estimated based on the logic synthesis using Design Compiler from Synopsys Inc.
[2012] with the 0.13 um process library, and we set the processor clock frequency to be 500 MHz in the experiment. Note that although we adopt one specific custom instruction selection technique to provide subgraphs for custom hardware generation, our algorithms are independent from the custom instruction generation technique. In the experiments we evaluate the effect of resource sharing on performance and explore the trade-off between area overhead and performance gain with the custom instructions already given. The selected subgraphs are taken in as DAGs in our ILP model. Our program utilizes LPsolver APIs [LPsolver 2012 ] to solve the ILP model for operation scheduling, and then allocates operators and generates the configurable CFU. We randomly select a suite of benchmark applications from MiBench [2012] , which is a commercially representative embedded benchmark set, to evaluate the effect of resource sharing using the proposed algorithms. Table I presents the statistic information of the selected subgraphs from each testbench, where the input/output ports constraint for the selected subgraphs is set to 4/2. Column 2 gives the number of subgraphs chosen for each testbench, and Column 3 the total number of operations for these subgraphs, which gives an idea about the input size to the ILP model.
In the experiments, we focus on reducing the area overhead first and hence set the coefficient α in the object function of Eq. (7) to be 1. Table II gives the statistics of the configurable CFU generated by our ILP models for different testbenches. Column 2 gives the total number of pipeline stages. Columns 3 to 8 show the number of functional units of each type and Muxes needed for configuring the interconnection. Column 9 gives the total area overhead for our pipelined shared CFU. Column 10 shows the original overhead if implementing all the custom instruction's datapath separately without pipelining and summing up the area. Column 11 gives the area reduction rate. The average area reduction rate achieved in our experiment for all the testbenches is 41.2%, with the maximum reduction rate up to 58.9%. Due to different testbench sets used by the related papers, it is not possible to conduct a fair comparison for the area reduction with the traditional resource sharing techniques. The authors in Brisk et al. [2004] claim an average area reduction of 55.4% over another set of testbenches, similar to our results. However, their technique considers pipelining after the resource sharing finishes. Our approach considers pipelining together with resource sharing, which can achieve better performance and also explore the trade-offs between performance and area reduction. Custom hardware extensions take a large portion of the whole ASIP area because they normally include multiple large functional modules such as multipliers, and the base processor we adopt is a simple RISC-style processor namely the Brownie processor from ASIPSolution Inc. [2012] . The synthesis results from Design Compiler show that the custom hardware extension takes an average of 47.6% of the total processor area (without on-chip cache). Table III shows the area reduction rate of the entire extended processor using our pipelining resource sharing algorithm. Column 2 gives the size of the baseline processor. Column 3 shows the area of the extended processor without resource sharing. Column 4 gives the area of the extended processor with our resource sharing technique. Column 5 gives the area reduction rate of the whole extended processor core. Analysis of the leakage current for transistors [Kim et al. 2003a] shows that for the same voltage source, the dominating part of the static power, leakage power, is linearly dependent on the size of the transistor. Therefore, we assume a linear dependency between the static power and the size of the hardware (the number of normalized size transistors in hardware). Based on this assumption, our proposed technique has the potential to reduce the static energy of the extended processor by 25.1% on average.
As discussed in Section 2, using the pipelined CFU structure can remedy structural hazards between different custom instructions, and therefore the proposed technique can achieve higher performance gain compared to traditional resource sharing techniques when there are multiple adjacent mutlicycle custom instructions selected from the application. Our experiments verify such a scenario. To show the effect of our resource sharing and pipelining approach on the execution speedup, we estimate the performance gain achieved by our CFU in a single issue out-of-order processor, shown in Table IV . Column 2 presents the total number of execution cycles of the applications on the baseline processor. Column 3 shows the execution cycles using the CFU generated with the traditional resource sharing technique, where one custom instruction can only be executed after the previous one finishes execution. Column 4 shows the execution cycles using our pipelined CFU. In this case, custom instructions can enter the execution stage in a pipelining fashion. In Columns 5 and 6 we give the corresponding execution speedup. We can see that for some testbenches, the pipelined CFUs achieve better performance gain than the nonpipelined CFUs. Other testbenches do not show much difference in the speedup because some of the generated custom instructions are single-cycle instructions, and therefore pipelining does not help on these instructions. As described in Section 3, there is a possible trade-off in scheduling between achieving more resource efficiency and reducing the execution cycles for custom instructions in a pipelined architecture. Next we change the coefficient in the objective function, α, to explore the design space. We use the same DAGs from Table I . The coefficient α is set to 1, 0.5, 0 to represent three different cases, that is, targeting the area reduction only, both area and delay, and the cycle delay for each custom instruction. We name the three cases as "Area", "Area-Time", and "Time", respectively. design trade-off. In Figure 10 , the area of the custom hardware generated is normalized to the total area overhead when implementing the custom instructions on separate hardware. In Figure 11 , we estimate the average execution cycles (weighted by the execution frequency of each custom instruction) for the custom instructions executed on the generated hardware for each testbench, and normalize them to the minimum cycles when they are executed on separate hardware and all the operations are scheduled in an ASAP manner. We can see that our "Time" case has the same average cycle as the minimum one, that is, having the value of 100%. For most testbenches the trade-off between area reduction and cycle delay is explicit, that is, the resource sharing plan with the largest area reduction normally has the longest average cycle delay. With more pipeline stages, the functional resources can be shared more efficiently between different DFGs. We also observe that when balancing the two design objectives by setting α between 0 and 1, the cycle number can be reduced greatly without increasing the area overhead much, for example, in qsort, dijkstra, and adpcm d. As an example, we show the three different CFUs obtained for the same testbench, namely adpcm d with different design objectives in Figure 12 . We omit the interconnections between functional units for simplicity. Blocks of different shape represent different types of functional units and the solid horizontal lines divide the functional units into different pipeline stages. We can see that the CFU generated in the "Area" case contains the fewest functional units but with more pipeline stages than the CFUs generated in the other two cases. The CFU in the "Time" case has the least pipeline stages but the largest number of functional units. The CFU generated in the "Area-Time" case achieves a balance between the two design objectives. It has less area overhead than the "Time" case, and also less pipeline stages than the "Area" case. We also present the selected custom instruction subgraphs of this testbench in Figure 13 .
When solving the ILP problems, we use the LPsolver on a host workstation with Intel(R) Xeon(TM) CPU 2.80 GHz and 6GB memory. Table V presents the statistics of solving the ILP model for each testbench. Column 2 and Column 3 show the number of variables and constraints for each benchmark. Column 4 shows the time consumed to solve the ILP problems. The solving time varies a lot between different testbenches even at the similar scale, which demonstrates that the solving time is also dependent on the specific DAG, in addition to the number of variables and constraints. We set the maximum solving time to be 1800 seconds to reduce the time cost, and for most of the testbenches within such a solving time, the difference between the optimal results (estimated by the LPsolver with relaxed constraints) and the results found so far is within 5% range. In addition, we observe that since the operation to operator mapping process (including the interconnection generation) has a linear complexity, the overall execution time is very small and can be neglected when compared to the time used in operation scheduling.
