Abstract-Customised processor performance generally increases as additional custom instructions are added. However, performance is not the only metric that modern systems must take into account; die area and energy efficiency are equally important. Resource [8] show that ASICs have an area advantage of at least 5x, a delay advantage of 3x or 4x, and a dynamic energy advantage of 14x. In mobile or low-power devices, where high performance and low cost are essential attributes, the standard cell ASIC approach remains highly competitive in all three key axes of performance.
creases as additional custom instructions are added. However, performance is not the only metric that modern systems must take into account; die area and energy efficiency are equally important. Resource sharing during synthesis of instruction set extensions (ISEs) can reduce significantly the die area and energy consumption of a customised processor. This may increase the number of custom instructions that can be synthesized with a given area budget. Resource sharing involves combining the graph representations of two or more ISEs which contain a similar sub-graph. This coupling of multiple sub-graphs, if performed naively, can increase the latency of the extension instructions considerably. And yet, as we show in this paper, an appropriate level of resource sharing provides a significantly simpler design with only modest increases in average latency for extension instructions.
Based on existing resource-sharing techniques, this study presents a new heuristic that controls the degree of resource sharing between a given set of custom instructions.
Our main contributions are the introduction of a parametric method for exploring the trade-offs that can be achieved between instruction latency and implementation complexity, and the coupling of design-space exploration with fast area-delay models for the operators comprising each ISE. We present experimental evidence that our heuristic exposes a broad range of design points, allowing advantageous trade-offs between die area and latency to be found and exploited. This has an impact on the number of ASIC design starts, not least designs involving application-specific processors with instruction set extensions. Application-specific instruction set processors (ASIPs) can be deployed easily on FPGA technologies, but FPGA cannot compete with ASIC implementation in die area (and therefore unit cost), energy efficiency, and maximum clock rate. Kuon and Rose [8] show that ASICs have an area advantage of at least 5x, a delay advantage of 3x or 4x, and a dynamic energy advantage of 14x. In mobile or low-power devices, where high performance and low cost are essential attributes, the standard cell ASIC approach remains highly competitive in all three key axes of performance.
Our work addresses these issues by focusing on the problem of how to explore the design space of customised processors which may support a wider collection of extensions, perhaps from an entire application domain, or indeed a large number of extensions from a single complex application.
Section II summarizes the prior work related to this topic, after which section III outlines our motivation for this research and presents the technical problem we address. Section IV then describes our proposal for a parameterised resource-sharing heuristic. Experimental methods and results are presented in sections V and VI, followed by concluding remarks in section VII. I. INTRODUCTION The customisation of a processor through instruction set extensions is now a widely adopted technique in high performance embedded systems. One of the key challenges in the field of processor customisation is how to increase processor speed across an application domain, without replicating logic which could otherwise be shared.
Most research to this point has assumed that each unique application will demand a uniquely customised processor. However, the non-recurrent engineering cost of producing an ASIC design increases with the introduction of each new technology generation. The approximate cost of creating a new SoC design has grown from $1 million in 1994 to current estimates of $20-50 million by 2010 [14] . A significant factor in the cost of each new chip design is the cost of mask production, which has grown from $100 thousand at 0.35,u to almost $9 million for a 45nm design [14] .
II. RELATED WORK
There is a significant body of previous work on automatic identification and selection of ISEs to create applicationspecific processors [2] , [6] , [16] - [18] , [20] . However, minimizing the area required to implement a set of ISEs is equivalent to the problem of constructing a minimal-cost weighted supergraph of a set of graphs, which is NP-Complete [4] . A heuristic approach to this problem is presented by Brisk et al., which transforms a set of ISEs into a single hardware datapath based on the classical problem of finding maximal subsequences and substrings thereof in the graph representation of ISEs [3] . The aim of their work is to maximize die area reduction through the construction of a consolidation graph representing merged ISEs. Similarly, Moreano et al. in [13] introduce a heuristic that uses the construction of a compatibility graph to reduce the problem of two datapath merging to a maximum weight clique problem which is NP complete, they propose non-exact methods to solve this problem in polynomial time.
Our work differentiates itself from [3] and [13] in the introduction of latency constraints in the merging process, while they focus only in maximising the area savings.
Hardware resource sharing is also a design goal in the field of high-level synthesis. Zaretsky et al. present an algorithm for dynamically generating templates of re-occuring patterns for resource sharing in CDFGs [21] , and Martinez and Kuchcinski present a constraint-solving approach based on graph-matching to implement resource sharing in CDFGs [12] .
For the sake of simplicity and tractability these heuristics all assume a fixed cost for the area and delay of each type of operator. This is known to be idealistic [19] . This is corroborated by results in our paper, which show a wide range of die area and timing possibilities for typical arithmetic operators synthesised by commercial tools.
In [7] , Ghiasi et al. present a polynomial time algorithm to maximise the delay assigned to operations in a DFG. The problem is solved by injecting delay to units until all the paths in the graph become critical.
Our paper builds on the work of Brisk et al. by introducing a parametric heuristic which uses the theory of timing budget management developed by Ghiasi et al. to allocate timing slack within a consolidation graph to non-critical nodes, thereby reducing their area-complexity.
The work of Lee et al. [10] , Lorenz et al. [11] , and Cheng and Tyson [5] , has demonstrated that custom instruction set extensions have a significant potential to reduce dynamic energy consumption. By helping to reducing die area, our work aims to also reduce static energy consumption.
III. MOTIVATION An ASIP design will be most cost-effective if it can address not a single application, but rather a whole class of applications. For example, an ASIP capable of efficient processing across the complete range of video standards has greater economy of scale compared with one which handles only MPEG2. When extending an instruction set to cover a complete class of applications we can expect larger numbers of extension instructions to be identified, effectively representing the union of the extension instructions required by each application in the class. Even with a single complex application it is possible to find large numbers of potential extension instructions, each of which adds to the die area of the system. Clark et al. highlight the difficulty of finding exact subgraph matches in order to reuse application-specific instructions across multiple applications in a single domain [6] . However, to avoid bloating the die area with large numbers of extension instructions, it is important to identify and exploit such commonality between instructions and, where possible, to share hardware resources when this represents a good tradeoff between die area and execution time.
A. Problem definition
In this work we assume that instruction set extensions have been identified by a previous compiler phase, and they are represented as a collection of directed acyclic graphs (DAGs) annotated with execution frequency. The problem we address is how to merge such a collection of graphs to reduce the overall die area, whilst minimising the increase in execution latency.
Depending on the alignment of shareable paths in ISE graphs, we may find that the resulting latency is almost unchanged after merging, or we may find that latency increases significantly for some or all merged operations. Naturally we want to avoid merging a frequent operation with an infrequent operation if such a merge would add to the latency of the frequent one. Thus the optimisation process becomes highly complex when instruction latencies may be modified by merging. Figure 1 illustrates how the sharing of one node within graphs (a) and (b) creates a significantly longer path with multiplexers to isolate the graphs according to the operation they implement. As a minimum, the latency will increase due to muxing, but this form of sharing also creates the potential for structural hazards between extension instructions. To avoid creating functional units from merged extensions that produce outputs at different times, a pipelined implementation of the graph in figure l.(c) may pipeline all paths to be the same length. In that case, the latency of both operations could be as large as the sum of the latencies of the graphs l. To evaluate the impact of graph-merging on die area and delay requires models for the operators of an ISE which reflect the wide variations in complexity that can be achieved under different timing constraints and logical context. For example, the marginal cost of an adder following a multiplier is less than the cost of an adder in isolation. Yehia et al. present examples of how arithmetic optimizations can reduce the combined latency of sequential operations [19] . Modern logic synthesis tools have the ability to perform similar arithmetic optimizations, such as folding ADD or SUBTRACT operations into the carry-save tree of a combinational multiplier. Even a single isolated arithmetic operator can be synthesised to a wide range of speed-area design points by specifying timing or area constraints during logic synthesis. more than a factor of 2 as timing constraints are varied. For a 32-bit fixed-point adder the relative variation is even greater.
Design objectives will not always stipulate the extremes of minimum execution time, nor minimum die area: there are many possible intermediate points in the area-delay relationship, any one of which may be ideal for a given system. Our goal has been to develop a parametric resource-sharing algorithm that will enumerate extensive regions of the design space through the settings of a small number of real-valued parameters. Such an algorithm could be used in iterative design-space exploration methods, or may provide a means through which machine-learning based approaches can be trained to understand the characteristics of the design space. Those are topics for future work and are not addressed in this paper.
IV. PARAMETRIC RESOURCE -SHARING HEURISTIC The proposed heuristic is derived from a path-based resource-sharing algorithm, introduced by Brisk et al. in [3] .
A DFG is a DAG represented by a set of vertices V and a set of edges E, where vertices are operators, inputs or outputs, and edges indicate the data dependencies between them. A path within a DFG is a sequence of vertices that traverses the graph, through the edges, from an input to an output.
Resource sharing is induced by the search for maximum common substrings between two paths. A maximum common substring is a subsequence of vertices that maximizes area reduction. The area of a substring is given by the sum of the areas of each operation within the substring.
A description of the proposed heuristic is illustrated in algorithm 1. The algorithm receives as inputs a set of n DFGs Gin, where each Gi E Gin represents an ISE to be synthesised. The algorithm is parameterised by three threshold values OAT, 3T and OT. Each of these is given a real value between 0 and 1. The output of the process is another set of graphs Go0t containing the result of resource sharing.
The algorithm is divided into a global and a local phase. During each phase there is an exhaustive search for a maximum substring, comparing all pairs of paths belonging to different graphs. During global merging the maximum substring is referred as MaxStrGlobal and in local merging as MaxStrLocal.
The global phase operates on Go0xt, which is initially copied directly from Gin. Consequently, before any resource sharing is applied, the number of input graphs is the same as the number of output graphs, i.e. m= n, where m IGo0,,tl For each Gi E Go0,t a set of paths Pi is created with all the possible paths found in Gi. P aggregates all the sets of paths from Pi to Pm. Every path in Pi is compared with all other paths that belong to Pj7,i in order to find the MaxStrGlobal between two DFGs. Graphs containing MaxStrGlobal, i.e.
Gx and Gy will be merged into one graph G'. The process then switches to the local phase where a MaxStrLocal is searched for, taking all pairs of paths of the merged graph G'; one path is found from Gx and the other from GyI Once MaxStrLocal is found, the paths are merged. The iterative search for further merging finishes when no further MaxStrLocal instances can be found. The process goes back to the global phase where the number of graphs is decreased by one. This loop will be finished when no MaxStrGlobal is found or when there is only one graph left in Gout.
A. Alpha and Alpha Threshold
Every graph Gi has an associated value ai.
(1)
where Fi is the normalised execution frequency of Gi, defined by the execution frequency of Gi divided by the maximum execution frequency in the set Gi, Li is the original latency of Gi, i.e. before the merging process. L' is the latency of Gi after being merged with other graphs. Mi is the percentage of area corresponding to operations in Gi that can be merged with other graphs, divided by the total area that could be merged in the whole process.
OcT is therefore a parameter that serves to omit graphs from the merging process if their corresponding ISE is executed very frequently, and if their latency, due to resource-sharing, is large. Additionally, this effect can be slightly cancelled out when the level of sharing found in Gi is high. The value of o associated with each graph, is compared with OAT to decide if the graph will be included in the merging process. As the value of o decreases, the probability to implement the graph separately increases.
B. Beta and Beta Threshold
Every Gi has an associated value 3i. The 3T parameter tends to leave graphs unmerged if their latency is much larger than the rest of the ISEs. This is indicated by the difference between the average latency of all input graphs and the latency of the graph in question. If 3i is greater than 3T, Gi will not be considered during the merging process, thus preventing Gi from affecting the latency of the other graphs. The value of 3 associated with every graph is compared with the value of 3T in order to decide if the graph will be included in the merging process.
When global merging is finished, the values of 8 and ag are calculated for every Gi e Gin These values will indicate if any input graph is to be left separate. The set of graphs G* keeps track of the graphs that will not be included in merging.
If G* )4 0, the merging process will start again from Gout Gi, G*. LG-Lx X (1 A Y±Ay AGI) (4) LGI~Ax +Ay 0Y LGI [7] . Each operator in the graph is initially assigned a latency and area given by its minimum delay point. Then a zero slack algorithm is applied in order to relax the area of the operators that are off the critical path. An iterative slack distribution process takes place until 2008 Symposium on Application Specific Processors (SASP 2008) (5) no further slack is found in the graph. At this point, each operator has an area value that corresponds to the maximum permissible delay of the operator such that the critical path of the graph is not increased. The sum of the areas of all the operators in the graph determines the estimated ISE area. We extracted a set of basic blocks from the Linear Predictive Coding (LPC) program, from the UTDSP benchmark suite, using an existing ISE identification technique. The DAGs from these blocks were presented as the initial set of n ISEs Gin GI ... Gn to our resource-sharing heuristic.
Additionally, we include summarised results of performing the same experiment with other two benchmarks: ADPCM encoder from the UTDSP benchmark suite [9] and ADPCM encoder/decoder from SNU-RT benchmark suite [1] . This is with the intention of demonstrating that the results shown are consistent independently of the input set. The characterists of the input sets can be seen in A. Algorithm Implementation The parameterised resource-sharing algorithm was implemented as a system that takes input graphs expressed in XML. It performs resource sharing and outputs a description of the resulting merged logic in Verilog to enable subsequent logic synthesis and integration with an existing processor core.
For [15] is not an issue in this process. Commutability of operations is exploited where any input of the ISE is part of the inputs of a two-input vertex in order to balance the assignment of inputs to the multiplexers.
The final area of the merged graphs in our experiments includes the contribution of all multiplexers inserted.
VI. RESULTS
The resource-sharing solutions found as a result of executing the algorithm several times with varying parameters using the input set extracted from the LPC program, are plotted in figure 3 . Five solutions that correspond to Pareto points of the design-space have been highlighted. We expect the solutions found by the resource-sharing techniques described in [3] and [13] figure 5 , three values for 3T and aT were chosen Figure 6 shows the sumarised results of the experiments performed with the three input sets described in 
