Abstract-Efficiency a n d flexibility are critical, but often conflicting, deaign goals in embedded system design. The resent emerg8ence of extensible processors promises a favorable tradeoff between emcieney and flexibility, while b p i n g design turnaround times short. Current extensible p r o e e s s~r deaign R o w s automate -vera1 tedious tasks, but typically require designers to manually select t h e parts of t h e program t h a t are to he implemented -custom instructions.
In this work, we describe an automatic methodology t o eelact custom instructions t o augment an extensible PTOC~SIIOI, in order to maximize ita efflcieney for a given application program. W e dsmonstrate t h a t t h e number of custom instruction candidates grows rapidly with program ~i s e , leading t o e large design mpace, and t h a t t h e quality (speedup) of custom instructions varies signiflcantly across this space, motivating t h e need for t h e proposed flow. Our methodology features cost funotions t o puide t h e custom inatrustion selection process, * weii M static and dynamic pruning techniques t o eliminate inferior parts of t h e design apace from considerstion. Further, we employ a twostage process, wherein a limited numher of promising instruction candidates are first selected, and then evaluated in more detail through cycleaccurate instruction met aimd a t i o n and synthesb o f t h e corresponding hardware, t o identify t h e custom instruction combinations t h a t resuit in t h e highest program speedup or maximize speedup under a given area constraint.
W e have evaluated t h e proposed techniques using a a t a t e o f -t h e art extensible processor platform, In t h e context of L. commercial design flow. Experiments with several benchmark programs indicate t h a t custom processors synthesized using automatic custom. instruction aeiection can result in large improvements in p e r f o r manee (upto 5.4X, average of S.4X), energy (upto 4.5X, average of S.ZX), and energy-delay product (upto 24.ZX, average of lZ.eX), while speeding up t h e design process eignifleantiy.
I. I N T R O D U C T I O N
Efficiency and flexibility are two of the most important driving factors in embedded system design. Efficient implementations are required t o meet the tight cost, timing and power constraints present in embedded system. Flexibility, albeit tough to quantify, is equally important -it allows system designs to be easily modified or enhanced in response t o bugs, evolution of standards, market shifts, or user requirements, during the design cycle and even after production. Various implementation alternatives for a given function, ranging from custom-designed hardware t o software running on embedded processors, provide a system designer with differing degrees of efficiency and flexibility. Unfortunately, it is often the case that these are conflicting design goals. While efficiency is obtained through custom hardwired implementations, flexibility is best provided through programmable implementations. Hardwarelsoftware partitioning -separating a system's functionality into embedded software (running on programmable processors), and custom hardware (implemented as cc-processors or peripheral units) -is one approach to achieve a good balance between flexibility and efficiency 111. However, the increasing scale and complexity of embedded Systems-on-Chips (SOCs), together Acknowled-ems: This work w~ supported io part by DARPA under contr-t no. DAABO7-OO-C-LSIB snd in part by ths NJCST Center on Embedded Systemon-*-Chip Daaizn. '4 with decreasing market cycles, provide a constant pusbtowards '
increasing embedded software content. While it is known that application-speciflc instruction set processors (ASIPs) provide a good tradeoff between efficiency ana flexibility, their relatively large design turnaround times ( pared t o software implemented on pre-designed and prRve, embedded processors) have, in part, prevented their wid&&' + SOCs. The recent availability of configurable and extensible pi$.$$:-I cessors ' promises a favorable tradeoff between efficiency and flex-*: ' . , ibility, while keeping design turnaround times short. The emir-., gence and success of companies that offer configurable and extensible embedded microprocessor and DSP cores as a primary product . , etc.) testifies to the beneflts a f t 1s approarb. ' Flealhiug the potential of extensible processors requires the,d+. velopment of supporting tools and methodologies that enable &?.-tem designers t o achieve design turnaround times that are +my parable t o all-software solutions. State-of-the-art extensible pi* !'' cessor design flows automate several tedious tasks and aid the design process by providing a high-level interface to perform processor customhatian, automatic generation of register-transfer level (RTL) hardware descriptions, a retargetable software tool chain for the customized processor (including compilers, assemblers, debuggers, and binary utilities), and design automation scripts t o support verification, logic synthesis, and physical design. However, they require designers to manually select parts of the pmgmm and design the custom instructions that implement them. As .
shorn in thG paper, these are daunting tasks f o r large pmpms, and are further complicated when performance improvements need to be attained subject to other considerations. such as constmints on hadware are0 overheads and clock period. Our work addresses the above need by providing a systematic methodology and automation algorithms for the application-specific customization of extensible processors.
A. Related Work
We briefly trace related research work along the lines of ASIP architectures and overall design methodologies, application-specific instruction set selection, compilation techniques for ASIPs, and . the custom instruction. While these steps are somewhat simplified by profiling tools, and through the specification of the custom instruction hardware at a high level of abstradion, they are still daunting tasks for large programs, and are further complicated when performance needs to be optimized subject to constraints on Instruction set selection refers to the problem of defining a c u e tom instruction set that will result in the most efficient processing for a given application or domain. Recent work has also focused on reducing power consumption in ASIPs. A concept of instruction subsetting is introduced in [26] to create an ASIP from a more general processor, such as a DSP.
In 1271, a low power ASIP is s y n t h e s i d from .a customized ASIC using power estimation techniques derived from high-level synthesis. Synthesis of power-efficient hypermedia processors is addressed in 1281. Case studies of power optimization of ASIP cores are presented in [29] , [30] .
B. Paper Overview and Contributions
The contributions of this work include: Instead of designing an entire instruction set, we propose an automatic methodology for the selection of custom instructions to augment an extensible processor in order to maximize its efficiency for a given application program ' .
We demonstrate the need for such a methodology by illustrating the size and complexity of the custom instruction design space, and present an analysis of the issues and tradeoffs involved in instruction set extension.
We develop cost functions and pruning techniques to guide the custom instruction selection process.
Unlike most previous work, OUT methodology involves actual logic synthesis and cycle-accurate instruction set simulation of promising candidate custom instructions to accurately estimate performance benefits and hardware overheads.
We not only consider every custom instruction independently, but also consider combinations of custom instructions, and choose a combination that can maximize performance under area constraints.
We have performed experiments, in the context of a commercial design flow (using Tensilica's Xtensa), indicating that custom processors generated through the proposed methodology result in significant improvements in performance, energy, and energy-delay product. Our results are derived from post-synthesis technologymapped netlists of the entire original and optimized processor Cores.
MOTIVATION
Current extensible processor flows typically require designers to manually design the custom instructions that are used to augment the base processor platform. Given an application program, this involves profiling of the program source code on the base processor, identification of the most performancacritical sections of the prw gram (the "hot spots"), formulating custom instructions by specifying new hardware resources and the operations they perform, and rewriting the source code of the program to directly invoke lAlthouih we dirostlj tsrget performsnca irnprovemants. -dernonstre.te that w x r u and erterw-deiay prodvst s r~ also nignihsantiy reduced. The number of candidate custom instructions for a given prw gram grows exponentially with the program size. It is not uncommon for functions with a few tens of operations to contain several hundred instruction candidates. We used the tools developed in our paper to identify all possible (unique) instruction candidates for a C function BYTESYAF'O derived from the Tensilica training kit, which is shown in Fig. 1 . As the name indicates, this function swaps the order of bytes in a word (e.g., for endian conversion). The function, although quite small, contains 482 potential custom instructions! Most realistic programs contain a function hierarchy with a large number of functions. In well-optimized software, it is also often the case that there are several "critical"~functions, i.e., no single or small set of functions is responsible for a large fraction of the total program execution time 1311. In such scenarios, a combination of seveml custom instmetions may be necessary to achieve the desired performance. Two custom instructions that appear to lead to the same speedups may result in hardware that impacts the critical path (and hence the clock period) to different extents. If the addition of a custom instruction causes a violation of the original processor's clock period, the designer can choose one of the following options: (i) reduce the amount of computation performed in the instruction, (ii) split it into multiple instructions, (iii) multi-cycle the execution of the instruction, or (iv) simply accept the clock period penalty. Different choices may he optimal, depending on various factors.
*
When constraints on hardware overheads are present (which is frequently the case), the designer needs to judiciously select the hardware resources employed. In fact, given a set of operations to be performed by a custom instruction, there exists an area-delay tradeoff based on the number of resources used. Similar arguments apply to storage resources (registers, lookup tables, etc.) used in custom instructions. When multiple instructions need to be selected fmm a large set of candidates, the tmdwffs involved am eomplex, and can be dificult to identify manually, leading t o a need for methodologies such as the one pmposed in this p~per.
Example 1: In order to illustrate the variation of speedup over the set of all candidate instructions (the "design space"), we generated and evaluated all possible custom instructions for the BYTESWAPO function. The function iterates 10,000 times and consumes 130K cycles on the base processor core. Fig. 2 plots the execution cycle savings resulting from each of the 482 candidate instructions. The instructions are ordered according to their size, location in the source code, and relative importance in the profile are defined in the TIE specification.
IV. METHODOLOGY AND ALGORJTHMS
In this section, we describe OUT design methodology for automatically generating custom instructions to get performance improvements and/or energy reductions. We first provide an overview and then describe the important steps in detail.
A. Overview 
I
Although manual creation of custom instructions allows for human ingenuity to be applied to create high-quality hardware, considering the large and complex design space of cnstom instructions, designers can benefit significantly from tools and methodologies to explore this design space automatically. The proposed methodolo w could be used to exdore the laree desien mace and short-list Performance variation 8cross the custom instruction design space . t& most promising instktion candaates, Ghiih human designers can further refine.
BACKGROUND
We chose Tensilica's Xtensa [3] as our target platform, since its architecture was designed from scratch to he customizahie, allowing it to be efficiently tailored to target applications, to obtain SOCs with optimized performance, power, code size and die size.
Its key features are as follows:
The Xtensa processor provides the means to select a wide range of architectural parameters in the base processor core, e.g., whether to include generic instructions (e.g., multiplyaccumulate), floating point co-processors, configure the register file and memoryfcache architecture, configure exception/interrupt mechanisms, and include test and debugging support.
The designer can extend the base processor hy designing custom instructions for application-specific computations.
A GNU-based software tool suite is automaticallv eenerated to -b 
t match the exa& configuration specified in the proce&r generator, including a GNU C/C++ compiler, assembler, linker, debugger, diagnostics, reference test benches, cycle-accurate instruction set simulator (ISS), and standard libraries. This enables rapid design, verification, and integration of application-specific hardware and software, and removes a major bottleneck that has prevented consideration of ASIPs as the processing element of choice in SOC design.
Designers use the Tensilica Instruction Extension (TIE) language 121 to define custom instructions. The instructions can he either single-cycle or multi-cycled. Instead of invoking custom instructions at the assembly level, calls to TIE instructions can be directly inserted into high-level language (e.g., C, C++) descrip tions of the application program. This eases designers' burden so that they can put more emphasis on the functionality of the program and select the best instructions. The Xtensa instruction set defines a limited opcode range and encoding formats for custom instructions. TIE instructions can have at most two input and one output operand fields in the instruction. If a cnstom instruction needs additional inputs and/or outputs, it can implicitly read them from or write them to some internal state registers, which erates the pmgmm dependence gmphs [32], which include: (i) the contml dependence gmph to identify the control blocks of the program, (ii) the datn dependence gmph to get data predecessor and SUCC~SSOI information, and (iii) the contml flow gmph to indicate how control flows through the application. At the same time, the program is simulated and then profiled both at the function level and line level, to determine where the hot spots are (step 2).
Step S ranks the control blocks of the program in descending order of potential for improvement. The ranking criteria may include performance (from a profiler), energy (from an energy estimator) or energy-delay product (from both), depending on the optimization objective. In this work, we focus on performance as the metric to drive custom instruction selection. However, our results (Section V) show that energy and energydelay product are also significantly reduced in the process. A contml block is a sequence of program statements or control blocks that can be executed sequentially. It is different from a basic block in that a control block can recursively contain other control blocks, while a basic block cannot contain other basic blocks. In our work, custom instructions are selected inside control blocks. However, it is possible to cross control block boundaries in the case of conditional statements that can he transformed to equivalent arithmetic/logic expressions (e.g., i f and case statements that can be translated into "?" and 'I:'' operators).
Steps 4 and 5 generate and select custom instruction templates, respectively. Although these two steps are explained separately for the sake of clarity, their implementation may be combined for the sake of efficiency. A template is a set of program statements that is a candidate for implementation as a custom instruction. Since the number of templates grows exponentially with program size, it is necessary to prune the search space. Our pruning techniques are explained in detail in Sections IV-B.l and IV-B.Z.
Each promising candidate selected in step 5 is extracted from the C program (step 6 ) , and transformed to a format (TIE) that describes the opcode, operand, states, user-registers, computations, etc. (step 7 ) . It is then compiled (using Tensilica's TIE compiler [3] ) to get the Verilog RTL description of the additional hardware that will augment the base processor (step 8). The RTL description is synthesized (using Synopsys Design Compiler [33]) to get the timing and area information (step 9). If the new instruction Cannot he fit in the base processor core's clock period, either the number of cycles used by the new instruction is increased, or the clock period is increased, and the custom instruction generation phase is repeated (step 11). Hence, for a single custom instruction template, there may be several versions with varying clock period and number of cycles. At the same time, the original C code is transformed by replacing the appropriate statements with a call to the custom instruction (step 10). Then, for each version of the custom instruction, the new C program is compiled and profiled using a cycle-accurate ISS to get the performance improvement (step 12). Steps 5 through 12 are iterated for every selected template.
After each individual custom instruction has been verified, a subset (Combination) of instructions is chosen to get the maximum performance improvement under the given area constraint, dependingon the selection criteria (step 14). This step is described in further detail in Section IV-B.4. The hardware corresponding to the selected custom instruction Combination is built and synthesized (steps 15 and 16). If the timing and/or area constraint is not satisfied, the next best custom instruction combination is selected (step 17). Otherwise, the modified C program is compiled and profiled again t o get the final performance improvement and/or energy reduction (step 18). After having selected the custom instruction combination, the whole processor is built and synthesized (steps 19 and 20).
B. Details
In this section, we describe in detail the important steps of our algorithm. Section IV-B.l describes the template generation method, Section IV-B.2 describes the template selection algorithm, Section 1V-B.3 describes the instrumentation of the program to invoke the custom instructions, and Section N-B.4 details the custom instruction combination selection method.
B.l Template generation
Although template generation and selection are represented as distinct steps in Fig. 3 , in our implementation, they are interleaved in order to improve efficiency. In the generation phase, some templates, which have low potential for performance improvement, are not generated. We refer to this aa a static pruning technique.
We propose t o generate templates in three phases. In the first phase, we can generate bnsic templates. A basic template consists of a single node in the program dependence graphs that satisfies the given selection (pruning) criteria. In the second phase, we generate dependent templates. A dependent template is a fully connected subgraph of the data dependence graph. Hence, each node of a dependent template is connected to some other node in the template through a variable. Dependent templates are generated by using a hasic template es a seed, checking data dependencies of the basic template, and including cornbinations of data dependence predecessors and successors if they satisfy the selection criteria. In the third phase, we generate independent templates. In this step, we use both basic and dependent templates as seeds, and add nodes that are independent of the seed template, The following example illustrates the template generation process. Ezample 2: Fig. 4(a) shows a small fragment of C code corresponding to a single control block, and its data dependence graph. Each node represents a single statement in the C program. Each node also has a weight that represents the fraction of the total program execution time spent in that node. The dotted lines in Fig. 4(a) indicate data dependencies with operations that belong to other control blocks. Fig. 4(h) shows all possible templates that can he generated from this graph. Nodes 1, 2, 3, and 4 form basic templates. Templates 5 , 6, and 7 are generated in the dependent template generation phase. The independent template generation phase generates templates 8 through 13. It combines every basic or dependent template with templates they are independent of, i.e., template 4.
Note that nodes 1 and 3 cannot he combined to form a template. To explain that, we need to consider the fact that all the statements corresponding to a template will be replaced by a single statement (call to the custom instruction) in the optimized C prw gram, i.e., they are merged into a single node in the dependence graph. In that case, node 2 will he the data predecessor of the new node (since it generates data that is used by node 3), as well as its data successor (since it uses data generated by node 1). That will introduce a data dependence cycle inside a control block, which is 0 In Example 2 , we enumerated all possible templates for the sake of clarity. In practice, the number of candidate templates may potentially he very large, even for programs of moderate size. Hence, it is necessary to use pruning criteria to select good templates while discarding less promising ones. Any metric to evaluate the templates should consider the following factors:
Amdahl's law [34] suggests that the fraction of the original prw gram's execution time that a template accounts for presents a bound on the performance improvement achievable when it is converted into a custom instruction, Hence, templates that have a larger cumulative weight are more desirable.
. Blindly applying the first criterion would result in the degenerate solution where the largest possible template is always chosen. However, it is often the case that the largest template does not illegal, since it changes the program's functionality. result in the best speedup. That is in part because each template efficiency when implemented as a custom instruction. Given two templates account for the Same fraction of total execution time on the original processor, the number of cycles required to indicator of their optimization potential.
. Many extensible processors, including pose a limit on (Priorityhigh.,e) seen thus far. After we generate a new template, If the ratio is below the threshold, we do not add it t o the template list. Fig. 5 shows the relationship between the threshold ratio RCBtoCMYK, which performs pixel wlor conversion. The number of templates displays a sharp decrease if the threshold ratio is above 0.2. If the threshold ratio is set too high, the searched design space has an inherently different swpe for optimization, i.e., template we compute its priority and wmpute the ratio Priorits=urzs.ni PliDl.itYhiph..l ' execute them when implemented as a custom instruction is an and the number of templates generated for an example program, Xtensa processor imnumber of operand fields that can he sp&ifid in the instruction format. ~l s~, the general-purpose register file in the processor has a specific of read and write ports, imposing a limit on the number of general-purp-registers that can he used in a custom instruction. This bottleneck can be overwme by defining custom registers (called state or user-defined wisters) whose use is hardwired into the instruction. However, the use of state registers imposes an additional overhead. When other wmputations generate (or use) data that are used (or generated) by the custom instruction, the contents of the state registers need t o be written t o or read from either memory or the processor's general-purpose registers. The overhead for data transfer is determined by the number of "excess" input and output variables of a given instruction template.
Considering all the above factors, we use the following equation t o rank candidate templates:
is limited and the solution may be trapped into local optima. On the other band, if the threshold ratio is set too low, not many templates are pruned out. In our experiments, we found that setting the threshold ratio between 0.1 and 0.15 achieved reasonable reductions in the number of templates generated, with Or no impact On quality. and Out are the number of inputs and outputs of the template, respectively, a is the number of inputs that can be enwded in the instruction, p is the number of outputs that can be enwded, and 7 is the number of cycles required by the template when implemented as a custom instruction. The numerator in Equation (1) can have at most a inputs and 5 outputs specified in the instruction (the exact values of ~1 and p are dependent on the processor architecture), if the number of inputs is greater than @;a cycle is needed for each additional input t o load it into a user-defined state register. If the number of inputs is less than a, this term is zero. A.similar explanation holds for the number of additional outputs.
It bears mentioning that the Priority metric presented in Q u ation (1) is a coarse-grained metric, because it does not consider detailed architectural effects such as pipeline stalls. Depending on the program structure, compiler, and base processor architecture, some templates may cause pipeline delays, while others may not. Also, because of pipeline stalls, the time spent in storing values t o state registers or reading values from state registers eau be masked in some templates but not in others, resulting in variations in the speedup obtained by seemingly similar templates.
Estimating pipeline stalls at the C program level is quite d f icult, since it requires a lot of information regarding the processork micr-architecture and compiler optimizations. However, for our purpose, an approximate metric suffices, since we employ it only as a pruning criterion, and t o identify groups of promising templates. As indicated in Fig. 3 , we actually evaluate the most promising templates using a cycleaccurate ISS and synthesis of the additional hardware, and select the best from among them.
Since templates having higher values of the Priority metric are likely t o get more performance speedup or energy reduction, we first consider those templates as seeds when generating new templates. In order t o achieve this, we preserve a ranked index of the templates, and traverse the list in decreasing order of Priority when chwsing a seed. Further, we set a Priority threshold t o determine the templates considered for further analysis, those below the threshold being discarded. For the sake of wmputational efficiency, the threshold mechanism is dynamically enforced during template creation itself (rather than as a post-processing step). While generating templates, we preserve the hgbest priority After all the templates in a control block have been generated, the template selection step separates those templates into several groups based on ranges of the Priority values (e.g., all templates having Priority within 5% of Priorityhighc,t form the first group, etc.). This "binning" is performed because, as mentioned earlier, the PTiority metric is only a coarse-grained indicator of the actual performance improvement, and hence more detailed evaluation is necessary t o discern the best template from a group of templates with similar Priority values. We examine templates groupbygroup, and all the templates within a group are evaluated in detail (using the ISS and hardware synthesis), without regard to their exact Priority value. We first attempt to generate custom instructions from all templates in the highest priority group (steps 6 to 12 of Fig. 3) . If all templates from a group fail t o generate a custom instruction (e.g., because of timing and area constraint vilatious), we move t o the next best group. This procedure continues until a t least one template from a group succeeds in generating a custom instruction, or all groups have been tried.
Ideally, the custom instruction should fit in one cycle a t the original processor wre's clock period. However, sometimes the additional hardware required t o implement the custom instruction results in an increase in the critical path and violates the original clock period constraint. In such situations, our tool tries two options: (i) increasing the clock period (this is equivalent t o slowing down all the instructions), and (ii) increasing the number of cycles for the custom instruction until the original clock period constraint is satisfied. Specifically, we first generate the custom instruction as a single-cycle instruction, and find the smallest clock period it can be synthesized t o fit in. If the clock period thus obtained is greater than the base processor core's clock period, we iteratively increase the number of cycles of the custom instruction until increasing cycle wunt further does not help reduce the critical path. For each number of cycles, we find the shortest dock period that can accommodate the custom instruction. Hence, each selected template may result in several different versions of a ~11s-tom instruction, each having a different clock period and number of cvcles.
I ~~
For each control block in the application program. all successfully gencrated custom instructionr are compared in terms of their actual speedup and area, and the ones that best fit the selection criteria are chosen and passed on t o the subsequent phase in which instruction combinations are selected. If we only consider the numher of execution cvcles as the obiective (without reeard to area or clock period), it is only necessary to pkeserve disjoint templates that result in the best speedup However, if we alsn consider area -and clock period as constraints or objectives, we can apply the notion of Pareto optimality [35] to remove inferior or dominated templates from further consideration. For example, suppose that execution cycles and clock period are the only parameters of interest. Consider two templates, A and B, where the set of nodes that constitute A is a superset of those that constitute B. If both A and B can be implemented without increasing the base processor core's clock period, we retain A and discard B. Naturally, as more dimensions (area, energy, etc.) are added, Pareto optimality t r a n s lates into stronger conditions that need t o be satisfied to discard a candidate, resulting in a larger number of candidate instructions being passed on t o the subsequent phase.
B.3 Custom instruction insertion
When a template is selected and needs to be evaluated in detail using an ISS, a call t o the custom instruction needs t o be inserted back into the orieinal C D I O E T~~ fsteD 10 of dence SIICC~&I. However. fo; 5ome teniplarea, no proper locition for insertion call be found using only the ahovr method. as illustrated in the following example. Ezomple 9; Fig. 6(a) shows an example code fragment, and its data dependence graph. As can he seen from the code, the control flow deoendencies are 1 + 2 + 3 -4 + 5. SUDDose that the . ~ ~~ ..
~~~
template undcr consideration ir {1,2,5). Sore that the template hecomes a single node in the modified data deprndenct? grnph Node 3 is the data depcndencr SIICCPFSO~ of the new node, and node 4 is its data dependence predecessor. In other words, the data flow after insertion of the custom instruction would be 4 -{1,2,5} + 3, which is in contradiction with the control flow of the original program. However, if nodes 3 and 4 are independent, it does not matter which one executes first. If we can exchange the order of nodes 3 and 4, we find that the template can be inserted back into the C program without affecting functional correctness, as shown in Fig. 6(b) . In the modified C code, one output is implicitly written to a state register. The statement t=RUR(O) in Fig, 6(b) assigns the value from state register 0 t o variable t. I The above example illustrated that the insertion of calls t o a cnstom instruction may require the reordering of other program statements that are not part of the template itself. This requires that the data dependence information inside a control block he complete and exact. In practice, it is known that exact data dependence graph extraction is difficult when array or pointer accesses are present [36]. One solution is t o have a preprocessing step, which adds code (i) at the beginning of a set of computations involving arrays and pointers, to read all required values into temporary scalar variables, and (ii) at the end, t o write back temoorarv scalar variables into arravs or Dointer contents. If the . " ~.
compiler is smart enough. the penalty of introducing theadditional variables can be reduced to a minimum.
If the template requires state or user-defined registers, reads from and writes t o state registers need t o be performed explicitly in the application program. This code is also automatically eenerated bv our tool as Dart of the custom instruction insertion with the register assignment for general-purpose registers, which is performed by the compiler. However, the overhead associated with transfer of data to/from state registers can be minimized if we carefully select the variables that are put into state registers. We illustrate this through the following example.
Ezomple 4: Fig. 7 (a) shows a fragment of C code. Note that variable o f f s e t is assigned a value before the loop and is not changed during the loop. Fig. 7(b) shows the modified program after a custom instruction is inserted t o perform the computation o f f s e t + i*j. This expression requires three input operands (offset, i, j ) , while only two operands can be read at a time from general-purpose registers. Hence, an additional (userdefined) state register is created t o store one of the operands. We have three natural choices that follow. Fig. 7(b 
B.4 C u s t o m instruction combination selection
After template selection, there may still be several custom instruction candidates in the program, with each candidate having several versions (due to variations in clock period and number of clock cycles). The next step is t o select a subset (or Combination) of custom instruction candidates that best satisfies the performance, area and energy requirements. The inclusion ofone custom instruction could either reduceor enhance the performance/energy benefits of another custom instruction. Thus, the custom instrnction candidates that have survived scrutiny so far also need to he evaluated considering their inter-dependencies. Clearly, this search space is large. Hence, methods t o efficiently explore the search space need to be employed. If all the selected custom instructions are non-overlapping, and if the optimization criterion is maximizing performance under an area constraint, the selection problem can be stated as follows:
Problem 1: Given a set of non-overlapping custom instructions, with each instruction having several versions (not selecting the instruction can also be considered as a degenerate version with zero area overhead and no performance benefit), find a version for each instruction such that the performance is maximized while area is under a certain threshold.
A more mathematical representation is as follows:
Problem 2: Given n non-overlapping custom instructions, cus- subject to: 
. In the above equations, Cycles.,;, is the number of cycles of the original program running on the base (unaugmented) processor core. AREA is the maximum total area allowed for the custom instruction hardware. Equation (3) ensures that exactly one version of each custom instruction is chosen. Equation (4) makes sure that the area of the selected custom instructions does not exceed the maximum area constraint. The cost function (Equation (2)) is the ratio of the execution time of the program with custom instructions t o the execution time of the original program. The.first factor in the numerator is the number of cycles of the program with custom instructions, and the second factor is the clock period of the new processor.
Note that the above formulation is only an approximation, and makes certain assumptions about how the area overheads and clock pericd will behave when combinations of custom instructions are included. Specifically, we assume that the area overhead will behave in an additive manner (we can also consider additive bebavior with a constant shared overhead), and that the clock period is governed by a max-function. Since this approximation may introduce some error, we find not just the best solution, but the best b ones, using a branch-and-bound algorithm, and evaluate all of them through logic synthesis. The branch-and-bound algorithm for custom instruction combination selection works as follows. First, all custom instructions are sorted in decending order of the metric maxj (2). The order computed above is used for branching, i.e., we make branching decisions on instructions strictly in the above order. Each branching decision consists of choosing a specific version of the instruction under consideration. At each point visited in the branch-and-bound decision tree, we compute three values: (i) the current cost function f, (ii) a lower hound of the cost function f t . , which is the cost function obtained hy including all custom instructions that have not been visited yet, while t h e maximum clock period remains as the current clock period, and (iii) total area of already selected custom instructions CA. We pop the next custom instruction from the sorted list, go through each version and compute the same three values again. If CA is greater than the maximum area constraint, we bound. If the lower bound fi. is worse than the best k solutions, we hound, otherwise, we consider the next custom instruction in the sorted list for branching. If the current f is within the k best solutions seen thus far, the current solution is stored in the result array.
The above procedure can be easily extended as follows t o the case when candidate custom instructions are generated from overlapping templates. Suppose that, at a given point in the decision tree, a custom instruction (say, instruction i) is chosen. We find all the other custom instructions that have overlap with instruction i, and force the procedure to avoid choosing them. If we backtrack to a different part of the decision tree and reverse the decision t o include instruction i, these constraints are removed.
V. EXPERIMENTAL RESULTS
We have implemented the flow described in Section IV by integrating several commercial and public-domain tools with our custom tools. Our tool takes a C program as input and outputs custom instructions and the modified C program. The data h e tween commercial tools and our program are exchanged through files and scripts. The GNU-based compiler, simulator, and profiler tools provided by Tensilica are used to simulate the program and gather information about execution cycles (steps 2, 12, and 18 in Fig. 3) . We use the Aristotle analysis system [32] t o generate the program dependence graphs (step 1 in Fig. 3) . The program dependence graphs are generated at the source code level. Hence, it is easy t o back-annotate t o the original C program.
After a promising new instruction or instruction combination is identified, our tool automatically outputs them in TIE format, and invokes Tensilica's TIE compiler t o transform the instruction specification t o RTL Verilog code (steps 7 and 15 in Fig. 3 ). We then use Synopsys Design Compiler [33] t o synthesize the RTL circuit and map it to NEC's commercial 0.18~ technology library (371 (steps 9 and 16 in Fig. 3) . The area and clock period information extracted from the synthesized, mapped netlists are used t o drive the selection of the final instruction combination that is used t o augment the processor.
We evaluated the proposed techniques using six example benchmarks. BYTESUPS is a function t o swap the order of bytes in a word. It is mostly used for little-eudian t o big-endian conversion and vice versa. Add4 adds the value of four bytes in one word and returns the sum. RCBtoCMYK is a color conversion program. Alphablend blends two 24bit pixels. Popcount implements the population count function, which counts the number of 1's in a word. Rand is a function for ISAAC (indirection, shift, accumulate, add, and count), which is used as a fmt cryptographic random number generator. Our experiments are run on a 440 MHz SUN Ultra10 workstation with 1 CB main memory. Area constraint is set to 10% of the original processor's total area in all experiments.
The time to completely generate and select custom instructions from original C programs (steps 1 t o 18 in Fig. 3) varies from less than an hour t o over six hours, depending on the number of iterations. Most of the time in the design flow is spent in synthesis (Design Compiler 1331: steps 9 and 16 in Fig. 3) , simulation (xtrun [3]) and profiling (xt-gprof [3]: steps 2, 12, and 18 in Fig. 3) ". mark programs, running on a base processor (without any TIE extensions), and on the customized processors generated hy our tool. We also report the area overheads incurred due to the addition of extra hardware to the processor. Note that the only difference between the two processor versions used for each program is the presence of custom instructions -all other processor parameters are kept unchanged. The results in Table I The results indicate that processors customized using instructions automatically generated by our tool can achieve a performance improvement of upto 5.4X (average of 3.4X) over the base processor cores. Energy consumption is reduced by upto 4.5X (average of 3.2X), energy-delay product is reduced by upto 24.2X (average of 12.6X), while average area increase is only 1.8%.
VI. CONCLUSIONS
Current design flows based on extensible processors require designers to manually identify and design custom instructions t o accelerate parts of the application program. In this work, we have developed an automatic flow to generate custom instructions or instruction combinations that maximize the performance improvement for a given program, under constraints on the overhead due t o the additional hardware. We have implemented this flow using a combination of commercial and public-domain tools, and our own in-house tools. Our experiments thus far have demonstrated promising results, indicating that automatic generation of custom instructions can result in large improvements in performance, energy, and energy-delay product, while significantly reducing design turnaround time. 
ReEeerences mann

