Performance optimization is the primary design goal in most digital signal processing (DSP) and numerically intensive applicutions. 
Introduction
As a result of the increase in demand for high-speed integrated circuits, an increasing need for synthesis tools targeted toward performance optimization exists. The increasing complexity of application specific integrated circuits (ASICs), in general, indicates that performance oriented high-level synthesis tools will become especi'dly important. Different tasks in high-level synthesis have varying degrees of effect on performance. One potentially powerful transformation in the synthesis process is instruction set mapping. Instruction set mapping refers to replacing groups of primitive hardware units (instructions) with more complex and more powerful units as shown i n Figure 1 . By replacing primitive instructions with the more complex instructions which execute faster, overall performance can be improved.
The most fundamental problem in instruction set mapping is pattern, or template, matching. One of the most influential works on template matching was provided by Aho et al. who suggested an optimal tree matching approach [1, 2] for instruction set mapping in compilers. Chu and Rabaey used a simulated annealing based algorithm for optimizing area and clock speed during chaining and module selection [ 7 ] . This paper proposes a novel approach to template matching in high-level synthesis which does not require the harsh trade-offs between speed and quality demanded by previous approaches and, further, has the capability to hrtndle general cyclic flowgraphs. The paper also discusses the more general problem of mapping hardware instruction sets using a novel optimization god (performance).
Template Matching
The instruction set mapping algorithm can be divided into three major components as shown in Figure 2 : template matching, instruction selection, and clock selection. Template matching, the central problem in instruction set mapping, refers to the matching of templates representing to a control/data flowgraph, or CDFG, consisting of only primitive instructions. Since the interpretation of the templates is application-specific, template generation is not discussed here. The fundamental difficulty in template matching lies in the fact that the number of template matches can be quite large and expensive to enumerate.
A few terms must be defined before proceeding. Nodes in the CDFG are referred to as graph nodes while nodes in the templates are referred to as template nodes. A graph node x and a template node y in template t are said to have a node match between them if some group of nodes in the CDFG including x can be replaced by t such that y replaces n. A matching between some set of graph nodes 'and entire template is referred to as a template match.
The output of the proposed template matching algorithm is a list of node matches (match list) for each graph node. The totd number of node matches for a CDFG is polynomially bounded. In addition, many template matches share node matches meaning that the node matches often require less space. The first step in the proposed algorithm is simply to create initial (potentially incorrect) match lists for all graph nodes by comparing only the types of the nodes being matched, and the types and number of parents 'and children. Invalid node matches are then systematically eliminated by finding node matches for which either a parent or child of the matched graph node does not have a proper node match to a parent or child of the template node. Figure 3 shows a simple example of the execution of the basic algorithm. The node match between graph node v and template node U is invalid because the child w of v does not match template node b, the child of a. This step is repeated until no other invalid node matches are found. Some invalid matches may remain in certain degenerate cases, but are eliminated by a simple post-processing step.
The matches found by this simple approach are complete matches. The approach can be extended to include partial matches as well. These are cases in which portions Matching Algorithm of the template remain unmatched. By using these types of matches, the same instructions can be used more often allowing for higher resource utilization. Partial matching can be differentiated into tvvo classes. The simplest is partial matching by unused resources. While each child of a graph node may need to be imatched, no child of a template node has to be matched (except as required to match the children of the graph node). This class of partial matching requires only minor changes to the basic algorithm.
The second class of partial matching is by identities, algebraic or otherwise, as shown in Figure 4 . While this problem is difficult in general, it can be made tractable by applying simple heuristics. Unlike the case of complete matches, these heuristics c;mnot be guaranteed to find all cases of partial matching.
Template

CDFG Cover
Figure 4: Partial Mlatching by Identities
The key to the algorithm is the concept of bypassability. A template node is said to be bypassable on some input if its output value can be set equal to this input by setting the other inputs to constants without inducing sideeffects. More specifically, bypassing a template node y should not 'affect any outputs of the template which are descendants of y. By determining bypassability during preprocessing, node matches can be verified independently during the main portion of the algorithm without considering the entire template. Also, during pre-processing, at most one bypassable input should be selected for each template node. This simplifies many steps in the matching process. In addition to pre-processing, post-processing is 61.9 required to insure that nodes are not both matched and bypassed.
The entire template matching algorithm can be implemented as a polynomial time algorithm which has been found to be reasonably efficient in practice.
Instruction Selection
Given that the node matches for each node have been found by the template matching algorithm, template matches can be constructed to replace groups of primitive instructions (nodes in the CDFG) with more complex instructions (represented by the templates). These template matches must be selected so that the overall critical path length is optimized.
Although the critical path is the optimization target, optimizing nodes in critical paths only is ill-advised since the critical paths change as nodes are replaced by templates. A common means by which to overcome this difficulty is to use the concept of the epsilon-critical network: the set of nodes lying in paths having lengths within some empirically derived constant E of the critical path length. These nodes presumably are those which are likely to become critical as well as those that already are.
For each graph node in the epsilon critical network (~~1 0 % ) , the proposed algorithms construct a single template match for each of the node matches of the graph node. An optimal match is then selected from (among these and is placed in the CDFG. The process is then repeated until the critical path length can no longer be improved.
A single node match does not necessarily uniquely determine the template match to be constructed, particularly when partial matching is considered. The algorithm applies heuristics to select node matches from runong the possible choices. The criterh for inclusion in the template match include maximizing the number of epsilon critical nodes in the template match as well as maximizing the improvement (locally) in the total execution time for nodes in the CDFG. Once constructed, the template matches are compared by a heuristic cost function. This function favors template matches which locally improve the execution delays of the nodes and which cover graph nodes with few node matches (since these graph nodes are less likely to be covered by other template matches).
This algorithm generates a complete graph cover which optimizes the critical path. However, before the delays of the nodes and templates can be known, a clock period must be selected as is discussed in the next section.
Clock Selection
Selecting the proper clock frequency can have a dramatic effect on performance. The proposed algorithm 'for clock selection iteratively executes the instruction selection algorithm for different clock frequencies using the best result as the final solution. This iterative approach is necessary since the optimal clock frequency depends on how the CDFG is finally covered and vice versa. The key to the approach is to uy as few clock frequencies as possible so as to minimize the running time. The minimum clock period considered is the minimum latency of any operator (since smaller clock periods would generally be impractical even if the delay were theoretically less). The maximum clock period and the resolution between clock periods are assigned values appropriate to the technology.
The number of clock periods to be considered ciin be reduced significantly by the following observations.
1.
A clock period TI which is a multiple of another clock period T, cannot yield a shorter critical path th" T,.
2. If all operators and templates have a shorter delay for clock period T, than for T,, then T, will yield a shorter critical path than T,.
These observations are used to prune obviously inferior clock periods. In addition, the algorithm uses min-bound estimation techniques to quickly eliminate clock periods which could never have a shorter delay than the delay ,. already computed for some other clock period.
Experimental Results
These algorithms have been implemented as part of the module selection facilities of the Hyper high-level synthesis environment [6]. The benchmark set used for testing and evaluation represents a wide variety of CDFG structures (regular,'irregular, recursive, and non-recursive). The results were generated using a library of 45 templates representing groups of chained units (some of which are commutative permutations of others). Table 1 shows the throughput improvement of the s'mple benchmarks. The original clock period is selected to be the minimum clock period for which all instructions require one clock cycle to execute. For the designs shown, the throughput doubled on the average overall (27% increase from clock selection and 57% increase from instruction selection). Table 2 shows the impact on area for these benchmarks. The active area estimates represent only arithmetic/ logic units, registers, multiplexers, and I/O. While the active 
Conclusions
A methodology and a set of algorithms for performance optimization using instruction set mapping have been presented. The proposed algorithms incorporate a new template matching approach, as well as methods for partial matching and clock selection. Experimental results show significant improvements in performance without unreasonable area penalties. The ideas presented provide an ideal starting point for the investigation of many template matching problems in CAD 'and compiler domains, 
