.
Introduction
The motivation for this work stems from the need for incorporating design reuse into architectural and high-level synthesis. Since the output of high-level synthesis typically consists of a datapath netlist of generic RTL components and a state sequencing table, there is the important back-end task of realizing the generic datapath components using existing libraries. While module generators and logic synthesis tools can be used to map RTL components into standard cells or layout geometries, they cannot e ectively reuse the data book libraries of higher-level components commonly used in design practice. In this paper, we present High-Level Library Mapping (HLLM), a technique for implementing generic RT-level datapath components with technology-speci c RTL library components with the goal of facilitating design reuse. Our approach addresses the criticism of designers who feel that architectural and high-level synthesis tools cannot e ectively reuse datapath components present in existing RTL data books, as well as complex handcrafted datapath components that are often manually reused in architectural designs.
Design reuse plays an important role in the complete design process: system design typically goes through various phases, starting with prototypes implemented in FPGAs, down to a nal design implementation composed of custom or standard library parts. Furthermore, existing outputs of high-level design often need to be migrated to new libraries and/or technologies. In such design scenarios, techniques for high-level library mapping are required to e ectively support design reuse.
Design reuse is actively practiced at the logic level in current design methodologies: logic-level designs can be retargetted to technology speci c gates and ip-ops using a variety of traditional technology mapping techniques. However, as architectural and high-level synthesis tools mature and move the design process to higher levels of abstraction, there is a need to upgrade design reuse techniques to these higher levels of abstraction. Register-transfer(RT) level components such as ALUs and register-les are good candidates for design reuse, since they are often hand optimized or created using module generators. The acceptance of high-level design (i.e., re ning the behavior of a design into a netlist of RT-level components) has led to a need for design reuse techniques that can port RT-level designs across various vendor libraries.
We describe a novel approach to design reuse through high-level library mapping (HLLM) that maps a RT-level component from one library onto another RT level component from a di erent library. High-level library mapping explores the e ect of di erent RT libraries on a given high-level design output and permits the coupling of architectural and high-level synthesis with databook libraries and complex handcrafted RT-components. Speci cally, we present a high-level library mapping technique for arithmetic-logic units (ALUs) derived from the output of high-level design. An ALU can perform a well-de ned set of arithmetic, logic, and comparison functions. ALUs are typically optimized and placed in the library, thus becoming ideal candidates for design reuse. Our approach can reuse ALUs from standard library or from datapath generators or even handcrafted components.
In this paper, we de ne the high-level library mapping problem for ALUs, describe two algorithms to solve the problem, and provide experimental results to validate its e ectiveness. Specifically, we demonstrate the versatility of this approach by applying HLLM to ALUs drawn from di erent libraries. We also compare the HLLM approach versus a traditional logic synthesis approach and demonstrate the advantage of HLLM for complex datapath components. This paper is organized as follows. Section 2 describes related work. Section 3 de nes highlevel library mapping for ALUs based on their functional behavior. Section 4 discusses the overall approach to the mapping problem. Section 5 presents some algorithms that map a source ALU onto a target ALU. Section 6 demonstrates the comprehensiveness and quality of designs produced by our approach. Section 7 concludes with a summary and future work.
Related work
There has been a fair amount of related work in applying design reuse at di erent levels of design. In the context of high-level design and synthesis, reuse techniques can be divided into two categories. One school of thought attempts to directly incorporate components from a technology library into the high-level design phase. This approach combines two steps of the high-level design process (design synthesis and technology mapping), and provides a direct approach to the reuse of RTlevel library components. Although this approach can yield good results for a speci c library, it requires a lot of e ort in tuning the synthesis and re nement tools to accommodate variances in RTlevel technology components. Furthermore, changes in the RT-library may necessitate a complete rewrite of the core synthesis algorithms that implement HLS with the RT-level components. Work described in 2] 11] 15] 18] falls into this category.
The other (more traditional) school of thought solves the problem using a two-phase approach.
First, the design re nement phase maps the behavior into some intermediate (generic) level, and second, technology mapping realizes the nal design by mapping this intermediate (generic) level design using cells from a technology library. We can classify the component mapping problem into four approaches, based on the levels of building blocks used to realize a generic component:
Logic-level Mapping. At the logic level, a component's functionality can be described using Boolean equations for the transformation of the inputs into outputs. These equations can then be mapped to logic-level technology-speci c components such as gates, ip-ops and latches. For example, the ALU in Figure 1 can be described with Boolean equations for each output(O0, OCOUT and OZERO) in terms of the inputs I0, I1, ICIN and C. Each of these equations can be mapped to components from a logic-level technology library (e.g., NOR gates) using structural (tree-based) , Boolean matching or rule-based mapping At the logic level, we have well-characterized primitive cells and technology mapping provides good results for small and random logic designs. As soon as the complexity of the circuit grows, the run-time of logic-level tools becomes prohibitive. 6] presents an investigation of the relationship between logic-level and high-level synthesis and presents some basic tradeo s. It is commonly known that designs produced by logic synthesis for regularly-structured datapath components are often of poor quality, indicating the need to apply mapping techniques at higher levels of abstraction. MILO 22] is one approach that combines logic-level mapping techniques with microarchitectural optimization to realize a netlist of RT-level components.
Functional Decomposition. At the RT-level, regular-structured datapath components can be mapped to MSI-level blocks from a technology library. Each component can be functionally and/or structurally decomposed into smaller building blocks based on well-de ned techniques for building datapath components of larger sizes. For instance, an ALU can be implemented as separate AU and LU blocks that are MUXed at the output. Alternatively, an ALU can be built using replicated bit-slices of one-bit ALUs.
The choice of such construction schemes leads to a design space of alternative implementations, where the RT-level component is represented as a hierarchical tree of alternative decompositions using library primitives. The root of the tree represents the source component (i.e., the one to be mapped), while leaves of the tree consist of the MSI/SSI-level blocks from the technology library. Figure 2 shows a sample decomposition tree for an ALU. This ALU is realized by composing the leaf cell blocks (such as 4-bit adders, FAs, MUX2, gates) from a technology-speci c component library. The DTAS system 10] follows this mapping approach. The functional decomposition approach is useful when the target component bit widths are much smaller than that of the source component. to-target component mapping approach where the source and target components have overlapping functionality and are of approximately equal size and complexity. In this approach, the source component is implemented using the target component, with a minimal amount of glue logic to satisfy the design constraints.
Our work investigates HLLM for RT-level datapath components that have well-de ned functional behavior. The source and target components are described as a set of well-characterized functions; these functional speci cation are then used to drive the source-to-target mapping process. To the best of our knowledge, prior to this work, there has not been any work that investigates HLLM for RT-level components.
System-level mapping. At the system level, mapping can be performed between system-level components such as processors, memories and interface units. MICON 3] is one approach that tries to reuse o -the-shelf system-level parts such as processors, memories and peripherals to build a single-board computer system. The input to MICON is a set of system-level speci cations that describe the functionality of the required computer in terms of the type of processor, amount and type of memory, etc., along with the design constraints (board size, cost, etc.). MICON generates a design (netlist of the above components) that satis es the requirements given to the system. The work on HLLM described in this paper therefore complements existing mapping approaches at the logic and system levels, and bridges the gap between these two design levels.
Problem De nition
The high-level library mapping problem for ALUs is based on functional speci cations of the source and target components, which are compared with respect to a \canonical" functional representation to derive an e ective mapping result. When the functionality of the target component does not exactly match that of the source component, the target component may need to be padded with additional (glue) logic. Figure 3 illustrates this high-level mapping approach between a source and a target ALU.
We therefore de ne the high-level library mapping problem in terms of a Source component (S), a Target component (T) and a set of Mapping rules (R) that maps the source component onto the target component. In order to establish equivalence between the source and the target components, each component is described in terms of a set of RT-functions that are de ned using a canonical representation of the component. A mapping rule in R describes an alternative for implementing a function in S using a function in T; each source function can potentially be implemented by di erent target functions. The task of high-level library mapping, then, reduces to selecting a set of rules, one for each function in source component S, that realizes the best mapping of S on T with respect to the cost function (e.g., area or delay).
In the rest of this section we discuss our assumptions, the canonical representation used for 
Assumptions
In this work, we make the following assumptions:
All data and arithmetic use the 2's complement representation.
A source component can perform only one function at a time. For example, a comparator can implement several RT-functions (e.g., EQ, NEQ, GT, LT, etc.), but only one function is performed at a time.
We restrict ourselves to arithmetic, logic and comparison functions. These functions are dened using a universal ALU that performs a set of canonical arithmetic, logic and comparison functions (as described in the next section).
Each of the target component's RT-functions should either be a canonical RT-function or a simple negation of a canonical RT-function.
The source (S) and the target (T) components have the same bit-widths.
These assumptions are made in order to to make the problem size tractable and also to facilitate illustration of the high level mapping approach. We believe that these assumptions can be relaxed once the basic HLLM approach is de ned and well understood. 
The universal ALU
In order to provide a reference model for HLLM, we de ne a universal ALU (U) that performs a canonical set of ALU functions: 5 arithmetic functions (ADD, SUB, RSUB, INC, DEC), 16 logic functions (all Boolean functions of 2 variables), and 6 comparison functions (EQ, NEQ, GT, GEQ, LT, LEQ). Using these canonical ALU functions, we can build any other ALU including libraryspeci c ALUs. These canonical ALU functions are described in more detail later in this section. Figure 4 shows an n-bit universal ALU. In this ALU, I0 n], I1 n] are the primary inputs, O0 n] the primary output, ICIN the carry input, OCOUT, OVF the carry output and over ow signal and CS m] the control input.
Canonical ALU functions
Each canonical ALU function de nes a functional mapping between the inputs and the outputs of the universal ALU. Note that an ALU function need not use all the ports of a universal ALU. Table 1 
y Assumes that the data is in 2's complement form. Table 2 . These functions use only primary inputs (I0 n], I1 n]) and the primary output (O0 n]). Table 3 describes the canonical representation of 6 comparison functions. These comparison functions use primary inputs (I0 n] and I1 n]) and the primary output O1. The universal ALU therefore has a total of 27 canonical ALU functions (5 arithmetic, 16 logic and 6 comparison). 
Representation of library components
A+AB+Cin (a) S A[n] B[n] C[n] Cin Cout L (b) ADD ICIN I0 I1 AB Cin C Cout
Logic function representation
Each logic function of the ALU is described using a standard minterm representation of two primary inputs. Note that we have four minterms with two inputs A and B, namely A B; AB; AB and AB.
A speci c logic function selects a subset of these four minterms. As an example, the OR function is given by the following minterms: AB; AB and AB. In other words, when one or more of these three minterms are active, output of the OR function is 1. A set of logic functions is implemented by ANDing the minterms for each function with the corresponding control lines and feeding the output to an OR gate. As an example, Figure 6 shows an implementation for two logic functions OR and XOR.
Mapping rule representation
A mapping rule describes how to implement a canonical function from another canonical function. Given a set of mapping rules, we can implement all the functions in the source component including the ones that are not present in the target component. Let SF and TF be the source and the target canonical functions respectively. Let us de ne port names for the source and target canonical component (CS and CT) as shown in Figure 7 . A mapping rule describes the mapping of CS ports onto CT ports such that SF is implemented using TF.
A rule is described in a tabular fashion similar to the library component representation. Table Each rule describes the implementation of a source function using a target function and indicates the port mappings required to implement the mapping. Note that each source function can be implemented using several alternative target functions. For example, the source ADD function in Table 4 could be implemented using target ADD function (rule "AA1"), or with the target SUB function (rule "AS1"). The input and output port entries indicate the connectivity and additional logic required to implement the mapping rule. For instance, the second rule "AS1" in Table 4 implements the source function ADD with the target function SUB by inverting the right input (SI1).
For each source logic function, there is a rule that implements the function from scratch without Rule Source Target  Input ports  Output ports  LV  name function function TI0 TI1 TICIN SO0 SOCOUT SOVF SO1   ANAN AND  AND SI0 SI1  {  TO0  {  {  {  {  AN  AND  {  {  {  {  TL0  {  {  {  AB  ANNA AND NAND SI0 SI1  {  TO0  {  {  {  {  XO  XOR  {  {  {  {  TL0  {  {  { AB; AB   Table 5 : Sample mapping rules for logic functions using any target function. To this rule, we add another entry in the table: a list of minterms corresponding to the source logic function. The primary output for a logic rule is given by the logic output (LO). Table 5 lists some sample logic rules. For example, the source AND logic function could be implemented from scratch (rule "AN ") by adding the minterm AB. 13 ] contains an extensive list of rules for all the ALU functions.
The cost function
A good cost function captures the important characteristics of an e cient source-to-target component mapping. We use a cost function based on two criteria: (a) an area metric, represented by the gate-count of the hardware overhead, and (b) a delay metric, represented by the worst case delay or max-delay of the generated design.
Gate-count
Mapping S on T requires extra hardware that could arise due to:
Routing data from the inputs of S to the inputs of T and from the output of T to the output of S.
Mapping a function in S onto some other function on T.
Generating a function (for example, a logic function) of S from basic gates.
Mismatch between the canonical functions and the functions in S and T.
Mapping the control lines of S onto the control lines of T.
In our current formulation, we use the gate-count (GC) of the extra hardware as a measure of the hardware cost. Speci cally, we use the number of equivalent 2-input gates as the cost function to guide our algorithm. Note that the actual cost of a design should also include the cost of the target component. However, since this cost function is used just to compare two designs, it does not matter if we exclude the component cost in our cost function, because this portion of the cost will be present in each design. Therefore, we use the gate-count of the hardware overhead as an area optimization criterion. 
Max-delay
We use the worst-case delay of the design as a delay metric. The worst-case delay, max-delay (MD), for a design is given by the maximum delay through all paths of the design. It is an approximation of the delay of the resultant design.
As an example, consider the design shown in Figure 8 , that shows a sample source component (S) mapped to a target component (T). Let (MD t ) be the maximum delay through the component T. We can calculate MD for the design S in the following way:
1. Calculate the max-delay to the inputs(MD i ) of T. It is given by the maximum of delays through all the paths shown by thin lines in Figure 8 .
2. Calculate the max-delay to the output(MD o ) of T. It is given by the maximum of delays through all the paths shown by thick lines in Figure 8 .
3. Calculate the worst case delay for logic-circuit(MD l ) of T. It is given by the maximum delay through all the paths shown by dashed lines in Figure 8 .
4. The MD of the design, then, is given by maximum of the two gures:
Max(MD i + MD o + MD t ; MD l + MD o )
Note that unlike the previous area metric (gate-count) calculation, we cannot ignore the delay of the target component T. This is because MD for all the designs may not include the delay of T; the worst case path might pass through the logic unit. 4 Overall Approach Figure 9 illustrates our overall approach to the mapping problem. The inputs to the system consist of the source component (S), the target component (T) and the mapping rule database (R). As mentioned before, both S and T are library components and they are described as source canonical and target canonical components using the representation discussed in the previous section. The mapping rule database (R) contains all the rules required to map one RT-function onto another RT-function. The rule database is also represented in a tabular fashion. In the rst step, the mapping algorithm implements the source component using the target canonical component. In the second step, the target canonical function is mapped to the actual target component. The output of the system is an implementation of S on the target component T with some additional (glue) logic surrounding T. This work focuses primarily on the rst mapping step and describes two mapping algorithms for it. The second step consists of simple tasks such as the matching of port names and is relatively trivial.
We illustrate each of the steps in the approach with a simple walk-though example. Let S be an arithmetic unit that can perform three functions: ADD, SUB and RSUB. Let T be another arithmetic unit that can perform two functions: ADD, SUB. These two components are shown in The mapping rule database (R) contains an extensive set of rules to map one RT-function onto another. From this database, we extract rules that map a source function onto a target function. Table 6 shows some interesting rules that have been extracted for our example.
In the next step, the mapping algorithm implements S onto a canonical ALU (C). This canonical ALU uses only those functions that are present in T. This mapping is achieved by nding a set of rules, one for each source function, such that cost of extra hardware (i.e., gate count) is minimized. The set of selected rules provides the connectivity between the ports of S and C. Figure 12 shows one such solution in terms of the selected set of rules and the canonical implementation.
The nal step involves mapping the canonical ALU (C) onto the target component (T)
. This is usually a simple process since we restrict T to be very close to C (see Section 3.1). This step connects the ports of C to the ports of T. Figure 13 shows the nal implementation.
Note that generation of the nal design requires solving many other subproblems such as bitwidth mapping, control mapping, secondary input and output mapping, port name mapping, etc. Control mapping refers to the task of mapping the control lines of the source component to the control lines of the target component and is achieved by nding the Boolean expression in terms of source control lines for each target control line. Again, it is important to note that the work described in this paper focuses on the algorithms for the functional mapping of the source to the target component, which is the heart of the mapping problem (the rst mapping step in Figure 9 ).
Rule
Source Target  Input ports  Output ports  name function function TI0 TI1 TICIN SO0 SOCOUT SOVF SO1   AA1 ADD  ADD SI0 SI1 SICIN TO0 TOCOUT TOVF {  AS1  ADD  SUB SI0 SI1 SICIN TO0 TOCOUT TOVF {  SA1  SUB  ADD SI0 SI1 SICIN TO0 TOCOUT TOVF {  SS  SUB  SUB SI0 SI1 SICIN TO0 TOCOUT TOVF {  RA1 RSUB  ADD SI0 SI1 SICIN TO0 TOCOUT TOVF {  RS  RSUB  SUB SI1 SI0 SICIN TO0 TOCOUT TOVF {   Table 6 : Selected set of rules for mapping example 
Mapping Algorithms
The task of mapping a source component (S) onto target component (T) is accomplished by selecting a set of mapping rules, one for each function in source component S, that realizes the best mapping of S on T with respect to the cost function (e.g., area or delay). Recall that not only are there multiple mapping rules for each source function, but the selection of mapping rules for various source functions are also interdependent. For example, consider the rules presented in Table 6 . If we decide to use the rule \AS1" for mapping the source function ADD, the rule \SA1" for source function SUB would lead to an e cient implementation since it shares the factor SI1 for the right primary input. Thus a strategy is required to select a mapping rule out of multiple alternatives for each source function. For some formulations, the order in which source functions are selected is also important. Our detailed technical report 13] contains all the algorithmic formulations we considered; in this paper we present two algorithms (greedy and dynamic programming) that perform this selection of mapping rules in an e cient manner. These algorithms take as input the function tables of T and S along with the selected rule set, and map a source component (S) onto the canonical component (C). This corresponds the to rst (and major) mapping step in Figure  4 . The algorithms generate as output the required mapping in terms of the set of rules and port connectivity.
The search space
Since many feasible mappings exist and since we use a constructive approach, we need to de ne the search space for our mapping problem. The search space is built by applying di erent sets of valid mapping rules for the mapping problem. Each of the algorithms mentioned in this section goes though the partial solutions and nally leads to a complete solution. We introduce the search space for the mapping example discussed in the last section. It is described by the tree shown in Figure 14 . The leaves of this tree represent a complete solution whereas internal nodes represent partial solutions. A solution (partial as well as complete) is represented by storing the list of inputs Figure 14 represents the structure shown in Figure 15 (b).
At the root of the search tree, we have a null partial solution. At each level, a function in f(S) is selected and all the mapping rules for this function are explored. An internal node represents the partial solution using the rules that are on the path from the root to this node. Figure 14 shows some of the partial solutions along with a few complete solutions. Some of the algorithms (Greedy) could be illustrated using this tree of the search space. Note that a trivial way of nding an optimal solution is the exhaustive method that generates all the leaf nodes and selects the best mapping. Of course, this is a very time-ine cient solution.
We now discuss the actual algorithms used to solve the mapping problem. Note that the algorithms discussed here are independent of the cost functions and that these algorithms could be applied with either of the cost functions discussed before. 
Greedy algorithm
The Greedy algorithm starts with a null partial solution. It rst sorts all the source functions in increasing number of mapping rules. Next it selects a source function at a time and generates a set of new partial solutions, one for each rule for the selected function. Each of these new partial solutions has a feasible mapping for all the source functions considered so far. Out of these partial solutions, the algorithm chooses the one with minimum cost. This process is repeated until it has mapped all the source functions onto some target function and we have a complete solution. Figure  16 shows an application of this algorithm. Note that at each step (each source function), we keep track of the single best solution out of all the partial solutions generated. The complexity of the algorithm is O(n m), where n is the number of source functions and m is the average number of mapping rules per function.
Algorithm 5.1 : Greedy algorithm INPUT: f(S), f(T), r(ST).
OUTPUT: A set of rules, one for each function in f(S), with minimum cost.
1. f s (S) = sorted f(S) in increasing number of mapping rules.
2. rule-set = .
3. while 9 an unmapped function 2 f s (S) loop 3.1 F = rst function 2 f s (S).
3.2 r = rule that implements F with minimum additional cost. 3.3 rule-set = rule-set + r. 
Dynamic programming algorithm
The Dynamic Programming (DP) technique performs a better global search as compared to greedy algorithms 7]. Instead of making decisions based only on the mapping of the current function, it keeps track of the best solutions for all subsets of functions. Typically, it works in a bottom-up fashion. The algorithm starts with the mapping for subsets of single functions, followed by mapping for subsets of two functions and so on till it has the mapping for the entire set of source functions. Each of the partial mapping for a subset of functions is stored in a table and subsequently used for building the partial solutions for subsets of bigger size.
Algorithm 5.2 is a dynamic programming algorithm that keeps track of k (bucket size) best partial solutions. Note that number of partial solutions increases exponentially with the size of function set. We restrict ourselves to a limited number(k) of partial solutions for each subset. Of course, by doing so we may sacri ce optimality with the advantage of requiring bounded storage space. 2. for i = 0 to (col-row-1) do 2.1 min k = k-best of combine (Table(row,row+i), Table( row+i+1,col)).
end for
As mentioned before, the dynamic programming algorithm builds up a table of partial solutions in a bottom-up fashion. This table is indexed by the number of source functions(n) for both row and column. An entry Table(i; j) represents a set of partial solutions for source function i to source function j. Figure 17 shows the table with the bucket size of 2 for our walk-through example. Note that the table is upper-triangular.
This algorithm iteratively lls up the table (Table with partial solutions. It starts by lling diagonal entries by generating the mapping for single functions. Each diagonal entry represents a set of partial solutions that map exactly one source function. Then it lls up the entries corresponding to two function sets and so on. Function CreateEntry creates the list of partial solutions for a set of source functions using the partial solutions generated so far. It generates the entry Table(row, col) using the previously generated entries in the Table. The nal solution is given by the top-row and right-most column. For our example shown in Figure 17 , table(0; 2) lists few solutions that represent the complete mapping. The complexity of the algorithm is O(n 3 k 2 ), where n and k are the number of source functions and the bucket size respectively.
Experimental Results
We performed two sets of experiments to validate our approach. The rst set of experiments tests the comprehensiveness of our approach in terms of mapping arithmetic components between di erent libraries. The second set of experiments compares the metrics of the designs generated by our approach against the ones generated by the traditional logic synthesis approach. First, we demonstrate the comprehensiveness of our approach, followed by the comparative study of design quality. In this section, we present experimental results that establish the generality of our approach across di erent source and target libraries using algorithms discussed in the last section. We considered a variety of ALUs, both source and target, in our experiments. These ALUs vary in terms of the library they come from, number and the set of functions they perform. We present the mapping results for ALUs from four libraries : GENUS 12], CASCADE 5], VDP300 21] and AMD 1]. GENUS contains an ALU generator parametrized by the set of functions, bit-width, etc. The ALUs in CASCADE and VDP300 have a xed set of functions. The AM2901 ALU is a commonly used 8-function ALU.
We covered a wide range of ALUs in terms of the number of functions they perform. Starting from a simple uni-function adder, the most complex ALU had 32 functions. Note that an ALU, as we have de ned, can have only 27 distinct canonical functions. Thus, some ALUs in our experiments have variants of the canonical functions repeated in the function set. For example, one of the ALUs in our experiments has 9 ADD functions, each with di erent port con gurations. The ALUs in our experiments perform di erent sets of functions, covering di erent functional categories: arithmetic, logic and comparison. We also chose ALUs of di erent bit-widths.
Note that the examples in our experiments were restricted by the availability of design metrics and not by the limitations of our approach. For example, the delay for target components in a library were available only for speci c bit-widths such as 8, 16, 48. Thus, all our experiments are for ALUs with one of the above bit-widths. Recall that our approach is independent of bit-widths and that it will require same amount of computation for all bit-widths. Similarly, we had to restrict ourselves to only those libraries that provide design metrics. Even though we considered other libraries such as XBLOX 23], LSI 16], Toshiba gate array 20] etc., we could not run our algorithms due to lack of metric data (gate counts, performance) for these libraries. We also had to restrict our mapping examples to ALUs with xed sets of functions, since these are the only ALUs supported by some of these libraries.
The table in Figure 18 summarizes the metrics for the designs generated by our mapping approach for seven examples. Detailed designs for each of these examples are available in 13]. This table describes the source and target component in terms of the library name and the set of functions they perform, design metrics (gate-count and max-delay) and run-time for generating the design. We present metrics for the designs produced by two algorithms: Greedy algorithm and Dynamic programming algorithm with a bucket size of k=2. These algorithms were run with two cost functions: gate-count(GC) and max-delay(MD) as optimization criteria. Recall that gate-count is an approximation of extra hardware required to implement the source ALU, whereas max-delay represents the maximum delay though all the ports of the generated design. The run-time column shows the execution time (user + system) for the given example on a Sparc 2 (sun-670-mp for examples 6 and 7).
As mentioned before, the seven examples in our experiments are from di erent libraries and are of varying complexity. Example 1 maps a GENUS ALU with 2 arithmetic and 2 comparison functions onto a VDP ADD-SUB component. Example 2 implements a GENUS ALU with all the 
Analysis of results
From the table in Figure 18 , we observe that the greedy algorithm provides a quick way of mapping a source ALU onto a target ALU. The run-time for the greedy algorithm is in the range of 4.1 to 16.3 seconds. Also in some cases, dynamic programming produces better designs as compared to greedy approach at the cost of longer run-times. For example, refer to the results for Example 3 in Figure 18 . The designs generated by this algorithm with delay optimization have lower delay value (68.10) as compared to the delay (72.10) of the designs generated by the greedy algorithm.
In summary, our approach is quite versatile in the sense that it can map ALUs from one library onto another. We have demonstrated the versatility of our approach by applying it on ALUs from wide variety of libraries. Also, our approach can handle ALUs with diverse complexity in terms of bit-width, number of functions and set of functions. Our algorithms generate designs of high quality, often optimal designs. Finally, we have implemented two algorithms: the greedy algorithm can be used for quick results, whereas the dynamic programming generates better designs at the expense of longer run-times.
Goodness
In order to quantify the goodness of our HLLM approach, we performed several experiments comparing HLLM for arithmetic components against the traditional approach using a commercial logic synthesis system 19]. the left branch shows our HLLM approach that maps the source component to high-level arithmetic library macros (drawn from the Synopsys DesignWare library 8]), together with additional glue logic. We label this path as approach I in our summary of results. the right branch shows the traditional logic synthesis approach, where each source arithmetic component is described by logic equations (generated by GENUS 12]), and then mapped onto gate-level cell library primitives using logic synthesis. We tried the Synopsys Design Compiler 19] with two levels of optimization scripts: low (results labeled II) and medium (results labeled III). Note that our attempts to use the high optimization script were often not feasible, since it led to run-times in the order of weeks without termination. We present the gate-count, max-delay and run-time for each approach (I, II and III) and use these numbers to evaluate the goodness of our HLLM approach (I).
For each input source component, we perform two sets of experiments. In the rst, we generate designs that are optimized for area (minimum gate-count). For this set of experiments, both the logic synthesis tool and our algorithm are con gured to generate best area designs. In the other set of experiments, we con gured our algorithm and logic synthesis tool to optimize the worst case delay. Figure 20: Our approach verses logic synthesis: optimized for area various macros in the DesignWare library 8]. Columns 2 through 4 describe the source and target components. The sixth column in this gure reports area, delay and run-time for the designs generated by our approach (I). In this table, gate-count is measured in 2-input generic gates, delay is in nanoseconds and run-time is in minutes. Note that the run-time for the designs generated by our approach is the sum of two numbers: the rst number is the time taken by our algorithm to perform mapping and the second number represents the time taken by the logic synthesis tool to optimize the glue logic and map the macro onto the target technology. The last two columns in Figure 20 present metrics for the designs produced by logic synthesis approach with low(II) and medium(III) optimization e ort. These two columns also report percentage di erence between design metrics for II and III as compared to I. Figures 21 and 22 graphically presents area, delay Figure 20 . The three columns for each design in these gures represent design metrics for the designs generated by the three approaches I, II and III.
Analysis of results
From Figures 20, 21 and 22 and similar results obtained for delay-optimized designs, we observe that the designs from our approach outperform designs produced by the logic synthesis approach. Logic synthesis designs with low optimization (II) are as much as 487% larger, 191% slower and require 637% extra run-time. Even with medium optimization(III), designs are 313% larger, 191% slower at the cost of even larger di erence in run-time(758%). Logic synthesis with the high optimization script ran for weeks on these examples and many had to be terminated prior to completion. In order to demonstrate the e ectiveness of design reuse through our approach, we also recorded the area and delay values of the macros used in the four examples shown in Figure 20 . The ratio of Designware macro area and area of the complete design through our approach (I) are 0.69, 0.52, 0.49 and 0.55 respectively for the four examples. The corresponding gures for the delay are 0.79, 0.51, 0.51 and 0.63. A set of high values for these gures demonstrates that we have been able to use these macros e ectively without adding too much glue logic. Note that the amount of glue logic also depends on the degree of functional di erence between the source and the target component 1 .
We conclude this analysis with two comments. First, note that in this experiment, we have compared metrics from the netlist of generic gates. The e ects of regularity are more pronounced when we map these designs onto layout; designs from our approach would perform even better. Second, the reason we have been able to outperform the traditional logic synthesis approach is that 1 For example, a source component with many new logic functions may result in more glue logic. logic synthesis works well for optimizing random or control logic; but is unable to exploit the regular structures inherent in data-path components. The logic equations for a moderately size component (32-bit ALU in our example), are too big to be handled by the traditional logic synthesis approach. This indicates the need for our HLLM approach which complements traditional logic synthesis. Thus, a combination of the two approaches: logic synthesis for control logic and our approach for datapath components would be a good design strategy for coupling the outputs of architectural synthesis and high-level design with technology libraries and logic/layout synthesis.
Summary and Future Work
We presented a novel library mapping approach at the RT-level based on functional speci cation of the source and target component. HLLM can reuse components from standard library, datapath generators or even handcrafted components. Speci cally we formulated and solved the problem of ALU mapping. We presented two algorithms (greedy and dynamic programming) for HLLM. These algorithms could be used for generating either area-optimized or delay-optimized designs.
In our experiments, we demonstrated the versatility of our approach by applying HLLM on ALUs drawn from a wide variety of libraries. We also demonstrated the superiority of our approach over traditional logic synthesis for complex ALU components in all the three metrics: area, delay and runtime. We believe that the HLLM approach needs to complement logic synthesis and traditional technology mapping techniques to bridge the output of architectural synthesis with RT-level databook libraries, module generators and handcrafted components. Future work will investigate such interactions.
8 Acknowledgement
