State-of-the-art configurable logic platforms, such as Field-Programmable Gate Arrays (FPGAs), consist of a heterogeneous mixture of different component types. Compared to traditional homogeneous configurable platforms, heterogeneity provides speed and density advantages. This is due to the replacement of inefficient programmable logic and routing with specialized logic and fixed interconnect in components such as memories, embedded processor units, and fused arithmetic units. Given the increasing complexity of these components, this article introduces a method to automatically propose and explore the benefits of different types of fused arithmetic units. The methods are based on common subgraph extraction techniques, meaning that it is possible to explore different subcircuits that occur frequently across a set of benchmarks. A quantitative analysis is performed of the various fused arithmetic circuits identified by our tool, which are then automatically synthesized to an ASIC process, providing a study of the speed and area benefits of the components. The results of this study provide bounds on the performance of heterogeneous FPGAs: by incorporating coarse-grain components which match the specific needs of a set of benchmarks we show that significant improvements in circuit speed and area can be made.
INTRODUCTION
For embedded systems, reconfigurable devices provide designers with high throughput and cost-effective platforms. When designing reconfigurable devices, the architects must be aware of the types of circuit that users intend to map onto the platform. This means that device architectures must be designed with the performance of the different systems for which they are intended in mind. Given the widespread use of reconfigurable architectures for high throughput computation, there have been several advances in reconfigurable chip design for this domain which are particularly evident in FieldProgrammable Gate Arrays (FPGAs).
Modern FPGAs consist of a variety of different resource types to implement a range of functionalities. Logic resources based on LookUp Tables (LUTs) are used to implement fine-grain operations, and can be configured to implement virtually any digital circuit. However, for some functions, the fine-grain fabric in FPGAs is inefficient for several reasons. The lookup tables can be inefficient when compared to gates for particular logic functions. More critically, the routing fabric required to connect blocks uses programmable switches which are slow and consume a considerable amount of area when compared to fixed connections. To speed up and improve the logic density of particular types of computation, embedded memories and fused arithmetic blocks have been incorporated into FPGA fabrics [Altera Corporation 2005b; Xilinx 2004a] .
This article focuses on fused arithmetic components in FPGAs. An example of a commercial fused arithmetic unit is the Xtreme Slice [Xilinx 2004b ], evident in Xilinx's Virtex 4 family of FPGAs. A simplified structure of the Xtreme Slice is shown in Figure 1 . The configuration multiplexers are one of the central features of this embedded core, providing flexibility that allows support for a wide variety of arithmetic functions. However, flexibility has an associated cost with regards to the metrics of delay, area, and power consumption. One of the main aims of this article is to quantify this cost.
This article proposes a method for automatically discovering potential fused arithmetic components by examining common connection patterns between circuit netlists. In doing so, it is possible to identify complex computation patterns leading to specialized embedded components that are more efficient than existing FPGA resources. A tool has been developed to create silicon cores from representations of the common connection patterns. These cores can be directly compared with those currently on FPGAs, thus indicating potential speed and density advantages that could improve future FPGA architectures.
The main contributions of the article can be summarized as follows:
-a methodology for extracting arithmetic subcircuits that occur frequently across a set of benchmarks [Smith et al. 2007 ]; -introduction of a logic optimization phase, to increase the discovery of common subgraphs; -a study of common arithmetic subcircuits that occur across a variety of benchmark circuits; -a quantitative analysis of the speed and area trade-offs of common arithmetic circuits, including comparisons to a commercial 90nm FPGA.
The remainder of this article is organized as follows. In Section 2 related work is discussed. The problem of common subgraph extraction for fused arithmetic component generation is formally defined in Section 3. Section 4 details the design flow and algorithm used to extract commonly occurring arithmetic subcircuits. Section 4.3 presents the methodology that is employed to synthesize and compare the commonly occurring patterns. Comparisons between subcircuits implemented in the generated silicon cores and those implemented in a commercially available device are presented in Section 5. Finally, conclusions are drawn in Section 6.
BACKGROUND
Considerable research effort has been focused on configurable architecture exploration in recent years. It has been estimated that designs implemented on heterogeneous FPGA architectures are on average approximately 20 times larger and 3-4 times slower than when implemented as an ASIC [Kuon and Rose 2007] . In order to close this gap, there are various aspects of configurable architectures that can be targeted: routing fabric, fine-grain logic fabric, and coarse-grain fabric. In this article the focus is on improving the coarse grain arithmetic blocks, with a potential knock-on benefit in the reduction of routing requirements. It is the intention of this article to examine how parts of a design can be improved by automatically detecting common arithmetic patterns in benchmarks and implementing them as their own hard block in silicon. Detecting common connection patterns is a well-known problem, and has been employed in fields such as molecular chemistry, ASIC template generation [Chowdary et al. 1999] , and custom instruction set generation [Cone et al. 1977; Cong et al. 2004; Kastner et al. 2002] . In the context of the article we present here, existing work involving custom instruction set generation is particularly relevant, as it deals with arithmetic operations commonly implemented by field-programmable hardware.
In Cong et al. [2004] , subgraph extraction is used in instruction set generation for Application-Specific Instruction set Processors (ASIP) implemented on reconfigurable fabrics. The method employed for the problem is hierarchical. First C codes are transformed into Control Data Flow Graphs (CDFGs) and a pattern library is generated by examining each application CDFG. This pattern library is then examined to ascertain the potential speedup of the new custom instructions (patterns), subject to a set of constraints (area and I/O).
Similarly, in Kastner et al. [2002] , subgraph extraction is used in custom instruction set generation for soft processors implemented in reconfigurable fabrics. The method uses template generation using a single benchmark to create a library of subgraphs. The algorithm uses a profiling step to examine potential subgraphs and locally optimal graph covering methods in application mapping.
ISEGEN [Biswas et al. 2005 ] also uses subgraph extraction for custom instruction set generation. The methods employed use min-cut partitioning heuristics in order to minimize the external communication to the custom instruction units proposed by the framework. Contrary to our proposed framework, Biswas et al. [2005] focuses on multiple rather than single output graphs. Our discussion of this is elaborated upon in Section 3.
There have also been efforts to use common subgraph detection methods for the purpose of discovering hard cores for FPGAs [Aravind and Sudarsanam 2005] , which presents a methodology from the point of view of reducing area and configuration overheads. However, in this article we present, for the first time, a quantitative analysis of possible components identified by common subgraph extraction, including a comparison of their speeds and areas in both FPGA and ASIC for 90nm technology.
Configurable architecture generation has also been approached in the context of coarse-grain configurable devices. For instance in Compton and Hauck [2007] architectures are generated for a specific set of benchmarks, and methods are employed to reduce the number of multiplexers and connecting wires. Similar to Compton and Hauck [2007] , we focus on low-level hardware optimizations, but concentrate on generating individual components that can be inserted into reconfigurable fabrics. Moreover, we base our methods on formal techniques and focus on commercial FPGA architectures, consisting of a mixture of coarse-and fine-grain components.
The problem of common subgraph discovery encompasses that of technology mapping. Traditional technology mapping algorithms for homogeneous fine-grain FPGAs, consisting of lookup tables, considers how to pack logic netlists into appropriately sized functional blocks. This is a well researched topic, exemplified by FlowMap [Cong and Ding 1994] and SIS [Sentovich et al. 1992] . With advances in FPGA technology, synthesis must also consider heterogeneous components such as embedded memories and complex DSP blocks, capable of many different fused arithmetic operations.
The problem of technology mapping to heterogeneous FPGAs is particularly problematic, as it is difficult to infer components such as DSP blocks and multipliers from logic netlists. Thus, recent efforts have attempted to infer such components from HDL descriptions. Odin [Jamieson and Rose 2005] is an example of a tool for technology mapping to modern heterogeneous FPGAs from Verilog code, and is used in this article to parse benchmark circuits into their constituent components. A graph matching algorithm, similar to that employed by Odin, is also used to determine which benchmark structures can then be matched by the common subgraphs.
PROBLEM DEFINITION
In this article, we focus on the common subgraph extraction problem in the context of arithmetic circuits. This motivates the following definitions and problem formulation.
We use word-level netlists as formal representations of benchmark circuits after parsing and elaboration of Verilog, where the node set corresponds to word-level arithmetic operations such as addition (a type). The edge set corresponds to the flow of data between these operations. An order is associated with the edge set in order to express precedence and noncommutative operations, for example, the netlist computing (a + b ) − c is not the same as the netlist computing c − (a + b ). The type set T can be considered as the set of all atomic computational nodes available in the elaborated Verilog, such as T = {add, mult, reg, sub }, and the type function T associates nodes with types. We further identify a subset T NC of types for which the ordering of inputs matters, for example, noncommutative arithmetic operations such as subtraction.
and such that the order on E is preserved in E for edges from nodes with types in T NC .
This definition captures the concept of "part of a computation" while the final conditions ensure that: (i) if two nodes are part of a subgraph, then so are the relevant edges, and (ii) if the order of inputs to a computation is important, then it is preserved. Definition 3. Two graphs are said to be isomorphic iff they are identical up to a relabeling of the vertex set.
Isomorphism is useful in this context, because if two word-level netlists are isomorphic, they compute the same function.
Definition 4. Given two netlists G 1 and G 2 , a common subgraph G is a netlist that is isomorphic to a subgraph of G 1 and also isomorphic to a subgraph of G 2 . A common subgraph is therefore a potential candidate for implementation as a hard embedded core in a reconfigurable fabric, since it represents functionality shared across two or more benchmark netlists. However, we often aim to be more flexible in the extraction of embedded cores; for example, Figures 2(a) and 2(b) show two subgraphs exhibiting multiply accumulate functionality. Firstly, we may wish to combine the add and subtract functionality into a combined adder/subtracter, which makes sense from a hardware point of view. After this merging procedure, if two operations differ only in their latency, we may wish to add optional pipeline registers, allowing both operations to be mapped to the same core; see Figure 2 (c). This observation motivates the following definition.
with respect to a contraction set C ⊆ T is a graph with V ⊂ V and for which: (i) there is a one-to-one correspondence between edges in E and paths
A contraction is thus a way of absorbing a number of nodes of a path into a single edge, allowing the common subgraph algorithm to bypass certain information that is necessary for synthesis of the graphs only. We therefore have a very flexible conceptual framework for defining a candidate hard embedded core: it is any common subgraph of the respective minimal contractions of two word-level netlists. By setting C = {reg} we can, for example, achieve optional pipeline registers. For the remainder of this article, this is the contraction set used.
A further constraint on the common subgraph extraction algorithm is that the common subgraph should have only one output, a problem known as Multiple-Input Single-Output MISO) graph extraction. In configurable hardware design there are several considerations that are important in hardware generation. Configurable routing wires are capacitance heavy, and logic resources require relatively large multiplexers to connect their inputs and outputs through the FPGA routing fabric. The single-output simplification thus suits this particular context well.
COMMON SUBGRAPH EXTRACTION ALGORITHM AND SYNTHESIS
An overview of the tool flow used in this work is shown in Figure 3 . The netlists for use in the common subgraph extraction framework are required to be word level. A combination of the publicly available tools Icarus Verilog [Williams 1999] and Odin [Jamieson and Rose 2005] are used to parse Verilog benchmarks. All technology-specific optimizations available in Odin are turned off, as simple arithmetic graphs are required to explore potential embedded arithmetic units. The output of Icarus/Odin is a flattened netlist of the basic components: standard logic operators, such as AND/OR; multiplexers; registers; arithmetic components; memories; and relational operators, for example, greater than/less than.
The netlists produced by Icarus/Odin are fed through some simple optimization procedures (discussed in Section 4.1). This allows the discovery of many more arithmetic common subgraphs, by reducing logic between arithmetic components. The resulting netlists are then fed into the common subgraph extraction algorithm (discussed in Section 4.2).
The common subgraph extraction algorithm operates on pairwise combinations of benchmarks. The common subgraphs must thus be examined across the benchmark set to assess the wordlength requirements of incorporating such a block into an FPGA fabric and the benefits of using such a block. Once the wordlength requirements are known, the block can be synthesized.
Constant Propagation
The netlists produced by Odin are optimized for use with heterogeneous architectures, and are intended for use in conjunction with further synthesis and optimization tools. Thus it is important to perform some logic optimization on the netlists. The aim of this phase is to remove logic chains between arithmetic components and facilitate discovery of common fused arithmetic subgraphs. This motivates the introduction of one of the new features of our tool over Smith et al. [2007] : a constant propagation and simplification phase.
The logic optimization phase introduced is relatively straightforward, and was motivated through observation of the Icarus/Odin netlists. A particularly inefficient part of the Verilog parser toolset was found to register input signals. Many of the input signals generated that feed into registers involve propagation of constants through logic chains; a simplified example of such a combination is shown in Figure 4 , which clearly can be reduced by applying simple logic identities. Odin netlists involve one and two input gates only, and the identities supported are simple transformations based on constant propagation through standard gates (e.g., 1.A = A), and logic identities such as A.A = A, The logic identities are applied iteratively until no more gates can be eliminated. The impact of these reductions is discussed in the results section.
Common Subgraph Extraction Algorithm
The common subgraph extraction algorithm used for this work is based on one in Bunke et al. [2002] , and has been developed to operate on large, directed graphs containing arithmetic components. The algorithm identifies potential fused arithmetic units by finding common graphs between two benchmarks, and a matching algorithm is employed to calculate the frequency of common graphs across the entire benchmark set.
The algorithm for extracting common subgraphs focuses on arithmetic components and is described in Algorithm 1. The algorithm starts by traversing both graphs and finds two matching nodes using the "ISMATCH" function, which in this case only has to check that the types of the two nodes (multipliers and adders or subtracters) are ammenable. A new node is then created 
Algorithm 1. Common Subgraph Extraction Top-Level Algorithm
add nodes to G 6:
end if 8: end for 9: end for for the common subgraph structure, which acts as a seed node to grow the rest of the common subgraph. The algorithm then recursively tries to add nodes to this graph.
Algorithm 2, "FAST MATCH" describes the recursive part of the algorithm, which adds nodes to the common subgraph. For each node in the common subgraph, a node-pair v = (v 1 , v 2 ) is created and functions r 1 (v ) and r 2 (v ) are created to return the appropriate benchmark node from the common subgraph node. The algorithm proceeds by examining each node in the common subgraph and the benchmark nodes to which it refers by employing the mapping functions r 1 (v ) and r 2 (v ) for each node in the common subgraph. The procedure must thus determine whether a new node from some combination of the inputs to the benchmark nodes can satisfy the problem constraints. The four problem constraints are as follows: the two potential benchmark nodes matched must be of the correct type; the ordering of nodes must be consistent (i.e., if there is a subtraction involved the nodes must be connected on the correct port); that they must both have the same connection pattern within the common subgraph; and that only one node can have an output. The recursive procedure enumerates all possibilities of inputs for each node within the common subgraph to determine whether a new node can be added to the common subgraph. If a node can be created, it is added to the common subgraph along with the appropriate connections and the algorithm continues by recursively trying to add nodes to the common subgraph. On exiting a recursive step, the common subgraph node is removed from the common subgraph so that a different combination of benchmark nodes can be tried in creating a common subgraph node.
In Algorithms 1 and 2, the "ISMATCH" function is used to test the compatibility of two benchmark nodes to the common subgraph. It must first test whether the two node types of the pair are compatible. It then has to ensure that adding such a node will satisfy other problem constraints, such as that the outputs of this potential node are not used outside the common subgraph, as the common subgraph is restricted to single-output graphs. The function also has to determine whether by adding this component, and any combination of components connected to and from this component, a graph satisfying the problem constraints can be extracted. In the common subgraph extraction add and subtract nodes are allowed to be matched to create a component with a configuration port.
The algorithm keeps copies of all graphs that satisfy the constraints, and thus terminates when all combinations of seed nodes from the two benchmarks have been examined. The pairwise nature of Algorithm 1 results in N t1 N t2 comparison operations, where N t1 and N t2 represent the total number of nodes in the benchmark circuit. The complexity of the recursion (Algorithm 2) is dependent on the size of the largest arithmetic common subgraph N c . As the size of the common subgraph is expanded a node at a time, the number of comparisons in each branch of the recursion increases linearly, as there is an additional node to compare each of its inputs for. For a common subgraph of size n, the number of computations performed by the recursion f n is thus f n = 2n+ k n−1 f n−1 , where k n−1 represents the number of possible recursive branches in the preceding stage leading to the common subgraph of size n. Further, k n in the worst case will grow linearly due to the number of free inputs to the nodes in the common subgraph being dependent on the size of the subgraph. However, as the size of the common subgraph approaches N c , the "ISMATCH" function prunes the design space minimizing the number of recursive steps. In the worst case the algorithm has exponential time complexity with respect to the maximum common subgraph size. However, in practice the size of the common subgraphs is low, resulting in manageable runtimes. Moreover, the number of arithmetic nodes is relatively small, meaning "ISMATCH" function prunes the design space further. The common subgraphs from all combinations of the benchmark set (17 benchmarks) can be extracted in less than half an hour on a conventional workstation (Pentium 4 2.4 GHz, 1GB memory, running windows XP).
An important consideration for the device architect is the frequency of the common subgraphs. The more frequent the graphs are, the better they are as candidates to be used in the configurable fabric. In this article, a graph covering approach is taken to this problem. A complete enumeration of potential for all input ports i on node r 1 (u ) do 3:
for all input ports j on node r 2 (u ) do
5:
u 2 = node on port j of r 2 (u ) end for 13: end for nonoverlapped coverings is made for each benchmark, with the mapping that covers the most nodes being chosen as the maximum covering.
During the process of evaluating the frequency of the common subgraphs it is also possible to obtain information about the required wordlengths of the structures for the process of synthesizing the graph as an embedded core. For each benchmark the maximum graph covering is used and the largest wordlength for each component is selected so that the synthesized core is capable of supporting the wordlength requirements of all graph coverings in the benchmark set. Modern synthesis tools are capable of combining multiplier blocks together in order to account for wordlengths beyond that supported by an individual block. This is advantageous for smaller wordlengths, particularly for blocks such as Altera's DSP block [Altera Corporation 2005a] , which is fracturable into many combinations of wordlengths other than the maximum. However, for benchmarks with greater precision requirements or dynamic range, blocks with greater wordlengths are particularly advantageous, as they negate the need to employ the slower fine-grain logic of an FPGA. This trade-off will be examined in Section 5.
Common Subgraph Synthesis
After common subgraph extraction and technology mapping, the tool can be used to compare components implemented in FPGA and ASIC, to ascertain the speed and area trade-offs. This is done automatically by creating HDL files for each of the common subgraphs and each of the respective technology mapped subcircuits of the benchmarks. These HDL files are then compiled to FPGA and ASIC.
In order to perform synthesis, two design flows have been used. The FPGA design flow used is that for the 90nm Xilinx Virtex 4 FPGA, using Xilinx Integrated Software Environment (ISE) 8.2 [Xilinx 2006 ]. The fastest speed grade Virtex 4 LX45 was used. Two runs of synthesis are performed: once to ascertain the number of components required by the structure from the benchmark, and a second time to ascertain the speed. To evaluate speed, the benchmark subcircuit is placed between registers and the critical path delay is observed. These registers ensure that delays associated with IO pads and the required routing do not increase the critical path. The delay from an input register to a logic input is also accounted for to ensure the delay calculated is not skewed by routing delays to and from the FPGA component.
To synthesize the common subgraphs as embedded cores, the Cadence digital IC design suite (version 5.2) was used. This uses a combination of RTL compiler [Cadence Design Systems 2005b] and SoC Encounter [Cadence Design Systems 2005a] , and incorporates full place and route. The Xilinx Virtex 4 is manufactured in UMC's 90nm technology, but due to the unavailability of UMC's technology files, the ASIC STMicroelectronics 90nm library [STMicroelectronics 2006] was used at the appropriate voltage (1.2V).
The Virtex 4 DSP blocks have been designed at a specific trade-off point in the area/time space, selected by the manufacturer. Hence to perform a reasonable comparison to the commercial architecture three versions of each subgraph identified by our tool have been synthesized covering a range of potential cores in the speed-area design space. This is done by first optimizing for speed: an attempt to force the component to operate at 2 GHz is made, which is an unreasonable constraint, but provides us with the fastest core possible given the synthesis flow. The estimated operating frequency of this implementation is ascertained through the synthesis tools. In order to find a balanced area-speed implementation of each core, the constraint on the operating frequency is relaxed to a half of its corresponding fastest core. The minimum area point in the design space is then found by relaxing the timing constraint to a third of the actual operating frequency of the fastest core. These frequencies were found experimentally to evenly cover the area-speed design space for the embedded cores.
RESULTS

Benchmarks and Logic Minimization
To explore the common subgraph extraction and synthesis methodology, a set of industrial and academic designs were used. These are a subset of those used in Jamieson and Rose [2005] , spanning a variety of application domains such as DSP and scientific computing. Only a subset of the designs are used, as several of the benchmarks in the original suite are essentially replicated with differences only in the memory components. Statistics of each benchmark are presented in Table I , including synthesis of the Verilog code onto a Virtex 4 device. Some of the benchmarks map particularly well to the Virtex 4 architecture, for example, the DSP benchmarks "iir", "iir1", "fir 3 8 8", and "fir 24 16 16" all take full advantage of the DSP blocks. However, 'fir scu rtl' does not, implementing only 8 multipliers in DSP blocks (note that the tools take advantage of the fracturable nature of the DSP blocks in this case).
The common subgraph extraction algorithm works on pairwise combinations of benchmarks. In order to find an upper bound on the size of the subgraphs that exist in the benchmark set, the common subgraph extraction technique was applied to each benchmark with the same benchmark used as both input graphs to the algorithm. This is equivalent to finding the largest arithmetic subgraph for each benchmark. The results of this study are shown alongside the benchmark statistics in Table I . The results show that there is potential for some relatively large subgraphs when compared to the DSP blocks currently used in state-of-the-art FPGAs. The results in Table I also show the importance of the logic minimization phase. This is particularly the case with the FIR filter benchmarks: inefficiently generated combinatorial logic exists between arithmetic nodes in the benchmark netlists, meaning that the logic minimization phase allows the design space of potential heterogeneous blocks to be significantly extended. This is a significant improvement over the previous reported results [Smith et al. 2007 ].
An analysis of benchmark critical paths indicates potentially significant benefits of instantiating fused hard arithmetic components in FPGA fabrics. When implemented in a Xilinx Virtex 4 device, approximately 75% of all paths within 20% of the most critical path involve arithmetic components. For over a third of the benchmarks 100% of these most critical paths involve arithmetic components. This means that there are significant potential system-wide benefits in terms of timing. Moreover, approximately 25% of benchmark area is devoted to arithmetic components, implying that there are also significant area savings on offer. We also note that over 50% of multipliers and approximately 40% of adders have connections to other arithmetic components, providing a strong motivation for the analysis performed in the following subsections.
Generated Cores
In order to generate the set of potential fused arithmetic cores, the subgraph extraction algorithm was run on all pairwise combinations of benchmarks. The results for the common subgraphs shown shortly include those subgraphs that exist in three or more benchmarks, that is, do not include self-similar subgraphs, or those existing in only two benchmarks.
The maximum common subgraphs extracted from all pairwise combinations of benchmarks are shown in Table II . Included in the table are the frequencies of the graphs across the entire benchmark set and the area and delay metrics for each component when synthesized by the ASIC tools to its minimum areadelay product. To simplify the results, graphs that occur with a frequency of two or less are not presented.
Relative to common subgraph extraction for custom instruction set generation, the size of the subgraphs extracted between benchmark pairs is small. This is predominantly because such work on instruction set generation looks for common graphs within a single application (see Biswas et al. [2005] for example). It does not make sense to do this for configurable hardware, where the fabric must be used for a number of different benchmarks. However, when compared to the self-similarity results in Table I , it is interesting to see that the approach we propose is capable of covering a large proportion of the arithmetic components within a given benchmark circuit, despite the constraint of a single output.
Compared to the results reported in Smith et al. [2007] , the combination of four additional benchmarks and the logic optimization phase increases the number of common subgraphs from 9 to 15 (a 33% improvement). The predominant factor here is the logic minimization phase, and can be seen by examining the breakdown of common subgraph frequencies by benchmark (see Table III ). The breakdown shows that the logic optimized FIR benchmarks ("fir 3 8 8" and "fir 24 16 16") used in both studies allow the discovery of pipelined addersubtracters and more complex multiply-add blocks in graphs 13, 14, and 15. This was not possible without the logic optimization phase. Table II includes the maximum output wordlength required for the generated components, as well as the minimum, maximum, and mean resources required when each of the subgraphs within the benchmark set has been implemented on a Virtex 4 device. These were obtained from the synthesis of each matching substructure across all benchmarks. From these figures it is evident that when the wordlengths fit correctly, the Xilinx DSP slices can be used without the need for any fine-grain logic. This is due to carry chains between DSP slices. However, the large wordlength requirements of some structures within the benchmarks necessitates the use of fine-grain logic unless the cores generated by our methodology are employed. The maximum wordlength required, and Virtex 4 resource usages for the smallest and largest wordlength components are also given.
The common subgraphs in Table II can be grouped according to type: graphs 1-6 represent simple cascaded adder/subtracter circuits, graphs 7-10 represent multiply accumulators, and graphs 11 and above, more complex structures extracted from the benchmark set. In the case of the cascaded adder components, the most flexible component is graph 6, and occurs 42 times. This means Table II by Benchmark Circuit   Benchmark  Graph Number from Table II  1  2  3  4  5  6  7  8  9  10  11 12  13 14  15 approximately 25% of adder components in the entire benchmark set can be supported by including this type of component. This is a significant proportion of the computational components in the benchmark set, and supports the inclusion of this type of component in FPGA architectures. Similarly, the most flexible multiply accumulate component, graph 10, occurs 54 times. This covers approximately 42% of multiply components in the benchmark set. Given that the component can be configured to perform normal multiplication (by grounding the external input pins to the adder/subtracted), this also supports the notion of including this component. The area improvements are predominantly due to the support of longer wordlengths: 39% of the subgraphs of this type require additional slices when implemented on an FPGA, the area penalty for which is substantial.
Some of the larger components provide an extra insight into the computational requirements of the benchmarks and show that these can provide significant benefits. For example, graph 14 identified by the common subgraph procedure contains a multiply accumulate as well as an additional multiplier. By using the component with an additional multiplier unit instead of a more basic multiply accumulate, it is possible to cover more of the multiplier nodes across the benchmark set (49% instead of 42%). This is at the expense of not covering several addition units.
Comparison to Virtex 4
In this section, each subgraph identified and synthesized by our framework is discussed and compared to an existing device family. Each subgraph in each benchmark was synthesized individually on the Virtex 4 platform and using the cores generated by the ASIC synthesis tools. Timing information was obtained from the estimates from the synthesis results. Area information was estimated by synthesizing an ASIC core similar to that found in the Virtex 4 device. To further verify and obtain area for the slice components, we compared to area figures reported in Beauchamp et al. [2008] , in which die photos were obtained for a Virtex II device. In order to compare fairly, each subcircuit identified during the node covering phase is mapped such that the wordlength requirements and corresponding delays are appropriately evaluated. For example, a subcircuit may include a 8-bit multiply accumulate with 16 output bits. This subcircuit would not need to use all of the 36 output bits of the 18x18-bit DSP blocks in the Virtex 4, hence the tools evaluate the delays to the appropriate output bit. Similarly, our flow might identify an 18x18-bit multiply accumulate, and the evaluation of this component must only consider the propagation to output bits that are used. Thus the subcircuit speed is evaluated consistently in both ASIC and FPGA synthesis flows. Table IV shows the relative speed and area ranges of the cascaded adder components. The geometric mean of the relative improvements seen has been used as the figure of merit. The comparison is made on a component to component basis for each subgraph identified: as relative factors are being averaged rather than absolute quantities, thus the geometric mean is an appropriate average. The range given in the table accounts for the fact we have synthesized three different components for each subgraph, each with varying time and area constraints (see Section 4.3 for details of these implementations). The minimum and maximum values show the relative differences between the generated component and synthesizing on a Xilinx FPGA for the fastest and slowest individual graph from the benchmarks set. For the cascaded adder components this means that in terms of the logic speed, this component always provides an advantage over the Xilinx's lookup-based logic components used to implement this kind of functionality. However, in terms of area the component does not always provide an advantage. This is mostly because there are several adder components using a relatively small number of bits, hence the large adder, which is up to 64 bits, is too area inefficient for these smaller adders. Conversely, when all of the bits of the component are used, area savings of around 50× are possible, accompanied by speed advantages of around 5×. Table V gives data on area and speed differences of multiply accumulate components. In this case there is much more of an obvious trade-off in speed and area: the fastest component generated by the ASIC synthesis tool has a better geometric average than the Xilinx-synthesized version, however, the corresponding geometric average in area shows that the component is larger than the Xilinx version. This is because this type of component matches the fused arithmetic resources (DSP slices) that exist on the Virtex 4 FPGA. In a similar manner to the cascaded adder components, when the wordlength requirements of the subcircuits better match the generated core, significant improvements can be made (a maximum of around 7× in both area and speed). The reason for the potential area improvements being much lower here than in the case of the cascaded adder is that the DSP slice component has much less inherent flexibility than the fine-grain slices used to implement addition. Table VI shows the advantages and disadvantages of using more complex components that were extracted from the benchmark set. These components can provide an FPGA with significant area savings of over 15× for an individual component. Again, this is when the wordlength requirements best match the generated component. An interesting characteristic not shown in the table is that some subcircuits can be constructed entirely from Xilinx DSP slices. This is due to the 3-input adder in conjunction with the carry chains between DSP slices (this is shown in Figure 1 ). In fact all of graphs 11-15 can be implemented using a single column of DSP slice if the wordlength requirements are sufficiently low. This removes the need for the relatively high capacitance routing resources.
Tables IV-VI also show the sum of the internal routing wires across the benchmark set used for each core. In configurable devices, the functional fabric is not the only part of the device that is used. The routing fabric also contributes a significant portion of device area. The number of segments used in routing a design determines the number of multiplexers and pass transistors and hence the substrate area used and is thus an interesting figure of merit for the potential embedded cores. Because the cores generated in our flow are chosen to support the largest wordlength possible, these wires would not be evident should the proposed cores be implemented. The routing segments used in each FPGA implementation were extracted using the Xilinx tools and isolated from the input and output routing segments. In some cases the amount of these wires is significant: for instance, once we start adding pipeline registers to the cascaded adder components, a large jump in the routing segments required is seen, as the adders can no longer be fit into a single slice without some sort of retiming. However, if graph 11 is considered, only a relatively small number of routing segments is used for such a relatively complex core. This is due to the efficient carry chains implemented in the Xtreme DSP slices used in the Virtex 4 device.
CONCLUSION
In this article, we have presented a methodology for extracting commonly occurring patterns in circuit netlists. This has led to the quantification of potential benefits in terms of area, delay, and configurable routing segments of a set of arithmetic components. The reported improvements indicate that there are arithmetic cores that have the potential to improve FPGA logic density and performance by significant amounts. There is potential for significant future work. For example, we intend to further address the system-level benefits of the components of new embedded silicon cores: the cost associated with routing to and from such components and how this affects the performance of the entire system is an important factor. We also intend to extend our benchmark set to incorporate a larger domain through the use of publicly available benchmarks such as Extensible, Programmable, and Reconfigurable Embedded Systems Group. Further interesting study could also examine Multiple-Input Multiple-Output (MIMO) graphs in order to find more complex structures within the benchmark set.
