Abstract-Programmable
I. INTRODUCTION

R
ECENT innovations in field-programmable gate array (FPGA) architecture have led to the development of hybrid FPGA families [3] that combine diverse sets of logic resources on the same silicon substrate. To support wide-fanin, low logic-density subcircuits, such as finite-state machines, some contemporary FPGA architectures [4] contain SRAM-configurable programmable logic arrays (PLAs). Unlike fine-grained lookup tables (LUTs), PLAs can implement sets of logic functions with minimal interconnect, the most area-expensive resource in contemporary FPGAs [3] . For product term (Pterm)-based PLA structures, this area efficiency often comes at the cost of increased minimum delay for PLA paths versus corresponding LUT paths, requiring resource balance. When coupled with fine-grained LUTs, PLAs provide an integrated programmable resource that can be used in many digital system designs to support noncritical-path control logic for LUT-based datapaths. Current industry technology mapping tools [5] , [6] do not provide automated techniques to partition user designs across heterogeneous logic resources, limiting the usefulness of hybrid devices. This work presents an automated technology mapping tool, hybridmap, that automatically parti-tions user designs to a collection of LUTs and PLAs so that an area-optimized solution is achieved. This involves packing as much logic as possible into available PLAs, thus minimizing required LUTs.
As shown in Fig. 1 , contemporary FPGA architectures generally contain highly optimized blocks which include both logic and routing resources. These blocks are replicated in a vendorspecific pattern throughout the device. A coarse-grained structure, such as an embedded PLA, is allocated one per finegrained LUTs. Hierarchical routing resources provide required connectivity among nonadjacent device structures.
Our new hybridmap tool automates PLA logic extraction and subsequent PLA and LUT mapping. Since FPGA devices generally contain proportionally more LUT than PLA resources, design subgraphs, initially targeted at LUTs, must be retargeted at PLAs. As a result, subgraph resource estimation forms a significant part of our approach. The developed system integrates a series of new graph search heuristics with novel cost functions to identify subgraphs quickly. Hybridmap can be run to target two distinct objectives: area minimization without regard to design performance and timing-constrained area minimization. The area to be minimized is defined in terms of the post-mapping -LUT count required to implement the design. When the technology mapping objective is unconstrained area minimization, hybridmap attempts to minimize LUT count by packing as much logic as possible into PLAs. When a timing constraint is specified, hybridmap controls the PLA packing process so that LUT count is minimized subject to prespecified timing constraints.
Hybrid device mapping takes place through a series of interrelated steps. Input to hybridmap is represented as a directed acyclic graph (DAG). Following preprocessing, a breadth-first search algorithm generates logic subgraphs satisfying the input, output, and Pterm constraints of the target PLA. An estimator of the number of Pterms required by a subgraph (Pterm count estimator) determines if a candidate subgraph meets the Pterm constraint of the target PLA. Unlike minimization approaches, such as Espresso [7] , the estimator is sufficiently fast ( sec) to allow embedding within the inner loop of subgraph extraction. Following extraction, candidate PLA subgraphs are ranked based on LUT coverage and mapped to available PLAs. The logic not mapped to PLAs is implemented in LUTs. When mapping under timing constraints, a delay estimator is integrated into the design flow to evaluate the effect of subgraph extraction on design performance.
To illustrate the effectiveness of hybridmap, the tool has been used to map a set of the Microelectronics Center of North Carolina (MCNC) [8] benchmark circuits to hybrid devices. Results were obtained by mapping to Altera's Apex20KE devices [1] for both unconstrained and timing-constrained area minimization. When mapping under timing constraints, hybridmap reduces required LUTs by 8% by mapping covered logic to PLAs. This value increases to 14% if timing constraints are not considered. This allows a larger design to be packed into a specific device or the same design to be packed into a smaller, less costly device.
In Section II, a description of the background material for hybrid technology mapping is presented. Section III motivates our approach through the analysis of circuit data. In Section IV, our technology mapping approach for unconstrained mapping is described, while in Section V, timing-constrained mapping is discussed. Details of the target FPGA architecture (the Altera Apex20KE), are described in Section VI. Experimental results obtained by applying hybridmap to a collection of benchmark circuits are presented in Section VII. Finally, Section VIII summarizes our research and outlines directions for future work.
II. BACKGROUND
A. Problem Definition
A hybrid LUT/PLA FPGA device consists of a collection of -input LUTs and multi-input, multi-output PLAs. The Ptermbased PLA resource supports inputs, outputs, and Pterms. For PLAs, , and define the PLA structural constraint and defines the PLA functional constraint. For a target device containing PLAs, each of which can be configured with inputs, outputs, and Pterms, the technology mapping objectives of unconstrained area minimization and timing-constrained area minimization can be defined as follows:
• Unconstrained area minimization. Given an input circuit, locate a mapping to circuit subgraphs and -input LUTs such that input, output, and logic constraints of corresponding resources are satisfied, the number of post-PLA mapping LUTs, , is minimized, and is less than , the number of LUTs available in the device.
• Timing-constrained area minimization. Given an input circuit locate a mapping to circuit subgraphs and -input LUTs such that input, output, and logic constraints of corresponding resources are satisfied, the number of post-PLA mapping LUTs, , is minimized, and is less than , the number of LUTs available in the device. The mapped circuit should operate at a minimum clock frequency . Theorem 1: The unconstrained area minimization problem for hybrid FPGAs containing a bounded number of LUT and PLA resources is NP-complete for general networks.
Proof: It has been shown that the area-optimal technology mapping problem for -bounded networks is NP-complete [9] . The subgraph extraction problem is more general than the -bounded, single-output problem since multi-output subgraphs can consist of a collection of -bounded, single-output networks. Therefore, unconstrained area minimization is NP-complete.
Theorem 2: The timing-constrained area minimization problem for hybrid FPGAs containing a bounded number of LUT and PLA resources is NP-hard for general networks.
Proof: It has been shown that delay-bounded technology mapping for heterogeneous FPGAs with LUTs and memory blocks is NP-hard [10] . Since subgraph implementation in a PLA block is more restrictive than subgraph implementation in a memory block, the timing-constrained area minimization problem for FPGAs with LUT and PLA resources is also NP-hard.
B. Terminology
Input to hybridmap is a combinational circuit represented as a DAG, , containing combinational nodes, , and interconnection edges, . Each node is a complex gate implementing a local function as a sum-of-products representation of its input signals. For a given node , is the set of nodes that drive and is the set of nodes driven by . The cone of a node , , is the set of transitive fanins of . The depth ( ) of a node is the length of the longest path, in terms of available LUT resources, from primary inputs to . The required signal arrival time is the time point a signal is needed at the input or output to a node . The slack-value, , of a node is the difference between the required signal arrival time at the output of and the depth of . For a specific hybrid device, a subgraph is considered PLA-feasible if the input, output, and Pterm count of the subgraph satisfies the constraints of the target PLA resource.
C. Related Work
Logic synthesis targeting FPGAs has been researched extensively and numerous technology mapping approaches for LUTbased FPGAs [11] have been developed. These approaches have two main objectives, area and delay minimization, and can be characterized by input representation. Input network types include tree-type [12] , [9] , MFFC-type [13] , and general networks [14] , [15] . To address delay minimization issues, delay models such as the unit-delay model [16] , net-delay model [17] , and edge delay model [18] have been proposed.
To date, most research in FPGA technology mapping has focused on FPGAs containing homogeneous type of resources, although recently, technology mapping algorithms for devices with LUTs of differing input sizes [19] , [20] and PLAs have been presented. In [21] , a technology mapping algorithm for devices with -input, single-output macrocells was presented. The algorithm determines the minimum height -feasible cut for circuit nodes and their cones. A heuristic technique is described that exhaustively enumerates Pterm mapping options and generates area and delay efficient design implementations. Another technique [22] , based on dag-map [14] , minimizes delay for macrocell architecture mapping. Since this architecture contains a homogeneous set of logic resources, it is not necessary to consider tradeoffs between resources with different mapping qualities (e.g., low versus high fanin, logic density) in making mapping decisions.
The introduction of coarse-grained memory elements in commercial devices (e.g., Altera's Flex10K [23] and Xilinx's Virtex [24] ) has motivated graph search approaches that identify suitable logic partitions for implementation in unused block memories. Technology mapping approaches, described in [10] , [25] , and [26] , configure unused embedded memory blocks as large multi-output ROMs to increase device utilization. These memory packing algorithms provide insight into the hybrid LUT/PLA FPGA mapping problem by identifying portions of an input logic design that are appropriate for implementation in a restricted resource. While memory blocks that have not been used to implement memory functions can be leveraged to implement combinational functions with extended logical depth, limited memory-input counts currently restrict the breadth of logic functions that can be implemented. As a result, wide-fanin subcircuits, such as finite-state machines, must be migrated to device LUTs.
Recently, Kaviani [27] investigated both the architectural parameters of hybrid FPGA architectures and supporting technology mapping approaches. The described technology mapping approach [27] for hybrid LUT/PLA architectures applies partial collapsing and partitioning to isolate wide-fanin logic nodes with single outputs. Input sharing is then used to determine the nodes to be merged into a PLA. The target devices in our work contain wide-fanout PLAs, which are structures with more than three outputs, necessitating subgraph-based approaches rather than node-based approaches.
Lin and Wilton [2] developed a technology mapping algorithm targeting hybrid architectures. Input to the PLA-mapping algorithm is first pre-mapped to four-LUTs using LUT technology mapping tools [15] . Subsequently, for each node in the mapped circuit, a local graph search collects transitive fanins of such that the overall input, output and Pterm count of the node set satisfies the PLA constraints. The collected set of nodes is not fanout-free and the intermediate nodes that drive nodes outside of the collected node set are implemented as subgraph outputs. Each subgraph is subsequently mapped to PLAs. The remainder of the design that is not mapped to PLAs is the LUT partition. Since the PLA logic extraction approach is localized, the algorithm is unable to identify and cover reconvergent paths. In this paper, a hill-climbing phase is performed during subgraph generation to cover reconvergent paths with PLA logic. A preliminary version of our hybrid technology mapping approach was previously presented [28] which did not provide for this extended search or for timing-constrained mapping.
III. METHODOLOGY MOTIVATION
Previous approaches to hybrid mapping have focused on node-collapsing techniques where multiple single-output, fanout-free cones are merged into a multi-output node. This technique can be used effectively to map subcircuits to PLAs with small numbers of outputs [27] . For wide-fanout PLAs, graph-based combinational node search approaches are needed. Subgraph logic, mapped to Pterms, ideally share a significant number of inputs. Several procedures used by our advanced hybridmap approach are motivated based on subgraph and design statistics.
As the number of PLA outputs increases, it would seem likely that opportunities for Pterm input sharing across outputs would also increase. The solid plot in Fig. 2 indicates that this is indeed the case. The figure shows that as the number of outputs per PLA grows, the average number of Pterm-based subgraph outputs driven by each input increases. Data points for the graph were collected from 1000 subgraphs extracted from a collection of MCNC [8] benchmarks discussed in Section VII. For low PLA output count values, little input sharing is present, but as the number of PLA outputs increases, sharing becomes more prevalent. Our experiments show that circuitry targeted to PLAs should be identified via subgraph identification rather than first isolating single-output logic cones followed by cone merging to obtain multiple outputs. The dashed plot in Fig. 2 furthers this assessment. The plot shows the average number of outputs driven by each Pterm in a subgraph grows as a function of subgraph output count. Shared output values greater than one indicate Pterm sharing among multiple outputs.
Unlike subgraph input and output counts, subgraph Pterm counts can be difficult to estimate quickly. Subgraphs with Pterm counts which exceed PLA constraints can be eliminated from mapping consideration. Fig. 3 shows the number of subgraphs requiring prior to and after subgraph logic minimization. A total of 1800 subgraphs were considered where each of the subgraphs contains a maximum of 32 inputs and 16 outputs to match a target PLA architecture of 32 inputs and 16 outputs. The figure illustrates that the number of Pterm-feasible subgraphs (e.g., ) identified prior to logic minimization is relatively small (about 28%). As shown in Fig. 3 , experimentation with Espresso [7] indicates that 10% more subgraphs become PLA-feasible following logic minimization. The availability of a larger number of available PLA-feasible subgraphs provides greater flexibility in choosing PLA partitions that provide a maximal reduction in post-mapped LUT count. This finding motivates our development of a fast Pterm count estimator for subgraphs which can estimate post-minimization Pterm counts without performing exhaustive logic minimization.
IV. METHODOLOGY FOR UNCONSTRAINED AREA MINIMIZATION
A. Software Overview
As shown in Fig. 4 , hybridmap, targeting area minimization without timing constraints, uses a collection of heuristics to perform hybrid technology mapping. Input circuits are represented in gate-level form (.blif format) as a forest of trees composed of combinational nodes. Each node in a tree represents a logic function in sum-of-products form. Specific processing steps include the following.
• Design preprocessing. Input circuit logic is initially reduced using SIS [29] technology independent optimization scripts script.algebraic or script.rugged. Resulting complex gates are then decomposed to two-input gate form using Huffman-tree decomposition script dmig [15] to facilitate high density PLA packing.
• LUT identification and subgraph extraction and merging. Hybridmap performs cone-based clustering on the two-input gate representation to heuristically identify -input, one-output LUTs. Following LUT identification, subgraph generation builds an initial set of PLA-feasible subgraphs and subgraph merging combines subgraphs to maximize logic coverage in PLAs. These phases integrate a subgraph search algorithm, which identifies reconvergent paths, and new Pterm count and area estimators to quickly evaluate subgraph fitness.
• Subgraph selection. After completing subgraph generation and merging, subgraphs are ranked based on the number of -LUTs covered by each subgraph. The subgraphs with the highest rank are packed into the available PLAs while the rest of the user design is marked as the -LUT partition.
• Vendor-specific computer-aided design (CAD). The design partitions are output in hierarchical (two-level) VHSIC hardware description language format and presented to vendor synthesis and place and route tools for mapping to the target architecture.
B. LUT Identification
After technology independent optimization and decomposition, hybridmap combines sets of two-input gates into -input, one-output nodes using a dual-phase approach from dag-map [14] . A depth assignment phase identifies the depth ( ) of each graph node in terms of -LUT delay values (
). This assignment is made via a search performed in topological order from circuit primary inputs to primary outputs [14] . Following depth assignment, nodes are clustered based on their node depth values. Clusters are formed via a search performed from circuit primary outputs to primary inputs [14] . During subsequent subgraph generation, the clusters are used for area and depth estimation purposes only. The original, unclustered graph is used for the subgraph search.
C. Subgraph Generation
The subgraph generation process identifies input-and output-feasible subgraphs from the input graph using a localized graph search approach. Input-output (I/O) feasible subgraphs are combinations of nodes which meet the structural constraints of the PLA. For a selected node in graph , a forward traversal phase collects a tree consisting of transitive and identifies a set of nodes driving no more than leaves, where is the available output count of the PLA resource. During a second phase, an inverse search traverses transitive fanin nodes of to identify subgraph input signals and the nodes that can be absorbed into the PLA. Following each subgraph identification, a Pterm count estimator evaluates the number of Pterms required to implement the subgraph. 1) Basic Approach: As shown in Fig. 5 , a subgraph is identified by starting at a node , selected in topological order from primary inputs to primary outputs. During a forward traversal phase, starting from , the transitive are visited iteratively in a breadth-first fashion to identify a tree rooted at . At each iteration step, a new set of tree leaves is identified. In Fig. 5(a) , for a selected node , the shaded nodes form the new leaves after the second iteration. New leaves are added in a breadth-first fashion until their number exceeds the output constraints of the target PLA. At the end of the forward search, the leaves of tree are designated as the root set, , of the traversal. In Fig. 5(a) , the shaded leaf nodes form the root set of the traversal starting at .
Once the root set has been determined, an inverse traversal of iteratively collects transitive fanins of the root set in a breadth-first fashion to determine the subgraph associated with the root set. At each iterative step in the basic approach, fanin nodes to the root set are included only if their fanout is limited to the current subgraph. Collection of fanin nodes continues until the subgraph input count exceeds the input count of the PLA resource ( ). As shown in Fig. 5(b) , subgraphs constructed in this way, that meet the allowed Pterm count, can be targeted to a PLA since their input and output counts are guaranteed to fit PLA constraints. In order to avoid logic duplication and to improve the runtime, seed nodes for subgraph generation are selected only from the nodes that are not already covered by any other subgraph. Additionally, the subgraphs are mutually exclusive and include only those nodes not covered by other subgraphs. Pseudocode for the subgraph generation algorithm is shown in Fig. 6 .
2) Hill Climbing: Taking Advantage of Reconvergent Paths: The basic subgraph generation approach can be augmented to take advantage of the reconvergent nature of some logic. Reconvergent paths originate either as a design artifact or due to recursive logic decomposition and resubstitution operations performed during the technology independent optimization process. These paths begin from a set of graph nodes (e.g., ), diverge to drive inputs of multiple nodes (set where , e.g., and ), and converge down to a set of nodes (set where , e.g.,
). Fig. 7 shows a circuit where output signals from four gates ( ) diverge as inputs to multiple gates ( ) before converging into two outputs ( ). These reconvergent paths could be covered with a single PLA subgraph boosting LUT savings. Alternative PLA covering approaches [2] , [22] , [27] do not identify reconvergent paths that exceed PLA I/O limits. In our approach, a hill-climbing phase extends forward and inverse traversal beyond the point where PLA input and output limits are exceeded with the expectation that the subgraph will reconverge.
The modified graph search approach starts in the basic subgraph generation mode. During either the forward or inverse traversal phase of the search, the first instance of a PLA output or input constraint violation forces the graph search to the hillclimbing mode. If, at a further point in the graph search, the I/O counts of the subgraph meet PLA , constraints, the graph search switches back to the basic approach of subgraph generation. This second phase of the basic subgraph generation algorithm terminates whenever PLA input or output count violations are observed. Fig. 7 shows the hill climbing phase applied during the forward traversal phase when targeting a fourinput, two-output PLA. During experimentation, hill climbing was found to improve LUT coverage versus the basic approach for about 10% of searches.
3) Subgraph Pruning: If the hill-climbing procedure never finds input or output counts that meet the PLA constraints it will terminate upon reaching a fixed search depth, , or circuit primary inputs, primary outputs or flip-flop inputs. As a result, subgraphs generated using hill climbing may violate PLA input, output or Pterm constraints. These subgraphs can be pruned to fit in a PLA by iteratively removing excess inputs and outputs. As an initial step of pruning, subgraph logic is collapsed to a two-level representation (sum-of-products form) and subgraph outputs are ranked in nondecreasing order by input requirements. Each output requiring less than inputs can be implemented using a single -LUT. As this represents a minimal penalty in terms of LUT coverage, logic and inputs solely associated with these outputs are removed first from the subgraph. This is followed by minimal multi-LUT removal, if necessary, until structural constraints are met.
The approach is optimal when the paths that are followed in determining a subgraph are the only possible paths.
Lemma 1:
In the worst case, subgraph generation is performed over all nodes and each node is visited during each iteration of subgraph generation. This search results in a time complexity of .
D. Product Term Count Estimation
During the subgraph generation process, each subgraph satisfying the input and output constraints of the PLA is evaluated as a potential candidate for PLA mapping. Subgraphs are collapsed into two-level form for direct Pterm count verification against the PLA Pterm count constraint.
Since Pterm count must be evaluated for each subgraph, the Pterm count estimator runtime impacts the usability of the estimator. Specifically, the estimator needs to satisfy the following requirements: 1) it has to be fast, requiring runtimes well under a second and 2) it only needs to verify that the post-minimization Pterm count of a subgraph is less than that allowed by the PLA. In order to quickly determine the post-minimization Pterm count, three Pterm count estimators were considered: a statistical technique, Espresso [7] , and a new Pterm estimator.
1) Statistical Approaches:
The statistical estimator attempts to predict the post-minimization Pterm count based on initial preminimization subgraph Pterm counts. To explore the usefulness of statistical data, 1000 wide-fanin ( ), wide-fanout ( ) subgraphs extracted from MCNC [8] benchmark circuits described in Section VII were passed through the logic minimization tool, Espresso. It was observed that for a given input and output count ( , ) the average Pterm count requirement per subgraph prior to and after minimization was 50 with a standard deviation 10 and 29 with standard deviation 9 respectively. For the above-mentioned subgraphs, the most important metric, the Pterm count reduction per subgraph, had a standard deviation of 10, making post-minimization Pterm count prediction impossible.
2) Espresso: Espresso finds a minimal Pterm count cover for each subgraph output in a iterative manner. During each step, each Pterm (cube) is expanded to explore the possibility of covering other Pterms (cubes) in the representation. The covered Pterms are subsequently removed. Exhaustive enumeration steps and the exact minimization operations in Espresso are characterized by long runtimes (e.g., 10 min) for wide-fanin ( ), wide-fanout ( ) subgraphs. The optimal Espresso option opoall attempts to minimize the Pterm count by exploring all 2 patterns for an -output subgraph . The fastest multi-output minimization option in Espresso, opo, minimizes the function where is the set of complemented functions and selects the minimal form.
3) New Estimation Heuristic: Our new Pterm count estimator determines Pterm count by attempting to cover a cube by a maximum of two other cubes. Fig. 8(a) -(c) demonstrate the minimization steps carried out by the estimator. The input to the estimator is a two-level representation of the subgraph logic functions presented in SIS PLA format [29] . Each row indicates a Pterm (or cube) with true (1), complemented (0), or don't care (-) conditions for the input signals and each output column represents a single output. When performing estimation, only a single cube literal is expanded at a time. Minimization operations such as Pterm covering, sharing, and input expansion are applied incrementally to reduce Pterm count. In Fig. 8(a) , for the given output, the first and the second Pterm differ only at the first input. As a result, the first Pterm covers the second one. Similarly, the first Pterm covers the third Pterm, reducing the overall Pterm count to one. Fig. 8(b) illustrates the covering of a single Pterm by two other Pterms. An initial Pterm count of three is required by the representation shown in the Before Expansion column. The second Pterm under this column can be expanded at the fourth input to form the middle two cubes under the After Expansion column. The first two Pterms under the After Expansion column can then be merged to produce the representation shown under the Intermediate Result column in Fig. 8(c) . The final representation is obtained by merging the second and third Pterms in the Intermediate Result. As seen under the Final Result column, the Pterm count is reduced to two.
For our new estimation approach, only spaces are considered for an -output function, corresponding to complementations, one output at a time. Each output is complemented starting from the one driven by the most Pterms and ending with the one driven by the fewest Pterms. An example of output complementation is shown in Fig. 8(d) . By choosing to complement only the first output, the logic expressed by the first three Pterms is now covered by the fourth. The phase information below the truth table indicates whether the outputs are represented in true (1) or complemented (0) form.
Lemma 2: Given that each Pterm must be evaluated against every Pterm, the time complexity per subgraph of the estimator is , where is the initial number of subgraph Pterms.
4) Estimation Comparison:
In order to evaluate the efficiency of the Pterm estimator used in hybridmap, wide-fanin and wide-fanout subgraphs ( and ) were extracted from the MCNC benchmarks [8] listed in Table I (15  subgraphs per benchmark) . Subsequently, minimization results were obtained using Espresso (with options opoall and opo) and the new Pterm count estimator. As can be seen from Table I , although opoall generates exact post-minimized Pterm counts, its runtime is prohibitively large to be of use. The option opo achieves a 99% accuracy in post-minimization Pterm count estimation in less than 1% of the runtime of opoall. The Pterm estimator is the fastest of all the approaches generating a 99% accurate result compared to the option opoall in about 8% of the time required by the option opo.
E. Area Estimation
To pack PLAs with subgraphs leading to maximal LUT count reduction, candidate subgraphs are ranked based on their LUT coverage. Our area estimator determines post-mapping LUT reduction due to each subgraph based on the following considerations: 1) Each primary output (PO) or flip-flop input in the input graph is an output of a LUT or a PLA and 2) Except for primary inputs, each LUT or PLA input is also a LUT or PLA output. The LUT identification process in Section IV-B iteratively computes the minimal LUT depth of each node to identify LUT boundaries. A change in LUT depth along an input-output path implies introduction of a new LUT. Nodes with the same depth can be collapsed into a single LUT subject to LUT I/O constraints. The area estimator uses the following strategy: A LUT output is identified at locations where the depth ( ) of a node is less than the depth of at least one . Mathematically, the condition for a node to be the output ( ) of a LUT ( ) can be stated as if or (1) When counting the number of LUTs covered by a subgraph, each intermediate node , with depth , satisfying (1) is counted as the output to a -LUT and the LUT count is incremented by one. Additionally, the input side boundary of the subgraph is evaluated for any LUT count penalty. If a subgraph input node , with depth , is not already the output to an existing LUT cluster, an extra LUT that generates is required. For every input node that does not satisfy (1), the LUT count covered by the subgraph is decremented by one to account for the increase in post-mapping LUT count. Fig. 9 (a) shows a circuit covered by nine three-LUT clusters. Intermediate nodes { } and , cover a subgraph with inputs I0, I1, I2, I3, and I4 and outputs O1 and O2. The numbers shown next to each node indicate the depth of the node. When the subgraph is mapped to a PLA, as shown in Fig. 9(b) , a total of seven three-LUTs are covered collectively by the nodes and O1, O2. I0 and I2 need to be generated to drive the PLA and, hence, the subgraph LUT coverage is decremented by two. As a result, the three-LUT coverage by the subgraph is five. The final LUT count is computed as which is equal to the post-mapping three-LUT count shown in Fig. 9(b) .
Lemma 3: The time complexity of each invocation of the area estimator is since in the worst case all nodes could be examined during each invocation.
Post-processing by LUT mapping tools, such as Flowmap [15] , pack LUTs densely, reducing final LUT counts by about 15% on average. Since both subgraphs and circuit graphs experience the same percentage reduction during post-processing, the area estimator computes the percentage of LUT count covered by each subgraph.
F. Subgraph Merging and Ranking
Following generation, smaller subgraphs can be bin-packed based on input sharing to construct combined implementations that meet PLA , , and Pterm requirements. The general gain function for merging two subgraphs , is given as Area (2) where Area is the estimate of the -LUT count (obtained as described in Section IV-E) covered by the subgraphs considered for merging. A merged subgraph is judged to be PLA-feasible ( ) if input, output, and post-minimization Pterm counts meet PLA requirements. Violation of PLA constraints sets to zero. Although input and output limits can be evaluated by simple counting, Pterm count verification may require additional invocation of the Pterm count estimator. Given target PLAs, following merging, the feasible subgraphs that 
V. TIMING-CONSTRAINED AREA MINIMIZATION
Timing-constrained area minimization is invoked when a minimum design clock frequency is specified in conjunction with the design. This timing constraint must be met by the delay of the longest combinational path in the circuit when mapped to a LUT-based device. For the designs evaluated in this paper, it is determined that LUT-only mappings achieve the desired minimum frequency and that logic migration to PLAs will maintain rather than improve design performance. Since wide-fanin, wide-fanout PLAs typically incur longer delays than LUTs (e.g., in Apex20KE), hybridmap maps noncritical paths of the original design to PLAs and the remainder to LUTs, keeping the delay of the longest combinational path within specified limits achieved by LUT-only mappings.
The basic flow of timing-constrained mapping is shown in Fig. 10 . Steps that have been added from the unconstrained flow, shown in Fig. 4 , appear as darkly shaded blocks. These main additions include the following.
• Delay estimation. Timing-constrained mapping requires accurate, iterative evaluation of mapped-circuit performance. Our delay estimator uses LUT packing information to compute the arrival time and delay-slack value of each node in the circuit and the largest combinational delay in the circuit. Logic and estimated routing values are used to approximate mapped-circuit performance.
• Iterative timing-constrained subgraph generation and selection. Following delay estimation, an iterative mapping process is started to partially transfer design logic to PLAs. During each iteration step, subgraphs suitable for PLA implementation are extracted along noncritical paths. After PLA-feasible subgraphs are identified, the highest ranking subgraph is packed into an available PLA and circuit delay is updated to account for the delay perturbation due to the included PLA. If fewer than PLAs have been packed, the next iteration of subgraph generation and selection is started. The iterative search process continues until no additional PLA resources are available.
Each of the steps is integrated into the hybridmap flow based on user-preference and clock-period specification.
A. Delay Estimation
The goal of the delay estimator is to compute the design critical path and delay slack values in terms of logic and estimated routing delays. The delay estimation process is composed of delay tracing and delay update phases. The delay tracing procedure [30] computes required signal arrival time and slack values associated with each LUT and any subgraph supernode in a partially mapped design. Each circuit node is assigned a required signal arrival time and slack value equal to that of the associated LUT or supernode. Once an available PLA is packed, the delay updating procedure performs LUT reclustering and design depth value update so that remaining PLAs can be packed.
1) Delay Tracing the Network:
The delay tracing procedure uses depth information to identify noncritical paths in the design. For each LUT cluster the required signal arrival time (RT) is computed at each output and input. Subsequently, the slack value (SV) at each cluster output is computed. A detailed example of circuit delay tracing is presented in [30] . Delay tracing has previously been used for LUT-based technology mapping [13] and technology mapping for FPGAs with LUTs and memory blocks [25] .
2) Reclustering and Updating the Circuit Delay: Prior to packing an available PLA, the depth of each circuit node is computed in terms of LUTs. Once a PLA is packed, the circuit delay values change along the paths through the PLA, necessitating circuit delay updates. The PLA subgraph is initially collapsed into a multi-input, multi-output supernode. The remaining circuit nodes are subsequently reclustered around the supernodes. Fig. 11 shows the original circuit covered by LUT clusters and a PLA subgraph collapsed to form a supernode.
When updating circuit delay values, LUT clusters are reconstructed with respect to the I/O boundaries of supernodes. The delay updating procedure progresses in two phases. During the first phase, the LUT-clustering procedure described in Section IV-B considers all the nodes in the circuit that are not along the paths through the supernodes. Depth values are computed for the selected nodes and are grouped to form -LUT clusters. Fig. 11 shows LUT clustering for such selected nodes. During the second phase of the delay updating procedure, the depth along paths through the supernodes are computed using the approach described in Section IV-B. For every supernode , the depth at the output ( ) of is computed as the sum of largest depth among inputs that drive and , where is the delay of the PLA resource. In Fig. 11 , the depth for a selected supernode output, O1, is computed as . The depth at O2 is computed as . The delay values computed at the outputs of each supernode can subsequently be used to compute the delay of the nodes driven by the supernodes.
B. Timing-Constrained Subgraph Generation
The timing-constrained subgraph generation process is based on the basic subgraph extraction approach described in Section IV-C. The timing-constrained approach searches for subgraphs along noncritical design paths since PLAs inserted along these paths have a higher probability of maintaining LUT-only timing performance. This search does not guarantee LUT-only performance since subsequent routing congestion may lengthen path route lengths, but, as shown in Section VII, in almost all experimental cases, performance is maintained. To control delay, subgraphs are collected only along paths that contain a minimum slack value of . In the likely event that a subgraph search encounters a node with slack value , the subgraph generation process terminates further searches along paths through that node. Thus, during the forward search phase of subgraph extraction, an encountered node , whose fanout has a slack value ( ), is automatically considered a member of the subgraph root set . Similarly, during the backward search, any encountered node whose slack value is automatically considered an input to the subgraph. These cases are shown in Fig. 12 for a subgraph ( ) search starting from node . If PLA delay ( ) is four and LUT delay ( ) is one, then, . As shown in Fig. 12(a) , during the forward graph search, the node is considered a member of , since ( 3) . The darkly shaded nodes belong to the and the lightly shaded nodes belong to the intermediate nodes of the subgraph. Similarly, as shown in Fig. 12(b) , during the backward graph search, the node is considered as input to since ( 3). In Fig. 12(b) , the unshaded nodes and signal represent the inputs to the subgraph.
VI. TECHNOLOGY MAPPING TO THE APEX20KE
A. Apex20KE Architecture
To illustrate the benefit of hybridmap, our new technology mapping approach was targeted to Altera Apex20KE devices [1] . This hybrid FPGA architecture contains embedded Pterm blocks with 32 inputs, 16 outputs, and 32 Pterms [4] . As shown in Fig. 13 , each Pterm block is composed of macrocell structures which can be fed with any combination of the 32 input signals of either polarity. Each macrocell for an Apex20KE Pterm block can either drive a macrocell output or a neighboring macrocell, but not both. As a result, the macrocell architecture does not allow sharing of Pterms/sum of Pterms across multiple outputs. Inputs from neighboring macrocells (parallel expanders) are utilized whenever it is necessary to use more than one macrocell to implement a selected output.
B. Unconstrained Area Minimization
As each vendor PLA architecture differs greatly, some adjustment to our basic mapping algorithms are required to achieve best-possible PLA mapping results. For example, the Apex20KE device allows for the programmable inversion of an AND gate output. The inverter can be programmably set to merge multiple single-input Pterms into one multi-input Pterm using DeMorgan rules (e.g., ). Our Pterm count estimator was modified to consider this architectural feature. The average difference in Pterm reduction predicted by the estimator improved by about 5% when specific Apex architectural features were taken into account.
C. Timing-Constrained Area Minimization
The implementation of logic functions with greater than two product terms requires special processing to meet timing constraints. The delay equation for a function requiring macrocells is , where and are the delays of a macrocell and parallel expander, respectively. As for the Apex20KE, the use of parallel expanders results in a significant delay for wide Pterm functions. For functions requiring more than four Pterms, Pterm outputs can be combined using four-LUTs. This approach also allows for function complement generation that can be used during minimization. The added LUTs lead to a minimal reduction in LUT coverage.
VII. RESULTS
To evaluate the performance of hybridmap, the MCNC benchmarks [8] listed in Table II were mapped to hybrid FPGA architectures. Unless specified otherwise, the target hybrid architecture is the Altera Apex20KE [1] . Results were obtained on a 386-MHz Celeron-based PC containing 128-MB RAM. All experiments, unless noted, involved the use of the Pterm count estimator.
The first experiment conducted with our system was to evaluate the amount of LUTs that could be absorbed into PLAs for a typical design without timing constraints (unconstrained). Two initial representations of input circuits were considered, designs reduced to two-input gates and designs preclustered to four-input nodes (LUTs). Previous embedded memory mapping approaches [26] have used four-LUTs as the basis for subgraph search. For our system, both two-input gate representations and four-bounded representations are supported. For the benchmarks listed in Table II , two separate input graph representations were constructed. For the two-bounded case, each input netlist was optimized with SIS (script.algebraic) and dmig [15] to create two-bounded nodes. For the four-bounded (four-LUT) case, each netlist was optimized with SIS and Flowmap [15] . In both cases, resulting circuits were processed by hybridmap. Results in Table II 
A. Pterm Count Estimation
In a second experiment, the benefit of the Pterm-count estimator was evaluated. Table III indicates leftover LUT counts for hybridmap run both with and without Pterm estimation for Table III , the Pterm estimator allows for approximately 4% improved overall LUT coverage for a modest increase in overall hybridmap run time. In some cases, the use of the Pterm estimator reduced run time since subgraphs were more aggressively extracted, reducing the search space. All designs were processed starting from a twoinput gate representation.
B. Comparisons to Previous Work
As discussed in Section II-C, several previous approaches to hybrid technology mapping with no timing constraints have been developed. In a customized experiment, hybridmap was compared to results reported in [31] for hybrid devices containing low fanout PLAs ( inputs, outputs, and Pterms). As shown in Table IV , the results obtained from hybridmap when targeting these small PLAs were competitive with previously-reported work. The LUT counts noted in the fifth and eighth column indicate remaining LUTs after logic has been mapped to PLAs. The optimized benchmarks used for these experiments were previously used in [31] and were obtained from Kaviani. Note that the number of PLAs per device varied from design to design. (Fig. 12, row 1) . The target architecture for both cases is the Apex20KE architecture with PLA counts ranging from 1 to 10. The input to the PLA mapping approach described in [2] is a four-LUT mapped circuit from which PLA subgraphs are extracted based on a search algorithm. The uncovered nodes in the design at the end of the PLA mapping process are counted to obtain the final LUT count. The benchmark circuits used by Lin and Wilton were obtained from Lin and were input to hybridmap. In order to maintain a common ground for comparison, no preprocessing operations such as logic optimization or two-input gate decomposition were performed on the input benchmark circuits. The post-mapping four-LUT count of each benchmark circuit was obtained by counting the number of nodes in the four-LUT partition output of hybridmap. It can be seen that, as the number of available PLAs increases, hybridmap achieves better LUT coverage than the approach presented in [2] . A maximum improvement of 22% was achieved for 10 PLAs per device ( ). The improved PLA packing density obtained by hybridmap is attributed mainly to two of the procedures available in hybridmap but not in [2] : subgraph-based logic extraction accentuated by hill climbing and Pterm estimation targeting Apex20KE devices.
Since the work described in this paper is the first reported timing-constrained mapping approach for hybrid FPGAs, it was not possible to compare against previous work in this area.
C. Mapping to Altera Apex20KE Devices
As a final experiment, hybridmap was applied in unconstrained and timing-constrained mode to the benchmark circuits listed in Table V . For delay and area comparison purposes, designs were initially mapped entirely to LUTs using Altera Quartus v2000.02 [5] , the commercially available Altera tool set. The critical path of each mapped circuit was then used as the minimum clock frequency constraint for hybridmap. Following this initial mapping, hybridmap was then applied to the initial (non-Quartus mapped) circuits and Pterm and LUT partitions were created. These partitions were subsequently mapped to EP20KE device resources using Quartus with speed as the synthesis objective. For each design, the smallest EP20KE device which could support LUTs-only mapping and design pin constraints was targeted. Table V compares the area and delay values obtained for four-LUT implementation against that obtained for hybridmap timing-constrained and unconstrained implementation. LUT values indicate the number of LUTs remaining after hybridmap processing. Note that not all designs have logic mapped to Pterms in the timing-constrained mode. Logic could not be migrated to Pterms for designs with a zero in column 6 without increasing the circuit critical path. The bottom row of the table presents the arithmetic sum of the results obtained for benchmark circuits. Hybridmap meets the delay value incurred by a purely four-LUT implementation, and packs about 8% of initial design LUTs to Pterms. For the unconstrained case, all designs had some logic mapped to Pterms. Overall, 14% of logic could be mapped from LUTs to Pterms. It was estimated with hybridmap that if Pterm blocks required 2X, rather than 3X, the delay of LUTs, timing-constrained LUT coverage would rise from 8% to 11.8%, approaching 14% unconstrained coverage.
VIII. CONCLUSION
Hybrid devices facilitate efficient area and delay tradeoffs for FPGA designs. In this paper, heuristic techniques to automatically identify design partitions for implementation in PLA-based logic resources have been described. A subgraph extraction approach based on heuristic search and hill-climbing was found to quickly identify feasible PLA subgraphs including those with reconvergent paths. A Pterm-count estimator has been developed which is sufficiently fast enough to be used in the inner loop of the subgraph generation. An area estimator further guides the subgraph selection process by estimating the LUT count coverage due to each subgraph. Hybridmap has been developed to support both unconstrained and timing-constrained area optimization. The technology mapping tool, evaluated using Altera's Apex20KE devices [1] , reveals that, on average, 8% of four-LUT area can be transferred to Pterms while preserving device timing performance and 14% can be transferred without timing constraints. This provides additional space for subsequent design additions or for migration to a smaller FPGA device.
