In this paper, we study the technology mapping problem for a novel FPGA architecture that is based on k-input single-output PLA-like cells, or, k/m-macrocells. Each cell in this architecture can implement a single output function of up to k inputs and up to m product terms. We develop a very efficient technology mapping algorithm, k m flow, for this new type of architecture. The experiment results show our algorithm can achieve depthoptimality in practically all cases. Furthermore it is shown that the k/m-macrocell based FPGAs are practically equivalent to the traditional k-LUT based FPGAs with only a relatively small number of product terms (m~+3). We also investigate the total area and delay of k/m-macrocell based FPGAs on various benchmarks to compare it with commonly used 4-LUT based FPGAs. The experimental result shows k/m-macrocell based FPGAs can outperform 4-LUT based FPGAs in terms of both delay and area after placement and routing by VPR.
Introduction
The Field Programmable Devices (FPDs) have been widely used for implementation of small to medium size digital circuits. There are two major types of FPDs --Field Programmable Gate Arrays (FPGAs) which usually consist of small programmable logic cells, such as k-input single-output lookup tables, and Complex Programmable Logic Devices (CPLDs) which are based on multiple-input and multiple-output PLA-like logic cells. Both of FPGAs and CPLDs have been widely used.
Most commonly used FPGAs are based on k-input single-output lookup tables (k-LUTs). Every k-LUT can implement any function with no more than k inputs. In practice, k is usually small, for example, 4-LUTs are widely used in commercial FPGAs, as the area of a k-LUT grows exponentially with large k. On the other hand, PLA based devices usually have large basic cells. Each cell can have a large number of inputs (typically between 30-40). Also, a PLA cell normally has multiple outputs (16, for example). As a result, a single PLA cell is able to implement multiple functions with wide inputs. Unlike lookup table, each cell can only implement functions with no more than m product terms.
Rose et aI. [16] showed 4-input, single-output LUT cell yields the smallest FPGA area of any k-LUT cell for a wide range of Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed fbr profit or commercial advantage and that copies bear this notice and the thll citation on the first page. To copy othe~,ise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. [15] investigated the best granularity for PLA-based CPLDs and found that the total CPLD area is smallest if each basic cell has 8-10 inputs, 3~4 outputs, and 12-13 product terms. The number of product terms is restricted to grow linearly as input size increases [14] . In practice, however, most commercially available CPLDs use much larger PLA-like logic cells. Since FPGAs use small programmable cells, they often offer high density and high capacity, at a price of possibly larger and somewhat unpredictable delays, as a critical path may need to go through multiple levels of programmable cells connected by programmable interconnect. On the other hand, CPLDs are usually faster as the programmable cells are much larger which results in fewer levels of the logic. (The worstcase delay in CPLD also tends to be more predictable as the level of the logic in worst-case delay path is usually determined by the architecture and can be estimated by the designer). However, CPLDs usually offer considerably lower logic density. We believe that this is due to two reasons: (a) it is inherently difficult to map logic into multi-output PLA-like programmable cells, as most technology mapping techniques are developed for singleoutput logic cells; and (b) the difficulty associated with synthesis/mapping for PLA-based CPLD devices in turn resulted in very limited studies on this topic --the only related works we can find were DDMap [14] in early 90's, a fast heuristic partition method for PLA-based architecture proposed in [10] , and TEMPLA [14] in 1998. (In comparison, there are much more extensive studies on LUT-based FPGAs, which will be briefly summarized in Section 3.1.)
The need to reduce the logic levels (and associated interconnects!) to improve circuit performance, the intention to avoid the mapping problem for multi-output functions, and the hope to leverage large amount of research results on synthesis and mapping for LUT-based FPGAs, seem to suggest that we should consider FPGAs with LUTs of much larger number of inputs. However, the area a k-LUTs grows expontentially with respect to k. Using k-LUTs with large k may considerably lowers chip density. Therefore, we have to explore other alternatives. We noticed that the functions mapped into large LUTs usually use considerably fewer product terms than the lookup table capacity [15] . This leads us to consider an FPGA architecture based on kinput single-output PLA-like logic cells. Each cell can implement a single output function of up to m product terms and up to k inputs. Such a cell is called a "k/m-macrocell" throughout this paper. A k/m-macrocells differ from a k-LUT in that each macrocell can implement only a subset of all possible k-input functions. A k/m-macrocell is different from a general PLA-like block used in most CPLD devices, too, as each k/m-macrocell has single output. If we choose m to be small, k/m-macrocells are much smaller than k-LUTs. Therefore, it is possible to use k/mmacrocells with larger input size in order to use smaller logic depth and less interconnect without lowering the chip capacity considerably.
In this paper, we develop a very efficient technology mapping algorithm, named k_m_flow, for this new type of architecture. The experiment results show our algorithm can achieve depth optimality in practically all cases. Furthermore we show that the k/m-macrocell based FPGAs are practically equivalent to the traditional k-LUT based FPGAs with only a relatively small number of product terms (m.~.k+3). We also investigate the total area and delay of k/m-macrocell based FPGAs on various benchmarks to compare it with commonly used 4-LUT based FPGAs. The result shows k/m-macrocell based FPGAs can outperform 4-LUT based FPGAs in terms of both delay and area after placement and routing by VPR.
The rest of this paper is organized as follows. Section 2 formulates the problem. Section 3 introduces a technology mapping algorithm for k/m-macrocell-based FPGAs. Section 4 further investigates the area and delay of k/m-macrocell-based architecture. We draw our conclusions based on experimental results and discuss the future work in Section 5.
Throughout this paper, the letter k is used to denote the input size of a macrocell, or the input size of a LUT in FPGA. The letter m is used to represent the maximum number of product terms that one macrocell can implement.
Definitions and Problem Formulation
A Boolean network can be represented as a directed acyclic graph (DAG) where each node represents a logic gate and a directed edge (i,j) exists if the output of gate i is an input of gate j. A primary input (PI) node has no incoming edge and a primary output (PO) node has no outgoing edge. We use input(v) to denote the set of nodes which are fanins of gate v. We assume the network is 2-bounded, that is, for each node v in the network, [ input (v) [ _~.2. Any network can be fully decomposed into 2-bounded network without deteriorating the mapping quality [6] .
A cone at v, denoted as C~, is a subgrapb consisting of v and its predecessors such that any path connecting a node in Cv and v lies entirely in C~. The notation of input(CO is also used to represent the set of distinct nodes outside C~ which supply inputs to the gates in Cv. A maximum cone at v, also the fanin network of v, denoted as N~, is a cone consisting of v and all of its predecessors. Several concepts about cuts in a network will be used in our discussion. Given a network N with a source s and a sink t, a cut ( X, X' ) is a partition of the nodes in the network such that seX, t~ X' and no nodes in X' provide input to any node in X. Clearly X' may be considered as a cone at t inside network N. Therefore we can apply the previous definitions on k/m-feasibility to cuts. A cut (X, X') is said to be k-feasible if and only if X' is a k-feasible We use two delay and area models to evaluate the quality of mapping solution. Throughout the discussion on the technology mapping algorithm (Section 3), unit delay and unit area models are used. That is, variation of interconnection delay and routing area is not directly considered during technology mapping of the original network. Each k/m-macrocell contributes a constant delay independent of the function it implements. Each cell is counted as a unit when we evaluate the area, hence the total area of the mapping solution equals to the total number of macrocells. Such simplification is reasonable because the layout information is not available yet. For architecture comparison in Section 4, however, we will use more accurate delay and area models with consideration of the interconnect, as we use a well-known FPGA placement and routing tool (VPR [2] ) to get the total area and critical path delay after layout for comparison. To avoid confusion, we use "depth" and "number of macrocells" in Section 3 to refer to the delay and area under unit delay and unit area model.
Technology Mapping for k/m-macroeells

Overview
A k/m-macrocell can be considered as a k-LUT with an additional restriction that it can only implement logic functions with no more than m product terms. Therefore, it is natural to start with the k-LUT mapping problem since it has been intensively studied in the past few years.
Currently, there are three major approaches to LUT-based FPGA mapping, tree-based mapping (e.g. Chortle-crf, Chortle-d [8] & [9] ), flow-based mapping (e.g. FlowMap [3] ) and cutenumeration-based mapping (4] ). See [5] for a more comprehensive survey. Tree-based mapping algorithms partition the network into trees and handle each tree separately. Each individual tree can be mapped optimally but a prior tree partitioning often compromises the mapping quality. They are usually fast heuristic algorithms. Flow-based algorithm is based on the theorem of max-flow-rain-cut and the computation of network flow. It can generate depth optimal mapping solution in polynomial time. However, flow-based algorithms lack of flexibility as they find only one or two depth optimal rain-cuts for every node. On the other hand, cut-enumeration-based approaches will find out many, if not all, possible cuts for every node. They offer high flexibility and can achieve optimality with more constraints, but they are considerably slower than tree-based or flow-based methods.
The approach we present here, called k m flow, is a hybrid of flow computation and cut enumeration. We try to find a k/mfeasible cut for every node first by flow computation. If failed, we turn to cut enumeration.
Algorithm
The k m flow algorithm consists of two phases .... labeling the network and mapping the network into macrocells. The labeling phase is trying to finds a k/m-feasible cut for every node for depth minimization. The mapping phase generates k/m-macrocells in the mapping solution according to the labels and cuts obtained in the labeling phase.
Labeling Phase
For every node v, let Nv be the fanin network consisting of node v and all its predecessors. We also define label*(v), the optimal mapping depth of v, to be the minimum depth of the k/mmacrocell mapping solution for Nv. The labeling phase for/k/mmacrocell mapping is similar to that in the FlowMap algorithm. It finds a k/m-feasible cut for every node v and compute a label for v to minimize the k/m-macrocell implementing node v in the mapping solution. Ideally, we would like the computed label to be equal to the optimal mapping depth, that is, label(v)=label*(v) for every node v in the network, as in the case with the FlowMap algorithm for k-LUT mapping. However, it is more difficult to do so for the k/m-macrocell based mapping due to the non-monotone properties of the clustering constraints and the optimal labels as presented in the next subsection.
Non-monotone Clustering Constraints and Optimal Mapping Depths
The fundamental difficulty of k/m-macrocell based FPGA mapping is that the constraint on the number of inputs and the number of product terms of a k/m-macrocell are not monotone clustering constraints. That is, a cone C~ is k-infeasible ( Figure 1) In addition, the optimal k/m-macrocell mapping depth is not monotone either. The optimal mapping depth is monotone if label*(v)>_label*(u) as long as u is an input to v. Figure 1 shows that the optimal mapping depth is not monotone. In Figure 1 , label*((/) = 1 < 2 = label*((/)). Note that for LUT mapping problem, it was shown in [3] that the optimal mapping depth is monotone.
Depth Optimal Mapping Algorithm
Given a cut (X, X') in Nv, the height of the cut, denoted as h(X,X'), is the maximum label in input(X'), i.e.
h(X, X')=max{label(v) I v ~ input(X')}
( It is assumed that every node in input(X') has a label ) 
Cut(x) is the set of k-feasible cuts for node x. Notation "(C x-x, x)" refers to the cut that cuts off the single node x. "~k" is a merging operator defined on two cut sets; "Sl®kS2" is to merge every cut cut1 in $1 with every cut cut2 in $2 and only keep the k-feasible cuts in the result.
After the enumeration process, we check Cut(v) to see if there is an m-packable cut. If there exists a k/m-feasible cut, node v can be labeled as mlevel, otherwise, it will be labeled as mlevel+l.
The pseudo code for labeling phase is shown in Figure 2 .
Mapping Phase
The second phase of our algorithm is to generate the k/mmacrocells in the mapping solution. For every node v, if in the labeling phase we found a k/m-feasible cut (X, X'), then we can create a k/m-macrocell map_node
(v) for v to implement the function of X' and input(map_node(v))=input(X').
If no k/mfeasible cut was found during the labeling phase (may occur in case 1 and 3), we can create a k/m-macrocell to implement the function of single node v. After generating macrocells for every node, we need to remove redundant cells that do not fan out to any other macrocell. Using a list to keep track of "visible" nodes and only generate macrocells for "visible" nodes can optimize this procedure. The detailed algorithm is shown is Figure 2 .
Properties of the kmflow Algorithm
We can prove the following properties for the algorithm discussed above:
1) If a node v is labeled as label(v), then it can be implemented with a depth no more than label(v). That is, label(v) is the upper
bound estimation of the depth of v in the mapping solution.
2) If case 3 never happens when mapping a specific circuit, then the mapping solution is delay optimal. Indeed, it is just the same as k-LUT mapping.
3) For any certain circuit, if the optimal depth for k-LUT based mapping is d~, the optimal depth for k/m-macrocell based mapping is d2 and the depth of k m flow mapping result is d3, then d t --~12-~l s.
Area Enhancement
After obtaining a k/m-macrocell mapping solution, we want to further reduce the number of k/m-macrocells used in the mapping solution without increasing its depth.
For every k/m-macrocell v, we try to pack as many its predecessors with it as possible into a single k/m-macrocell. Clearly we need to guarantee the condition that the new k/mmacrocell is still k/m-feasible. In order to do so, we try to combine algorithm k m flow; Therefore, the above greedy packing process will be repeated until no more nodes can be packed. The detailed packing algorithm, k_m_pack, is shown in Figure 3 .
On average, the total number of macrocells in the mapping solution may be reduced by a factor of 6% after the above packing process.
Experiment Result
Our algorithm, k_mflow, has been implemented in C language within the Berkeley SIS and UCLA RASP [7] framework. We chose a set of 16 MCNC benchmarks to test k m flow on a Sun Ultra II workstation with 512M memory. 
Table 1 Description of 16 benchmark circuits
In order to find out the optimal mapping depth for each benchmark and compare it with the k m flow mapping solution, we implemented an algorithm called k m enumerate. The k m enumerate algorithm can find the depth optimal mapping solution by exhaustive cut enumeration on the entire network, as proposed in Section 3.2.1.2. We would like to point out that k m enumerate is impractical to use for large k. We use it only to collect data to analyze the depth optimality of the result of k m flow.
In Table 2 , we list the mapping depth generated by k_m_flow and k m enumerate under different k and m. The data is in the form of "x/y", where "x" is the depth of mapping solution generated by k m flow; "y" is the optimal mapping depth obtained by k m enumerate under the specified k and m. A question mark "?" means the optimal depth is unknown yet because of the extremely long runtime and large memory requirement of k m enumerate Table 3 Total mapping depth ofk/m-macrocell vs. k-LUT Table 4 Total number ofk/m-macrocells vs. total number of on 16 MCNC benchmarks for large k. From Table 2 , we can see that although k m flow cannot guarantee delay optimality in theory, in practice it is almost always able to find out the depth optimal mapping solution.
We also compare the k/m-macrocell mapping solution generated by k_mflow with k-LUT mapping solution generated by FlowMap. Table 3 shows the total mapping depth of k/mmacroeell vs. k-LUT on 16 benchmarks. Table 4 shows the total number of macrocells vs. the total number of k-LUTs on 16 benchmarks. FlowMap is the depth optimal k-LUT mapping algorithm based on flow computation. Since k/m-macrocells can be considered as k-LUT with additional m-product-term constraints, the optimal depth of k-LUT mapping solution is the lower bound of the optimal depth of k/m-macrocell mapping solution. Table 5 Quick success rate of flow computation nodes inside. An exhaustive cut enumeration on a small network with no more 50 nodes usually runs very fast. Therefore, k m flow algorithm shall be an efficient algorithm to generate the k/m-macrocell mapping solution for medium k. For large k, the cone may be large and even the local cut enumeration may take a long time to finish. Table 6 shows the total CPU time (in seconds) needed to generate all the mapping solutions for 16 benchmarks.
Since the quick success rate is usually very high, in practice, skipping local enumeration will cause little impact on the mapping quality but will save the runtime. 
Investigation of k/m-macrocell Based Architectures
In section 3 we use unit area and unit delay model to evaluate the quality of our k/m-macrocell mapping algorithm. In order to collect more accurate delay and area information to draw architecture study conclusion, we use VPR [2] , an FPGA placement and routing tool developed in University of Toronto, to do placement and routing for our k/m-macrocell-based architecture and compare this architecture with the traditional 4-LUT-based architecture in terms of total area and critical path. Figure 4 shows the schematic diagram of the logic block used in our k/m-macrocell-based architecture (we call it k/m logic block), and Figure 5 shows the logic block used in 4-LUT-based architecture (we call it 4-LUT logic block) [2] . Since the area of a logic block is greatly effected by the total number of I/O pins of the block ~ and the number of transistors in the block, we use the 1. Table 7 . Therefore we can estimate that the area of a k/m logic block is 4-6 times large as the area of a 4-LUT logic block for k=7~10, m=10-13. As we have not done any simulation on the k/m logic block, we do not have the accurate delay for the k/m logic block. A rough estimation on the delay of k/m logic block is that it is 2 times slower than a 4-LUT for k between 7 and 10 based on the observation that the number of transistors in the longest path a signal would pass in the k/m logic block is about 3 times of that in a 4 -LUT logic block. The total area is the sum of routing area and logic block area; the critical path delay is the sum of interconnect delay and logic block delay. The routing area and interconnect delay is estimated by VPR. 
I I
OR block Figure 6 k/m-macrocell
Experimental Setting of VPR
The authors of VPR did lots of studies on area/delay trade-off for 4-LUT and cluster-based logic block. They proposed a detailed 4-LUT-based FPGA architecture under TSMC's 0.35 gin, 3.3V process [2] . The 4-LUT logic block they proposed is exactly the same as what Figure 5 shows. We compare our k/m-macrocellbased architecture with their 4-LUT-based architecture by only changing the area and delay of logic block in the architecture file. VPR reports routing area in number of rain-width transistors and the delay of critical path in seconds. We add up the logic block area to the routing area and get the total area of each mapping solution.
Experimental Result
We compared k/m-macrocell based architecture with 4-LUT-based architecture by running VPR on the two kinds of mapping solutions of the 16 MCNC benchmarks under the experimental settings mentioned above. The k/m-macrocell mapping solutions are obtained by running km_flow algorithm and then performing k_m_pack to further reduce the number of macrocells. The 4-LUT mapping solutions are obtained by running FlowMap followed by greedy-pack. Average area and delay are showed in Table 8 is small, most of the area is devoted to routing. With the increase of k, routing area decreases, but the area increase of logic blocks could be more than the decrease of routing area. Since the area of k/m-macrocell blocks does not grow exponentially as k-LUT does, the total area decreases. Since the logic depth and routing area decrease, the total delay decreases. 
Conclusions and Future Work
We have studied a novel FPGA architecture based on k/mmacrocells through this paper and proposed a k/m-macrocell technology mapping algorithm, named k m flow, which produces optimal mapping depths in most cases. Using this algorithm, we showed that k/m-macrocell based FPGAs are similar to k-LUT based FPGAs in terms of the mapping depths and number of macrocells being used. The high quick success rate (Table 5) suggests that k/m-macrocell can provide similar flexibility as lookup table while each k/m-macrocell is much smaller than k-LUT. We have analyzed the delays and areas of k/m-macrocell based FPGAs using VPR. We compared the results with those of traditional 4-LUT based FPGAs. Our comparison showed convincingly that k/m-macrocell based FPGAs can significantly outperform 4-LUT based FPGAs both in delay and area when the delay of a k/m logic block is no more than 3 times and area is no more than 6 times worse than those of a 4-LUT logic block.
We are extending this work in several directions. First, we plan to perform detailed layout of a k/m-macrocell (including necessary transistor sizing) to collect more accurate area and delay information and compare those with a k-LUT based logic cell. Such more accurate area and delay models will be fed into VPR for more accurate area and delay results. Second, we plan to compare an FPGA architecture with clusters of k/m-macrocell and compare it with an architecture with clusters of k-LUTs, as most modem FPGAs use LUT clusters for density and performance enhancement.
