Abstract-
An Efficient Algorithm for Performance-Optimal FPGA Technology Mapping with Retiming
Jason Cong and Chang Wu
Abstract-It is known that most field programmable gate array (FPGA) mapping algorithms consider only combinational circuits. Pan and Liu [22] recently proposed a novel algorithm, named SeqMapII, of technology mapping with retiming for clock period minimization. Their algorithm, however, requires O(K 3 n 5 log(Kn 2 ) log n) O(K 3 n 5 log(Kn 2 ) log n) O(K 3 n 5 log(Kn 2 ) log n) run time and O(K 2 n 2 ) O(K 2 n 2 ) O(K 2 n 2 ) space for sequential circuits with n n n gates. In practice, these requirements are too high for targeting K K K-lookup-table-based FPGA's implementing medium or large designs. In this paper, we present three strategies to improve the performance of the SeqMapII algorithm significantly. Our algorithm works in O(K 2 n l n j Pv j log n) O(K 2 n l n j Pv j log n) O(K 2 n l n j Pv j log n) run time and O(K j Pv j) O(K j Pv j) O(K j Pv j) space, where n l n l n l is the number of labeling iterations and j Pv j j Pv j j Pv j is the size of the partial flow network. In practice, both n l n l n l and j Pv j j Pv j j Pv j are less than n n n. Area minimization is also considered in our algorithm based on efficient low-cost K K K-cut computation.
Index Terms-Expanded circuit, field programmable gate array (FPGA), lookup table, retiming, technology mapping.
I. INTRODUCTION
T HE technology mapping and synthesis problem for field programmable gate array (FPGA's) is to produce an equivalent circuit for a given circuit using only specific programmable logic blocks (PLB's). More specifically, without synthesis, the PLB's in a mapping solution form a cover of gates in the original circuit possibly with overlap. There are a variety of different PLB architectures. In this paper, we consider a generic type of PLB: the -input lookup table ( -LUT), which has been widely used in current FPGA technology [1] , [18] , [30] . Most of the previous LUT mapping algorithms optimize either area (e.g., [13] , [14] , and [20] ) or delay (e.g., [5] , [15] , and [21] ). The algorithms in [4] and [7] consider both delay and area. The algorithms in [25] and [27] consider the routability. A comprehensive survey of FPGA mapping algorithms is given in [6] ; however, most of these approaches apply only to combinational circuits. For sequential circuits, these approaches assume that the positions of flip-flops (FF's) are fixed so that the entire circuit can be partitioned into combinational subcircuits, each of which is mapped separately. A major limitation of these approaches is that they do not consider mapping and retiming simultaneously. In fact, the optimal mapping solutions for all Manuscript received March 11, 1997 . This work was supported in part by the National Science Foundation under Young Investigator Award MIP9357582 and by grants from Xilinx and Lucent Technologies under the California MICRO program. This paper was recommended by Associate Editor A. Saldanha.
The authors are with the Computer Science Department, University of California, Los Angeles, CA 90095 USA.
Publisher Item Identifier S 0278-0070(98)06759-1.
combinational subcircuits may not lead to an optimal mapping solution for the entire sequential circuit due to the effect of retiming. Retiming is a technique of moving FF's within the circuit without changing the circuit behavior. For single-phase clock and edge-triggered FF's, Leiserson and Saxe [16] , [17] solved the retiming problem of minimizing the clock period or the number of FF's. Several FPGA synthesis and mapping algorithms have been proposed specifically for sequential circuits. The approach in [19] does not consider retiming, but rather, its objective is to consider proper packing of LUT's with FF's to minimize the number of configurable logic blocks for Xilinx FPGA's [30] . The methods in [23] and [29] are heuristics that consider loopless sequential circuits. Touati et al. [28] proposed an approach of retiming specifically for Xilinx FPGA's after mapping, placement, and routing. A significant advancement was made recently by Pan and Liu [22] . They proposed a novel algorithm, named SeqMapII, to find a mapping solution with the minimum clock period under retiming. Similar to the FlowMap algorithm [5] , their algorithm works in two phases: the labeling phase and the mapping generation phase. They introduced the idea of expanded circuits to represent all possible -LUT's under retiming and nodereplication. An iterative method is used to compute labels for all nodes. The time and space complexities for SeqMapII are and , respectively, for a circuit with gates [22] . 1 Although the SeqMapII algorithm runs in polynomial time, it has two shortcomings: 1) too many candidate values ( ) need to be considered for each label update and 2) the expanded circuits are too large ( nodes) for computing the optimal solutions. Experimental results show that the run time of SeqMapII for computing the optimal solutions is too long in practice (e.g., more than 12 h of CPU time for a design of 134 gates on a SPARC5 workstation).
In this paper, we present three strategies to improve the performance of the label computation significantly, which is the most time-consuming step in SeqMapII [22] . First, we prove that the monotone property of labels holds for sequential circuits, then develop an efficient label update to speed up the algorithm by a factor of . Second, we propose a new approach of -cut computation on partial flow networks, which are much smaller than the expanded circuits used in SeqMapII, while guaranteeing the optimality of the results.
Our experimental results show that the average numbers of nodes in the partial flow networks are far less than , which is a big improvement over the number of nodes in the expanded circuits used in SeqMapII [22] . Last, strongly connected-component (SCC) partitioning and heuristic label ordering are used to eliminate much redundant label computation to further speed up the algorithm. In practice, our algorithm works in time and space according to our experimental results. The area reduction is also considered in our algorithm by choosing a low-cost -cut for every node. As a result, our algorithm is 2.8 10 times faster than SeqMapII-opt for computing optimal solutions, and even eight times faster than SeqMapII-heu, which uses very small expanded circuits as a heuristic. Furthermore, our algorithm reduces LUT count by 28%, and FF count by 27% and achieves clock periods 23% shorter as compared with SeqMapII-heu [22] .
The remainder of this paper is organized as the following. Section II presents the problem formulation and definitions. Section III gives a review of the approach by Pan and Liu [22] . Our improved algorithm is presented in Section IV. The experimental results are presented in Section V, followed by conclusions and future work in Section VI. 2 
II. PROBLEM FORMULATION AND DEFINITIONS
Given a sequential circuit, the technology mapping problem for -LUT-based FPGA's is to construct an equivalent circuit consisting of -LUT's and FF's. For performance optimization, we study the following problem.
Problem 1: For a sequential circuit, find an equivalent LUT circuit with the minimum clock period under retiming.
As in [5] and [22] , the unit delay model is used in this paper, which assumes that the delay of each LUT is one and the delay of each net is zero or a constant.
A mapping solution in which the output signals of all LUT's are from the original circuit is called a simple mapping solution [22] . As shown in Fig. 1(b) , the outputs of LUT and LUT are the outputs of and in the original circuit shown in Fig. 1(a) . But the output of LUT in Fig. 1(c) is one clock cycle ahead of the output of in the original circuit. Pan and Liu [22] showed that there exists a simple mapping solution whose clock period under retiming is equal to the minimum clock period among all mapping with retiming solutions. This means that there is no need to move FF's before mapping in order to get an optimal solution to Problem 1. Furthermore, they proposed to solve the decision version of the problem.
Problem 2: Given a target clock period , determine the existence of a simple mapping solution whose clock period under retiming is no more than .
As in [5] and [22] , the results in this paper apply to only -bounded networks, i.e., each gate in the network has at most fan-ins. 3 In the remainder of this paper, all circuits are assumed to be -bounded. 2 An extended abstract of this paper appears in [10] . 3 When a circuit is not K-bounded, we can use gate decomposition algorithms in [2] , [3] , and [8] to decompose gates with more than K fan-ins. We use or to denote the retiming graph [17] of a sequential circuit, where is the set of nodes representing the gates in the circuit, is the set of edges representing the connection between the gates, and is the set of edge weights. Edge denotes the connection from gate to gate and denotes the number of FF's on the connection. The path weight, denoted of a path , is the sum of weights of all edges on the path.
is a subgraph of consisting of node and all nodes that have paths to .
For a simple mapping solution and a given clock period , the edge length, denoted of an edge , is defined to be . The path length, denoted of a path , is . The -value of a node in a mapping solution is the maximum length of all paths from primary inputs (PI's) to in . We define if there is a path from one of the PI's to going through a (feedback) loop of positive length. It was shown that for a mapping solution and a given , the retimed clock period is no more than if and only if for every PO [22] . The label of node , denoted , is defined to be the minimum -values of the -LUT's rooted at among all mapping solutions. Clearly, there is a mapping solution with retimed clock period of no more than if and only if for every primary output (PO) [22] . Now let us introduce the definition of -cuts. In a directed graph with one sink and one source, a cut is a partition of the graph such that the sink is in and the source is in . The node cut-set is the set of nodes in that are connected directly to nodes in . If , a cut is called a -feasible cut, or -cut in short. A cut is a min-cut if is minimum. To determine the existence of a -cut, one can compute the max-flow from the source to the sink and decide whether it is larger than (see [5] for details). This process is called -cut computation in this paper. 
III. REVIEW OF THE SEQMAPII ALGORITHM
The SeqMapII algorithm works in two phases: label computation and mapping generation. The label computation starts with a lower bound on the value of each node label and repeatedly improves the lower bounds until they all converge to the node labels. The initial lower bounds are zero for the PI's and for the other nodes. Based on the current lower bound of each node , Pan and Liu [22] presented a procedure to determine a new lower bound . They introduced the concept of expanded circuits for each node to represent all possible -LUT's rooted at with consideration of retiming and node replication. An expanded circuit of node with control number is a directed acyclic graph (DAG) rooted at formed by replication of nodes in , such that has the property that all paths from a node to the root have the same number of FF's. If a replication of node passes FF's before reaching the root in , we denote it and call its weight . The control number of is the shortest distance (in terms of the number of edges) between the root and each leaf in that is not a replication of a PI in . For example, Fig. 2(b) -(e) is four expanded circuits of node shown in Fig. 2 (a) with control number from zero to three, respectively.
Pan and Liu [22] showed that to examine all -LUT's for a node , it sufficed to examine all the -LUT's that can be derived from the -cuts in . With the assumption that the weight of each edge is at most one, it was shown that the numbers of nodes and edges in are bounded by and , respectively, where is the number of gates in the original circuit [22] . In an expanded circuit of node , the height of a -cut is defined as based on the current lower bounds of node labels. The new lower bound is computed as This value is determined by binary search among candidates in the set of and performing a -cut computation for each candidate value. The computation time for every is based on network flow computation. The labels of all nodes can be determined in time because there is a total of label updates [22] . 4 SeqMapII [22] is the first polynomial algorithm to find a mapping solution with the optimal clock period under retiming. However, two major shortcomings make this approach inefficient in practice. First, the expanded circuit is too large [ nodes and edges], which requires prohibitively large memory and run time for circuits with more than 1000 gates. Second, too many values ( ) have to be considered when computing the new lower bound of each node label.
IV. TURBOMAP ALGORITHM
In this section, we present three strategies to improve the label computation of the SeqMapII algorithm, which is the most time-consuming step. First, we prove the monotone property of the node labels and develop a new procedure for computing a tighter lower bound with a single -cut computation. Second, we propose a new approach to compute -cuts on much smaller partial flow networks, which are built incrementally during the -cut computation. Third, SCC partitioning and depth-first-search (DFS) ordering are used to eliminate much redundant label computation and reduce the number of labeling iteration to further speed up the algorithm.
A. Label Update with Single K-Cut Computation
In SeqMapII [22] , to compute for a node, it is necessary to perform binary search among all possible values in , which requires -cut computations. In our approach, we compute a tighter lower bound with singlecut computation to speed up the algorithm by a factor of . Let
We update the lower bound on the value of the label of as follows:
if cut with otherwise.
Obviously, can be computed with single -cut computation. Recall that this result is similar to Lemma 2 in [5] , which applies to combinational circuits only. The correctness of our approach is based on the fact that , where is the lower bound computed in SeqMapII [22] . This can be proved based on the monotone property of node labels. In a sequential circuit, which has a mapping solution with a clock period of no more than a given , we say that the set of its node labels is monotone if for any edge , .
Theorem 1 (Monotone Property):
In a sequential circuit that has a mapping solution with the clock period of no more than a given under retiming, the node labels are monotone. That is, for every edge in the retiming graph of the circuit.
Proof: For each edge in the original circuit, there exists a simple mapping solution such that . Let LUT denote the -LUT rooted at in . We consider the following two cases.
Case 1: is a fan-in to LUT . According to the definition of -values, the -values of and in satisfy . Since by definition, we have .
is covered inside LUT , as shown in Fig. 3(a) . Let be the number of FF's on edge and be the number of FF's on edge LUT for each fan-in of LUT . In formation of LUT , we have to push those FF's on back to LUT 's fan-in edges as shown in Fig. 3(b) . 5 Let be the sub-DAG rooted at inside LUT . 6 The edge weight after retiming of each fan-in edge of is . The edge weight for the rest of the fan-ins of LUT remains unchanged, i.e., . Since can be covered by a -LUT by replicating explicitly outside LUT to form LUT , we get another simple mapping solution as shown in Fig. 3(d) . Note that the weight of 5 Note that the FF's cannot be pushed down to the output of LUT v , as M is a simple mapping solution. 6 Note that the FF's inside H u need to be pushed back on edge e(x i ; H u ) as well. For ease of presentation, we assume that wx includes both the numbers of FF's originally on edge e(x i ; H u ) and from inside H u . edge LUT is because those FF's were pushed back only on edges LUT . The -value in is
Since every input of LUT is also an input of LUT This concludes that for any edge in the original circuit. Let one iteration denote the computation process where is updated once for every node (in an arbitrary order). We prove the following.
Theorem 2: For a sequential circuit that has a mapping solution with a clock period under retiming of no more than a given , the inequalities hold during every iteration.
Proof: It is clear that based on the definitions of and . We prove by mathematical induction. Initially, . Now suppose that holds for every node at the current iteration. We prove that the newly updated lower bound holds for every node . Since , we have . The result is contradictory to the assumption that there is no -cut in with height of no more than .
Note that the above results depend on neither the order of label update nor the number of iterations performed. Therefore, we can conclude that the inequalities hold for every node during every iteration.
B. K-Cut Computation on Partial Flow Networks
In this subsection, we present a new approach to determine if by max-flow computation on much smaller partial flow networks than that of the expanded circuit used in [22] . To check whether with the approach in SeqMapII [22] , one needs to build the expanded circuit and construct the corresponding flow network, and then decide the existence of a -cut by maxflow computation. In TurboMap, however, we construct the flow network incrementally without constructing the entire . More important, we construct the flow network just large enough to determine whether a -cut exists. Recall that all the previous flow-computation-based FPGA mapping algorithms [5] , [7] , and [23] build the entire flow network before the max-flow computation. Thus, they are less efficient than our approach, and can be improved in a similar way.
The basic idea of our algorithm is that, although the flow network for is very large, the union of the shortest augmenting paths (in terms of the number of edges) is usually much smaller. (Note that we only need to determine whether or not the value of a max-flow is no more than . Searching for shortest augmenting paths is enough.) So if we start from and grow the flow network during the -cut computation incrementally, the flow network constructed would be much smaller than . We call the flow networks constructed by our approach the partial flow networks.
When updating the label lower bound for node , we construct the partial flow network, denoted , directly from . As shown in Fig. 4(c) , the edge direction in is reversed from that in [shown in Fig. 4(a) ]. The is the source of , and all the for PI will be connected to the sink of . A node in an expanded circuit of is critical if . The basic idea of partial flow network construction is to perform breadth-first search on during the construction of . We maintain a first-in, first-out (FIFO) queue , which initially includes only node , the source of . Each time, we fetch a node from to process and add new nodes to the end of until is empty or has an edge to the sink . Suppose is the current node fetched from the queue. If has fan-ins in the partially constructed , we put the fan-ins to the end of . If, however, does not have fanins and is not a PI in , we create the fan-in edges for and add new nodes to the flow network as follows. For each fan-in edge of in , we create two nodes and if they have not been created and put first and then to the end of . We add a new edge and assign the flow capacity to be if is a critical node, or "1" otherwise. Then we add edge with flow capacity of . If is a PI in , we connect to the sink with flow capacity of and find one augmenting path. Whenever an augmenting path is found, we augment the path, clear , and start from again to search for another shortest augmenting path until no more augmenting paths exist [in this case, we assign ], or we find the th augmenting path [here, a -cut does not exist and we assign ]. Let us look at an example of constructing the partial flow network for node in shown in Fig. 4(a) . Node is a primary input with . Each black bar represents an FF. For and , suppose . We now compute . Since , we only need to decide whether or . The construction of the partial flow network is shown step by step in Fig. 4(b)-(g) . At first, we create as the source of the partial flow network and put it in an FIFO queue . Then, for getting from , nodes and edges with flow capacity of will be created based on edges in . Nodes will be put to the end of . In the following, will be fetched successively from , and nodes and edges will be created. Since is critical [because ], the flow capacity of edge is . On the other hand, is not critical [because ], so the flow capacity of edge is one. The current flow network is shown in Fig. 4(c) . Since is a PI, a new edge with flow capacity of will be created, and one shortest augmenting path is found. After augmenting this path, we clear and start from again to search for another augmenting path. The flow network is shown in Fig. 4(d) . When reaching , which was created in previous steps, we need to create the fan-out edges of and add two pairs of nodes and . The new flow network is shown in Fig. 4(e) . We then create the fan-out edges of and add two nodes , as shown in Fig. 4 (f). Now we find and augment another augmenting path. The new flow network is shown in Fig. 4(g) .
As shown in Fig. 5 , after finding three augmenting paths, there exist no more. The value of max-flow is and therefore . The light shaded area is the partial flow network , which corresponds to the expanded circuit in this example. As shown in Fig. 5 , is much smaller than the entire flow network corresponding to . The heavy shaded area shown in Fig. 5 , of the min-cut in , forms a -LUT rooted at . Since only the first shortest augmenting paths are searched, the incrementally constructed flow networks are much smaller than the flow networks corresponding to the large . Our experimental results on MCNC and ISCAS benchmarks show that the average numbers of nodes in the partial flow networks are always far less than . As a result, each label update takes only time and space in practice.
To bound the partial flow networks to be no larger than the flow network of , we add an additional criterion to the partial flow network construction. Since the shortest path from a leaf to the root in is bounded by and each node needs to be split into two nodes to construct the corresponding flow network, the shortest path from the root to a leaf in the flow network of is bounded by . Therefore, we limit the augmenting path length to be no more than in . In other words, if the shortest augmenting path length from the source to during the construction of is , we mark as a leaf and connect it directly to the sink without growing further to 's fan-ins. Let be the corresponding flow network of . We prove that , and there is a -cut in if and only if there is a -cut in . Theorem 3:
. Proof: Let denote the shortest path length (in terms of the number of edges) from the root to in the residue flow network after pushing flows. Based on Lemma 27.8 in [11] , the shortest path distance in the residual graph is monotonically increasing if we always augment the shortest augmenting paths. So for any leaf . This implies that the shortest path distance from the root to every leaf in is less than . So . One important property of the min-volume min-cut is that every node in is reachable from the source in the residual graph of a maximum flow based on Lemma 6 in [5] .
Since is -feasible, has at most nodes. The reason is that if , at least one node in the original circuit has duplications in with different weights. (Note that each node in becomes a node pair with the same weight in .) Since every node is reachable from the PI's (nodes isolated from the PI's can be deleted easily with a preprocessing), there must be a path of in the original circuit from a PI to . Thus, for each duplication , there is a unique duplication of in . It means that the max-flow in is at least , and the cut is not -feasible. Let be a maximum flow with . Since any is reachable from the source and , the shortest path length in the residual graph . Therefore, for any flow with , . Note that if and only if . Thus, any in is also in . Since any is connected directly to a node , where is a node pair and will be created together in , thus .
C. SCC Partitioning and DFS Ordering
An SCC of a retiming graph is a maximal set of vertices such that for every pair of nodes and in , there are both paths and . Clearly, the labels of two nodes and are mutually dependent only if they are in one SCC. For and in different SCC's, e.g., but , the label of can be computed before the label of without iterations between and .
TurboMap first computes all the SCC's and sorts them in a topological order in linear time based on the algorithm in [12] . Then it labels the nodes in each SCC together. For different SCC's, the labels are computed separately in the topological order of SCC's from PI's to PO's. Our results show that with the SCC partitioning, the computation time of TurboMap can be reduced by 50% on average.
The computation order is also important to the rate of convergence of labeling computation. In [22] , it is proved that the number of labeling iterations is bounded by , which is too large in practice. In TurboMap, we compute the node labels in the order based on DFS from the outputs to the inputs of each SCC. If node is searched after node , we update the label of before in every iteration. Our experimental results show that if there is a feasible solution for a target clock period, all the node labels can be computed in 5-20 iterations in almost all the cases, which is much less than the worst case bound of under an arbitrary computation order.
D. Area Minimization
After obtaining the minimum clock period and all the node labels, we can form a mapping with retiming solution by implementing each as the LUT rooted at each node for the min-cut of computed during label computation. Although this solution has the minimum clock period after retiming, the area may not be good. In this subsection, we propose a heuristic to compute a low-cost -cut for every node based on to reduce the number of LUT's. Since the number of FF's is undetermined before retiming, we reduce only the number of LUT's during mapping, while minimizing the number of FF's during a postprocessing of retiming [17] .
There are two approaches to reduce the numbers of LUT's in final mapping solutions. First is to enlarge the volume of each -cut, as in [5] . Second is to maximize the sharing of fan-ins between LUT's as follows.
Similar to CutMap [7] , we define the cost of a node to be zero if the node has already been marked as an LUT root, and one otherwise. Initially, all PI's, PO's, and nodes with large fan-out numbers are marked as LUT roots. Our approach begins with an FIFO queue containing all PO's. For every node fetched from the queue, we compute a low-cost -cut and put all nodes in the cut-set to the end of and mark them as LUT roots. If the cut cost of the cut of computed during labeling is zero, the cut has the minimum cost. Otherwise, we try to compute another -cut of with smaller cost with two additional max-flow computations in a heuristic way.
First, we compute a zero-cost min-cut for node by constructing a new flow network and computing the max-flow. Let the cut capacity of a node be the edge capacity of the edge connecting the node pair of the node in the new flow network. 7 We assign the cut capacity of to be zero if is noncritical, i.e., , and has already be marked 7 Recall that each node in the expanded circuit corresponds to a pair of nodes in the flow network. as an LUT root. Otherwise, we assign the cut capacity to be . If the min-cut in the new flow network is -feasible, it corresponds to a zero-cost -cut on . Otherwise, there does not exist any zero-cost -cut on . In this case, if the mincut we computed happens to have cost one, it corresponds to a min-cost -cut on . If, however, the cost of the min-cut is larger than one, we try to find a lower cost -cut with one additional -cut computation by assigning different cut capacities to LUT roots and non-LUT roots based on the min-cut size on , as shown in Table I . The cut capacity is assigned in such a way that a min-cut in the new network corresponds to a min-cost cut on if the min-cut is -feasible. For example, suppose on one min-cut size is three with cost of two, and there exists another cut with a cut size of four and cost of one. Based on the assignment table, the cut capacity of the min-cut is . The cut capacity of the other cut is , however, so it is the min-cut on the new flow network and can be found through max-flow computation. Note that this approach is not guaranteed to find the min-cost -cut on because the min-cut on the new flow network may not be -feasible. For the previous example, if there is also a cut with cut size of and cost of zero, the cut capacity is six. It is also a min-cut on the new flow network but not -feasible, and cannot be implemented in one LUT. In the case that the min-cut found on the new flow network is not -feasible, we keep the min-cut on for node . After getting the low-cost -cut of every node , we then try to pack single-output fan-ins of the cut into (or LUT ). Our experimental results show that this approach is very efficient (with only two additional max-flow compu- 
E. Summary of the TurboMap Algorithm
In the preceding subsections, we have presented three strategies to speed up the label computation of the SeqMapII algorithm [22] and one heuristic method to reduce the area. For a target clock period, our algorithm, named TurboMap, performs SCC partitioning at first. Then, in topological order from the PI's to the PO's, TurboMap computes the node labels for each SCC separately. For each SCC, a number of efficient label update iterations are performed.
To find the minimum clock period, TurboMap performs a binary search using the upper bound of the clock period computed by FlowMap [5] on each combinational subcircuit independently. After getting the minimum clock period and the low-cost -cut for every node, TurboMap generates the mapping solution and then performs LUT packing [5] and retiming to achieve the minimum clock period [17] , [22] .
Theorem 5: For a -bounded sequential circuit with nodes, the TurboMap algorithm produces a -LUT mapping solution with the minimum clock period under retiming in time, where is an upper bound on the labeling iteration number and is an upper bound on the numbers of nodes in the partial flow networks.
Proof: Each -cut computation on the partial flow network takes time and space. Each label update iteration needs -cut computation. The label computation for a given target clock period takes run time with space requirement. Clearly, the minimum clock period under retiming is less than . With binary search, the total run time of label computation is . There are at most low-costcut computations, each of which takes run time and in total run time. So the total run time of In practice, and . TurboMap can be finished in time. Since the original SeqMapII algorithm takes time and space to compute optimal mapping solutions with the minimum clock period under retiming [22] , TurboMap is about times faster with times less memory. In fact, our monotone property (first presented in [10] ) was also adopted by the authors of SeqMapII to improve its performance in a later publication [24] . 
V. EXPERIMENTAL RESULTS
The TurboMap algorithm has been implemented in the C programming language on Sun SPARC workstations and incorporated into the SIS package [26] and the RASP FPGA synthesis package [9] . The test set includes 13 MCNC finite state machines (FSM's) and four large ISCAS benchmarks as used in [10] . The initial gate-level circuits for technology mapping are shown in Table II . Columns PI and PO list the numbers of primary inputs and primary outputs, respectively, of the circuits. Columns GATE and FF list the numbers of gates and FF's, respectively, in the circuits. Column lists the clock periods of the circuits after retiming but before technology mapping.
The experiments were run on a SUN SPARC5 workstation with 96 MB memory.
is set to be five. FlowMap followed by retiming is performed to get an upper bound (shown in column in Table II ) of the clock period for each example. LUT packing [5] and retiming are performed as postprocessings to get the final mapping solutions. Table III shows the comparison of TurboMap with SeqMapII [22] . SeqMapII has a parameter selecting for . We choose (SeqMapII-opt, which guarantees the optimal solution) and (SeqMapII-heu, which was used in the experiments by Pan and Liu [22] as a heuristic method), respectively, for each example. Columns LUT and FF list the numbers of LUT's and FF's, respectively, in the final solutions. The columns list the minimum clock periods of the final solutions. The CPU columns list the CPU time in seconds. Note that both TurboMap and SeqMapII-opt can obtain mapping solutions with the minimum clock periods under retiming, but TurboMap is 2.8 10 times faster. Moreover, TurboMap is more than 8 times faster than SeqMapII-heu, which may generate suboptimal solutions. 8 Compared with SeqMapIIheu, TurboMap produces mapping solutions with 23% smaller clock periods, 28% fewer LUT's, and 27% fewer FF's.
To show the effect of our -cut computation on partial flow networks, we compare the numbers of nodes of the partial flow networks with those of and used in SeqMapII [22] . The results are shown in Table IV . The column NODE lists the number of nodes in the retiming graph for each example, which is the sum of the numbers of the PI's, the PO's, and the gates. The column FF lists the numbers of FF's in the original circuits. The columns with subscripts and list the maximum and average numbers of nodes, respectively, of the partial flow networks or the expanded circuits over all nodes in the original circuits and generated by each algorithm. For the last four examples, cannot be generated due to either time or space limitations. The results show that the average sizes of the partial flow networks are only slightly larger than and 314 times smaller than . This result, together with our efficient label update and SCC partitioning with DFS ordering, provides an explanation of why TurboMap is significantly faster than SeqMapII-opt. Table V shows the comparison of TurboMap with the conventional design flow of using FlowMap [5] or CutMap [7] for mapping each combinational circuit independently followed by optimal retiming, whose results are listed in columns "FlowMap Retiming" and "CutMap Retiming," respectively. The results show that TurboMap can reduce the clock periods by 14% on average compared with both methods, with 4% fewer LUT's as compared with "FlowMap+retiming" but 4% more LUT's as compared with "CutMap+retiming." TurboMap also uses 49% more FF's to reduce the clock period. Note that the number of FF's will not affect area significantly because it is usually much less than the number of LUT's in a mapping solution. The final PLB count of FPGA's will be determined by the number of LUT's.
VI. CONCLUSIONS AND FUTURE WORK
We presented an improved algorithm, named TurboMap, for technology mapping with retiming for optimal clock periods. We proved the monotone property of node labels. Three strategies are used to enhance the performance of SeqMapII, i.e., efficient label update with single -cut computation, much smaller partial flow networks, and SCC partitioning and DFS ordering. Area reduction is also considered. The experimental results show that TurboMap is about 2.8 10 times faster than SeqMapII in computing optimal solutions. TurboMap is even 8 times faster than SeqMapII-heu heuristic algorithm. As a result, we conclude that optimal mapping for minimum clock period under retiming can be carried out efficiently for large circuits in practical use. Furthermore, there is no area overhead compared to conventional approaches to sequential circuits (FlowMap [5] or CutMap [7] followed by retiming). In our future work, we want to develop an efficient algorithm to compute the initial state of the mapping solution and combine resynthesis and pipelining techniques to further reduce the clock periods. We also want to investigate the open issues in this work of whether and whether the label computation can converge in time in the worst case.
