Abstract-Due to the precedence constraints among vertices, the partitioning problem for time-multiplexed field-programmable gate arrays (TMFPGAs) is different from the traditional one. In this paper, we first derive logic formulations for the precedence-constrained partitioning problems and then transform the formulations into integer linear programs (ILPs). The ILPs can handle the precedence constraints and minimize cut sizes simultaneously. To enhance performance, we also propose a clustering method to reduce the problem size. Experimental results based on the Xilinx TMFPGA architecture show that our approach outperforms the list-scheduling (List), the network-flow-based (FBB-m) (Liu and Wong, 1998), and the probability-based (PAT) (Chao, 1999) methods by respective average improvements of 46.6%, 32.3%, and 21.5% in cut sizes. Our approach is practical and scales well to larger problems; the empirical runtime grows close to linearly in the circuit size. More importantly, our approach is very flexible and can readily extend to the partitioning problems with various objectives and constraints, which makes the ILP formulations superior alternatives to the TMFPGA partitioning problems.
I. INTRODUCTION
Time-multiplexed field-programmable gate arrays (TMFPGAs) improve logic efficiency by dynamically re-using hardware. Currently there is fast growing interest in TMFPGAs for reconfigurable computing. In TMFPGAs, a large design can be partitioned into multiple stages to share the same smaller physical device in different time frames. Several different architectures have been proposed, e.g., the Xilinx architecture [17] , the virtual element gate array [13] , the dynamically programmable gate array [3] , [7] , dharma [1] , etc. All these models allow dynamic reuse of logic blocks and wire segments by reprogramming on-chip static random access memory (SRAM) bits. Fig. 1 shows the Xilinx TMFPGA configuration model [17] . The TMFPGA emulates a single large design through multiple configurations. Circuit configuration can be partitioned into multiple stages and stored in the configuration memory planes (CMPs). The TMFPGA can hold only one active configuration in any time frame. Each configuration is called a microcycle and one pass through all microcycles is called a user cycle. All combinational logic is evaluated and flip-flop values are updated in one user cycle. The target architecture consists of an array of augmented XC4000-style control logic blocks (CLBs) [17] , [18] . Each CLB includes microregisters (MRs) to store the intermediate values of combinational logic for use in later microcycles and also hold the flip-flop values for use in the next user cycle. A microcycle starts with saving all CLB results of the previous microcycle in Manuscript received June 1, 2000; revised February 26, 2001 . This work was supported in part by the National Science Council of Taiwan, R.O.C., under Grant NSC-89-2215-E-009-054. This paper was recommended by Associate Editor C.-K. Cheng.
G.-M. Wu is with the Department of Information Management, Nan-Hua University, Dalin Chiayi 622, Taiwan (e-mail: gmwu@mail.nhu.edu.tw).
J.-M. Lin is with the Department of Computer and Information Science, National Chiao Tung University, Hsinchu 30010, Taiwan (e-mail: gis87808@cis.nctu.edu.tw).
Y.-W. Chang is with the Department of Electrical Engineering, National Taiwan University, Taipei 106, Taiwan (e-mail: ywchang@cc.ee.ntu.edu.tw).
Publisher Item Identifier S 0278-0070(01)08887-X. MRs and then a new configuration is loaded into the active configuration memory.
Because the logic and interconnect needed for a circuit is time-multiplexed on a TMFPGA, its partitioning problem is different from traditional ones. (This partitioning is similar to the scheduling problem in high-level synthesis [12] .) The major difference is that the execution order of circuit elements must follow the precedence constraints in the TMFPGAs. The TMFPGA partitioning problem has been studied in the recent literature [4] - [6] , [15] , [17] . Chang and Marek-Sadowska in [4] and [5] presented the list-scheduling methods (List) for buffer-register and cut-size minimization for various TMFPGA architectures. Liu and Wong in [15] proposed a network-flow-based algorithm (FBP-m) for multistage precedence-constrained partitioning for the Xilinx-like TMFPGAs. Recently, Chao et al. in [6] proposed a probability-based approach (PAT) for the partitioning for the Xilinx-like TMFPGAs. The probability-based approach combines second-order information and a stochastic-gain function [9] , the Fiduccia and Mattheyses partitioning-based iterative-improvement method [10] , and the maximum fan-out-free cone-based clustering [8] . It gives the best results among the previous works for the TMFPGA partitioning problem.
In this paper, we present generic integer linear programming (ILP) formulations for the multistage precedence-constrained partitioning problems. We begin with a mathematical description of the partitioning objectives and constraints, which can easily be translated into integer linear programs. Unlike most existing methods that can consider the precedence constraints and cut sizes only in some local stages at a time, the ILP-based method can consider those for all stages simultaneously and, thus, has a more global perspective to optimize given objectives. To enhance performance, we also propose a clustering method to reduce the problem size; the clustering provides a tradeoff between runtime and solution quality (in terms of CLB and interconnection costs). Experimental results based on the Xilinx TMFPGA architecture show that our approach outperforms the list-scheduling (List), the network-flow-based (FBP-m) [15] , and the probability-based (PAT) [6] by respective average improvements of 46.6%, 32.3%, and 21.5% in cut sizes. More importantly, our algorithm is very practical and scales well to larger problems; the empirical runtime grows close to linearly in the circuit size. Its runtimes range from 38 min for the smallest circuit (s820) to about 6 h for the largest circuit (s35932). Moreover, our approach is very flexible and can readily extend to the partitioning problems with various objectives and constraints, e.g., buffer-register minimization [4] . The flexibility makes the ILP formulations superior alternatives to the TMFPGA partitioning.
The remainder of this paper is organized as follows. Section II formulates the TMFPGA partitioning problem. Section III presents the ILP formulations for the problem. Section IV proposes a clustering method to enhance runtime. Section V extends our approach to the TMFPGA partitioning problems with various objectives and [17] , the following precedence constraints must be satisfied: 1) each combinational vertex must be scheduled in a stage no later than all its output vertices and 2) each flip-flop vertex must be scheduled in a stage no earlier than all its input and output vertices. These constraints define a partial temporal ordering on the vertices in the circuit. Let Pre(v) be the precedence of a vertex v. For two vertices v 1 and v 2 , we define Pre(v1) Pre(v2) if v1 must be scheduled no later than v2. In other words, in order to produce the correct result in one user cycle for a partitioned time-multiplexed circuit, the virtual CLBs must be evaluated in the proper order (see Fig. 2 ) and the MRs used in a microcyle cannot exceed the number of actual CLBs in a TMFPGA. We call this type of partitioning as precedence-constrained partitioning.
By the above constraints, we formulate the MP-constrained partitioning (MPCP) problem for TMFPGAs as follows.
MP-Constrained Partitioning Problem:
Input: Given a circuit G = (V;E) and the number of MPs p in a TMFPGA. 2) Min-cut objective: Minimize maxfjIij j1 i pg.
The logic in different stages share the same physical CLBs in a time-multiplexed manner. Hence, the CLBs used in a microcyle cannot exceed the number of actual CLBs in a TMFPGA, i.e., the number of vertices should be smaller than the number of actual CLBs. The balance objective, minimize maxfw(V i )j1 i pg, is to balance the sizes of stages so that the design can fit into a smaller physical device. The min-cut objective minimizes the maximum cuts between successive stages. Fig. 3 shows a part of a design that has been partitioned into four MPs in a TMFPGA. Assume that a vertex requires a CLB and an interconnection requires an MR. For example, the partitioning shown in Fig. 3 (a) needs five CLBs and five MRs while that shown in Fig. 3 (b) uses only three CLBs and three MRs. Therefore, the partitioning shown in Fig. 3(b) is desirable.
III. ILP FORMULATIONS FOR THE MPCP PROBLEM
In this section, we first describe the ILP formulations for the MPCP problem. To reduce the total execution time of a user cycle, we add the ILP formulations for timing associated with the temporal precedence graph (TPG).
A. Two-Terminal Nets
The notations used in our formulations are defined as follows. Suppose that a hypergraph G(V; E) with n vertices and m nets is partitioned into p stages.
In the MPCP problem, given a circuit G(V; E) and a TMFPGA with p MPs, the circuit is partitioned into p stages without violating the precedence constraints, balance objective and the cost of the partitioning is minimized, where the cost consists of the maximum numbers of interconnections and CLBs needed in a stage. The above cost can be minimized by the ILP formulations presented in the following.
The variables used in the formulations are as follows.
M c
Integer variable that denotes the number of CLBs needed in the TMFPGA.
M r
Integer variable that denotes the number of MRs (or interconnections) needed in the TMFPGA.
xi;j 0-1 integer variable associated with vi. xi;j = 1 if vi is assigned to the stage j; otherwise, x i;j = 0. By the precedence constraint, vi must be assigned to a stage no later than v i . Therefore, the two OR terms in the above equation cannot be zero at the same time. Equation (1) is mathematically equivalent to (2) 
For a net e i 2 E c , y i;p is always equal to zero, since the data of a combinational node is only used in the current user cycle. 
The objective function is used to minimize the number of MRs. which is formulated in constraint (7). Constraints (8) and (9) ensure that the precedence relations of the graph will be preserved. For any
ensures that v i will be scheduled in a stage no later (earlier) than v i .
Constraint (10) states that any cut size between two adjacent stages cannot exceed Mr-an interconnection in a cut needs an MR to save its signal.
B. Multiterminal Nets
In this section, we present the ILP formulation for multiterminal nets. We redefine the 0-1 integer variable y i;k associated with a multiterminal net e i = (v i ! hv j ; v j ; . . . ; v j i). y i;k = 1 if the net introduces an interconnection between stages k and k + 1; otherwise, 1 t q) . Therefore, we can rewrite constraint (9) for the net e i as follows: Considering the movable range estimated by the ASAP and ALAP scheduling, we reformulate constraints (7)- (9) derived from the ASAP and ALAP scheduling is a necessary, but not sufficient condition for v to be feasible in the bounded-delay precedence-constrained partitioning.
In the following, we present the TPG and translate the constraints associated with TPG (TPG constraints) into ILP formulations, which is a precise limitation for the relative positions of nodes.
A pair of vertices (v i ; v j ) is critical if the length of the longest path between vi and vj is equal to D + 1. (We do not need to consider those pairs of vertices with their longest paths greater than D + 1.). Thus, v i and v j cannot be assigned to the same stage if the pair (v i ; v j ) is critical. Given a circuit G = (V; E), its corresponding TPG Gt = (V t ; E t ) is constructed by V t = fV c jV c 2 V g and E t = f(v i ! hv j i)j pair (v i ; v j ) is criticalg. Fig. 5 shows a TPG example that contains eight vertices v1; v2; . . . ; v8 . Assume that delay bound in a stage is equal to two (i.e., D = 2). Then, two vertices cannot be assigned to the same stage if there exists a path between them and they are not consecutive. As an example in Fig. 5(a) , the length of the longest path associated with (v 1 ; v 7 ) is equal to three; therefore, they cannot be assigned to the same stage and we connect (v 1 ; v 7 ) with a directed edge [see Fig. 5(b) ]. Similarly, there are directed edges (v2; v7 ), (v2; v6 ), and (v5; v8 ) in the TPG shown in Fig. 5 .
For a sequential circuit, depth is the length of the longest path of the combinational part. Therefore, for a sequential circuit G(V; E), we can remove all nets in E f and all flip-flop vertices and obtain a directed acyclic graph G a (V a ; E a ). The TPG G t (V t ; E t ) can then be constructed from G a by the similar method as discussed earlier. If there is no path between two vertices in Ga , there is no precedence relation between them. Thus, it suffices to consider every pair of connected vertices in Ga . For each vertex vi , we add the edge (vi ! hvji) to E t if the length of longest path between v i and v j (a successor of v i ) equals D + 1. Given a circuit G(V; E), algorithm TPG_Generation shown in Fig. 6 generates a TPG associated with G(V; E).
Two vertices in a critical pair cannot be assigned to the same stage. Therefore, we can incorporate the critical pairs into the MPCP formulation to ensure that the execution time of every stage do not exceed D. The constraint can be formulated as follows:
for each e i = (v i ! hv i i); e i 2 E t :
Our ILP formulation is summarized in Fig. 7 and Theorem 1 states the correctness of the formulation.
Theorem 1: The problem MPCP has a solution if and only if all vertices V in G can be partitioned into p stages in the TMFPGA under the precedence and delay-bound constraints. 
D. Complexity of the PDMPCP Problem
The complexity of the PDMPCP problem can be analyzed in terms of the numbers of variables and constraints. In the PDMPCP problem, the number of stages is given and the values of the 0-1 variables yi ;i We analyze the number of equations needed for a circuit in the following. In Fig. 7 , it is obvious that the number of equations required for constraints (21)- (23) and (28) is one, p, n, and p, respectively. For each two-terminal (multiterminal) net, we need an equation for constraint (24) or (25) [(26) 
IV. SOLUTION SPACE REDUCTION
An effective clustering algorithm can greatly improve the quality of the precedence-constrained partitioning and speed up the partitioning algorithm by reducing the problem size. Sanker and Rose [16] proposed a new clustering metric that is effective in clustering traditional circuits, but may not generate feasible clusters due to the precedence constraints in the TMFPGA partitioning. Based on Sanker and Rose's algorithm [16] , we propose in the following a clustering method that can consider the precedence constraints during clustering. Our clustering algorithm begins by randomly choosing some vertices as seeds. Each unclustered vertex connected to those seeds is assigned scores used to decide to which cluster the vertex belongs. The score wv;c for each candidate vertex v associated with a cluster c has two components:
1) the number of connections between the candidate vertex v and the cluster c being considered, with each connection weighted by the fan-out of the net on which it lies; 2) the number of nets that would be completely absorbed if this candidate vertex v were added to the cluster c. A net is said to be absorbed by a cluster if all the vertices on that net are contained within that single cluster. Let N v;c denote the set of nets shared between the candidate vertex v and the cluster c, P v the set of pins on net e 2 Nv;c, and Av;c the set of nets absorbed by adding the candidate vertex v to the cluster c, then the score can be expressed as With this function, vertices on low fan-out nets and on nets that are about to be absorbed are preferred when building the clusters. For a candidate vertex, we pick the highest score associated with a cluster and add the vertex to the cluster. This process is repeated until all vertices are clustered. The result is a netlist of clusters with absorbed nets removed.
However, the above procedure might not satisfy the the precedence constraints. It may generate cycles in the graph. For example, in Fig. 8(a), vertices v1 , v2 , and v3 are clustered in clusters C1 , C2 , and C 3 , respectively. Considering the vertex v 4 , if the score w v ;c is the highest among the scores associated with v4 , then v4 will be clustered in C 1 , which violates the precedence constraints because this clustering causes a cycle hC 1 ; C 2 ; C 3 ; C 1 i as shown in Fig. 8(b) .
To ensure that no cycle be generated, we cluster according to the topological order of vertices in circuits. Moreover, we check whether there is any cycle created when clustering. The algorithm is named precedence-constrained clustering and is summarized in Fig. 9 . Theorem 2 gives the time complexity of the algorithm. 
V. EXTENSIONS
Our approach is very flexible; it can also handle the precedence-constrained partitioning of different forms with only minor modifications. In this section, we extend the ILP formulations to the buffer-register minimization partitioning (BRMP) problem that was first investigated in [4] and CLB-constrained stage minimization.
A. Buffer-Register Minimization Partitioning
In this section, we extend our approach to the BRMP problem addressed in [4] . The original problem is addressed on the dharma [1] architecture. As mentioned in [4] , buffer registers are needed because the time-multiplexed nature of TMFPGAs means that only a portion of the circuit implemented on the chip is present at any given time instance. Thus, there is a need to buffer signal until they are no longer needed. Fig. 10 shows the partitioning with different buffer-register requirements. Buffer registers are used to store signals among nonadjacent stages. Three buffer registers are needed in the partitioning shown in Fig. 10(a) , while only two buffer registers are required in that shown in Fig. 10(b) . We denote the set of buffer registers needed in the stage i as B i and the total number of buffer registers in B i as jB i j. Based on the model in [4] , jBij = jfej the fan-out vertex v of net e is assigned earlier than the stage i and one of v 0 s outputs is assigned later than the stage igj. We formulate the BRMP problem as follows. 
Buffer-Register Minimization
where is user specified parameter and 0. The new constraint (38) states that each stage cannot contain more than M b buffer registers.
B. CLB-Constrained Stage-Minimization Partitioning
In the Xilinx TMFPGA, each stage must be stored in a CMP. Each MP is a very large word of memory. Therefore, minimizing the number of stages (i.e., the variable p or the number of MPs needed) allows the design to fit into a TMFPGA with a smaller number of MPs. The significance of minimizing the number of stages required is twofold: 1) it is possible to implement a circuit in a TMFPGA with fewer MPs and 2) a TMFPGA with the fixed number of MPs can accommodate a larger circuit design.
The CLB-constrained stage-minimization partitioning (CCSMP) problem can be described as follows.
CLB-Constrained Stage-Minimization Partitioning Problem:
Input: Given a circuit G = (V;E) and the number of CLBs in a TMFPGA.
Problem: Determine the precedence-constrained partitioning with the following objective. 1) Minimize p + Mr.
The CCSMP problem considers the numbers of stages and interconnections simultaneously. The variables used in the formulations are as follows.
p p p
Integer variable that denotes the number of MPs needed in the TMFPGA. Note that p is a variable here while it is a constant in the previous discussions. 
VI. EXPERIMENTAL RESULTS
The programs for our system (the ASAP and ALAP scheduling, i.e., bounded-delay precedence-constrained partitioning presented in Section III-C, clustering, and ILPs) were written in the C++ language and the ILPs were solved using the LINDO package [14] on a PC with a Pentium II 300 microprocessor and 512-MB RAM. LINDO starts with a feasible linear programming solution and searches for optimal integer solutions using the branch-and-bound method. To speed up the runtime, we search at most five feasible solutions in using LINDO. We tested on the MCNC Partitioning93 benchmark circuits [2] used in [4] - [6] and [15] . Columns 2-4 in Table I list the number of vertices, nets, and primary input-outputs in the circuits, respectively. In column 5, depth refers to the number of vertices on the longest critical path.
We compared our method with the list scheduling list [4] , [5] , the network-flow-based approach FBP-m [15] , and the probability-based approach PAT [6] on the Xilinx TMFPGA model in which a circuit was partitioned into eight stages. The size of a stage is bound by the balance factor 5%. This is the same as in [5] , [6] , and [15] . The results are shown in Table II . Columns 2-4 in Table II The results show that our method on the average reduces the maximum numbers of MRs required by 46.6%, 32.3%, and 21.5%, compared with List, FBP-m, and PAT, respectively. The results show the effectiveness of our ILP approach. Our approach is practical and scales well to larger problems. As shown in Fig. 11 in which the runtime is plotted as a function of the circuit size, the empirical runtime grows close to linearly in the circuit size. The runtimes depend on the numbers of 0-1 variables and range from 38 min for the smallest circuit s820 to about 6 h for the largest circuit s35932. (In the ILP formulation, the numbers of variables and constraints needed by the largest circuit s35932 are 11 328 and 
VII. CONCLUSION
We have presented generic ILP formulations for a set of multistage precedence-constrained partitioning problems and a clustering method for reducing problem sizes. Experimental results have shown the effectiveness of the ILP-based approaches. The ILP-based formulations are so flexible that they can readily apply to the partitioning problems with various objectives and constraints. The flexibility makes the ILP formulations superior alternatives to the TMFPGA partitioning.
