In 
Introduction
In high-level synthesis a synchronous data-flow graph (DFG) is mapped onto a set of modules, registers, and interconnections [ 11. An example of a DFG is shown in Fig. 1 . The data-flow graph represents an iterative algorithm such as a digital signal processing algorithm. A DFG can be nonrecursive or recursive. A recursive DFG contains feedback loops (or cycles) and therefore has an inherent lower bound on its iteration period called the iteration bound [2, 3] . High level synthesis consists of scheduling and resource allocation where the goal is to assign an operation in the DFG to an execution time on a particular processor. In this paper we consider time-constrained scheduling where required resources are minimized while satisfying the iteration period specification.
Finding an optimal schedule during synthesis of a DFG is an NP-complete problem [4] . Therefore, many heuristic schedulers have been proposed [ 1] [5] - [8] . While these schedulers generate reasonable schedules in short CPU time, the optimality of the schedule may not be guaranteed. Integer linear programming (ILP) solutions have been proposed to solve the scheduling problem during high level VLSI synthesis of DSP algorithms [9] - [15] . The ILP In this paper we extend the ILP model of [ 151 to solve the problem of module selection while scheduling. The objective of module selection is automatic allocation of each operation to a library of processors to synthesize a system using less silicon area and lower power. The module selection during scheduling has been addressed in [18] - [20] in the context of heuristic scheduling and in [ 141 for scheduling large-grain signal processing algorithms by ILP. In this paper we support fine-grain signal processing algorithms. One common way to build a library of components is to include bit-serial and bit-parallel units [21] . If both types of units are used and must communicate, then it is essential to include a serial to parallel (or parallel to serial) data format converter. Thus, we include support for cost and computational latency of data format conversion which has not been considered before.
This paper is organized as follows. In section 2, blocked scheduling and unfolding are discussed. Module selection and data format conversion are demonstrated in section 3.
The time assignment ILP model with module selection and data format conversion is presented in section 4. Section 5 contains the processor allocation ILP model with support for unfolding factor minimization. In section 6 scheduling results of several benchmarks are presented. Fig. 2(a) shows a critical path method (CPM) schedule [l, 51 of the DFG in Fig.1 . In this schedule a single iteraits date a m , and notic? is given that mpYing is by n on ".the tion does not overlap subsequent iterations. The minimum possible iteration period for a CPM scheduler is equal to the critical path length of the DFG. There are many cases where it is not possible to reduce the critical path time to the iteration bound through DFG transformations such as retiming and pipelining [16, 221 . For example, no matter how the DFG in Fig. 1 is retimed, the critical path will always be greater than the iteration bound of 4 u.t. (units of time).
Blocked schedules
This limitation is overcome by schedulers which overlap mulliple iterations [6] - [9] [23] . These schedulers schedule a single iteration of the DFG but allow subsequent iterations to overlap the first. In an overlapped schedule, each node computation is folded into T, equivalence classes and executed every T, u.t. where T, is the iteration period. This is sometimes referred to as loop unrolling or functional pipelining [6, 231. An overlapping schedule automatically supports retiming and functional pipelining. The minimum possible iteration period for an overlapping scheduler is limited by the longest execution time of a single node or the iteration bound, whichever is largest. Moreover, the processor utilization may not be optimal in overlapped schedules as shown in Fig. 2(b) .
Unfolding [3] and cyclo-static techniques [24] can be used to guarantee a rate-optimal and processor-optimal schedule even when there exists a node whose computation time exceeds the iteration bound, Both of these techniques require scheduling multiple iterations of the DFG. We call a multiple iteration schedule a blocked schedule. While an iteration is repeated in a non-blocked schedule, a block of iterations is repeated in a blocked schedule. An example of a blocked schedule for the DFG in Fig. 1 is shown in Fig. 2(c) . In this case, a schedule of 12 u.t., representing three iterations of the iteration period of 4 u.t., is repeated in every processor. Here the processor utilization is optimized since the number of processors is reduced from 4 in Fig. 2 (b) to 3 in Fig. 2(c) . The blocked schedule can always achieve the iteration bound of the DFG with optimal processor utilization. An apparent disadvantage of unfolding is the need to schedule multiple executions of each node.
The blocked scheduler proposed in [ 151 generates an abstracted blocked schedule of the original DFG, like the ones shown in Fig. 2 (e) and in Fig. 2(f) , without explicitly unfolding the original DFG. The complete blocked schedule can be generated from the abstracted schedule by simply repeating the schedule while exchanging processor assignments such that an iteration of any node is executed in a single processor. For example, while expanding the abstracted schedule of Fig. 2 (e) to generate the complete schedule in Fig. 2(c) , the allocations of P3 and P2 must be exchanged so tlhat B 2 is completed in processor P3. Thus it is possible to generate the abstracted schedule by scheduling the original DFG without considering multiple iterations. Let the unfolding factor of a processor be defined as the length of one period of the blocked schedule of the processor divided by the iteration period. In the processor-optimal blocked schedule shown in Fig. 2(c) , all the unfolding factor of each processor is 3. It is important to note that the unfolding factors of processors need not be identical. For example in the blocked schedule shown in Fig. 2 (d) which is also processor-optimal, the unfolding factors of processors P1, P2, and P 3 are 2,2, and 1, respectively. From the viewpoint of control circuit cost, smaller unfolding factors are preferable since the cost of the control circuit could be proportional to the length of the iteration period and therefore proportional to the sum of the processor unfolding factors. Thus, the schedule in Fig. 2 Generating a blocked schedule by a single complicated ILP model requires a long solution time. In our approach, a blocked schedule is generated by two ILP models to improve the solution time without degrading the optimality. 
Extension to support module selection and data format conversion
In the scheduling examples discussed in Fig. 2 , we assume a node has a predetermined computation time. We extend the DFG as follows to support module selection. First consider a library of processors with varying processor types as shown in Table 1 . These processors are derived assuming the use of 16 bit wordlength fixed point arithmetic. The computational latency, C, represents the time from an input to its associated output. The pipeline period, L, represents the time between successive operations. The cost, m, represents the cost in terms of area (i.e., the equivalent number of full adders) of each processor. The input and output datu formats, I and 0, represent the digit-size or the number of bits processed in every clock cycle in each processor. A 4-bit digit-serial architecture processes the data 4 bits at a time. These architectures may be derived using the techniques described in [21] .
Each node in the DFG can be assigned to one of the processors in the library. For example, the DFG in Fig. 3(a) represents a biquad filter. Nodes 1, 2, 3, and 4 can be assigned to any of the adders, Abp, Ahp, or Ads, in Table 1. Similarly, nodes 5,6,7, and 8 can be assigned to any of the multipliers, MbP, Mhp, or Mds, in Table 1 . Furthermore to support data format conversion, we include a library of data format converters which convert between all possible data formats listed in the library of processors. The library of processors in Table 1 requires data format converters as shown in Table 2 . Each of the data format converters is classified according to its conversion type, its conversion latency, C, its pipeline period, L, and its cost, m. The conversion latency is the time between input of the first digit and the output of the first digit of converted data. For example, it is 0 for a bit-parallel to half-word parallel converter (vbp,hp) since the first half-word is available as soon as a bit-parallel data is input. The conversion latency for a 4-bit digit-serial to bit-parallel parallel converter ( Q s , b p ) is 3 since it takes 3 u.t. to input and store three digits and the converted data is output when the last digit is input.
When the processor library is limited to just two processors, Ahp and Mh, in Table 1 , then a blocked schedule of the DFG of Fig. 3(a) can be obtained as shown in Fig. 3(b) . This schedule requires two Ahp adders and two Mhp multipliers with a total cost of 384 units. When the processor library is expanded to include all the processors and data format converters in Tables 1 and 2 , then nodes are assigned to processors and data format conversions are inserted as shown in Fig. 3(c) . Fig. 3(d) shows the abstracted blocked schedule. Nodes 1 and 6 are assigned to slower and less expensive processors. Data format conversions, symbolized by a box in Fig. 3(c) , are necessary between nodes 2 and 6, and 1 and 2. One converts from half-word parallel to 4-bit digit-serial and the other converts from 4-bit digit-serial to half-word parallel. The blocked schedule with module selection and data format conversion has a total cost of only 290 units compared to the original cost of 384.
ILP model for time assignment
We define a time assignment ILP model to derive the cost optimal architecture for the given DFG. The time as-signrnent model assigns a start time to each node within the DFG so as to satisfy precedence constraints, while performing module selection and data format conversion.
is q and node b is assigned to a processor whose input data format is r.
In the precedence constraint from processor to proces- The following terminology is used.
The DFG is defined as ( N , E) . N is the set of nodes and E is the set of edges. We is the number of delays on edge e E E.
T,. is the given iteration period. PROC is the library of available processors.
Ei denotes the subset of processors Fi c PROC, capable of'executing node i E N . Each processor, t E PROC, has computational latency Ct , pipeline period Lt , and cost mt .
A binary variable xi,j,t = 1 means that node i starts at time j on a processor of type t. FORM is the set of all the formats.
I ( t )
and O(t) are respectively the input and output data folrmats of a processor of type t.
CQNV is the library of available converters. vqr denotes a data format converter which converts data friom format q to format T . Each data format converter, U, has conversion latency C,, pipeline period L,, and cost m, .
A binary variable ~i , j ,~ = I means that a data format converter of type v is used and the conversion for the output data of node i starts at time j . LBi and UBI are the lower bound and the upper bound of the time at which the computation of node i can start. LBQ and UBQ are the lower bound and the upper bound of the time at which a converter of type v could start converting the data output from node i. These bounds determine the scheduling range of node i and are calculated as in [7, Inequalities (5) and (6) ensure the precedence constraints from processor to converter and from converter to processor, respectively. In the case that the output format of the converter and the input format of the processor are different, there is no need to constrain the precedence relation between them. In that case, inequality (6) is automatically satisfied.
Inequalities (7) and (8) are used to count the number of processors and the number of converters of each type. In an overlapped schedule with an iteration period of T,, there are T,. time partitions. Each time unit, j o , belongs to the time partition denoted by j o -2 T,., or j o mod T,, and nodes assigned to a time belonging to the same time partition are executed concurrently. Such nodes must be assigned to different processors. The parameter k1 in constraint (7) is used to fold a time into its time partition. The parameter p is used to handle structural pipelining. When a node is assigned to a processor whose pipeline period is longer than the iteration period, the processor must be counted multiple times, [?] , since the node occupies the processor for more than one iteration period. This accounts for the second term in constraint (7). The same applies to (8).
L T J

ILP model for processor allocation
The processor allocation model allocates node computations to processors to support unfolding using the start times and module selection provided by the time assignment model. Allocating data format conversions to data format converters can be performed in the same way as allocating node computations to processors. Therefore, only the allocation of node computations to processors is considered here. The goal of the allocation is to minimize the unfolding factor while supporting blocked schedules.
as follOws' The
VECONV
Precise calculation of unfolding factor tEPROC
The node assignment constraint (2) ensures that node i has lone start time and is assigned to one processor. The converter assignment constraint (3) ensures that a data format converter of type vqr is used if an edge (a, b) exists and node a is assigned to a processor whose output data format
As discussed in section 2, the processor allocation in Fig. 2(e) is preferable to that in Fig. 2(d) since the sum of the processor unfolding factors in Fig. 2(d) is less than that of Fig. 2(c) . For the purpose of calculating unfolding factors, we only need to consider the allocation of nodes whose First we calculate the unfolding factors assuming there are no nodes with a computation time (pipeline period) longer than the iteration period. Then we modify our calculations to include nodes with computation times longer than the iteration period. Let a computation of a node be divided at a multiple of the original iteration period as illustrated in Fig. 4(a) . Let the first portion be called the head of the node and the second portion be called the tail of the node. The head is assigned to the end of an iteration cycle and the tail is assigned to the beginning of an iteration cycle as shown in Fig. 4(a) .
The head and the tail of a node are allocated either to an identical processor or to two distinct processors. Nodes are divided into node groups depending on their allocation as follows.
[Definition: Node group]
A node group is the set of nodes such that either the head or tail of a node in a node group is allocated to a processor to which the tail or head of another node in the same node group is allocated. There may exist at most one node in a node group whose head is allocated to a processor to which no tail is allocated.
When this occurs we say a head is allocated alone.
The schedule is unfolded by a factor such that all the computations of nodes in the node group are executed one after another on an identical processor. In the processor allocation in Fig. 2(f) , there are two node groups: one consists of node B and the other consists of node C. The head of node B is allocated alone to processor P1. Since the head and the tail of node C are allocated to the identical processor, P3, the schedule is not unfolded and the unfolding factor for P 3 is 1. On the other hand, the head and tail of node B are allocated to distinct processors P 1 and P2. The schedules of these processors must be unfolded twice so that the head and the tail of node B are executed consecutively on an identical processor,
In the processor allocation in Fig. 2(e) , there is only one node group consisting of nodes B and C . The head of node B is allocated alone. In this case, the schedules of these processors must be unfolded 3 times as shown in Fig. 2(c) so that the head and the tail of node B are executed consecutively on an identical processor, the head of node C is executed on the same processor as the tail of node B, and the head and the tail of node C are executed consecutively on an identical processor. Consequently, the unfolding factor of a processor which executes the computations of nodes in a node group is equal to the number of nodes in the node group plus one.
We can calculate the unfolding factors more precisely as follows. Let a binary variable gz1,z2 = 1 if node il and node i2 are in the same node group, otherwise gz1,z2 = 0. By definition, g2,% = 1 since node i is always in the same node group as node i. Then The "ax' operation guarantees that the unfolding factors for all the nodes in each node group will be the same.
When the computation of a node is greater than the iteration period, the node must be divided into bodies as well as a head and tail as illustrated in Fig. 4(b) . In this case, the unfolding factor is increased by the number of bodies of the node. Therefore, yc is redefined as if head of node i is allocated alone,
where wk is the number of bodies of node k. pi can be calculated as in (10).
ILP model for processor allocation with unfolding factor minimization
First we identify those nodes, in the original set of nodes, whose computation times cross a multiple of the iteration period. Let N 1 denote the set of such nodes. The N I nodes are dlivided into heads, bodies, and tails as discussed above. Let S2, S3, and S4 denote the set of heads, the set of tails, and the set of bodies, respectively. The allocation model operates on the set of nodes M = ( N -N 1) U S2 U S3 U S4 where N is the original set of nodes.
The following terminology is used. It is the set of computations which are assigned to a processor of type t. PSt is the set of Mt processors of type t .
K is a sufficiently large positive integer. Since all the heads are allocated at the end of the iteration cycle, they would never be allocated to an identical processor. Therefore, we can fix the allocation of heads prior to the solution of the processor allocation ILP model.
The parameter Pi denotes the processor to which the head of node i is allocated. This simplifies the computation of node groups, i.e., the values of gi,,i2.
We minimize the sum of the unfolding factors (12) subject to the following constraints (13)- (21).
Constraint (13) ensures that each computation is allocated to one processor. Constraint (14) prevents more than one computation from being allocated to the same processor during the same time class. Constraints (15) to (18) (19) is equal to yi, since the last term becomes 0 if the head of node i is allocated alone and 1 otherwise. Constraints (19) and (20) find the maximum p over all the nodes in the same node group. If there exists a node group where a head is allocated alone, then there must exist a tail which is allocated to a processor to which no head is allocated. / 3: is the unfolding factor of the processor to which a tail would be allocated alone. It is computed by (21) . The cost function (12) is minimized by minimizing the sum of pi, U&, and pt. These represent the unfolding factors of the head, body, and tail of a node respectively.
Results
We simulated several DFGs to prove the effectiveness of our models. All the ILP models were solved using the ILP solver GAMS/OSL [25] on a SparcStation 2. To show that our model is able to derive optimal solutions, it is applied to the scheduling of the 5th order elliptic wave filter (EWF) which has been used in [5] [7]- [13] . In this case, a single processor type is assumed for each operation type, i.e., either nonpipelined multiplier and adder (symbolized as '*' and '+') or pipelined multiplier and adder ('*p' and '+'). The specification of these processors are the same as in [ 111. The number of processors required in each case are equal to or better than those in [ll] . Shown in Table 3 are the numbers of processors in the solution for each iteration period, T,, which compares to the latency f as described in [ 111. Though [ 111 shows the result of resource-constrained scheduling, our model derived the same results for most cases and a better result for one case. With 1 pipelined multiplier and 2 adders, the approach in [ l l ] required 18 units for the iteration period while our approach requires 17 units of time for the iteration period for the same resource constraints.
We scheduled several benchmarks using our models for a given iteration period with the components shown in Tables l and 2. The 4th order lattice and Jaumann filter benchmarks have been used in [7] , the 4 stage pipelined lattice filter benchmark has been used in [8] , and the 16 point FIR filter benchmark has been used in [5] - [8] . Table 4 Table 5 contains the results of the processor allocation ILP model. This table shows the minimum unfolding factor necessary to achieve the processor allocation and the CPU time required to solve the ILP model. B is the sum of unfolding factors of all the processors and max p is the maximum unfolding factor. Bars ('-') in both of these two columns mean that the processor allocation is obvious since the number of processors and the number of converters are 1. A bar in the column of max p means there exists no node computation which crosses a multiple of the iteration period and therefore B is 0.
Conclusion
We have proposed two new ILP models for the timeconstrained scheduling problem. Our models perform module selection and data format conversion while automatically retiming, pipelining, and unfolding the DFG in an implicit manner. We have run several benchmarks to prove the utility of these models. The ILP model is very attrac- 
