Abstract-In high level synthesis a data-flow graph (DFG) description of an algorithm is mapped onto a register transfer level description of an architecture. Each node of the DFG is scheduled to a specific time and allocated to a processor. In this paper we present new integer linear programming (ILP) models which generate a blocked schedule for a DFG with automatic retiming, pipelining, and unfolding while performing module selection and data format conversion. A blocked schedule is a schedule which overlaps multiple iterations of the DFG to guarantee a processor optimal schedule. During module selection an appropriate processor is chosen from a library of processors to construct a cost optimal architecture. Furthermore we also include the cost and latency of data format conversions between processors of different implementation styles. We also present a new formulation for minimizing the unfolding factor of the blocked schedule. The approach presented in this paper is the only systematic approach proposed so far to include implicit unfolding and to perform synthesis using non-uniform processor styles and data format converters.
I. Introduction
I N high-level synthesis a synchronous data-flow graph (DFG) is mapped onto a set of modules, registers, and interconnections [1] , [2] . An example of a DFG is shown in Fig. 1 . The data-flow graph represents an iterative algorithm such as a digital signal processing algorithm. A DFG can be non-recursive or recursive. A recursive DFG contains feedback loops (or cycles) and therefore has an inherent lower bound on its iteration period called the iteration bound [3] , [4] . High level synthesis consists of scheduling and resource allocation where the goal is to assign an operation in the DFG to an execution time on a particular processor. Scheduling can be either time-constrained where the goal is to minimize the iteration period or resource-constrained where the goal is to minimize the number of resources. Timeconstrained scheduling is the focus of this paper.
Finding an optimal schedule during synthesis of a DFG is an NP-complete problem [5] , [6] . Therefore, many heuristic schedulers have been proposed [2] , [7] - [18] . While these schedulers generate reasonable schedules in short CPU time, the optimality of the schedule may not be guaranteed. Integer linear programming (ILP) solutions have been proposed to solve the scheduling problem during high level VLSI synthesis of DSP algorithms [19] - [28] . The ILP model in [27] , [28] operates on the original DFG and generates a blocked schedule which automatically retimes [29] , pipelines [30] , and unfolds [4] the DFG. We present a new formulation to minimize the unfolding factor. The ILP formulation is attractive because of the ease of adding additional constraints to the scheduling problem. Moreover, it is possible to find an optimal solution using the ILP approach. In this paper we extend the ILP model of [27] , [28] to solve the problem of module selection while scheduling. Most synthesis systems assume each function in a DFG to be implemented by a predetermined hardware processor. In this paper we perform automatic allocation of each operation to a library of processors. The problem of module selection during scheduling has been addressed [13] , [14] [31]- [33] in the context of heuristic scheduling. Module selection during scheduling by ILP was reported in [26] for scheduling large-grain signal processing algorithms. In this paper we support fine-grain signal processing algorithms.
Each module in the library may be described by several parameters including computational latency, pipeline period, digit-size, and cost. Typically these processors differ according to their implementation styles. For example, an add operation may be performed by either a carry lookahead adder or a ripple carry adder [34] . During synthesis the fast, expensive implementations are used only where necessary. This allows us to search a wider design space and thus we can synthesize a system using less silicon area and lower power.
Furthermore, we also include support for calculating the cost and latency of data format conversion which has not been considered before. Data format conversion allows us to accurately model the multiple implementation styles available in a library of components. One common way to build a library of components is to include bit-serial and bit-parallel units [35] . The bit-parallel units are faster but require more area. If both types of units are used and must communicate, then it is essential to include a serial to parallel (or parallel to serial) data format converter. We include the cost and computational latency of data format conversion in our ILP synthesis system. This paper is organized as follows. In section II, blocked scheduling and unfolding are discussed. The time assignment ILP model for blocked scheduling with module selection and data format conversion is presented in section III. To improve the solution time of ILP models, a novel technique called scheduling range folding and model division technique are presented in section IV and in section V, respectively. Section VI contains the processor allocation ILP model with support for unfolding factor minimization. In section VII results of several benchmarks obtained by using our synthesis system are presented. C2 C2   A0   B0   A1   B1   C0  C1   B2   A3  A1  A1   B0  B0   A0  A0   D0  D0   A2 A2  A2   B2  B2  B1  B1   A3  A3   C0  C0  C1  C1   D1  D1   B0  B1  C0  B0  B0   D0  D0   B1  B1  C0  C0  A0 A0  A0  A1 A1  A1   D2 D2   0 1 2 3  4 5 6 7 8 9 10 11 12 13 14 15   0 1 2 3  4 5 6 7 8 9 10 11 12 13 14 15   Time   Time   P1   P2   P1   P2   P3 Time   B0   B1   C0   C1   B2   A3  B0  B0   D0  D0  A2 A2  A2  B2  B2   B1  B1   A3  A3  C0  C0   C1  C1   D1  D1  A0 A0  A0   A1 A1  A1   C2   D2 D2   C2   B3   0 1 
II. Blocked Schedules
Critical path method (CPM) schedulers [2] , [9] , [36] , [37] schedule a single iteration at a time and do not overlap subsequent iterations. The minimum possible iteration period for a CPM scheduler is equal to the critical path length of the DFG. An example of a critical path schedule for the DFG in Fig. 1 is shown in Fig. 2(a) . The iteration period, T r , for this schedule is 9 u.t. (units of time) equal to the critical path A-B-C. Retiming and pipelining [7] , [29] , [38] could be used to reduce the critical path time, however, it is not always possible to reduce the critical path time to the iteration bound. For example, the DFG in Fig. 1 is retimed to a minimal critical path of 5 u.t., as shown in Fig. 3 , which is greater than the iteration bound of 4 u.t.
To further exploit the potential parallelism among iterations, schedulers have been developed which overlap multiple iterations [7] , [10] , [11] , [19] , [39] . These schedulers schedule a single iteration of the DFG but allow subsequent iterations to overlap the first. This is sometimes referred to as loop unrolling or functional pipelining [7] , [39] and has also been used in design of high-performance compilers [37] , [16] . An overlapping schedule automatically supports retiming and functional pipelining.
The minimum possible iteration period for an overlapping scheduler is limited by the longest execution time of a single node or the iteration bound, whichever is largest. Moreover, even when it is possible to achieve the iteration bound in overlapped schedules, the processor utilization may not be optimal. An example of an overlapped schedule for the DFG in Fig. 1 is shown in Fig. 2(b) . Although the iteration bound is achieved in this example, the processor utilization is not optimal.
Unfolding [4] and cyclo-static techniques [40] , [41] can be used to obtain a schedule where the iteration period equals the iteration bound and the processor utilization is maximized. Both of these techniques require scheduling multiple iterations of the DFG. We call a multiple iteration schedule a blocked schedule because such a schedule effectively handles a block of iterations. An example of a blocked schedule for the DFG in Fig. 1 is shown in Fig. 2(c) . In this schedule, three iterations are scheduled before the schedule repeats. In other words, a schedule of 12 u.t., representing a block of three iterations of 4 u.t., is repeated in every processor. An apparent disadvantage of unfolding is the need to schedule multiple executions of each node.
In [27] , [28] it was demonstrated that it is possible to generate a blocked schedule from the original DFG, like the one shown in Fig. 2(f) , without explicitly unfolding the original DFG or using cyclo-static techniques. Let the blocking factor of a processor be defined as the length of one repetition of the blocked schedule divided by the iteration period. For example in the processoroptimal blocked schedule shown in Fig. 2(c) the blocking factor of all the three processors is 3. It is important to note that in the schedule shown in Fig. 2(d) , which is also processor-optimal, the blocking factor of P 1 and P 2 is 2 while the schedule of processor P 3 repeats every 4 u.t. and requires no further blocking.
If there exist processors whose blocking factors are B, then the period of B T r u.t. must be controlled. If other processors whose blocking factors are B 0 coexist, then a controller must control the period of LCM(B; B 0 )T r u.t., where LCM(B; B 0 ) denotes the least common multiple (LCM) of B and B 0 . It can be assumed that the cost of the control circuit is proportional to the LCM of blocking factors. Thus, the schedule in Fig. 2(d) where the LCM of blocking factors is 2 is superior to the schedule in Fig. 2(c) where the LCM of blocking factors is 3. The objective of the processor allocation is to minimize the control circuit while achieving the minimum number of processors.
In our approach, a blocked schedule is generated without explicit unfolding and the scheduling is performed on the original DFG. Our approach to generating the blocked schedule requires two ILP models; a time assignment model and a processor allocation model. The time assignment ILP model finds the start times for each node by folding the nodes into equivalent time partitions as in an overlapping scheduler with the addition of support for unfolding nodes whose execution times are longer than the iteration period. The processor allocation ILP model finds an optimal processor allocation for each of the nodes in the DFG using the result of the time assignment model. While many synthesis systems perform implicit pipelining and retiming, the proposed approach is the first approach to consider implicit unfolding in synthesis of heterogeneous architectures.
III. ILP Model for Time Assignment
We present a time assignment ILP model to derive the optimal architecture with the minimized number of processors with implicit retiming, pipelining, and unfolding. Initially, we assume each node in the DFG is preassigned to a single processor style based on the operation type. A processor style may represent a function like an addition or multiplication or may be multifunctional like an ALU. Processors in this model can be either pipelined or non-pipelined and may execute multi-cycle operations. Each processor style is represented by its computational latency, pipeline period, and cost. The computational latency represents the time from an input to its associated output. The pipeline period represents the time between successive initiations of operations on the processor.
Next we extend the time assignment ILP model to support automatic module selection where appropriate processor styles are selected from a library of processors. Processors selected from a library of processors may input and output data in different format. In that case, data format converters are automatically inserted as needed in the synthesized architecture to convert the data format from one processor to the next.
A. Time assignment ILP model for blocked scheduling
In this section, an ILP model to generate the time assignment of a blocked schedule is presented.
The following terminology is used.
The DFG is defined as (N; E ). N is the set of nodes and E is the set of edges. W e is the number of delays on edge
T r is the given iteration period. A binary variable x i;j = 1 means that a computation of node i starts at time step j . C i and L i are respectively the computational latency and the pipeline period of node i. LB i and U B i are the lower bound and the upper bound of the time at which the computation of node i can start. These bounds determine the scheduling ranges of nodes calculated as in [10] , [27] . 
Constraint (2) ensures that node i has only one start time. Inequality (3) ensures the precedence constraints are satisfied. For each edge e = (a; b), the computation of node b must start at least (4) are used to support unfolding nodes whose pipeline period is larger than the iteration period. Such a node must be assigned to more than one processor.
B. Extending the DFG to Support Module Selection and Data Format Conversion
We extend the DFG as follows to support module selection. First consider a library of processors with varying implementation styles. Each processor is labeled according to its type. Various types of processors can be obtained by various implementation styles which may vary according to data size, such as bit-serial, digit-serial, or bit-parallel; or according to architecture such as carry lookahead versus ripple carry. Each processor is described by its computational latency, pipeline period, cost, and data format. A node in the DFG can be assigned to a particular type from a subset of processor types in the library. Second we include a set of data format converters to convert between all possible data formats. Each of the data format converters is classified according to its conversion type, its computational latency, its pipeline period, and its cost.
For example, the DFG in Fig. 4 (a) represents a biquad filter. A library of processors is given as shown in Table I . bp, hp, and ds indicate the style of a processor. A bp is a bit parallel adder, A hp is a half-word parallel adder, and M ds is a 4-bit digit serial multiplier. In this table, C denotes the computational latency, L the pipeline period, m the cost, I the input data format, and O the output data format. For simplicity we assume that only processors A hp , A ds , M hp , and M ds can be used. Nodes 1, 2, 3, and 4 can be assigned to either processor A hp or processor A ds . Similarly, nodes 5, 6, 7, and 8 can be assigned to either processor M hp or M ds . More specifically, processors A hp and A ds represent two different adder implementations while processors M hp and M ds represent two different multiplier implementations. In addition we include the library of data format converters shown in Table II as required by the library of processors in Table I . In Table II C denotes the latency of data format conversion, L the pipeline period, m the cost.
For the DFG of Fig. 4 (a), the use of module selection can reduce the cost of an architecture even when the cost of data format conversion is included. When the processor library is limited to just two processors, A hp and M hp in Table I , and the iteration period is 7 u.t., then a blocked schedule such as the one in Fig. 4(b) can be obtained. This schedule requires two A hp adders and two M hp multipliers with a total cost of 384 units. When the processors A hp , A ds , M hp , and M ds and necessary data format converters in the libraries could be used, then an assignment of nodes to processors and converters is determined as in Fig. 4 (c) and a blocked schedule such as the one in Fig. 4(d) can be obtained. In this case, nodes 1 and 6 are assigned to slower and less expensive processors. A data format conversion, symbolized by a square in Fig. 4(c) , is necessary and automatically inserted between nodes 2 and 6, and between nodes 1 and 2. The blocked schedule with module selection and data format conversion has a total cost of only 290 units compared to the original cost of 384.
C. ILP model for automatic module selection and data format conversion
In this section we present an ILP model for time assignment with automatic module selection and data format converter insertion. This model is an extension to the one presented in section III-A.
The following terminology is used in addition to the terminology defined in section III-A.
A binary variable x i;j;t = 1 means that node i starts at time j on a processor of type t.
PROC is the library of available processors. C i t and L i t are respectively the computation latency and the pipeline period of node i when it is executed on a processor of type t 2 PROC. m t denotes the cost of a processor of type t. F i denotes the subset of processors, F i PROC, capable of executing node i 2 N. G t denotes the set of nodes whose computation can be executed on a processor of type t. in denotes an imaginary node generating input data to the DFG.
N in denotes the set of nodes whose computation uses the input data sample to the DFG. Thus, N in N. n o 2 N denotes a node whose output data is the output data sample of the DFG. FORM is the set of input and output formats for all the processors.
f in and f out denote the data formats of the input and output of the DFG, respectively. I(t) and O(t) are respectively the input and output data formats of processor t. They may or may not be identical.
CONV is the library of available converters.
v qr denotes a data format converter which converts data from format q to format r. Each data format converter, v, has conversion latency C v , pipeline period L v , and cost m v . V v is the set of nodes which could be assigned to a processor whose output data may be converted by a converter of type v. It may contain in. A binary variable y i;j;v = 1 means that the conversion for the output data of node i starts at time j using a data format converter of type v. LB i v and UB i v are the lower bound and the upper bound of the time at which a converter of type v could start convert-ing the data output from node i. These bounds determine the the scheduling range of node i and are calculated as in [10] , [27] . We describe the ILP model as follows. The model minimizes the total cost of processors and converters (5) while satisfying assignment, precedence, and resource counting constraints. The following assignment constraints are necessary.
Minimize COST = The node assignment constraint (6) ensures that node i has one start time and is assigned to one processor. The converter assignment constraint (7) ensures that a data format converter of type v qr is used if node a is assigned to a processor with data format q and has an immediate successor node b assigned to a processor with data format r. Constraint (8) forces a converter of type v fin ;r to be used in the case when a node in N in is assigned to a processor whose data format, r, is different from the format of the input, f in . The scheduling range of these converters is [0; T r 0 1] and they may be assigned to any time since they are never in a recursive loop. Similarly constraint (9) ensures that a converter of an appropriate type is used if the data format of the output of the DFG is specified. If the input and output data formats are not specified for the DFG then constraints (8) and (9) may be eliminated.
We also must include the following precedence constraints and inequalities to count the number of processors and converters. The precedence relation from a node to another node must be satisfied whether or not data format conversion is needed between the nodes. On the other hand, precedence relations from node computations to data format conversions and from data format conversions to node computations should be satisfied only when data format converters are used. Therefore we need three kinds of precedence constraints: from processor to processor (10) , from processor to converter (11) , and from converter to processor (12) . In the processor to processor precedence constraint (10), the data format conversion time is taken into account. If an edge e = (a; b) exists, the computation of node b must start at least Constraints (11) and (12) represent the processor to converter precedence constraint and the converter to processor constraint. Inequalities (13) and (14) are used to count the number of processors and the number of converters of each type.
IV. Schedule Range Folding
The solution time of ILP models strongly depends on the number of integer (binary) variables. To reduce the ILP solution time it is important to tightly bound the variables. Initially the time assignment for each node was bounded by solving the scheduling range problem [10] , [27] . In this section, we propose a technique to further reduce the number of integer variables in the time assignment ILP model.
A. New variables in schedule range folding
The number of processors is counted using constraint (4) . In this constraint parameter k 1 is used to fold a node into its equivalent time class prior to counting the nodes in each time class. Instead of folding during the solution of the ILP model, we can perform the folding prior to generating the ILP model to derive an ILP model with folded scheduling ranges. This is called schedule range folding. In the case where the scheduling range of node i, U B i 0 LB i + 1, is longer than the iteration period, 
Consequently, by schedule range folding, the number of variables to indicate the time assignment for node i is reduced from
The exact time at which the computation of node i is started can be given as P minfU Bi;LBi+Tr 01g j =LBi j x i;j + T r P Z U Bi k=1 kz i;k . To simplify notation, we assume that P j2R i jx i;j means P minfU Bi;LBi+Tr 01g j =LBi j x i;j in the remainder of the paper.
B. Extension to module selection
By using scheduling range folding, the assignment constraint (6) and processor counting constraint (13) are rewritten as The computational latency of node a may vary from one processor type to another depending on the assignment between node a and a processor type. Let F a denote the set of processor types capable of executing the computation of node a. Let ta 2 F a denote a processor type and C a ta denote the computational latency of node a on this processor.
The precedence constraints for a directed edge e = (a; b) in the case where node a is assigned to a processor of type ta are By applying the constraints (18) and (19) to every processor type ta 2 F a , precedence relations are constrained in all the case of assignment between nodes and processor types.
By simple calculation, it is shown that the following theorem holds.
Theorem 1: Inequalities (18) and (19) impose the same constraint as inequality (10) .
We must also fold the scheduling ranges of data format conversions. In a similar way as node computations, binary variables yz i;f ;k are used in addition to binary variablesỹ i;j;v . The variable yz i;f;k = 1 indicates that the data output by node computation i is converted into the format f in the (k +1)-th iteration cycle by a data format converter of type v qf . The yz variables are prepared for each data format f since the constraints (10)- (12) are applied to each data format f 2 FORM.
Example: In Table III we compare the solution of the time assignment models with and without (w/o) schedule range folding for synthesizing the 4-stage pipelined lattice filter from [42] using the libraries of processors and converters of Tables I and II in section VII. We synthesize this DFG using the three time assignment models discussed in the next section. The table contains the number of constraints, the number of variables, and the CPU times in seconds for the ILP models. These CPU times are measured by using an ILP solver GAMS/OSL [43] on a 75MHz SPARC workstation. It can be seen that the schedule range folding slightly increases the CPU time for small ILP models. This occurs because the variables z i;k are loosely constrained. When the ILP model gets larger, however, the reduction in the number of variables greatly improves the CPU time. In addition to schedule range folding, there exists another technique to improve the solution time for the time assignment ILP models.
The complete model for time assignment discussed in the previous section is very time consuming to solve because of the large number of variables and constraints. Therefore we replace this model with a series of three smaller models each of which takes a significantly smaller amount of time to solve.
The objective of the first model is to reduce the pool of processor types for succeeding models. The first model minimizes the cost function (5) under the constraints (6), (10), (13) , and (21) It derives the cost optimal architecture by taking into account the cost of at most one data format converter of each type if such a data format converter should be used. If all the M v are 0, then the cost optimal solution where the derived architecture consists of processors of an identical data format has been obtained. Otherwise we must proceed with the second model.
The second model checks whether or not the required number of converters of each type in the first model solution is one. This model is the same as the complete model except that the assignment of nodes to processor types is fixed as given in the solution of the first model. If the number of converters is either one or zero for each converter type, the solution must be truly cost optimal. Otherwise, we proceed with the third model.
The third model is the complete model except that the pool of processor types is reduced by eliminating the processor types not used in the result of the first model. The solution to the third model may not be cost optimal because the pool of processor types has been reduced. However, since the converter cost is usually smaller than the processor cost, minimizing processor cost in the first ILP model generally leads to the complete cost optimal architecture. Therefore, reducing the pool of processors by the first model and minimizing the converter cost by the third model can generate the optimal or a very near optimal architecture.
Example: The three models and the original model are applied to the the 4-level pipelined lattice filter [42] . Table IV Table IV shows the synthesized architecture; the cost of the architecture; the number of constraints; the number of integer variables in each ILP model; and the CPU time (in seconds) to solve the ILP model. The first model says that some converters should be used but we do not know exactly how many should be used. The second ILP model says that more than one v bp;ds converter is required. Hence we proceed with the third ILP model where we assume that only A hp adders, A ds adders, and M bp multipliers are available. The final architecture synthesized by the third ILP model is as cost optimal as the complete ILP model. The sum of the CPU times required to solve the first, second, and third ILP models is 53.7 seconds, which is much smaller than the CPU time of 768 seconds used to solve the complete model. Thus, in many cases, we can improve the CPU time and still achieve a cost optimal solution.
VI. Processor Allocation
In processor allocation nodes are allocated to particular processors to support unfolding using the start times and module selection provided by the time assignment model. By treating the data format conversions as node computations and converters as processors, the allocation of data conversions to converters can be treated in the same way as the allocation of nodes to processors. Therefore, only the allocation of nodes to processors is considered here. The goal of the allocation is to minimize the LCM of unfolding factors to minimize the control circuit while maintaining the minimum number of processors determined by time assignment.
A. Node group
A node group is a set of node computations which are executed on an identical processor in the blocked schedule. Let each node computation which crosses a multiple of the iteration period be divided into two portions at the multiple of the iteration period as illustrated in Fig. 5 . Let the portion allocated to time slots at the end of the iteration period be called head and the other portion tail. If a head of a node computation and a tail of another node computation are allocated to an identical processor, then these node computations are in the same node group since these node computations must be executed one after another on the processor.
In the processor allocation of Fig. 2(e) , there exists only one node group consisting of node computations B and C. On the other hand, in the processor allocation Fig. 2(f) , there exist two node groups; one consisting of node computation B, and the other consisting of node computation C.
If the total number of node computations which cross a multiple of the iteration period is smaller than the required number of processors determined by time assignment, then imaginary node computations are introduced. An imaginary node computation crosses a multiple of the iteration period and the lengths of the head and the tail are infinitely short. In the case of a time assignment shown in Fig. 2 , the required number of processors is 3 and only B and C cross a multiple of the iteration period. Hence an imaginary node computation is assumed. Now two precise processor allocations are possible as shown in Fig. 6(a) and (b) . In these figures, a is the introduced imaginary node computations.
In the processor allocation in Fig. 6(a) , there exists one node group fB, C, ag. On the other hand, in the processor allocation in Fig. 6(b) , there exist two node groups fB, ag and fCg.
The blocking factor of processors which execute node computations of the node group is the same as the number of these node computations. In the case that node computations has bodies in addition to heads and tails as illustrated in Fig. 5(b) , the blocking factor must be increased by the total number of bodies of those node computations.
The two processor allocations shown in Fig. 6 (a) and (b) are the same as the abstracted schedules shown in Fig. 2 respectively. Therefore, the LCM of blocking factors is three for the processor allocation in Fig. fig:4 (a) and two for the processor allocation in Fig. fig:4 (b).
Now we can determine blocking factors uniquely for each set of node groups, or namely, for each processor allocation. If a blocking factor of two is preferable to minimize the LCM of blocking factors, then the processor allocation of Fig. 6(a) would be chosen. If a blocking factor of three is preferable, then the processor allocation of Fig. 6(b) would be chosen.
B. Identification of node groups
Two node computations i and j are given orders k and k + 1, respectively, if the tail of i and the head of j are allocated to an identical processor. Fig. 7 shows an example of processor allocation and the orders given to node computations based on the processor allocation. In Fig. 7(b) , a vertex indicates a node computation and an arrow between vertices indicates that the tail and the head of two node computations are allocated to an identical processor. At first, node computation A is assumed to be given the order 1. Then, node computation C is given the order 2 since the tail of A and the head of C are allocated to an identical processor P 3 . Similarly node computation D is given the order 3. Then there exists the exception. Although the tail of D and the head of A are allocated to an identical processor, these two node computations are not given successive orders.
This exception is used to count the number of nodes in a node group. node computations in a node group. If the tail of node computation i (the given order is k i ) and the head of node computation j (the given order is k j ) are allocated to an identical processor but k i > k j , that is, the exception occurs, then the number of node computations in the node group is computed as
C. ILP model for processor allocation
Let node computation i be labeled a respective integer l i . two node computations are given the same label. For example in y(i; j) = 1 means that the tail of node computation i is allocated to the processor where the head of node computation j is preallocated.
z(v; j ) = 1 means that node computation v 2 N 2 is allocated to the processor where the head of node computation j is allocated prior to generation of the ILP model. g(i; k) = 1 means that the node computations i is given the order k where k is a positive integer.
w i is the number of bodies of the node computation i. N G i is a node group where i is the youngest ordered node computation in the node group. Q(k) = 1 means that there exist one or more node groups whose blocking factor is k. The objective of the ILP model is shown in equation (22) where the least common multiple of blocking factors is minimized. Equation (23) ensures that each tail is allocated to one of the processors. Equation (25) ensures that no two tails are allocated to an identical processor. Equation (26) orders the node computations. Inequality (27) ensures no two node computations are given an identical order.
The inequalities (28) and (29) identify the node groups. In inequality (28) , node computation j is given the order k + 1 + w i , if y(i; j ) = 1 and the order of node computation i is k. The inequality (29) takes into account the exception in giving orders. The exception can be restricted to occur only when y(i; j ) = 1 and l i l j . Thus the inequalities (28) are applied when l i < l j and the inequalities (29) are applied when l i l j .
The inequality (30) extracts the blocking factor. The inequality (31) supports the case that i = j , that is, a node group consists of only one node computation.
The inequalities (32)- (45) calculate the LCM of blocking factors in the case that the maximum number of processors is less than or equal to 6. If more than 6 processors is used, then we must add inequalities to support blocking factors greater than 6. Such inequalities can be derived by applying the method described in the reference [44] . Several DFGs were simulated to prove the effectiveness of the models. All the ILP models were solved using the ILP solver GAMS/OSL [43] on a 75MHz SparcStation 20.
To demonstrate the ability of our model to derive optimal solutions, we apply it to the 5th order elliptic wave filter (EWF). We assume a single processor for each operation type. We compare our results to those in [21] . Table V contains the results for a number of iteration periods, T r . In this table, '3p,' '3,' and '+' imply the number of pipelined multipliers, the number of non-pipelined multipliers, and the number of adders, respectively. The computational latencies are 2 for the pipelined and non-pipelined multipliers and 1 for the adder. The pipeline periods are 1 for the pipelined multiplier and the adder and 2 for the non-pipelined multiplier. We assumed that the cost of a processor is 1 for all the processor types. Though the result in [21] is the result of resource-constrained scheduling, our model derived the same results for most cases and a better result for one case. With 1 pipelined multiplier and 2 adders, the approach in [21] required 18 u.t. for the iteration period while our approach achieves the same resource utilization at an iteration period of 17 u.t. Table V also shows the number of constraints, the number of integer variables, and CPU time in seconds for our ILP models.
A more practical design utilizes the library of processors described in Table I . These processors are derived assuming the use of 16 bit wordlength fixed point arithmetic. A 4-bit digitserial adder and a half word parallel adder are provided along with a bit parallel adder. The 4-bit digit serial architecture processes the data 4 bits at a time. Similarly bit-parallel, half word parallel, and digit-serial multipliers are provided. These architectures may be derived using the techniques described in [35] . Table I shows the computational latency C, the pipeline period L, the cost m, and the input and the output data format I and O, of each processor. The cost is determined by counting the number of equivalent full adders used in each processor. Note the computational latencies of the adders are the same but their pipeline periods vary. Table II shows the conversion latency C, the pipeline period L, and the cost m of each converter. Table VI also shows the number of processors of each type, and the cost of these processors for the solutions determined in Table VI . For example, in the case of the 4 stage pipelined lattice filter with an iteration period T r = 3, the synthesized architecture consists of two A bp adders, seven A ds adders, five M bp multipliers, and six v bp;ds converters.
In these benchmarks, precedence relations are considered only within each strongly connected component (SCC). 2 Precedence relations between SCCs are ignored in order to simplify the time assignment ILP models since our primary objective is to find the cost optimal architecture and such inter-SCC precedences do not affect the cost of the architecture. Hence, the synthesis result may not give an executable schedule if the DFG consists of more than one SCC because inter-SCC precedence relations are not maintained. In that case, we run another time assignment ILP model to generate a complete schedule satisfying both intraand inter-SCC precedence relations. This time assignment ILP model begins with the processor type assignments as determined by the 1st, 2nd, and 3rd ILP models and solves for new starting times. Table VII shows the result of this second time assignment for those DFGs with more than one SCC. Table VIII contains the results of the processor allocation ILP model. This table shows the minimum LCM of unfolding factors necessary to achieve the processor allocation, the number of variables and the number of equations in the ILP model, and the CPU time required to solve the ILP model. B is the list of unfolding factors of all the processors and LCM is the LCM of unfolding factor. Bars ('-') in both of these two columns mean that the processor allocation is obvious since the number of processors and the number of converters is one. A bar in the column of LCM means there exists no node computation which crosses a multiple of the iteration period and therefore B is 1. Fig. 8 shows the processor allocation result for 4 stage pipelined lattice filter (PLF) with the iteration period T r = 5 u.t. In Fig. 8 , white boxes represent node computations and a number in the box represents a name of node. a through f denote imaginary node computations. The blocking factors are 4 and 1 for A ds adders, 1 for M bp multipliers, 6 and 3 for v bp;ds converters, and 2 for v ds;bp converters. LCM(2; 3; 4; 6) = 12 is the minimum LCM achieved in the processor allocation.
VIII. Conclusion
We have proposed two new ILP models for the timeconstrained scheduling problem. The first model performs automatic module selection and data format converter insertion while automatically retiming, pipelining, and unfolding the DFG. The second model determines processor allocation with the objective of minimizing the least common multiple of blocking factors to minimize control cost. For each model we have determined methods to significantly reduce the CPU time necessary to generate a solution. We have run several benchmarks to prove the utility of these models. The ILP model is very attractive because we can easily add additional constraints to our models to impose new requirements such as minimum latency, minimum interpro- 
