In this papel; we present for the first time a mathematical framework for solving a special instance of the scheduling problem in control-flow dominated behavioral VHDL descriptions given that the timing of I/O signals has been completely or partially spec$ed. It is based on a codetransformational approach which fully preserves the VHDL semantics. The scheduling problem is mapped onto an integer linear program (IL,P) which can be constrained to be solvable in polynomial time, but still permits optimizing the statement sequence across basic block boundaries.
Introduction
As opposed to logic synthesis, architectural synthesis can considerably optimize a design by exploiting the potential given by the incomplete timing of the input description. Scheduling, which fixes the timing of operations in a design, is thus one of the key problems in architectural synthesis [ 111. Scheduling for data-jow dominated designs is relatively well understood and covered by a large number of algorithms. These mostly deal with optimizing resource utilization, latency or pipelining and operate on acyclic graphs [ 111. They are specialized to deal with a large share of arithmetic operations and moderate control-flow characteristic for digital signal processing (DSP) applications or the like Control-flow dominated designs, however, are characterized by a large share of nested loops and conditionals with only few arithmetic operations. The number of paths in a modestly sized design of only a few hundred lines of code can be in the order of some millions. Common examples are controllers, parsing engines for packetized data streams (e.g. in ATM and video compressioddecompression applications) and protocol processors.
Typically, these systems are also characterized by complex timing constraints, i.e. the number of clock cycles to ~7 1 .
Siemens AG Semiconductor Division
Balanstr. 73 D-81549 Miinchen, Germany elapse between a pair of operations is either upper or lower bounded or statically fixed and must not be changed during scheduling. This is particularly true for the interface timing which must adhere to a fixed protocol if the component is to be used along with components re-used from earlier projects or highly optimized full-custom building blocks.
Control-flow dominated designs are represented using cyclic controvdata flow graphs to account for loops. Similarly, timing constraints are not limited to operations within a single basic block (i.e. a code sequence of maximum length in which control flow does not branch), but can span an arbitrary number of conditionals and loops. This type of timing constraint is termed path-activated in [21] . Control flow and timing constraints thus interact in a complex manner. YEN and WOLF show in [21] that even without taking allocation costs into account, the problem of finding a feasible schedule under path-activated timing constraints is NP-complete. This paper presents an approach to scheduling controlflow dominated VHDL descriptions by behavioral code transformation. It allows partial or full specification of the I/O timing and of path-activated timing constraints. The scheduling problem is formulated in terms of an integer linear program [lo], the problem size of which is independent of the number of paths. To allow computationally efficient evaluation of design alternatives within an interactive design environment, we impose restrictions on the partial order of statements, thereby making the ILP problem solvable in polynomial time. However, the model still permits optimizing the statement sequences across basic blocks to meet performance andlor resource constraints. Scheduling can be done subject to the number of states.to minimize the length of each path or to the number of registers.
Related Research
Apart from data-flow oriented schedulers which have been extended to efficiently handle conditional branches (such as [18] ), only few approaches which explicitly deal with control-flow dominated descriptions exist. Path-based scheduling [3] aims at scheduling each path as fast as possible. Although there is a mechanism for specifying overall timing constraints, specification of exact timing relationships between pairs of YO operations is not supported. Pathbased scheduling considers all paths explicitly and does not permit optimizing the sequence of operations. Also, the start of each path is defined to be at the beginning of a loop, thus possibly yielding non-optimum results after scheduling. In [12] this is relaxed by allowing a path to start at arbitrary points, however, all paths still need to be considered explicitly.
Tree-based scheduling [7] does optimize the statement sequence, but by considering all paths its complexity is also exponential in the number of paths. It does not support specification of the I/O timing.
Relative scheduling [9] does not consider path-activated constraints and breaks up cyclic control-flow. In [20] the design and its timing constraints are initially represented separately using automaton models which are subsequently merged into a product automaton from which a feasible schedule is derived. For complex constraints, the product automaton grows exponentially in size which makes the approach impractical for large designs.
Other ILP-based approaches include [lS, 6, 41. All of them, though, are targeted towards data-flow dominated designs with only moderate control flow. Moreover, the associated ILP problems are all exponential in complexity and hence computationally inefficient for large designs. The approach presented in [ 131 permits global optimization of the execution order of operations. It relies on representing all possible schedules in a compressed OBDD representation. The OBDD essentially represents an ILP formulation and hence suffers from some of the drawbacks conventional ILP techniques as mentioned above suffer from.
Of particular interest within the context of this work are [8] and [21] . In [8] , a methodology for HDL-based specification is proposed which has been adopted in the Synopsys Behavioral Compiler. There are three different scheduling modes, depending on how tight the timing is to be specified: in cycle-$xed mode the exact temporal relationship between VO-operations is specified in terms of clock cycles in the behavioral description. Superstate-ped mode permits "stretching", i.e. clock cycle insertion at certain points. Free-joating mode permits specification of overall timing constraiints but leaves the exact position of clock cycle transitions open to the scheduler. Similar approaches have been presented earlier in [16, 191 , however, unlike the one presented in this paper, these did not globally optimize the order of st,atements across basic blocks.
The approach presented in [2 13 is based on so-called Behavior Finite State Machines (BFSMs) which permit more powerful specification of timing constraints than VHDL does, if it has not been semantically enhanced.
Our approach can be classified in-between the latter two. It relies on a similar methodology as the one suggested in [8] , but allows the combination of all three types of timing specifications within a single description. The BFSM model supports more complex timing constraints, however, it cannot directly be incorporated into a VHDL-based design and simulation environment. Moreover, its scheduling algorithm is based on a heuristic which might not always find a solution even though a feasible schedule exists. Due to its exact nature, the ILP formulation presented here will always find an optimum solution if it exists, assuming a partial order on certain operations.
Modeling and Problem Definition

Timing Specification
We will consider single multiple-wait VHDL processes in which the timing behavior is specified in relation to the same edge of a single clock signal c 1 k using the following wait-statement: wait until c l k = ' 1'. The latter will be in this paper referred to simply as "wait-statement". For instance, two consecutive writes of an output signal ack between which exactly two clock cycles must elapse (as in a tight communication protocol), can be specified as shown in Program 1 (a). Interface timing can be specified incompletely using socalled wait-sources. A wait-source is an abstract statement which is replaced during scheduling by as many waitstatements as required to fulfill all frequency and resource constraints. It can thus be identified with a superstate which is during scheduling broken up into "ordinary" states. The range of into how many wait-statements a wait-source may be converted can be specified by passing an upper and/or lower bound to the wait-source. This is useful in specifying more flexible communication protocols. Program 1 (b) shows a code fragment from a module which after receiving a request req must carry out some action and issue an acknowledge ack no sooner than two clock cycles after the request and not later than four clock cycles.
Program 1 VHDL modeling examples
The waits generated by the wait-source can be arbitrarily distributed on the path between the two I/O operations; in particular, they can be useful in satisfying resource and/or frequency constraints within the block that carries out some-action.
Graph Model
The VHDL specification is mapped onto two graph structures: a weighted controUdataJIow graph (CDFG) and afrow graph. The CDFG(V, E , tu) is a weighted, directed, cyclic graph (V, E , w) with nodes V = V,, U fi U Vvir and edges E = E, U Evir U Edu along with a weight func- 
Definition of the Scheduling Problem
Scheduling determines the start time of an operation on clock cycle level or, in other words, assigns each operation a state in which its execution is initiated. In the model proposed above, each wait-statement can be identified with a state in the corresponding state machine. Consequently, the set of statements on a path between two wait-statements can in turn be associated with a state. Scheduling a statement s into a different state requires a code transformation to bind s to a different wait-statement. Consider the code sequence in Program 2 (a). Let us call the states the two wait-statements can be identified with qi and q j , respectively. Note that due to the simulation semantics the values of outl and out2 will only be visible to the environment two clock cycles after the multiplications have been initiated. This description has a resource requirement of two multipliers and two addershbtracters, since both multiplications and additions/subtractions are initiated in state qi and state q j , respectively. Assuming that each multiplication requires two time units and each addition requires one time unit, the maximum delay in states qi and qj is four time units. Program 2(b) shows the same chunk of code after code transformation. It is easy to see that both code sequences are semantically equivalent. Particularly the external behavior of the circuit has not been touched even though an additional state was created. The delay on each of the two paths from the start of the code sequence until the output signals o u t l and out2 change is still two clock cycles. However, the transformed code has a resource requirement of only one adderhubtracter and multiplier each. Furthermore, the maximum delay has been reduced to three time units.
To mathematically model scheduling by code transformation, each node v E V is assigned a schedule-variable s(v) whose value reflects the relative change in state of node w in the new schedule with regard to the initial schedule. Based on the schedule-variables, we recompute the temporal relationship of any pair of nodes (and hence their associated VHDL statements) after scheduling. The "distance" dq(vi, vj) of two operations vi, vj in terms of controller states (or equivalently, the delay in clock cycles) is implicitly given in the original specification and can be derived from the CDFG's edge weights. Based on dq(vi7 vj), the distance of the two operations dk(v1, v,) after scheduling can be computed based on the schedule-variables as follows:
(1)
We can enforce certain properties of the scheduled code by constraining db(vi, vj) for selected pairs of nodes. For a data dependence (vi, vj) E Edu, for example, this distance must obviously be greater than or equal to zero: db(vi,vj) 2 0. In other words, vj must be executed in the same state (through chaining) or in a later state than vi.
Note that this model does not yet take wait-sources into account which also contribute to dq. The number of wait-
db(vi7vj) = ~( v j ) -.(vi) + dq(vi, vj).
statements generated by a wait-source v will be its schedule value s(w). We will deal with wait-sources separately in Section 4.2 and modify the model accordingly.
Obviously, a code transformation can be expressed in terms of an assignment to schedule-variables. Such an assignment is termedfeasible if it does not alter the semantic properties of the original description. The scheduling problem dealt with in this paper can hence be defined as follows: In the following section we will define more rigorously what exactly makes an assignment feasible and how this can be expressed in terms of linear inequalities on the set S.
Given a weighted controVdataJlow graph CDFG(V, E , w),
Feasible Assignments
Basic ILP Model
We require that the overall structure of thejow graph be retained, i.e. we will only allow non-control flow statements such as assignments to be moved in and out of basic blocks.
For each basic block b E B in a flow graph FG(B, EB)
we define an entry node entry(b) and an exit node exit@) which uniquely mark the entry and exit points of basic block b. These nodes may be defined so that for two basic blocks bi and bj exit@) = entry(bj) if and only if (bi, b j ) E EB.
Then, it can be shown that the following two (in)equalities guarantee that the assignment to S will retain the structure of the flow graph:
If there is a data dependence from a node U to a node U, i.e. ( U , w) E Edzl, U may, to preserve the code semantics, never be scheduled after .U. This is enforced by the following constraint: moves a statement from an i f -or case-branch into the basic block preceding the branching statement (preamble for short in the following).
To determine whether boosting a node would violate data dependencies in other branches, we exploit data flow information gathered at the entry and exit points of basic blocks during global data flow analysis.
Let The interface timing of a circuit, i.e. the temporal relationship between pairs of signal access operations, is determined by the number of controller states that are traversed between the two signal access operations. To guarantee that the external timing before and after scheduling coincide on clock cycle level, the number of controller states on each path between a pair of U 0 operations must not be changed during scheduling, i.e.
P€?(U,U)
Substituting Eq. (1) into the latter equation returns s(vi) = where P(u, V) is the set of all paths in E, U EuiT from U to w and dt ( U , v) the delay on a path p .
It can be shown that Eq. (4) guarantees that data dependencies as defined by the set Edu will be preserved after scheduling. However, while moving nodes into other basic blocks, we must also ensure that no other data dependencies are violated. In [ 141 a set of basic code transformations are defined based on which any relocation of a statement may be described by composition. In the following we will, for the sake of brevity, concentrate on boosting-up which s ( u j ) . The assignment to the schedule-variables of all U 0 operations must hence be the same:
if the nodes in KO are labeled 211 through v,.
Suppose a resource constraint of nT instances of a functional unit of type r has been defined. Then, any path in the controljow with more than n, operations of type r and a path weight of zero must be partitioned into at least two states:
where V: is the set of node pairs violating the constraint outlined above. V: can be computed in O( lVI2) time using a breadth-first technique. Instead of considering paths in the control flow, frequency constraints can be formulated in a similar way by considering chains in the dataflow for which the delay exceeds some upper bound Amax.
Modifications for Wait-Sources
In the presence of wait-sources the number of states between two nodes can obviously not be determined statically, since it is not known how many wait-statements (or controller states) are generated by a wait-source. We thus have to generalize the definition of d,.
Given a pair of nodes (vi, vj) and the set of wait-sources V, between vi and vj, the distance d: (VI , v,) taking into account wait-sources can be computed as follows:
The constraints have to be modified accordingly. Since it is no more possible to find the minimum distance between a pair of nodes, each of the constraints including a minimumterm has to be replaced by a set of constraints, one for each possible path between the two nodes. This guarantees that the constraints for each path will be met simultaneously. Moreover, the following constraint ensures that for a wait- 
Objective Functions
The constraint equations defined in the preceding subsections are all linear in the unknown schedule-variables; together with a linear objective function we thus arrive at an integer linear programming problem [ 101. In the presence of wait-sources the minimum number of wait statements will be generated with the following objective function: xwEVs s(w). Note that this will implicitly exactly minimize the length of each path containing wait-sources, since only as many wait statements will be generated as are required to satisfy all resource and frequency constraints.
Similarly, cost functions can be formulated to minimize the number of controller states or the number of registers, either by minimizing the "maximum cut" or by maximum chaining. Minimization of registers is important from a low-power aspect, since the clock net accounts for a large percentage of the overall power budget. For more information on the objective functions, the reader is referred to [ 191.
Complexity
Let us first look at the complexity of a scheduling problem without wait-sources. The number of ILP variables in the model presented in the preceding sections is O( IVl), the number of constraints is O(lVI2). In general, solving an integer linear program is an NP-hard problem [5] . Note, however, that constraints (2)- (7) are all linearly dependent on only two ILP variables and that the corresponding coefficients are +1 and -1 in each equation. The constraint matrix defined by constraints (2)- (7) is a (0, +1, -1) matrix in which no row has more than one +1 and one -1. This makes the constraint matrix totally unimodular and it follows that the optimum solution to the ILP problem can be found in polynomial time [ 101 as opposed to the approach in [21] .
Moreover, the size of the ILP problem is independent of the number of paths in the description. Constraints (4)- (5) and (7) do consider pairs of nodes between which more than one path may exist, however only one constraint per pair is generated considering only the minimum delay on each path.
Obviously, we are not solving the general scheduling problem, which belongs to the class of NP-complete problems. This is achieved by imposing a partial order on pairs of statements which belong to the set V:: Eq. (7) implicitly defines node v j to be scheduled after node vi, even though this order may not be imposed by any data dependence. In practice, we have found that this is not really a restriction, since the target applications are control-flow dominated and usually only consist of a few blocks of data-flow intensive code. Also note that due to the static nature of data-flow analysis, the mobility of a node may change when some other node is moved, so that the mobility as defined in Section 4.1 may actually over-constrain the ILP problem.
In the presence of wait-sources the total unimodularity of the constraint matrix is lost. Since in this case a distance d, is not a constant term but a sum of schedule variables, there may be more than two non-zero entries in each row of the constraint matrix. Furthermore, as explained in Section 4.2, some constraints are replaced by sets of constraints, so that the problem size theoretically becomes dependent on the number of paths. In reality, though, we have observed that due to the relative small number of wait-sources in a typical application (cf. Section 5), the penalty in runtime is tolerable.
Results
The scheduling algorithm was implemented in the VOTAN (VHDL Optimization, Transformation and ANalysis) synthesis platform, an interactive framework for analysis and optimization of behavioral VHDL code designed to be used as a front-end optimizer to logic synthesis.
Comparing our scheduling technique with related work is difficult due to the fact that we require an initial sched- ule or at least, in the case of wait-sources, some notion in what part of the code "superstates" are to be specified. Table l reviews the benchmark examples on which we ran the scheduler. qrs is a circuit for heart rate monitoring, mw is part of an automotive control circuit and X .2 5 is the send process from an X.25 communications protocol.
Of particular interest in this context are the qrs-andmwexamples, which are both heavily control-dominated. The qrs description is made up of only 272 lines of VHDL code (LOC) but contains approx. 6.7. lo6 paths. The mw description is 845 LOC containing more than 4 . lo9 paths. The x . 2 5 example is of modest size with only 74 LOC and 12 paths. Table 1 lists the number of registers 1x1 and states [&I in the original description and the optimized descriptions scheduled for register and state minimization subject to the resource constraints given in the first column. The gain in columns three and four are the reduction in number of registers and states obtained with the appropriate objective function. The CPU time t for solving the ILP problems was measured on a SPARCstation 10 with 320 MB of memory clocked at 70 MHz. The ILP solver used was a public domain solver from TU Eindhoven [2] . Note that the gains, particularly for the large qrs and the mw examples, are significant while the CPU times required were relatively low.
Conclusion
We have presented an ILP formulation to solve the scheduling problem in control-flow dominated behavioral descriptions in polynomial time by restricting the purtiul order of statements. The size of the problem is independent of the number of paths in the VHDL description. Furthermore, its structure permits optimization of statement sequences across basic block boundaries. The model supports complete or partial specification of the timing of U 0 signals which is retained during scheduling. Subject to resource and performance constraints, this scheduling approach allows optimization of the number of registers and controller states, which traditionally belong to the domain of RT-synthesis, on architectural level by means of code transformation. It is thus well-suited for optimizing VHDL descriptions prior to logic synthesis.
