Abstract| This paper describes an exact solution methodology, implemented in Rensselaer's Voyager design space exploration system, for solving the scheduling problem in a 3-dimensional (3D) design space: the usual 2D design space (which trades o area and schedule length), plus a third dimension representing clock length. Unlike design space exploration methodologies which rely on bounds or estimates, this methodology is guaranteed to nd the globally optimal solution to a 3D scheduling problem. Furthermore, this methodology e ciently prunes the search space, eliminating provably inferior design points through: (1) a careful selection of candidate clock lengths, and (2) tight bounds on the number of functional units or on the schedule length. Both chaining and multi-cycle operations are supported.
I. Introduction
High-level synthesis is the design task of converting a behavioral description of a digital system into a registertransfer level design that implements that behavior. One of the central problems in high-level synthesis is the scheduling problem { the problem of mapping operations onto control steps (csteps) in the proper order. The scheduling problem is usually formulated in one of three ways, depending on the goal: (1) Time-Constrained Scheduling (TCS), which minimizes the number of resources when the number of control steps is xed, (2) Resource-Constrained Scheduling (RCS), which minimizes the number of control steps when the number of functional units is xed, or (3) Timeand Resource-Constrained Scheduling (TRCS), which determines whether or not a feasible schedule exists when both the number of functional units and the number of control steps are xed.
The process of solving the scheduling problem can be viewed as the process of exploring a 2-dimensional (2D) design space, with axes representing time (schedule length) and area (ideally total area, but often simpli ed to functional unit area). This 2D design space is shown in Figure 1 , where feasible designs lie in the shaded region, and infeasible designs lie in the white region. Optimal designs lie on the curve between the two regions, and represent the tradeo between time and area. A. The 3D Design Space In reality, however, this 2D design space is only a small part of a much larger design space. One such larger design space is presented by De Micheli in 1] , and is illustrated in Figure 2 . Here the design space for high-level synthesis is viewed as a 3-dimensional (3D) space, with axes not only representing schedule length and area, but clock (cycle) length as well.
A typical scheduling algorithm explores only one 2D slice of this larger 3D design space { the 2D slice corresponding to a xed clock length chosen a priori by the designer. This clock length depends on many factors, including the delays of the functional units, storage elements, glue logic, and wiring, as well as controller delays. Some of those values are unknown before scheduling, and can therefore only be estimated at this stage in the design process.
Unfortunately, the designer must specify a clock length (or at least, the data path component of the clock length) before scheduling. Lacking detailed information, the designer is forced to make an ad hoc and frequently arbitrary guess at the clock length 1 . Unfortunately, this ad hoc choice eliminates an entire dimension of the search space, so even an optimal scheduler will explore only the corresponding 2D slice of the design space, and will produce a schedule that is optimal only for that one clock length. A better schedule may exist for a di erent clock length, but will not be found.
To motivate the need to explore this 3D design space, consider the problem of scheduling the well-known Elliptic Wave Filter 6, p.206] (EWF) benchmark, under a variety of resource constraints, to nd the fastest possible schedule. Assume that the VDP100 module library 7], 8] is used, which has a multiplication delay of 163ns, and an addition delay of 48ns.
Forced to select a clock length for the scheduling algorithm, the designer would probably choose either a clock length of 48ns or 163ns { the execution delay of either addition or multiplication. Given those clock lengths, an optimal scheduler that supports multi-cycle operations (such as the ILP-based scheduler 9] in our Voyager design space exploration system) would produce the results shown on the rows labeled \48" and \163" in Table I . Now consider the other rows of Table I , which represent other, perhaps less obvious, choices for the clock length. For each resource constraint, the fastest design corresponds to a clock length of 24ns { a design that would not be found by a scheduling methodology limited by ad hoc guesses. 2 Thus it is important to explore a number of candidate clock lengths to nd the globally optimal solution.
B. Exploring the 3D Design Space
A variety of methodologies can be used for design space exploration. The methodologies may use exact algorithms to nd optimal solutions, may use heuristic algorithms to nd lower and upper bounds on the optimal solution, or may use heuristic algorithms to estimate the optimal solution. In general, the tradeo between these three types of methodologies is one of solution quality versus computation tion 3], or reclocking 4] to determine the nal clock length. However, these techniques generally do not change the relative scheduling of the operations, and do not perform tradeo s involving resource sharing, so they do not explore the high-level design space as fully as scheduling techniques. Nevertheless, the later use of these techniques, possibly in conjunction with other transformations 5], can serve as a valuable complement to our methodologies. 2 This small clock length also results in a larger number of control steps, and thus a larger and more complex control unit. However, note that a clock length of 55ns { more comparable to the ad hoc guesses { results in a schedule almost as fast as the one corresponding to a 24ns clock, and faster than those corresponding to the ad hoc guesses.
Exhaustive Search: read in DFG, module library, and any constraints for each clock length optimally schedule the DFG present the best result(s) to the user for evaluation Fig. 3 . Exhaustive Search of the 3D Design Space (Impractical) time. This paper is concerned with nding optimal (exact) solutions to the scheduling problem in the 3D design space.
One exact methodology for optimally solving this 3D-scheduling problem shown in Figure 3 . This methodology exhaustively explores all potential clock lengths and all feasible schedules, and guarantees a globally optimal solution. Unfortunately, the computation time for this methodology is too high to be practical for all but the simplest examples.
In contrast, this paper presents a more e cient exact methodology, implemented in the Voyager design space exploration system, for optimally solving this 3D-scheduling problem. This methodology makes the problem tractable through: (1) careful pruning of provably inferior points from the design space, and (2) provably e cient exact algorithms for solving the individual problems.
However, even this solution methodology is only the rst step toward the larger design space exploration problem that eventually needs to be solved. As described here, our methodology does not consider the module selection or type mapping problems, and does not support loops or conditionals 3 . It also does not incorporate register, wiring, or controller area, and only partially incorporates the delays associated with the controller and wiring. Nevertheless, the work described here can serve as a foundation for an exact solution methodology that incorporates each of these factors, either by adding extra dimensions to the search space or by adding other stages to the methodology.
II. Methodology Overview
This paper presents two methodologies to solve the clock determination and scheduling problem, that are guaranteed to nd the globally optimal design, and that are far more e cient than an exhaustive search of the design space. One methodology solves the Time-Constrained 3D Scheduling (TCS-3D) problem (Figure 4 ), while the other solves the Resource-Constrained 3D Scheduling (RCS-3D) problem ( Figure 5 ). Both methodologies are implemented in Rensselaer's Voyager design space exploration system.
The core of each methodology is based roughly on the exhaustive search of Figure 3 . Each methodology computes a set of candidate clock lengths, and then, for each candidate clock length, optimally solves the scheduling problem. However, a straightforward implementation of this core methodology takes much too long to solve, even for small benchmarks. Thus it is important to (1) solve the scheduling problem for only a small, provably minimal set of candidate clock lengths, and (2) solve the scheduling problems as e ciently as possible so that an optimal solution is found in a reasonable amount of time.
To solve the scheduling problem, both methodologies use an Integer Linear Programming (ILP) formulation (Section IV) that was developed after a careful formal analysis of that problem. This analysis was presented earlier in 9], where we proved that this formulation, in particular the formulation of the TRCS problem, was well-structured and can be solved e ciently.
Since the search spaces for the TCS and RCS problems are each larger than that of the TRCS problem, these methodologies solve the TCS and RCS problems by generating the missing constraints, in e ect converting each into an easier-to-solve TRCS problem. For the TCS problem, the methodology computes constraints on the number of functional units of each type; for the RCS problem, it computes a time constraint on the length of the schedule. Since these constraints can also be found e ciently, the entire methodology is e cient.
A. Time-Constrained 3D Scheduling (TCS-3D)
Voyager's methodology for solving the time-constrained 3D scheduling problem is outlined in Figure 4 . This methodology begins by reading in the data ow graph (DFG), the execution delays for the relevant functional units in the module library, and the overall time constraint.
The minimal set of candidate clock lengths is then determined (see Section III), based on the execution delays of the relevant functional units in the module library, and for chained designs, on the structure of the DFG. For the EWF and the module library described earlier, 10 candidate clock lengths would be generated (in the absence of chaining). For each of these clock lengths, time-constrained scheduling is then performed, and the results are presented to the user for evaluation.
To solve the TCS problem e ciently, Voyager's ILP formulation of the TRCS problem (described in Section IV) is used as follows. First, tight lower bounds on the number of functional units of each type are computed (using a method sketched out in Section V). These bounds are then used as resource constraints, and the TRCS problem is solved as a decision problem. If TRCS produces a feasible schedule, then that schedule is guaranteed to be optimal; if not, the resource constraints are increased, and this process is repeated. This TCS-3D solution methodology is relatively e cient for the following reasons. First, the functional unit lower bounds can be computed in polynomial time, by solving at most two Linear Programs (LPs). Second, TRCS is solved as a decision problem, rather than an optimization problem, using a formulation that is well-structured, and requires few, if any, branches in a branch-and-bound search 9]. Finally, the functional unit lower bounds are highly accurate 11] (in almost every case they lead immediately to a feasible solution), so in practice the lower bounds seldom have to be increased to solve TRCS again. Thus the TCS-3D problem can be solved quickly, even for medium-sized benchmarks (see Section VII).
The e ciency of the methodology can be further increased if the goal is to nd the schedule with the fewest number of functional units. In this case, before each TCS problem is solved as a TRCS problem, the FU lower bounds are compared to the number of FUs required in the best previous schedule. If the new bounds are smaller, then the TRCS problem is solved as explained above; if the new bounds are larger, then there is no need to solve the TRCS problem since it would require more functional units than the best solution found so far. B. Resource-Constrained 3D Scheduling (RCS-3D)
Voyager's methodology for solving the resourceconstrained 3D scheduling problem is similar (see Figure 5) . This methodology reads in a resource constraint, and generates a minimal set of candidate clocks using the clock length determination algorithm described in Section III. For each of these clock lengths, resourceconstrained scheduling is then performed.
To solve the RCS problem e ciently, Voyager's ILP formulation of the TRCS problem (Section IV) is used as follows. First, a tight lower bound on the overall length of the schedule is computed (Section VI). This bound is then used as a time constraint, and the TRCS problem is solved as a decision problem. If TRCS produces a feasible schedule, then that schedule is guaranteed to be optimal; if not, the time constraint is increased, and this process is repeated. The RCS problem can be solved quickly, even for medium-sized benchmarks (see Section VII).
The e ciency of the methodology can be further increased if the goal is to nd the shortest schedule. In this case, before each RCS problem is solved as a TRCS problem, the schedule length lower bound is compared to the length of best previous schedule. If the new bound is smaller, then the TRCS problem is solved as explained above; if the new bound is larger, then there is no need to solve the TRCS problem since it would result in a longer schedule than the best solution found so far.
C. Advantages of this Solution Methodology
In summary, Voyager's exact solution methodology has a two-fold advantage over previous methodologies: (1) guaranteed optimal results, and (2) solution techniques based on e cient pruning of the search space.
Unlike other design space exploration methodologies which rely on bounds or estimates to make the problem tractable, this methodology generates the minimal set of candidate clock lengths that could possibly correspond to the optimal design, and then optimally solves either the TCS or RCS problem for each of those clock lengths. Thus it is guaranteed to nd the globally optimal result.
Furthermore, although this methodology may appear at rst glance to perform exhaustive scheduling, in reality it is quite e cient for three reasons. First, a minimal set of candidate clock lengths is generated, and scheduling is performed for only those few values. Second, instead of directly solving the TCS or RCS problem, the missing constraints are generated, converting that problem into a TRCS problem with a smaller search space; moreover, those constraints are tight, and are also generated e ciently. Finally, a TRCS formulation is used that is well-structured 9], and therefore usually nds an optimal solution with few branches.
III. Determining Candidate Clock Lengths
One of the most important parameters needed by any scheduling algorithm is the length of the system clock 4 .
Determining this clock length requires a detailed analysis of the clock skew, wire delays, glue logic delays, setup and propagation delays of the storage elements, etc. 13]. However, all such quantities are largely unknown during high-level synthesis. Fortunately, although such a detailed analysis is necessary later in the design process, it is not needed during high-level synthesis, where only the macroscopic structure of the circuit is determined.
One appropriate model of the clock length during highlevel synthesis is presented by Chaiyakul and Gajski in 14] . Here the clock length is assumed to have 3 components: datapath delay, control delay, and wire delay. For the moment, we will use only the datapath delays to determine the clock length, and will ignore the control and wire delays, realizing that the actual clock length will be longer due to those delays; this limitation will be addressed later in Section III-C. We will also assume a bus-based architecture with a point-to-point interconnection topology, meaning there exists only one bus between any two functional unit and/or storage unit ports.
De nition 1: Let t s (reg) and t p (reg) be the setup time and propagation delay of the registers, and let t p (interconnect) be the interconnect propagation delay. If the delay of a functional unit of type k is denoted as delay (k), the execution delay d k for a register-to-register transfer executing an operation of type k is given as
Throughout the remainder of this section, the set D will be used to denote the set of all d k 's found in the given DFG.
The remainder of this section describes Voyager's methodology for choosing a set of provably non-inferior candidate clock lengths. Section III-A describes the methodology in the absence of chaining, and then Section III-B describes the extensions necessary to support chaining. Finally, Section III-C discusses how controller delay could be included as well. A. Determining Candidate Clock Lengths in the Absence of Chaining Before discussing Voyager's methodology for determining candidate clock lengths, it is necessary to have a measure of the quality of one clock length with respect to other clock lengths for a particular operation. One such measure that is commonly used is operation slack, de ned as follows:
De nition 2: For a given clock length c, the slack s k of an operation of type k is given by
Voyager's methodology determines a minimal set of candidate clock lengths in a range c; c]. This range is bounded by c, the minimum clock length possible for implementing the design's controller, and c, the largest d k (or maximum chain length when chaining is considered). One of the goals of the Voyager 3D design space exploration methodology is to nd the minimal set of non-inferior clock lengths c in this range that need to be examined in order to nd the globally optimal solution.
Unfortunately, the clock determination problem is usually ignored in favor of ad hoc decisions or estimates, which, as demonstrated later, can ignore much of the design space and lead to an inferior design. For example, several previous clock estimation schemes 15], 16] use the delay of the slowest functional unit as the estimated clock length. A more realistic approach is used in 7], in which a contiguous range of integer candidate clock lengths is heuristically evaluated in an attempt to provide some guidance as to the \best" clock length to choose.
However, all of these approaches choose the clock length before, and independent of, scheduling. Thus they are at best estimates, since it is never possible to guarantee that a better schedule with a di erent clock length does not exist. Therefore it may seem at rst that the globally optimal solution to the 3D scheduling problem cannot be found without optimally solving the scheduling problem for every possible clock length { a prohibitively expensive exhaustive search.
Fortunately, this exhaustive search is not necessary. In 17], Corazao et al. combined clock length determination with the problem of operation template matching, and made some suggestions to reduce the number of candidate clock lengths. However, the number of candidate clock lengths can be reduced even further, as shown in our Theorem 1 below (a similar observation was made by Chen et al. in 18] , but presented without proof).
The following theorem shows that only certain clock lengths in the range c; c] must be explored to nd the globally optimal clock length c , when chaining is not considered, and when clock lengths are not assumed to be integers: , chaining refers to the technique of scheduling two or more data-dependent operations into the same control step, using the otherwise wasted \slack" time that remains in the clock period after the rst operation nishes. Commonly used in industry, these chains of two or more operations may include a variety of arithmetic operations, logic operations, etc. At the register-transfer level, the chain is implemented by connecting the output of one functional unit directly to the input of the following functional unit (i.e., without an intervening register).
De nition 3: Let t s (reg) and t p (reg) be the setup time and propagation delay of the registers, and let t p (interconnect) be the interconnect propagation delay. If the total delay of the functional units involved in chain ch is denoted as delay (ch), the chain delay d ch for a registerto-register transfer executing a chain ch is given as d ch = delay (ch) + t s (reg) + t p (reg) + t p (interconnect):
In the discussion of chaining that follows, the length of the chain will denote the number of operations that are chained together in sequence. For simplicity, the discussion will be limited to chains of length 2, although the methodologies can be extended to handle longer chains at the cost of a larger solution space and a corresponding increase in execution time. We categorize chaining into three types, since treating each each type di erently allows us to more fully reduce the set of candidate clock lengths. These three types 5 are summarized as follows, and are illustrated in Figure 7 :
type I chaining { the entire chain must execute within a single control step type II chaining { the chain may execute over multiple control steps (as may one or more of the operations), but the nal operation in the chain must start and nish within the last control step type III chaining { the chain may execute over multiple control steps (as may one or more of the operations), but the rst operation in the chain must start and nish within the rst control step. Note that type I chaining (the classical form of chaining) is a special case of type II and type III chaining.
To determine the minimal set of candidate clock lengths CK when chaining is allowed, the set of operation execution delays D must be considered in conjunction with the set D ch of possible chain delays in a given DFG. Thus, the task of determining the candidate clock lengths for resource-constrained scheduling with chaining consists of two steps: (1) nding the set D ch of all possible chain delays due to potential chains in the DFG, and (2) using D and D ch to determine a set CK 0 that represents the the minimal set of candidate clock lengths that must be explored for the given type of chaining.
B.1 Finding All Possible Chain Delays in the DFG
To nd all possible chain delays due to potential chains in the DFG, Voyager performs a depth-rst search of the DFG, using a lookahead equal to the maximum chain length allowed. This recursive algorithm, shown in Figure 8 , nds all the chain delays in a DFG when invoked with the call chain clocks(source, 0, 0). The algorithm runs in O(n l ) time, where n is the number of operations in the DFG, and l is the maximum allowable chain length.
For the AR-lattice lter benchmark 19] and the VDP100 8] module library, the algorithm returns the set D ch = f211; 96g, corresponding to the operation sequences f ; +g and f+; +g. 5 Note that we do not consider type IV chaining, in which all operations in the chain are multicycled, since this is not really chaining { one operation is not completed within the wasted slack left by another. However, this set of candidate clock lengths can be pruned even further, because many of these candidate clock lengths will actually not support a type II chain. For example, consider the clock length of 55ns, which is too long to multi-cycle an addition, and thus can only multi-cycle a multiplication. However, the slack left after a multi-cycle multiplication is 55 d163=55e ? 163 = 2ns, which is less than d add , so the chain can not be completed and 55 should be removed from CK 0 . Applying this process further to the AR-lattice lter example gives the minimal set of candidate clock lengths to explore as CK 0 = f106; 71; 53g. 6 Note that we only consider D ch , rather than div(D ch ), since type I chaining does not allow chains to be scheduled over multiple cycles. Since the additional delays are not available at the beginning of the design process, most of the previous work in high-level synthesis has concentrated solely on the functional unit delay (delay(k)), or possibly the functional unit and register delays, arguing that the interconnect and controller delays can not be accurately determined before scheduling. However, even rough estimates of those additional delays, determined from a previous iteration of the design process and treated as constants in the current iteration, can sometimes be exploited to produce a better schedule.
The additional controller-related delays can be accurately determined after logic synthesis 22], or estimated from an RT-level design 23], and then used in the current design iteration. For example, the controller delays determined in the previous iteration can be used in the current iteration 24], or the system can attempt to predict the incremental change over the previous delays (the work presented in 25] is a rst step in this direction). Similarly, the interconnect delays can be determined from the maximum value in the previous iteration 24], or from models of the layout tools 26], and the register delays can be treated as constants, or measured more accurately in conjunction with a detailed retiming model 27].
Once the controller and interconnect related delays have been determined, the new values of d k and d ch can be used with the techniques described in III-A and III-B to more accurately calculate the candidate clock lengths. For pipelined controllers, care must be taken to prevent chaining over conditional statements.
IV. Optimally Solving the Scheduling Problem
In high-level synthesis, the basic scheduling problem is the problem of determining the control step in which each operation will execute. After a careful formal analysis of the scheduling problem 9], we were able to develop wellstructured formulations of those problems, in particular the TRCS problem. We began by characterizing the set of feasible schedules in terms of assignment, precedence, and resource constraints. We then used polyhedral theory to analyze these constraints to determine the structure of the corresponding scheduling polytope, and we proved that our precedence constraints lead to the tightest possible description of that polytope. Finally, once this analysis was complete, that structure was exploited to develop a provably well-structured Integer Linear Programming (ILP) formulation of the TRCS problem.
This section brie y introduces Voyager's ILP formulation of the scheduling problem, and describes the modi cations necessary to support chaining 28]. Most previous ILP formulations have considered only type I chaining 29], 30], or a combination of types I and II 31]. An ILP formulation that supports all three types of chaining is presented in 32], but the encompassing methodology does not allow for multicycling of non-chained operations due to clock length restrictions in the ILP formulation. In contrast, this section describes how all three types of chaining can be incorporated into our ILP formulation while still allowing multicycling of non-chained operations. For a more detailed description of this formulation in the absence of chaining, see 9].
A. ILP Formulation of the Scheduling Problem
Voyager's formulation of the TRCS problem can be sum- (1) assignment constraints (A), which ensure that each operation is scheduled onto exactly one cstep; (2) precedence constraints (P), which ensure that each operation is always scheduled after all of its predecessors; and (3) resource constraints (R), which ensure that the schedule does not use more than the available number of functional units of each type. The TRCS problem is the problem of determining whether or not a feasible schedule exists that satis es these constraints, and can be written succinctly as min f 0 T x j M a x = 1 ; M t x 1 ; M r x m ; x integer g: where 0 is a vector of zeros, and M a , M t and M r are the coe cient matrices due to the assignment constraints, precedence constraints, and resource constraints, respectively.
The TCS and RCS problems can be de ned similarly. Since the RCSs problem minimizes the number of control steps, a sink operation o d is introduced, and the formulation ensures that it is scheduled in the last control step by making it the successor of all operations that had no successors in the original DFG. The total number of control steps can then computed as To support chaining in these formulations, the precedence constraints described above must be modi ed. Section IV-B rst discusses some modi cations to the schedule intervals that are necessary, and then Section IV-C presents the new precedence constraints. However, these ASAP and ALAP times must be determined di erently when chaining is allowed. This section discusses only the modi cations necessary to determine the ASAP values when chaining is supported; the modi cations for the ALAP values are analogous.
In the case of type I chaining, the ASAP cstep for each operation can be determined by placing as many operations as the maximum chain length will allow into the current cstep, while ensuring that the sum of the delays of these chained operations does not exceed the chosen clock.
However, for type II or type III chaining, the situation becomes more complex due to the interplay between the maximum chain length and the delays of chained operations. Given a DFG, consider an operation sequence i ! j ! k in which i and j could be chained, as could j and k; the ASAP time for k would then be calculated as: 
C. Modifying Precedence Constraints to Support Chaining
Once the new schedule intervals have been determined, the precedence constraints can be modi ed to support all three types of chaining, as described in this section. However, during scheduling, time is measured in discrete control steps, or clocks, rather than in continuous time.
In particular, when a clock of length c is used without chaining, the previous continuous-time relation is replaced with the following discrete-time relation:
where s(i) and s(j) denote the control steps in which o i and o j respectively start execution. Note that s(j) s(i) + 1, which is consistent with the assumption that o j can not be chained with o i .
When chaining is supported, the above discrete-time relation is modi ed to s(j) s(i) + bd i =cc: (1) to allow o j to begin in the same control step as the one in which o i nishes execution. In other words, from each operation o i , arcs a ik must be added to its nearest successors o k that can not be chained with it. Determining whether o k can be chained with o i depends on the type of chaining and the maximum chainlength; so the decision of adding arc a ik is also a ected by the type of chaining and the maximum chain-length.
After adding the new arcs for chaining, we use the modi ed discrete-time relation (1) to generate the precedence constraints in a similar manner as the non-chained precedence constraints. A detailed description of the nonchained precedence constraints can be found in 9].
V. Bounding the Number of Functional Units to Solve TCS More Efficiently
As discussed earlier in Section II, it is important to generate tight lower bounds on the number of functional units (FUs) of each type, so that those bounds can be used as resource constraints to convert the TCS problem into an easier-to-solve TRCS problem. Furthermore, these bounds must be computed e ciently.
This FU lower-bounding problem can be viewed as a relaxation of the FU minimization problem. While many di erent FU lower-bounding problems can be formed by relaxing the minimization problem in di erent ways, most are A di erent approach, described more formally in 11], is used in Voyager. This approach starts with an ILP formulation of the FU minimization problem (minimize the number m k of FUs of type k 2 K). The problem is then relaxed to a generic description of an entire class of FU lowerbounding problems (the problems above are special cases of this generic class). From this class, the FU lower-bounding problem that produces the tightest possible bound is selected and solved. This approach formalizes an entire class of FU lower-bounding problems and is guaranteed to produce the tightest possible bound in that class; this bound was veri ed to be exact in most of our experiments.
Furthermore, the solution to this FU lower-bounding problem can be found in polynomial time by solving at most two LP's, even though the original formulation was an ILP formulation. Such an LP-based relaxation is chosen because we want as tight as possible bounds to increase the e ciency of solving the TCS problem. However, Voyager also has a suite of more e cient heuristic algorithms 10] that may produce less accurate bounds and are suitable for a quick rst pass over the design space.
VI. Bounding the Length of the Schedule to
Solve RCS More Efficiently
This section presents a method of generating a tight lower bound on the schedule length, so that the RCS problem can be solved more e ciently. The method is similar in principle to the method presented in the previous section for solving the FU lower-bounding problem.
One early formulation of the schedule length lowerbounding problem in presence of resource constraints is presented in 19]; however, the bounds produced by that approach are very loose. More recent algorithms that produce tighter bounds are those In much the same manner as FU-lower bounding, the ILP formulation of the schedule-length minimization problem can be relaxed to a generic description of an entire class of schedule-length lower-bounding problems. From this class, the lower-bounding problem that produces the tightest possible bound (the problems above are special cases of this generic class) is chosen and solved. This approach formalizes an entire class of schedule length lower-bounding problems, and is guaranteed to produce the tightest bound of all possible precedence relaxations in polynomial time. 8] , giving a datapath delay of 48ns for addition, 56ns for subtraction, and 163ns for multiplication. For each benchmark, we performed Time-Constrained 3D Scheduling (TCS-3D), and Resource-Constrained 3D Scheduling (RCS-3D) with and without chaining, using the methodologies presented in Section II.
A. AR Filter
The TCS-3D results for the AR lter are presented in Table II . They show, for each of two time constraints, those clock lengths from the candidate set that lead to a feasible schedule (the other clock lengths lead to infeasible schedules regardless of the number of functional units available).
For a time constraint of 902ns, eight clock lengths (82ns, 55ns, 41ns, 33ns, 28ns, 24ns, 21ns, and 19ns) led to the minimum number of functional units. Of these, the schedule for the 82ns clock (dd mult =2e) requires the fewest control steps (and thus potentially the smallest controller), and so To nd the fastest possible design, the critical path length was used to derive the tightest possible time constraint of 760ns. For this time constraint, only one clock length { 24ns { led to a feasible schedule, and thus to the optimal 3D schedule.
The RCS-3D results for the AR lter are presented in Tables III-VI. Some schedule lengths of interest are shown in boldface, and those that were lower-bounded by the RCS-3D methodology are shown in gray along with the lowerbounded schedule length. As described in Section II-B, the TRCS problem was not solved for those clock lengths, since each would result in a schedule that was longer than the shortest schedule found for the previous clock lengths.
In the absence of chaining, schedule lengths are shown for every candidate clock length in Table III . In this table, the fastest schedules correspond to clock lengths of 55ns, and when su cient resources are available, 24ns. Again, it is interesting to note that neither of these clock lengths is an obvious ad hoc guess (55 is dd mult =3e, and 24 is both dd mult =7e and dd add =2e), which means that the fastest schedule might be missed using more conventional methodologies. Furthermore, although the clock length of 24ns would correspond to a larger number of control steps (and perhaps a larger controller), that small clock length does result in the optimal 3D schedule, because the smaller clock lengths tend to reduce the operation slack.
Similarly, in the presence of chaining, schedule lengths are shown for every candidate clock length in Tables IV-VI. For a given clock length (such as 163ns), chaining usually improved the schedule when there were su cient resources available, but provided no improvement when the number of resources was small. Furthermore, for some clock lengths and resource constraints, one type of chaining provided the largest improvement, while for other clock lengths and resource constraints, another type provided the largest improvement.
Finally, looking at all these results over all candidate clock lengths, note that type II chaining gave the fastest overall chained schedule (742ns), slightly faster than the fastest unchained schedule (744ns), and with half the number of control steps. Type III chaining's performance was poorer (770ns), but at least tied with the unchained schedule at that same clock length (55ns). The best schedule from type I chaining (the \standard" form of chaining), however, was considerably worse than the best unchained schedule (844ns vs. 744ns).
B. Elliptic Wave Filter (EWF)
The TCS-3D results for the EWF are presented in Table VII. Again, they show, for each of two time constraints, those clock lengths from the candidate set that lead to a feasible schedule.
For a time constraint of 1394ns, three clock lengths (55ns, 48ns, and 24ns) led to the minimum number of functional units. Of these, the schedule for the 55ns clock (dd mult =3e) requires the fewest control steps (and thus potentially a smaller controller), so would be preferable. Note that this is a di erent clock length than the one chosen for the AR lter, illustrating the importance of taking the structure of the DFG into account.
To nd the fastest possible design, the critical path length was used to derive the tightest possible time constraint of 1035ns. For this time constraint, only one clock length { 24ns { led to a feasible schedule, and thus to the optimal 3D schedule.
The RCS-3D results for the EWF are shown in Tables VIII-XI. In the absence of chaining, the 24ns and 55ns clock lengths correspond to the fastest schedules. Again, for a given clock length, chaining usually improved the schedule when there were su cient resources available. However, neither form of chaining was able able to nd an overall faster schedule than the fastest unchained schedule.
C. Discrete Cosine Transform (DCT)
The TCS-3D results for the DCT are presented in Table XII. The rst set of results are for a time constraint of 500ns, corresponding to a design will run at 2MHz. Eight clock lengths produced feasible schedules, but only one { 24ns { led to the minimum number of functional units. To nd the fastest possible design, the critical path length was used to derive the tightest possible time constraint of 434ns, and only one clock length { 24ns { led to a feasible schedule and thus to the optimal 3D schedule.
The RCS-3D results for the DCT are presented in Table XIII. In the absence of chaining, the 56ns clock length (d sub ) corresponds to the fastest schedule. This time, not only could type I chaining not nd an overall faster schedule than the fastest unchained schedule, but it could not even improve the schedule for a given clock length over the unchained schedule, probably due to the severe resource constraints.
D. Methodology Run Times
Voyager's design space exploration methodologies consists of three main tasks: computing the minimal set of candidate clock lengths, computing tight bounds on the number of functional units or on the schedule length, and solving the TRCS problem. The minimal set of candidate clock lengths can be computed quickly, and the bounds can be computed by solving at most two linear programs in polynomial time, as discussed in Sections V and VI. Finally, the TRCS formulation used in Voyager is well-structured, meaning that it converges on the optimal solution faster than an arbitrary formulation.
To motivate the need for solving the TCS or RCS problem by rst computing bounds and then solving the re- sulting TRCS problem, consider the result of solving the TCS problem directly for a time constraint of 1394ns and a 24ns clock on the EWF benchmark. Even with a wellstructured formulation such as Voyager's, solving this problem directly took over an hour of CPU time (using LINDO on a Sun SPARCstation 2). In contrast, we spent only 1.51 sec to compute the lower bounds on the number of functional units, and only 7.75 sec to solve the TRCS problem { solving the same problem in two orders of magnitude less time! On a larger benchmark { the DCT { for a time constraint of 500ns and a 24ns clock, we spent 8.28 sec to compute the lower bounds on the number of functional units, and 2.62 sec to solve the TRCS problem. Again, directly solving the TCS problem for this case took over an hour.
In general, the best designs for each example were generated within seconds. However, for very small clock lengths (e.g. 19ns), the ILP for the TRCS problem becomes quite large, and in some cases would have taken hours to nd the exact solution. Fortunately, even in those cases the bounds were produced fairly quickly, and could often obviate the need to solve the TRCS problem for those clock lengths as described in Sections II-A and II-B.
VIII. Summary and Future Work
This paper has de ned a new problem { the 3D scheduling problem { and has presented an exact solution methodology to solve that problem without resorting to a timeconsuming exhaustive search. This solution methodology is exact { it is guaranteed to nd the optimal clock length and schedule. Furthermore, it is e cient { it prunes inferior points in the design space through a careful selection of candidate clock lengths (an important design parameter too often determined by guesswork or estimates), and through tight bounds on the number of functional units or the length of the schedule. It can optimally solve mediumsized problems in seconds, as opposed to more conventional techniques that might require hours. Thus it eliminates the
