In this paper we examine the multi-criteria optimization involved in scheduling for data path synthesis (DPS). The criteria we examine are the area cost of the components and the schedule time. Scheduling for DPS is a well known NP-Complete problem. We present a method to nd non-dominated schedules using a combination of restricted search and heuristic scheduling techniques. Our method supports design with architectural constraints such as the total number of functional units, buses, etc. The schedules produced have been taken to completion using GABIND 1] and the results are promising.
Introduction
Data path synthesis (DPS) involves rst scheduling of operations, and then allocation and binding of abstract design entities to their physical counterparts. At the end of DPS we are required to nd one or more \optimized" implementations for a design problem input to the synthesis system. The objectives of optimization in this situation is multi-fold in the sense that we seek to optimize not only the area cost estimate of the data path but also its performance, measured as a function of the length of the schedule. This makes the synthesis problem a multi-criteria optimization problem. In the present paper we consider the scheduling problem and the criteria being considered are area (estimated through the cost of individual components) and performance of the nal design.
A feature of most multi-criteria optimization problems is that the criteria are often noncommensurate and sometimes con icting. It is therefore di cult to combine the criteria into a single cost function. We take the approach of representing the cost of a design as a tuple of costs of the individual objectives. This is similar to the approach taken in Stewart et al. 2] . One cost tuple is said to be better than another distinct tuple if the cost of each criterion of the rst tuple is no worse than the corresponding costs of the other tuple. A design whose cost tuple is better than that of another design is said to dominate that design. The global problem of optimization is to nd the set of designs which are not dominated by any other designs. The set of feasible designs satisfying the design parameters constitute the design space. Each design point in the design space corresponds to an estimate of hardware requirement and performance computed as a function of the schedule time. Thus an algorithm for DPS needs to consider techniques not only for scheduling and allocation but also for a systematic exploration of the design space to locate these non-dominated designs. The starting point of design space exploration often revolves round the basic scheduling problem. We do design space exploration (DSE) using a combination of controlled search, and heuristic scheduling techniques.
Conventional scheduling algorithms require a time constraint or a speci cation of the available FUs. In a practical DPS situation neither the appropriate time constraint nor the appropriate FU requirement will be known in advance. Through DSE we systematically explore several combinations of time constraints and hardware resource con gurations that are feasible. We use the concept of multi-criteria optimization and arrive at several con gurations with di erent performance and FU requirement estimates. We employ a multi-objective search approach to perform design space exploration and scheduling. In our scheme we have a state space generation mechanism coupled with an estimator for obtaining various <hardware cost, performance> estimates. A controlled depth rst branch and bound is used to determine the hardware cost estimate and produce a partial schedule for a given time constraint. This actually corresponds to a localized exact or near exact exploration of a region of the entire design space. In order to contain the combinatorial explosion, the computational e ort to be spent on DSE can be controlled by certain parameters. While designing these tools we also permit the designer to impose design parameters and then examine the design space for possible designs which satisfy these parameters. We have chosen these parameters to re ect some important architectural aspects, such as the number of buses, the number of FU sites, the number of storage access points, etc. over which the designer may wish to have some control.
At the heart of the DSE mechanism is the controlled search based Resource Estimation and Partial Scheduling (REPS) algorithm. The basic DSE technique makes use of the REPS algorithm to estimate the hardware requirement, as tightly as possible, so that the design parameters are also satis ed. REPS also returns a partial or complete schedule depending on the situation. This way the design points are computed. Scheduling, however, is an NP-hard problem 3] and for large problem instances it may be necessary to settle for a restricted search. In this case the design points obtained are approximate (lower bounds) and the schedules may be partial in the sense that the degree of freedom of some operation may still be more than one. To meet this situation a local DSE mechanism has been developed to explore the neighborhood of such a design point to obtain one or more non-dominated design points for which feasible schedules will exist. The local DSE mechanism also produces a feasible schedule for each design point that it returns. Such schedules are obtained using existing scheduling techniques which perform scheduling using the precedence constraints and sometimes the available hardware resources.
In the following we present details of our solution to the problem of design space exploration (DSE) to generate a set of schedules which will represent non-dominated designs. The inputs for design space exploration are explained in the next section. The estimates used by REPS for hardware cost and schedule time are discussed in section 3. REPS itself is presented in section 4 Then the overall DSE mechanism (which uses REPS) is explained in section 5. The experimental results are presented in section 6. We conclude in section 7.
Inputs to DSE

Operation Precedences
This is the most important input to the DSE algorithm. For practical design examples there will be a number of basic blocks, and for each b.b. there will be precedence constraints on the operations in that b.b. The precedence constraints between operations are restricted to be partial orders. This is not a severe restriction because a sequence of operations in a basic block gives rise to such constraints 4]. Each type of operation is also assigned an execution time, to indicate the number of time steps over which the operation will execute. The execution time of an operation is determined by the speed of the hardware implementation of that type of operation.
Design Parameters
In order to explore the designs which are possible for a given behavioural speci cation in reasonable time and in a structured manner, it is desirable to guide the design process with some user speci ed parameters. These parameters should be simple and easily visualizable by the design engineer. Our design space exploration (DSE) scheme uses the following two parameters.
NFUS This indicates the number of sites where hardware operators will be clustered. However, FUs need not be formed during scheduling. No two hardware operators at the same site may receive inputs or deliver outputs in the same time step. The hardware operators in the FUs are in this sense mutually exclusive. Clearly, NFUS simple operations may be performed in the same time step. The clustering has been done to facilitate the optimizations at the time of physical design. NFUS has an e ect on the controller cost, a small value of NFUS compared to the total number of operation units will permit signi cant optimizations in the design of the controller, while a comparable value will permit little redundancy and therefore less optimization. NFUS constrains the scheduling algorithm which has to ensure that no more than NFUS simultaneous operations take place in a time step.
NBUS This is the maximum number of logically distinct buses in the system. Presently, each communication path between units is abstracted as a bus. In our model of implementation two or more units are connected to each main bus. The connection may be switched or direct.
NVREF This is the maximum number of distinct variable references permitted in any time step. Reading and writing to a variable are considered distinct accesses. Variables that are simultaneously accessed cannot be packed into the same single port memory. This parameter is used to have a check on the number of simultaneous variable accesses. Though the above parameters are independent they are well co-related. It may be expected that NBUS 3NFUS and NVREF 3NFUS. The exact equality may not hold for several reasons. In some cases the value generated by an operation could have to be stored in more than one storage location. Such a situation is depicted in gure 1 where the + operation is annotated with two labels, indicating that the output of this operation would have to be transferred to the locations where these variables have been mapped. In general, such variables will not be mapped to the same location, and the transfers will be distinct. Speci cations where additional transfers are frequent, a slightly higher value of NBUS and NVREF than 3NFUS could be desirable.
Measures for DSE
We shall often have to consider a region of the design space and e ciently determine the design point or points in this region which will be feasible and worth retaining. In general we shall have to resort to scheduling to answer this question. However, scheduling could be computationally intensive. We, shall therefore rely on heuristic measures not only to aid scheduling but also to arrive at our decision as early as possible. Therefore, estimates which try to compute area by actual oor planning schemes are ruled out at this stage. We shall rst indicate the type of measures that we would like to compute and then suggest some basic methods to compute them.
Estimates of Hardware Requirement
We would like to estimate the hardware requirement to check for the feasibility of a design region and to locate feasible and possibly non-dominated design points. We would, therefore, like to estimate the following: i) hardware operators, ii) storage elements, iii) buses and iv) switching elements. The maximum number of operations to be executed in any time step determines the minimum number of FUs that will be required. This should not exceed the number speci ed as the design parameter. The bus requirement also needs to be determined to ensure that the other design parameter is not violated. For the purpose of DSE it is desirable to estimate these as accurately as possible. However, at this early stage of design it is di cult to have reliable estimates of all the three types of RTL components mentioned above. Among these, it is easiest to estimate the requirement of hardware operators. It is also possible to estimate the storage requirement before scheduling has been done 5], but this estimate is, relatively, less reliable. It is most di cult to estimate the switch requirement before scheduling has been performed. After scheduling, the hardware operator requirement and the storage requirement can be estimated more accurately. For straight-line code the minimum storage cost can be easily obtained using the left edge algorithm 6]. At this stage a better estimate of the switch requirement can also be obtained for a point to point interconnection scheme. For a bus based interconnection scheme a reasonable estimate of switch requirement can be obtained after transfers have been mapped to buses 7] . Thus, while working with incomplete schedules it is may be computationally ine cient to include the switch cost. When scheduling is complete the design can be evaluated more accurately by using better estimates of storage, switch and hardware operator cost. In the next subsection we indicate the computation of lower bound estimates for speci c hardware operators, total number of operations in any time step and the bus requirement.
Estimators for DSE
Estimation of Resources for Speci c Operations
Given a DAG, we would like to estimate the number of each hardware operator for realizing each kind of operation to schedule the DAG in (say) n time steps. This estimate can be obtained as a lower bound. The method of determining the lower bound is similar to the techniques proposed in 8, 9] . We rst introduce the notion of a window which we shall use to compute the estimates. A contiguous sequences of time steps is referred to as a window. Given a DAG to be scheduled in n time steps, there can be n windows of size one, n ? 2 windows of size two, ..., and one window of size n. Thus there can be a total of n(n+1) 2 windows with n time steps. For determining the estimate it is necessary to determine the earliest and latest times at which each operation in the DAG (of a b.b.) may be scheduled. These are most conveniently determined from the ASAP and ALAP schedules, where the operations are scheduled as early or as late as possible, respectively.
The construction of the lower bound is now explained. First consider any window in the given DAG of j; j n, steps and starting at time step i; 1 i n. Consider any operation o in the DAG, let the earliest time step where it can be scheduled be t a;o and the latest time step where it can be scheduled be t l;o . If t a;o i and t l;o < i + j, then in each and every possible schedule of the DAG o must lie in the aforesaid window. Let the operation o be of type x. Let there be a total of m operations of type x restricted, in the same manner, to lie in this window, then at least l r x;i;j = d m j e hardware operators to realize the operations of type x in the DAG. Let l r x = max i;j l x;i;j ; 1 i n; 1 n ? i ? j + 1 n. Then l r x too is a lower bound on the number of hardware operators for operations of type x. Similarly let L r x be the maximum of l r x over all the DAG's. This too is a lower bound. This is the principle that has been used to derive the l.b.s on the number of operation units of type x for the design. If C x is the cost per unit for an operator of type x then the estimate of the resource cost is de ned as P x L r x C x .
Estimation on the Total Number of Operations per Time Step
This metric is required to ensure that the parameter NFUS is not violated. This metric is found in a manner that is very similar to the method explained above, for the previous metric. Only, in this case, no distinction is made between the di erent types of operations, and all the operations occurring in a window are counted. Therefore, this estimate is also obtained as a lower bound.
Estimation for Buses
The bus requirement is estimated by examining the transfers that take place in various windows. Each operand of an operation contributes to a transfer. Transfers also arise due to variable assignments. As usual we consider the transfers that will be restricted within the window under consideration and then compute the lower bound on the number of concurrent transfers. Common variables which form inputs to operations need to be handled carefully. For the purpose of computing a lower bound transfers arising from the same variable to operations which are neither ancestors or descendents of one another may be counted only once, otherwise they may be considered distinct.
Estimation for Variable Accesses
The number of distinct variable accesses is determined by examining the variable accesses that take place in various windows. Each input and output operand of an operation contributes to a variable access. As usual we consider the accesses that will be restricted within the window under consideration to compute the lower bound. Input operands named by the same variable need careful handling. Like the l.b. determination for buses variable accesses by operations which are neither ancestors or descendents are counted only once, otherwise they may be considered distinct.
While working with a behavioural speci cation (BS) and the associated parameters, we would like to have an idea of the cost of the various designs that are possible without having to go through those designs in full detail. At this early design stage the hardware cost can be estimated with only a limited accuracy. The schedule time is dependent on the duration of the clock cycle and the number of time steps in the schedule. We will, in general, only be concerned with the number of time steps. However, when the intermediate representation of the BS consists of multiple b.b.'s the e ective schedule time of the design needs to be suitably de ned. We now examine the estimation of schedule time.
The above estimation methods are applicable to individual DAG's. For multiple DAG's these estimators need to be applied to each of those DAG's. The global l.b. is obtained by merging the individual l.b.'s.
Search Algorithm for REPS
The resource estimation and partial scheduling algorithm uses the estimators described in the previous section to determine the resource cost for a given schedule time. It also returns a complete schedule if required. This requirement is controlled by a threshold. If the threshold W is set to one then it returns a complete schedule. If W > 1 then it returns a partial schedule in the sense that the degrees of freedoms (DOF) of all operations are suitably reduced but some may still have non-zero DOF. It is, however, ensured that the DOF of all operations will be less than W. We have found experimentally that these estimators work better for smaller DAG's. Thus the REPS algorithm partitions the DAG, if necessary, into smaller DAG's, applies the estimator to these partitions and combines the estimates for the di erent partitions to arrive at the nal estimate. The schedules of partitions are combined to return the partial schedule obtained. The REPS algorithm does a systematic search of the problem space using DAG partitioning as the state space decomposition procedure. The details are now explained.
DAG Partitioning
A threshold W on the maximum size DAG for which the estimate will be accepted without further partitioning is speci ed by the designer. The partitioning scheme involves splitting the n time steps, in which to schedule (a partition of) the DAG, if n > W, into d n w e bands, each of at most W time steps. Each operation of the DAG is restricted to lie in only one of these bands. For operations whose, ASAP and ALAP times, t a and t o , lie within a band, nothing needs to be done. For other operations it is necessary to take a decision regarding the band where it should be restricted to be scheduled. A poor decision regarding the band where the operation should be placed could give rise to a high and sub-optimal resource cost estimate. A search must, therefore, be conducted on the DAG to take the right set of decisions. We have employed a depth rst branch and bound scheme. The process of decomposition continues recursively till the size of none of the partitions of the current DAG are more than W. It may be noted that if W = 1 then the search would nally produce a complete schedule of the graph, to minimize the resource cost. Such an algorithm would, in general, fail to nd a schedule in reasonable time for relatively large problems. By having W > 1 we are able to reduce the amount of search that will be incurred. The resource requirement of a particular type of resource in the design is the maximum requirement of that resource over all the partitions of a particular DAG. We now explain the search mechanism.
The Search Scheme
The memory requirement for storing the partial solutions is high. Thus we have chosen the depth rst search branch and bound (DFBB), whose memory requirement is minimal. In the search scheme the partitioned DAG's are treated like separate DAG's. If the number of time steps within which the DAG needs to be scheduled does not exceed W then no more repartitioning is done and the current estimates are accepted. Otherwise, it is split into two smaller DAG's. The splitting is done near about the middle so that the two sub-problems generated are of similar size. If there are one or more operations crossing the boundary then all the possibilities of distributing these operations need to be tested out. This is where the search comes in. We perform the search by explicit backtracking. In order to keep track of the moves a stack (stack1) is used. For an operation that crosses the partition boundary there are three moves to be made, which are i) it has to be scheduled in the top half, ii) it has to be scheduled in the lower half and iii) its original freedom has to be restored. The rst two moves are forward moves, while the third move is there to perform backtracking. The rst move is performed right away while the other two moves into the stack (stack1), in that order. After making a move the ASAP and ALAP schedules are recomputed. When the move made is of the forward type, the resource estimate is computed. This is a lower bound estimate and may increase as the depth of the search increases. If this estimate exceeds the estimate of the best design found so far, then the current move is rejected and backtracking is initiated. Move rejection followed by backtracking also takes place if the resource estimate after the move is found to be infeasible with respect to the design parameters. Initially there is no solution and so at the beginning a dummy solution of very high cost is assumed. This solution is replaced by the rst (partial) feasible solution that is found. Backtracking has been illustrated through example 1.
Example 1 Consider REPS for a hypothetical b.b. containing only plus and minus operations, to be scheduled in eight time steps. Assume that the value of W is ve and the current l.b. on the requirement of adders and subtracters for the best solution identi ed so far is < 2; 3 >. Partitioning is required and the partition boundary may be taken to be time step 5 3; 7; o > from the stack. In this particular no more moves need to be made to compute the resource requirement for the b.b., which is now put back into the head of the list. The procedure estimate now returns the computed l.b. for the b.b. It also leaves the list the way it was when it was invoked.
2
When the list becomes empty, it is assured that the sizes of all the (partitioned) b.b.'s is less than or equal to W. When this condition is satis ed no more partitioning needs to be done, and the set of stacked (stack2) b.b.'s constitute the partial schedule. If W = 1, then this is also the complete schedule and corresponds to a feasible design point. Otherwise, the design point found is an approximate one. If this point corresponds to a design with a better (lower) resource estimate than that of the best stored design then it replaces that design, otherwise it is rejected and backtracking is initiated. The algorithm terminates with a failure if there exists a partition where the design parameters of NFUS and NBUS cannot be possibly satis ed.
It may be noted that, if instead of appending the b.b.'s to the list, they were pushed back into the head then balancing of block sizes would not have been achieved. The blocks at the head of the list would be partitioned till they became simple and made way for blocks behind. Thus a lot of time would be spent in re ning the blocks at the anterior of the list which could go waste if a block at the end of the list turned out to have an acceptably high value of l.b. on the resource requirement.
The requirement for each resource is generated in the form of the a tuple < m; w; j >, where m is the number of units of that entity occurring in a window of size w in the b.b. j. Such tuples are generated for the maximum number of operations per time step, the bus requirement, the storage access point requirement and the requirement for hardware operator for each type of operation. d m w e is the l.b. on that resource. Tuples, instead of the resource requirement, are generated because this information is needed by the exploration heuristic used in the DSE tool (section 5.1).
The search mechanism explained above has one anomaly. The problem is that when REPS is being done with a relaxed time constraint then the search space turns out to be far larger than when REPS is being done with a tighter time constraint. This situation is addressed by running an approximate scheduling algorithm on the current b.b. before going ahead to partition it into smaller b.b.'s. Basically a time and resource constrained scheduling type approximate scheduling algorithm is required. If a time constrained scheduling type algorithm is used then the nal resource requirement after scheduling should not exceed the l.b. estimate of the resource requirement, while satisfying the design parameters. If an resource (hardware operators) constrained scheduling type algorithm is used then the time steps required should not exceed the available number of time steps. If the approximate scheduling algorithm terminates successfully then the current b.b. may be assumed to satisfy the l.b. on the resource estimate and the remaining b.b.'s may be examined.
The pseudo code for the search scheme is given in gure 4. The rst line of the procedure REPS checks whether the block currently being handled is simple. A block is considered to be simple if either i) the block is to be scheduled in no more than W time steps or ii) the block can be scheduled using the approximate algorithm without violating either the current time constraint or the current level of resource estimate. We illustrate the working REPS through the following example. Example 3 We consider the DAG shown in gure 5 for scheduling in ten time steps. We rst note that DAG's of this type pose a di culty for the hardware l.b. estimator described earlier. If we compute the l.b. for scheduling in nine time steps then the estimator will report a requirement of two adders, whereas three adders will be actually needed.
For illustrating the working of REPS we consider a schedule in ten time steps for the aforementioned DAG. In gure 6 we illustrate the main actions taken by REPS. An inspection of gure 5 reveals that two adders will be required. In the start state (state 0) of gure 3 the correct l.b. is obtained, but the approximate scheduling algorithm fails to schedule using two adders in ten time steps. Hence partitioning of the DAG is required. We choose the fourth time step for partitioning. The time frame of the operation marked 1 in gure 5 spans across this time step. We restrict it to be scheduled on or before time step four (state 1). This decision does not complete the partitioning of the DAG. Partitioning is completed after the operation marked 2 in gure 5 is scheduled after time step four (state 2) in gure 3. The l.b. continues to be three and this time the approximate scheduling algorithm succeeds in nding a schedule without violating the l.b. This case is recorded as the current best solution. Backtracking is initiated. Since the l.b. of the parent state (state 1) matches with the current cost, the other child of state 1 is not generated. Backtracking is continued to state 0. Now the other option of scheduling the operation marked 1, above or at time step 4 is exercised, leading to state 3. The l.b. continues to be two and the partitioning is completed by restricting the node marked 3 in gure 5 on or before time step 4 (state 4). The approximate scheduling algorithm succeeds and a new better solution is recorded. Backtracking is initiated and continues to the start state and the algorithm terminates. In each of the dashed boxes the status of the queue is also shown. After partitioning, in state 2 and state 4, the two smaller DAG's are entered into the queue. The schedule of the approximate algorithm which obtained the best solution (for this problem) is accepted as the schedule. It must be noted however, that in some cases partial schedules may be returned when W > 1 and l.b.'s and u.b.'s from solutions obtained by approximate algorithms do not match. is simple multi-cycle and the other is when the implementation is pipelined. We consider only simple arithmetic pipelined implementations because these are the ones that are most commonly used in practice in DPS. When multi-cycle or pipelined operations are present they may sometimes cross partition boundaries. In such situations the computation of the lower bound is a little more involved. Operations that are implemented by multi-cycle hardware operators will be referred to a multi-cycle operations. Similarly operations having pipelined implementation will be referred to as pipelined operations.
REPS handles multi-cycle operations in the following manner. Suppose that the time frame of a multi-cycle operation of k time steps crosses the partition boundary set at time t. Up to k possibilities need to be examined. These are, initiating the operation at times earlier than time step t ? k, initiating the operation at times t ? k; : : :; t, and at times later that t. Initiation of the operation at speci c times is the additional overhead for handling multi-cycle operations. When more than a single operation crosses the partition boundary, partitioning is initiated with the operation requiring the least number of cycles for its execution. Handling of pipelined operations is as follows. Consider a p-stage pipelined implementation with a stage delay of d, of an operation of type x. The result of such an operation will be obtained p ? 1 time steps after initiation.
Therefore, while scheduling the number of time steps to complete operations of type x should be taken a p. The l.b. is obtained as for a multicycle operation, the stage delay d being used in place of the number of time steps k of the multi-cycle operation.
Scheme for DSE
We now describe the overall scheme for design space exploration. At the heart of the DSE technique is the resource estimation and partial scheduling algorithm (REPS) which is repeatedly invoked with varying time constraints. The time constraints with which REPS is invoked is determined by the exploration heuristic in section 5.1. With each invocation REPS either indicates that the time constraint is not feasible or it returns a design point and a schedule. The design points thus obtain are used to obtain the design space. When a new design point is obtained one of the three conditions will be true. The point is dominated by existing design points. In this case this design point has to be discarded. The point dominates a set of the existing design points. All the dominated points have to be discarded and the new point has to be incorporated in the design space. It neither dominates, nor is it dominated by other design points. The point is simply incorporated in the design space. In this manner the fastest design requiring maximum hardware, the slowest design requiring minimum hardware and intermediate non-dominated designs are obtained. If the parameter W is set to one then we have a complete schedule and the corresponding hardware requirement. If W > 1 then REPS will generally return a partial schedule and an approximate hardware requirement. In the latter case it is desirable to obtain the complete feasible schedules, which will be needed for performing subsequent allocation and binding. These schedules will have to be obtained using approximate scheduling algorithms. The detailed scheme of obtaining complete schedules from approximate design points is explained in sections 5.2 and 5.3.
Exploration Heuristic
The resource cost estimation scheme described above requires the number of time steps for each b.b. to be speci ed. To start with, the number of time steps for each DAG is set to its critical length, and then REPS is invoked. The resulting resource requirements are computed from the tuples, as explained above, and examined. It was mentioned in section 4.2 that the requirement for each hardware resource or the requirement of FUs, buses, etc. are generated in the form of the a tuple < m; w; j >, where m is the number of units of that entity occurring in a window of size w in the b.b. j. In case the any of the design parameters is violated a corrective action is taken as follows. Suppose that a design parameter X having the value v X is violated, i.e. d m X w X e > v X .
Consider the e ect of adding i; i > 0; time steps to the DAG of the b.b. j X . Now the earliest time of each operation o, t a;o remains unaltered, but t l;o goes up by i. Therefore, each operation previously restricted to lie in a window of size w will now lie in a window of size w+i. A minimal number of time steps t X > 0 is added to w X so that i.e. d m X w X + t X e v X . REPS is invoked after making the correction. The DSE retains the set of mutually non-dominating design points that have been found. When a set of design point, characterized by the time constraints and the resource cost estimate, is found to be feasible it is compared with the stored design points. If is dominated by any point then it is not included in the set. If it dominates any point of the set then it replaces that point. Exploration continues with a new set of constraints, generated as follows. For each operation O whose requirement exceeds unity, we identify the DAG's where it is required maximally. In each of these DAG's we determine the time t by which the time constraint of that DAG should be relaxed so that the new requirement of the operator will be one less, i. This is the heuristic use to conduct the exploration of the design space. Exploration is terminated when the resource requirements of all the operations becomes unity.
Scheduling Schemes for Use with DSE
We have noted that the REPS generates hhardware cost, performancei estimates and a schedule for a given design input. When the grain of partitioning is a single time step, the schedule obtained is necessarily complete. The schedule is obtained using a combination of successive partitioning and application of approximate scheduling. For a larger grain of partitioning the schedules obtained may be partial. In the partial schedule an operation, instead of being con ned to a single time step is now restricted to be in a partition, which could extend over a few (about ve) time steps. For subsequent allocation and binding complete schedules are needed. Most of the existing scheduling algorithms, like FDLS 10], can be adapted to work with the partially scheduled DAG's generated by REPS. However, the performance of such modi ed heuristic algorithms may not match the performance of the original algorithm.
There is a second and more important aspect that needs to be addressed. It may be noted that the resource estimates are lower bounds and not exact estimates. It is, therefore, quite possible that for a time constraint and a set of hardware operators, as indicated by a design point, a feasible solution might not exist. Even if such a solution does exist, it might be missed out by the approximate scheduling algorithm. However, feasible solutions will be present in the neighborhood of a design point. We would like have schedules with pragmatic hardware requirement and performance. We, therefore, resort to a systematic generation of schedules in the neighborhood of a design point reported by REPS and retained by the DSE mechanism as a non-dominated design point. We rely on existing scheduling algorithms and use them in an appropriate framework. Such a local exploration scheme should be capable of examining the neighborhood of a design point for feasible non-dominated solutions using approximate scheduling algorithms. The choice of polynomial time techniques here is emphasized, for otherwise an exact method could be used to obtain the schedule in the rst place. In the next sub-section we examine the local exploration scheme.
Thus after the rst phase of DSE we have a set of design points. With each design point we also have the set of partitioned DAG's which had lead to its FU estimate component. At this juncture we complete the schedules of these partitioned DAG's using standard algorithms like FDLS 10] or the scheduling method proposed in 9]. The solutions obtained from this completion gives us upper bound (u.b.) estimates. If these match with the lower bound estimates obtained through DSE, we can terminate with accurate design points and schedules. On the other hand, if the u.b.'s and the l.b.'s di er, we explore around the estimated design point for feasible schedules leading to non-dominated <performance, FU requirement> design points. That is, we make limited search (in polynomial time) around the estimated design points obtained earlier. Our study of some list scheduling algorithms shows that these algorithms usually terminated with optimal solutions for small DAG's. Therefore, in our state space generation we performed decomposition in a balanced manner to ensure that the sub-problems generated after DSE are small and more suitable for existing scheduling algorithms.
procedure relax(dpoint) f set of non-dominated designs = ; try to nd schedule with constraints speci ed in dpoint incorporate schedule and actual design point corresponding to schedule into the set of non-dominated designs while (dpoint could no be satis ed) f dpoint1 = dpoint relax time constraint of dpoint1 while (dpoint1 is not dominated by a design in the set of non-dominated designs) f try to nd schedule with constraints speci ed in dpoint1 incorporate schedule and actual design point corresponding to schedule into the set of non-dominated designs) if (dpoint1 is satis ed) then break relax time constraint of dpoint1 g relax resource constraint on dpoint if (dpoint is dominated by a design in the set of non-dominated designs) then break try to nd schedule with constraints speci ed in dpoint incorporate schedule and actual design point corresponding to schedule into the set of non-dominated designs g g Figure 7 : Heuristic Relaxation Scheme for Local Exploration.
Local Exploration
Given a design point for which only a partial schedule is available, we rst try to schedule using the available time and resource constraints to check for the existence of a solution. If such a schedule is found then we are done. In case scheduling fails for the time and resource constraint as indicated by the design point then the time constraint as well as the resource constraint can be relaxed. The relaxation of these constraints also constitutes a search space. We have adopted a heuristic relaxation scheme, as shown in gure 7. The algorithm works in polynomial time. This relaxation scheme e ects both resource and time constraint relaxation. Otherwise, it initiates time relaxation on a copy of the the design point, keeping the original one for resource relaxation. The time is relaxed in steps and for each new constraint a schedule is found. The actual schedule found may not satisfy the constraint; anyway the schedule along with the actual design point corresponding to the schedule is incorporated in the set of non-dominated designs. It is necessary to incorporate a schedule even if it does not satisfy the design constraint, to accommodate the inadequacy of the approximate scheduling algorithm, and ensure a proper termination of the approximate scheduling scheme. The time constraint is relaxed till the new design point is dominated by one of the designs in the set of non-dominated designs. This marks the end of a run of time constraint relaxations. Now the resource constraint on the original design is relaxed and the entire process is repeated till the new resource constraint turns out to be dominated by one of the designs in the set of non-dominated designs.
The approximate scheduling algorithm that has been used here is the force directed list scheduling algorithm 10]. However, any other approximate scheduling algorithm can also be used. The quality of the design space actually will be governed by the quality of the approximate scheduling algorithm.
Example 4 We summarize the working of the overall DSE technique through this example. We refer to gure 8. In this gure the lled circles ( s ) indicate the actual non-dominated design points of a hypothetical design that we would like to uncover through DSE. For small problems where REPS can be run with W = 1, these points are directly obtained. We consider a scenario where REPS is invoked with W > 1. The design points returned by REPS are indicated by`+' and . These are approximate design points. We do not consider the points indicated by because these are dominated by the points marked +. The feasible schedules in the neighbourhood of these points (indicated by the large circles) are now found by local exploration. These points are marked by empty circles ( f ). It may be noted that some of these points coincide with the design points found by REPS (marked +). These are optimal schedules because the l.b. and the u.b. costs are identical. In other cases these are distinct.
6 Experimentation
The techniques proposed in this paper have been implemented and tested. The implementation has been done in C in a UNIX environment. We have performed DSE on some common examples like Facet 11] , di erential equation solver 5] and elliptic wave lter 12]. We now describe our experimentation. Tables 1, 2 and 3 indicate the design points obtained after design space exploration of Facet, Di eq. and Elliptic Wave Filter, respectively. All these designs are for single cycle implementations of the operations. The rst two columns indicate design parameters. The design points . Each block of rows in a table indicates the design points obtained for a particular set of parameters. For Facet the design points obtained by DSE match the actual designs obtained after allocation and binding. This is also true for the elliptic wave lter example in table 3. For Di eq. the actual implementation of the design points indicated in rows 2, 3 and 4 of table 2, require an additional adder in each case. For two FUs and seven time steps for Di eq. the operations scheduled in three time steps were as follows: < ? + >; < ? ? >; and < + ? >. Therefore, although at most one + and one ? are scheduled in any time step, it is not possible to have an FU con guration using two FUs where at least a + or ? is not repeated. For the case with three FUs and seven time steps, however, such a problem did not exist. Yet an additional + was used to keep the switch cost low. The implementation of Di eq. using two FUs in seven time steps is especially nice, requiring only three switches. The design point indicated in row 2 is for designing with only two FUs, whereas there are four types of operations distributed over seven time steps. For the design point indicated in row 3 of table 2 the number of time steps is four, exactly equal to the length of the critical path in the data ow graph. The data paths for Di eq. using two FUs in six time steps is shown in gure 9.
The lled circles indicate switched connections and un lled circles indicate unswitched connections. The variables in each memory are indicated. (A colouring of the lifetime con ict graph of these variables has to be performed to determine the actual number of cell needed in the memory.) The direction of arrows indicate read or write ports. No direction is speci ed for read/write ports.
Conclusion
A given behavioural speci cation can have a large number of RTL implementations. We can partially characterize RTL implementations by means of design parameters like the number of FUs and the number of buses, when we consider a bus based data path. Even for a given set of design parameters a number of designs are possible which di er in their hardware requirement and performance. These designs constitute a design space which needs to be systematically explored to nd the non-dominated designs. The early part of this design space exploration (DSE) problem revolves around the basic scheduling problem which is NP-hard. We have proposed a scheme for doing this exploration using a combination of controlled search and approximate scheduling techniques. The search is based on depth rst branch and bound (DFBB). DFBB has the advantage of requiring minimum space in the host machine where it has to run. It is necessary to conserve space because the storage for a single (partial) solution, itself, is considerable, We have used a balanced problem decomposition scheme for the DFBB. This has the advantage of partitioning the original problem into smaller subproblems of nearly equal sizes. The importance of doing DSE has been demonstrated through experimentation on randomly generated DAG's of various sizes. Through this experimentation we have noted that the design space for a problem typically has a number of non-dominated design points whose hardware requirements and performances are quite arbitrary. We have also highlighted the importance of working with design parameters to obtain schedules for which data paths having a xed number of functional unit sites and buses may be constructed. This is advantageous for the subsequent construction of data paths from these schedules. We have applied our DSE techniques to some common examples like Facet, di erential equation solver and elliptic wave lter and constructed data paths from the schedules obtained. We have noted close conformity with the estimates obtained with DSE and the actual hardware used in the data paths.
