I. INTRODUCTION HE OBJECTIVE of high-level synthesis is to convert
T a behavioral hardware description to a description of RTL (Register Transfer Level) structure that implements the behavior [l] . The main procedure of high-level synthesis can be divided into three major phases. The first phase is to convert the input description to a control/dataflow graph (CDFG) which represents the control flow and data dependence relation between operations [2] . The second phase is the scheduling which assigns operations into the appropriate control steps [3]- [8] , [26] , [27] . The allocation of functional units can be preceded by or followed by or combined with the scheduling phase. The third phase is the data-path synthesis that allocates hardware resources such as registers and busses, and binds the operations of CDFG to functional units [9] - [16] . This paper concentrates on the scheduling process. Core design decisions such as the number of hardware resources, clock cycle time, and implementation styles (pipeline, multi-cycle operation, etc.) are made during the scheduling process. Furthermore, these decisions have strong influences on the following data-path synthesis phase. Thus, an efficient scheduler is essential in highlevel synthesis. During the last decade, several scheduling algorithms were published by many researchers. Since scheduling is an intractable problem, most high-level synthesis systems rely on heuristic approaches. Most of the heuristic approaches are based on the constructive algorithm in which the selection and fixing of operations in the control step repeats one by one until all the operations are fixed.
The simplest scheduling method is to locate operations as soon as possible (ASAP) [9] or as late as possible (ALAP). Emerald system [ 101 uses the ASAP scheduling method. If there is no resource limitation the method generates the minimal number of control steps (c-steps), otherwise a schedule which requires too many hardware resources to implement might occur. To deal with the defeat, ASAP scheduling with conditional postponement of operations is proposed in MIMOLA [ll] system and Flame1 [17] system. Another method to avoid the above problem is the list scheduling [3] in which all the operations are sorted in a topological order using data dependence relations indicated in the CDFG, and then each operation is placed to a c-step determined by a heuristic priority function such as urgency described in MAHA [ 121 or mobility in Slicer [4] . Force-directed scheduling (FDS) [5] used in HAL system is based on a more global view. In FDS, the assignment of operations into the c-steps is performed one by one based on an evaluation using distribution graph which represents the distribution of fixed operations and unscheduled operations in each c-step. The distribution of unscheduled operations is probabilistic rather than deterministic.
The constructive algorithms, however, do not guarantee the globally optimal solution. An Integer Linear Programming (ILP) approach which finds the globally optimal solution is proposed in ALPS [6] . Though ALPS yields the optimal solution, it is not adequate for some large examples because ILP is an exponential time algorithm by nature. Another problem with the ILP approach is that it is sometimes very difficult to accommodate various design styles in an ILP formulation because there is a constraint or objective function that is sometimes difficult to be expressed in a linear form. Another class of algorithm which provides the probabilistically optimal result is a stochastic approach like simulated annealing [SI.
The above schedulings are often applied to a basic block. On the other hand, path-based scheduling [26] and percolation-based scheduling [27] can somewhat exploit 0278-0070/93$03.00 0 1993 IEEE parallelism across basic block boundaries. Path-based scheduling finds all possible execution paths and optimizes each of them, and then all the execution paths are overlapped to minimize the number of control states. Percolation-based scheduling starts with an optimal schedule and applies semantic-preserving transformations to maximize parallelism.
The scheduling and data-path synthesis are closely interrelated. Since it is hard to consider all the details of data-path synthesis during scheduling process, an optimal schedule does not necessarily lead to an optimal synthesis result after data-path synthesis is finished. In many practical situations, it is, therefore, economical to obtain a good solution using a fast heuristic algorithm because many iterations of scheduling and data-path synthesis are usually required to obtain a satisfactory synthesis result.
In this paper, we address the scheduling process under real world constraints such as chained operations, multicycle operations, mutually exclusive operations, pipelined data paths and so on. We propose a new scheduling algorithm, called FAMOS , which iteratively improves an initial schedule by the selection of multiple exchange pairs with maximal cumulative gain as was originally proposed by Kernighan and Lin in their min-cut graph partitioning [ 181. The min-cut has been widely used for placement and floorplanning applications [20], [2 11. The proposed scheduling algorithm is very effective in the computation time and optimality point of view. The algorithm can escape from local minima and has a tendency to reach a globally optimal solution, which was demonstrated in several examples. A graph model which includes information on the real world constraints is also presented. For examples appearing in previous literatures, our algorithm has produced optimal results in a shorter computation time compared to the earlier works.
The paper is organized as follows. Section I1 states the scheduling problem to be solved. Section I11 describes the graph model, and Sections IV-VI explain FAMOS scheduling algorithm with the analysis of time complexity. Extensions to include real world situations are described in Section VII. Section VI11 gives experimental results and comparisons with other scheduling methods. The paper ends with discussions and conclusions.
11. PROBLEM DESCRIPTION A behavioral hardware description consists of CDFG blocks, where each CDFG block contains not only straight-line codes but also conditional branches. However, a block does not include such a type of statements as GOTO or LOOP. The goal of this paper is to find a schedule having minimal hardware cost for a CDFG block under the constraint of fixed execution time which is given by the number of c-steps. The scheduling problem is named as mincost scheduling in the paper. The counter part problem is the mintime scheduling of which objective is to find a schedule having minimal execution time under the constraint of fixed hardware cost. An extension for the mintime scheduling is briefly described in Section VII.
We assume that a CDFG block is given as input. The problem of how to manage total CDFG blocks in a hardware description is important, but is not the focus of the paper. Flame1 [ 171, for instance, has used the block-level transformation to find an implementation that minimizes the execution time while satisfying a user-supplied constraint on the cost of hardware implementation.
GENERATION OF WEIGHTED PRECEDENCE GRAPH
The input to the scheduling consists of a CDFG and some hardware information such as clock cycle time, execution delay of each operation and the number of c-steps. A graph called Weighted Precedence Graph (WPG) is generated to accommodate the information. WPG is a directed acyclic graph having weights on both nodes and edges, which contains not only precedence relations but also the number of minimal clock cycles between operations. Each node in WPG corresponds to an operation, and two nodes are connected by a directed edge if there is a precedence relation between them. An edge weight denotes the minimal number of c-steps between a pair of operations, while a node weight represents the minimal number of c-steps necessary to cover the delay of the corresponding operation. Additional edges that do not appear in CDFG may be inserted in WPG for the case of chained operations. Chained operations occur when several operations can be executed consecutively in one cycle time. If chained operations are not allowed, the topologies of CDFG and WPG are the same except that edges and nodes have weights in WPG. Fig. 1 shows a small sample. The example contains four additions each of which takes half cycle. Two consecutive additions can be assigned into the same c-step, while two additions which are not consecutive cannot be assigned within the same c-step. Note that in Fig. l(b) one additional thick edge is included between two additions to denote the fact. The practical meaning of edge (o;, oj) with weight w, can be represented by the following inequality,
where cs (oi) denotes the identifier of c-step where operation oi is scheduled. The node weight is valuable for multi-cycle operations and pipelined functional units as explained in Section VII. Fig. 2 shows another example that is the 16-point FIR filter quoted from SEHWA [7] . It is assumed that one multiplication takes one cycle while an addition takes half cycle. New precedence relations are shown as thick arrows in Fig. 2 , which indicates that two non-consecutive additions have to be assigned to different c-steps each other.
A CDFG block is represented by a directed acyclic graph because statements such as GOTO and LOOP are not allowed in a block. WPG is constructed by drawing directed edges between two operations which cannot be assigned to the same c-step. If there is no path between two operations in CDFG, there is no precedence relation be- For each operation oi, the search for its successors to represent the precedence relation is made based on the calculation of cumulative execution delay from operation oi to its successors. If the sum of execution delay of operation oi and that of its successor is less than one cycle time, both of them can be assigned within the same c-step. In this case, we draw a precedence edge having zero weight and then invoke recursively a new search to find grandsuccessors to which cumulative delay from oi is greater than one cycle time. If we find an operation oj (successor or grand-successor of o i ) that can not be assigned into the same c-step with oi, we draw a precedence edge with a weight more than zero. Since oi and the successor of oj are clearly not assigned into the same c-step, it is redundant to find precedence relations between oi and the successors of oj. The procedure for the generation of WPG is given below.
WPGJeneration: begin
Assign zero weight to all edges in CDFG;
for each operation oi begin node weight of oi = [execution delay of oil cycle-time1 ; 
IV. BASIC SCHEDULING ALGORITHM
The scheduling process of FAMOS is an iterative algorithm which incrementally improves an initial schedule. The iteration strategy was taken from graph partitioning algorithm [19] for selecting a set of trials which produce maximal cumulative gain. The original graph partitioning problem is to partition n nodes into two subsets of size rn/21 such that the sum of the weights of edges connecting two subsets is minimal. Graph partitioning is NP-complete, but there are good heuristics [ 181, FAMOS scheduling starts from an initial feasible schedule, usually the result of ASAP or ALAP scheduling or one that is produced using a constructive scheduling method. All possible c-steps for each operation are first identified from the results of both ASAP and ALAP scheduling. An operation may sit anywhere between its ASAP c-step and ALAP c-step as long as the precedence relation is not violated.
Consider the differential equation example shown in Fig. 3 , which is taken from FDS [5] . ASAP and ALAP schedules are shown in Fig. 3 (a) and Fig. 3(b) , respectively, and the possible c-steps of each operation are shown in Fig. 3 (c). Operations 1-5 are not taken into account further during the scheduling process since their c-steps have already been uniquely determined.
The strategy is an iterative improvement. Given an initial schedule, our algorithm improves the quality of the schedule by selecting a set of tuples that gives the maximal cumulative gain, where a tuple is defined as tuple = (operation, c-step). The tuple is selected when the operation is moved into the c-step. Each iteration pass tries to identify such a set of tuples. If the cumulative gain of the set of tuples selected in such a way is positive, the operations belonging to the set are moved to the corresponding c-steps, and then a new iteration pass is attempted again upon the result of a previous iteration pass. This process is continued until the maximal cumulative gain is not positive.
To find a set of tuples during one iteration pass, we repeatedly select one tuple by one. A tuple that is once selected is never visited during the iteration pass, that is, the tuple is locked. It is important to note that there is no ~9 1 . requirement that the operation which is chosen for movement has a positive gain. An operation is moved only because it provides the best selection function value out of all operations which are free to move, not because it provides and overall improvement. This scheme is originally proposed by Kernighan and Lin [18] and helps escape from local minima by utilizing the fact that a movement of operation which results in a schedule worse than a previous schedule may lead to an eventual improvement finally. One iteration pass is terminated when there are no more tuples to visit. Then, we select the portion of the sequence of selected tuples which provides the minimal hardware cost including functional units and registers. The objective of our scheduling scheme is to minimize the amount of hardware resources necessary to implement the given WPG, where one operation can be assigned to one of k c-steps that are provided from the results of ASAP and ALAP scheduling, where k L 1, with a constraint that there is no violation of precedence relation. This is a more general problem than the graph partitioning where there are only two possible places for each node. A tuple is selected based on the selection function explained in detail in the following section. The selection function is defined such that it causes the most balanced distribution of operations in each c-step and reduces the number of functional units required. This concept is similar to that of FDS [13].
The overall procedure of our scheduling method is described below.
Scheduling-Algorithm-of-FAMOS:
repeat { all tuples are unlocked; count = 1; while there is a movable operation Let us illustrate the scheduling process using the example shown in Fig. 3 . We have assumed that the available functional units are multipliers and ALU's, and that the cost of multiplier is 5 and that of ALU is 1. The ALU is capable of performing addition, subtraction and comparison. ASAP scheduling is applied for the initial solution. Fig. 4 (a) shows a sequence of tuples selected during the first iteration pass, with the change of hardware cost which corresponds to the total cost of functional units required. The hardware cost is varied as an operation is moved from one c-step to another c-step s depicted in Fig.  4(b) . When there are no further movable operations, the sequence of tuples which gives the minimal hardware cost is identified. The point is marked by a dark circle in 
4(b)
. As a result, tuples (7,3), (6,2), (9,3), (9,4) and (8,3) are accepted, but the remaining tuples are rejected. At the point, two multipliers and two ALU's are required for the implementation of data-path. With the obtained result as a new initial schedule, another iteration pass can be performed, however, a better solution is not generated because the solution obtained after the first pass happened to be optimal as depicted in Fig. 4 (c). Two other optimal solutions obtained by two additional moves, (11, 3) and (10,2), respectively, are also shown in Fig. 4(b) . It is another advantage of our algorithm that multiple optimal solutions can be simultaneously obtained, where the tie can be broken with further evaluation among themselves using another cost function.
v . SELECTION FUNCTION In selecting a good candidate operation to be moved, the most important factor is the balance of the number of functional units required among all c-steps. This is a necessary condition for obtaining the minimal number of functional units. The balance of the number of functional units among c-steps can be obtained by decreasing the number of functional units in the c-step where the density of functional units is maximal. Therefore, the selection function used in our work is defined as follows:
where S i ( j , k) denotes the selection value of a tuple, (o;, k), when operation 0; which is currently located at c-step j is moved to c-step k, and cost (oi) is the hardware cost of functional unit that performs oi. In (2.b), Change of Density Gradient (CDG) is designed as the difference between Density Gradient (DG) before the movement and that after the movement, i.e., CDG DG before movement -DG after movement
In Fig. 5 , two different selection values, S,(2, 3) and Ss (l, 2) , are calculated to be 20 and 40, respectively. Before the calculation of selection value, it is checked whether the move of an operation does not violate the precedence relation indicated in WPG. In each iteration, the selection value is calculated for each free tuple.
VI. TIME COMPLEXITY A N D DATA STRUCTURE When we implement the proposed scheduling algorithm straightforwardly, the time complexity of FAMOS scheduling in the worst case is O(s2m2), where s is the number of c-steps and m is the total number of operations. The time complexity is derived with the following considerations.
Each iteration pass generates a move set which contains a sequence of moves having the maximal cumulative gain. The number of maximal allowable moves is n , where n is the number of tuples.
Within each iteration, selecting the tuple having maximal selection function value requires 0 (n) time when implemented in a straightforward fashion, and modifying the selection function values of maximally n tuples takes O(n) time. 3) The number of maximal possible tuples for an operation is s, if all the operations are totally independent. Therefore, total number of tuples, n, is bounded by s -m.
4)
The total number of iteration passes is typically 2-5, i.e., independent of the problem size as is the case with typical min-cut procedure.
If we evaluate the selection function exhaustively for all free tuples and choose a tuple having the maximal selection value, it takes O(n) time for each selection. To reduce the time complexity, one can consider more efficient data structures. One of those is a heap data structure for extracting a tuple having maximal selection value, as shown in the upper part of Fig. 6 .
The heap data structure has property that the value at each node is at least as large as the values of its child, which implies that the largest element is available at the root of heap. The update of selection values as well as the locking of the tuples is achieved in an incremental fashion using the linked list data structure shown at the lower part of Fig. 6 . Let us assume that each node in the heap stores a tuple and its selection value. The initial construction of this data structure requires O ( n + mlogm) time, where m is the number of operations. When a tuple is selected from the root of the heap, the corresponding operation is moved to the corresponding c-step and the tuple is removed from the linked list. After such a move, it is necessary to update the selection values of all the remaining tuples. However, we can significantly reduce the update time by confining the possible update to two c-steps where the operation was moved from or to. Therefore, we need only to modify the selection values of tuples which are at those c-steps. Tuples which meet these conditions can be easily found using the linked list data structure. For such a tuple found, we update the selection value and then propagate the tuple upward through the heap. If the updated selection value of the tuple is greater than that of its parent. O(1ogm) time is sufficient for the update of the heap for each tuple found. Therefore, for each move, the update of both selection values and the heap data structure can be carried out in O(m1ogm) time. Thus, the time complexity with the heat data structure is O(nm1ogm) = 0 (sm'logrn).
The bucket sorting can be used as another possible data structure for selecting a tuple with the maximal selection value. Using the bucket sorting, the selection can be finished in a constant time if the bucket size is properly determined. The linked list of tuples is also valuable to update incrementally the selection values of tuples. Therefore, the time complexity with the bucket sorting is reduced to 0 (sm2).
In the analysis of time complexity, we assumed that the number of tuples for each c-step is equal to the number of operations. However, in practical cases, the number of tuples for each operation is less linearly proportional to problem size, and can be assumed to be bound to some constant. This contrasts with the length of critical path (the number of c-steps) which usually increases as problem size increases. If we assume the number of tuples for each c-step is independent of the problem size, the time complexity is reduced to O(sm1ogm) and O(sm) using the heap data structure and the bucket data structure, respectively.
VII. EXTENSIONS FOR REAL WORLD CONSTRAINTS
In previous sections, we assumed that one clock cycle is required for each operation, which is not always true in real situations. We extend the algorithm to accommodate typical real world situations.
A. Mutually Exclusive Operations
Two operations are mutually exclusive if they cannot be executed concurrently, which occurs when there are IF-THEN-ELSE or CASE statements in the behavioral description. Operations in different branches are mutually exclusive. Mutually exclusive operations can be easily handled by regarding them as indistinguishable operations if they are of the same type of functional unit, since they can share the same functional units even though they are assigned to the same c-step. For each c-step in which mutually exclusive operations are assigned, only one is added to the density for each set of mutually exclusive operations that can be performed in the same functional unit.
To check whether two operations are mutually exclusive or not, a node coloring scheme is proposed in SEHWA  [ 7 ] , which, however, does not work in some cases. In Bridge [25] , the mutual exclusiveness is tested by checking the intersection of active boolean cube sets between two basic blocks. Another easy algorithm is presented in the Appendix.
B. Chained Operations
If sum of the execution delays of k consecutive operations (k L 2) is less than one cycle time, these k operations can be chained within one cycle time. The algorithm is applicable for the case of chained operations without any modifications because the information on chaining has already been incorporated into WPG by including additional precedence relations.
C. Multi-cycle Operations
Although information on the multi-cycle operation is included in the node weight of WPG, the selection function should be modified to accommodate the fact that multi-cycle operation takes more than one cycle. This necessitates a different formulation for the Density evaluation of (2). For example, if a multiplication requiring two cycles is assigned to c-step 2, one is added to both the density of c-step 2 and 3. The density in (2) is replaced by the maximal density (MD) out of all the c-steps covered by the operation as defined below:
j 5 r < j + nwi where nwi denotes the node weight oi, i.e., the minimal number of c-steps required to perform oi.
D. Pipelined Data-path
Consider a pipelined data-path system with a fixed latency L. In the pipelined systems, the operations assigned to c-step i + pL (forp = 0 , 1 , 2, * and i = i = 0 , 1 , 2 , -. . , L -1 ) can not share a functional unit because they are executed concurrently. Thus, the density in (2) is changed to the sum of densities (SD) over c-steps i + p L ( p = 0, 1 , 2, . . .) as follows: 
E. Maximal Time Constraints
In the synthesis of interface signals, maximal time constraints between two operations may be specified. This kind of constraints is included into WPG as an edge with negative weight, which is similar to the approach used in the layout compaction [22] . For example, assume a pair of operations oi and ok has both the minimal constraint of two cycles and the maximal constraint of five cycles. For such a case, a directed edge from oi to oj with a weight of 2 is added to WPG, and a directed edge from oj to oi with a weight of -5 is also included. The meaning of edge (oj, oi) with a negative weight w, is exactly the same as (1).
This negative precedence relation causes the two operations to be separated by no more than (w,J clock cycles.
If this negative precedence relation exists, the longest path algorithm [22] is applied to find ASAP and ALAP schedules. Since the maximal time constraints are incorporated into WPG, our scheduling algorithm remains the same as the normal algorithm except that we need to take into account the negative precedence relation of WPG.
F. Pipelined Functional Unit
A pipelined functional unit can receive input operands every latency clocks. If the first stage of the pipelined functional unit is empty, the functional unit is available for another processing. This means that we can regard a pipelined functional unit as a functional unit of which execution delay is equal to the number of latency cycles. Data dependence relations with other operations are preserved by the edge weight of WPG which represents the minimal processing time of every input. Thus, for the case of pipelined functional unit, the calculation of density is the same as the case of latency-cycle operations, while the actual data dependence with other operations can be found in the edge weight of WPG.
The only change in WPG that is needed to incorporate the pipelined functional unit is to set the weight of the node corresponding to each operation performed by a pipelined functional unit to the number of latency cycles. On the other side, the edge weights of WPG should remain unchanged.
G. Register Cost and Interconnection Cost
To reduce the number of registers required for the datapath, the maximal density of live register variables should VOL. 12, NO. 10, OCTOBER 1993 be minimized, which is closely related to the number of registers required as is stated in the following theorem.
1444

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS,
Theorem: Variables which should be saved for later uses can be assigned to the same number of registers as the maximal number of live variables for all c-step boundaries, if there are no control branches.
The kth c-step boundary indicates the boundary between c-step k and c-step k + l. According to the above theorem, we monitor the density of live variables in each c-step boundary to minimize the number of registers, and incorporate the change of the density into the objective function. The calculation of selection function based on the register cost is very similar to the method explained in Section V. If one operation oi is moved from c-stepj to c-step k, then the densities of live variables only at the c-step boundaries lying between c-steps j and k may be changed. Thus, the following selection function Rj ( j , k) to reflect the register cost can be linearly superimposed on the original selection function Si ( j , k).
where RD[r] means the number of live variables at the c-step boundary r, and CRDG denotes the change of register density gradient (RDG) as defined below:
where RD' [r] means the number of live variables crossing over c-step boundary r after the operation oi is moved to c-step k.
Since the interconnection cost of recent chips is comparable to the cost of functional units, it is required to consider the interconnection cost during the scheduling phase. The interconnection cost usually depends on the interconnection styles such as bus-based, point-to-point, and multiplexer-based interconnection. It is hard to estimate the interconnection cost without any knowledge on interconnection style and interconnection pattern resulting from data-path synthesis. Therefore, we focus on only bus cost. To minimize bus cost, we have to minimize the number of concurrent data transfers because the number of busses required is low bounded by the maximum number of concurrent data transfers for all c-steps. The number of concurrent data transfers in a c-step is equal to the number of unique inputs and outputs of operations that are assigned to the c-step. Thus, the bus cost can be considered by a similar way as for the operations. 
H. Approach for Mintime Scheduling
The scheduling algorithm mentioned so far focused on the mincost problem, that is, finding a schedule having minimal hardware cost under the given number of c-steps.
The scheduling algorithm can be extended for the mintime scheduling problem. To determine an upper bound of the number of c-steps, a list scheduling is applied before the proposed scheduling is invoked. If the hardware cost produced by the proposed algorithm for the case of one less number of c-steps than that of the list scheduling is within the hardware constraints, another scheduling with even smaller number of c-steps is tried to check whether there is a better solution satisfying the given hardware constraints. This process is repeated until a schedule with a hardware cost greater than the hardware constraint is produced.
VIII. EXPERIMENTAL RESULTS
The proposed scheduling algorithm was implemented in C language on a SUN4 computer (10 MIPS). To show the effectiveness of the proposed algorithm, five results are tabulated in Tables 11-VI, where every result of FA-MOS was obtained using the ASAP schedule as an initial solution. For comparisons with other methods, we have normalized the CPU time of each method using MIPS information shown in Table I . Since the information on Xerox 1108 LISP machine is not available to us, the normalized CPU time of FDS is not included in the tables.
The first is an example taken from [12] whose result is summarized and compared to ours in Table 11 . This example contains chained operations when the number of cycles is less than eight. The results of ALPS shown in Tables 11-V are optimal, as they are obtained via integer linear programming.
The second example is the differential equation adopted from FDA [5] , which contains multi-cycle operations.
Multiplications are given a delay of two cycles, while additions are given a delay of half cycle. In this example, the critical path is six cycles long, and the result is shown in Table 111 . We have used the extension algorithm for multi-cy cle operations explained in section VII-C .
The third example is the pipelined 16-point digital FIR filter which is taken from SEHWA [7] . For this example, the latency of inputs to the pipeline is equal to three cycles, and there are two types of functional units, multiplier and adder. One cycle time is 100 ns, and a multiplication takes 80 ns while an addition takes 40 ns. The result is summarized in Table IV with that of SEHWA. In SEHWA, the optimal solution was achieved only through the exhaustive search which is rather time-consuming. However, FAMOS produced an optimal solution in a significantly reduced CPU time. The result of percolation scheduling was obtained with a different assumption that latches are inserted only when the chaining is impossible rather than to every c-step boundary.
The fourth example is a fifth-order elliptic wave filter borrowed from [23] . This example contains 26 additions and eight multiplications. A multiplication takes two clock cycles while an addition takes one clock cycle. The critical path length of the wave filter is 17 cycles long. In the case of 18 cycles, as shown in Table V, FDS obtained a TABLE I  INFORMATION ON COMPUTERS   Method  Computer  MIPS I241 MAHA [12] VAX-111750 1 SEHWA [7] VAX-111750 1 FDS [5] Xerox 1108 LISP machine nla ALPS [6] VAX-1118800 12 Path [26] IBM PCIRT 6 Percolation [27] nla nla FAMOS SUN41280 10 solution composed of three adders and two multipliers. But FAMOS has obtained the optimal solution having two adders and two multipliers, which is the optimal result as Table VI shows results for the fifth-order elliptic wave filter using a two-stage pipelined multiplier. For this example, we have used the extension scheme for pipelined functional unit. Path-based scheduling uses less c-steps than FAMOS because it chains operations and uses multipliers which taken one cycle time.
ours. 
IX. DISCUSSION
The proposed scheduling algorithm is based on the move acceptance strategy developed by Kernighan and Lin. One demerit of the scheduling is that the result depends on the initial schedule. In our experiments, however, this did not cause serious problems because optimal schedules were obtained for all cases using the ASAP schedules as initial solutions. Starting from the ALAP schedule, we also obtained optimal solutions except for one case (wave filter example, 21 c-steps) in which one more adder was required. Therefore, in our implementation, we pick up the best one out of two schedules produced by starting from the ASAP and ALAP schedules, respectively.
We have obtained rather satisfactory results despite that we did not explicitly take into account the successor/predecessor cost [5] in the selection function. The reason why FAMOS produced good results without explicitly including the cost is that the algorithm has a hill-climbing property by selecting a set of tuples rather than one tuple during each iteration pass. The property provides a similar effect as the successor/predecessor cost. However, better schedules can be achieved for large-sized problems by incorporating the successor/predecessor cost in the selection function.
Although many scheduling problems have been addressed so far, there are still many problems to be solved in future works:
As stated before, the final schedule is dependent on the initial schedule so that a constructive scheduling which yields a good initial schedule and generates various initial schedules is required to explore design space. A block-level transformation is necessary because a high-level description is usually composed of many CDFG blocks. One assumption of the proposed scheduling is that the type of functional unit for each type of operation
4)
has been determined a priori. Consider a mapping from the types of operations to the types of functional units. The assumption allows many-to-one as well as one-to-one mappings, but does not allow one-to-many mappings. This does not mean that only single-functional unit is permitted in our scheduling algorithm. Since many-to-one mapping is permitted, multi-functional units such as ALU are allowed provided that there is only one type of functional unit suitable for each type of operation. However, in real world situations, there may be many types of functional units suitable for one type of operation. In this case it is difficult to calculate the density function. Hence, an algorithm for allocating functional units should be developed to take into account such a case. A digital system consists of data-path part and control-path part. The cost of control-path is dependent on controller styles and is not considered in the scheduling algorithm. To include the cost of control-path, we need to develop a scheme that estimates the cost efficiently.
X. CONCLUSION In this paper, we have presented a new scheduling algorithm called FAMOS which improves an initial schedule by selecting a set of tuples in each iteration pass. The algorithm has a feature of escaping from local minima and has a tendency to reach a globally optimal solution. The proposed algorithm has a polynomial time complexity in spite of its iterative nature. As shown in the experiments, the algorithm is faster than the previous approaches and is expected to have a significant advantage for large examples which is difficult to be solved by exponential time algorithms such as ILP. The proposed algorithm produced optimal results for many examples in a short computation time, although there is no guarantee for optimality. The real world constraints such as multi-cycle operations, chained operations and pipelined data-paths are also taken into account in the proposed graph model called WPG on which our scheduling algorithm is based. The proposed algorithm is capable of handling diverse design criteria and design styles simply by modifying the selection function.
APPENDIX TESTING OF MUTUAL EXCLUSION
The input CDFG may contain conditional branches. To deal with such a CDFG, we have to determine whether two operations are mutually exclusive or not. In SEHWA [7] , this problem is solved by the node coloring algorithm which assigns a color code consisting of a sequence of one or more integers to each operation such that testing of mutual exclusion between the operations is done by comparing the color codes of the operations. Bridge [25] tests the mutual exclusion between two basic blocks by checking the intersection of their active cube sets. The algorithm presented in this appendix is based on an interval graph associated with a tree structure. We construct a tree for each set of operations lying between the outermost distribution and join operations. Such a set of operations can be converted to a tree structure.
For example, consider a set of operations depicted in Fig. 8(a) , which is a subset of an example quoted from SEHWA. The corresponding tree is shown in Fig. 8(b) , in which each node contains a set of operations that are in the same distribution level. In view of SEHWA algorithm, the operations in each node of the tree correspond to the operations with the same color code. An interval is assigned to each node of the tree as shown in Fig. 8(b) .
Each node has a property that the interval of the node includes the intervals of its child nodes. To assign such an interval to each node, we first assign an interval with length 1 to each leaf node from the left-most leaf node to the right-most leaf node. This is achieved by the depthfirst search of the tree with the strategy of left-child-first. And then, the non-leaf node is assigned by an interval containing intervals of its child nodes. This can be achieved by the post-order searching, i.e., assigning an interval to a node when we visit the node lastly.
The mutual exclusion of two operations is easily tested by checking whether their corresponding intervals are overlapped or not. If they are overlapped, the operations are not mutually exclusive, otherwise mutually exclusive.
However, there is a case that is difficult to convert to a tree structure. For example, consider a CDFG block shown in Fig. 9(a) , in which pairs of operations (5, 8) , (5, 9) , (6, 8) and (6, 9) are not mutually exclusive. In this case, the node coloring algorithm of SEHWA does not work because it is not possible to assign such color codes to operations 5, 6, 8, and 9.
The node denoted by a dark circle in Fig. 9 (b) is inserted to denote that its subtrees are not mutually exclusive. The mutual exclusion is checked using the following rules.
1) If two operations are in the same node of the tree, they are not mutually exclusive. 2) If the deepest common parent node of two operations is not a dark circle, they are mutually exclusive. 3) Otherwise, they are not mutually exclusive.
However, this type of mutual exclusion testing between two operations where one is in the upper distribution-join block and the other is in the lower distribution-join block is exceptional and expected to occur seldom, because they are not usually assigned into the same c-step.
