We address the prblem of time-stationary control synthesis for pipelined data paths. Control 
INTRODUCTION
Pipelining is a widely used approach for designing high performance digital circuts. As design size increases, pipelined architectures become quite complex and thus automatic pipeline design synthesis tools are necessary to cope with such complexity. pipelined data path synthesis problems have been well investigated in [1, 2, 5] . Sehwa [1] performs allocation of functional modules and scheduling of resources and estimates the cost of registers and interconnections. In [5] a method for module assignment and Register-Transfer level synthesis of pipelined data paths was presented. Once an RTlevel data path is obtained, the corresponding controller can be synthesized.
A time-stationary control mechanism [6] provides the control signals for the entire pipeline *This work was supported by NSF Grant#MIP-8909677 and by a TRW fellowship. Based on "Automatic synthesis of timestationary controllers for pipelined data paths", by J.T. Kim Paulin [7] that such a partitioning minimizes the total controller area.
We describe a method to generate control specifications of the pipelined data paths which makes use of the node labelling and mutual exclusion testing techniques described in [1] . The synthesis of the sequencing part follows the Behavioral Synthesis tasks of scheduling and resource allocation. Once the RT-level-data path synthesis is done, (i.e., the tasks of module assignment, and multiplexor and register allocation), the control requirements of each RT level component are known and thus the command logic can be synthesized. Figure 2 describes the flow of tasks in the overall high-level synthesis of data path and control.
Section 2 presents the control specification of pipelined data paths. The [8] dealt with the automatic production of control specifications from high-level behavioral descriptions in control and timing graph form and is designed for interface processors.
Bridge [9] is a high level synthesis system developed at AT&T Bell Laboratory and performs data path and control path allocation for nonpipelined systems by applying either a local slicing or a global slicing techniques. The Bridge system starts with a micro-architecture model and sometimes the implementation can be reduced to a finite state machine or a combinational circuit. Structural synthesis [12] [26] . In [7] , Paulin and in order to avoid any ambiguity, it is important to define the term latency which will be extensively used throughout this paper.
DEI'INITION
The number of time units between two consecutive initiations in a pipeline is called the latency, L of the pipeline.
Loops can be modeled in several difl'erent ways, e.g. [13, 14] . Loops with a small number of iterations can be handled by unrolling as in [1] . Loops can be also treated as conditional blocks as Although the technique in this paper can easily be extended to multi-way l)-Js, we will assume that the l)-J pairs are two-way splits.
described in [15] Figure 3 . In Figure 3 (a), Figure 3 (a), since NOP nodes and time steps are overlaped. But they can be reduced to four states after rearrangement of distribute-join block in Figure 3 (b).
ln this paper, we will use the terms stage, time step, and control step interchangeably.
State Decisions
It is assumed that the CDFG schedule is pipelined with a latency of L and that the total number of stages (or time steps) is n t. Conditional branches are handled by using an algorithm described in [1] .
The algorithm assigns to every node a label consisting of a sequence of one or more integer codes. Using these labels, we can test for mutual exclusion between any pair of nodes (operations) in pseudo-constant time. Before going any further, we define the following terms which assume a CDFG scheduling with a latency L. Figure 4 shows a scheduled CDFG with node labels. For convenience, we identify the operations in a time step by their node labels between parentheses. Table III shows the state identification procedure. In time step 2, we have 5 nodes and there are two mutual exclusion sets which are m2,! (10, 110 , 111) and M2,2 (20, 21) . There
Time
Step (11) (1)
FIGURE 4 Example Sehwa-l-Scheduled CDFG with node labelling: L 3. awe assume that the D operation itself takes negligible time to execute compared to the other operations. the corresponding input condition in assumed to be a don't care, in which cases we just find a state which has the same node label or is compatible with the current state. The procedure is presented in Figure 6 . Going back to the example described in Section 2.2, Figure 4 and The experimental results in Section 4 show that partitioning the groups of states onto more that two partitions can result in more area efficient implementations than the two-way partitioning. The multiway partitioning Algorithm is a simple modification of the two-way partitioning.
Branch and Bound Method
As noted in Section 2, the presence of loops in a CDFG schedule could easily make the latency L relatively large. Thus, exhaustive search would become computationally intractable. In these cases, the problem is reformulated as a branch and-bound alogirthm to reduce the search space.
In our branch and bound method a decision tree is constructed as follows. The ith level in the tree represents a possible 2 way partition with partition 5The estimate does not take into consideration the effect ot" further logic minimization and is only an approximation of the tinal result. size of and n -i, where n is the total number of distinct objects (or groups). The labels on the edges denote the objects selected. For example, the first level shows the possible partitions of sizes and n 1, the second level represents the possible partitions of sizes 2 and n -2, and so on as shown in Figure 7 . We observe that there are n levels in the decision tree. 6 In order to implement an efficient branch and bound technique, we need to derive good bounding function which can be easily computed. In our case, the cost function to be minimized is the sum of the areas of the resulting partitions. In the following, we discuss this issue and propose a bounding technique to be used by the search algorithm. Here, we assume that the target implementation is done using PLAs. Let Figure  8 (a). Let w and h be the sides of the rectangle which shows the unpartitioned design. Let a and b be the sides of one partition as shown in Figure   8 6Note that the tree is not balanced. Based on these observations we developed a new bounding technique to be used in the Branch and Bound method for our horizontal partitioning. While the decision tree is searched down to the leaf nodes, point P (a, b in Figure 8 moves from point S to point T through Quadrants II, III, or through point C. Subsequently, the search will go down on the subbranches until the point P reaches Quadrant IV. Beyond this point, any further move would increase the area./" (a, b and further search through this bounds of the tree is not necessary as can be seen in Figure 9 . The sufficient and necessary conditions to detect that point P has already reached Quadrant IV are that: (1) the resulting area of child node is larger than the parent node's, (i.e.,./" (a, b is now increasing), and (2) ./' (a, b of the child node is greater than wh /2. These two and bound based algorithm. The less the levels which we need to investigate the better this branch and bound scheme works. In other words, if the minimum point is found at the earlier levels for each branch, then the search space is reduced significantly. To satisfy this objective, the objects are sorted according to the size of their diagonal, (i.e., by increasing distance to Point S).
Heuristic Approach
While the branch and bound approach can significantly cut the search time, its worst case performance is still exponential. In order to further reduce the partitioning runtime, we developed a heuristic algorithm which is a variation of branch and bound method. As we mentioned in the previous section, f(a, b) is decreasing with both a and b in Quadrant I. It means that f(a, b) is likely to be large in earlier levels since point P is now clear to point S in Quadrant I. So, by avoiding the computation of./"(a, b) in the earlier levels, we can reduce the overall amount of computation. While the Branch and Bound scheme prevents searching in Quadrant IV, the heuristic approach tries to reduce the search in both Quadrants and IV. It controls the starting level of the next sub-branch in Quadrant and adopts the Branch and Bound method in Quadrant IV. As depicted in Figure 10 , the tree is searched in a depth first manner. If the minimum value of the previous sub-branch is found at level 1, then in the next sub-branches we start the search from the same level 1. Thus, savings can be achieved if the next sub-branch starts from level k where (1 _< k < ), in which case we save l-k calculations. Figure 10 shows an example of the design space with a latency of 6, where the number inside the circle show the total normalized PLA area of each partitoning. Since the minimum partition is found in level 3 (1-2-5), only the solutions at levels greater than or equal to level 3 are computed, thus only the space between line and 2 is the design space which the heuristic algorithm explores, and the space above the line 3 is the one for branch and bound method.
Both approaches were implemented in one program as shown in Figure 11 . Currently, it is up to the user to decide which method will be used in partitioning. Table III compares the efficiency of the branch and bound method and the heuristic algorithm with respect to the exhaustive search on the quadratic equation solver example depicted in Figure 12 . In this example we assume that the values a, b, and c in the equation (ax 2+ bx + c 0) are 8-bit integers so that the number of iterations compute the square root is 7. We unrolled the loop completely and ran Schwa to schedule the CDFG with different values of latency(L up to 11. As explained in Section 2.2 there are L groups of states to be partitioned. In order to show the efficiency of our branch and bound and heuristic approaches we created additional cases (of latency more than 11) by duplicating the group, i.e., we use each group twice. For example, the case with latency of 22 is made up by taking the scheduling with a latency of 11 and using each group twice. The heuristic always finds the optimal partition except when the latencies are 11 and 22. In these cases, the heuristic finds suboptimal solutions which are only 1.2% and 1.8% greater in area than the optimal solution, respectively.
State Encoding
Once the horizontal partitioning of the state table is done, we need to perform state encoding. Given a set of coding constraints, the objective of this procedure is to assign state codes so that the size of the sequencing logic is reduced. We generate coding constraint groups consisting of states having the same next state and matching primary J.T. K|M inputs. States in the same coding constraint group can be collapsed into one common PT, thus reducing the number of states. In addition to saving by horizontal partitioning, we can also reduce the number of bits/state in the two-way partitioning case by assigning even codes to all the next states of one partition (in a PLA, this will set the last column in the OR-plane to all zeros, and in random logic, this will reduce the gate count and the wiring). To decide on a candidate for this reduction, we compute the number of next states in each partition and check if it is less than [log2(total number of states)I/2. If this applies to both partitions, choose the partition that can result in a larger reduction in area. This is always possible since the number of next states either partitions is less than equal to a half of the total number of states and also the number of available codes are always at least equal to the total number of states. Furthermore, PTs not in the current partition area included but their next states are set to don't cares. This allows further minimization by logic optimization tools such as Espresso [19] (for PLAs) or MIS [20] (for random logic) since it reduces the number of literals. The state encoding algorithms is shown in Figure 13 . The encoding algorithm can be extended to the multi-way partitioning in a straightforward manner by dividing the partitions onto two blocks and assigning state codes to each as if it were a twoway partitioning.
RESULTS
In this section, we present some experimental results which were obtained by applying our approach to two design examples. The first example is from [1] , the second is a reduced instruction set version of the M6502 microproces- approach achieves better area savings compared to traditional synthesis methods.
The Sehwa Example
The first example CDFG [1] is shown in Figure 4 In Example Schwa-2, the same DFG in Figure 4 is scheduled with latency L -2. Here we use the io_hybrid encoding strategy for NOVA since the i_exact encoding was computationally infeasible (we ran for more than 70 hours on a Convex super computer). The savings are much greater in this case.
In the case of random logic controllers, we minimized the logic by using the MIS multi-level logic optimizer. Each partition was optimized [27] . [30] . In order to obtain a manageable size example (which can be handled by Schwa), we reduced the instruction set to four instructions. This resulted in the CDFG shown in Figure 15 . Also, the original specifications were based on a non-pipelined scheduling which was reflected in the assumed data path. In order to enable us to perform pipelined scheduling with high throughput, we made some modifications on the data path and the CDFG, as follows:
we assume a dual ported memory in which two memory read operations are permitted to overlap. Only one memory write operation is permitted at a time, though. we increased the number of various resources, such as registers, in order to enable the overlapped execution of some register operations in the CDFG.
We used Sehwa to schedule the CDFG with latencies L--4 and 6. For each scheduling, we generated a pipelined RT-level implemnetation of the data path. Table V shows some statistics on the data paths and state tables for both latency values. Figure 16 shows the data path for L 6. For both data path, we used our algorithm to synthesize several implementations of the control part using both PLAs and standard cells. 7 In each case, we generated layouts corresponding to various n-way partitionings of the groups of states for n 1,2, 4, 6. 
