In this paper we give a model for predicting the shape of cost-speed tradeoff curves for pipelined designs. The model includes prediction of the number of operators, registers and multiplexers from a behavioral specification. It has been verified with the designs generated by an automated pipeline synthesis program, Sehwa. This model was developed as a part of the ADAM Advanced Design Automation System of the University of Southern California.
INTRODUCTION
Synthesizing datapaths automatically is computationally expensive for production designs. Many trial synthesis passes or computations are made to experiment with different sets of modules and varying degrees of concurrency and resource sharing. For example, Sehwa 131, a part of the USC ADAM system, a pipeline datapath synthesis program, is a good example of such software. The input to Sehwa is a dataflow graph, and a set of module types which can be used to implement the operations of the dataflow graph. Sehwa gives as an output the number of each type of operator required and the scheduling of the dataflow graph. Sehwa also takes into consideration conditional branching within the dataflow graph and resynchronization due to resource conflicts and data dependencies. The scheduling is a static scheduling which takes into account all possible combinations of conditional branches which can occur. First, Sehwa produces the fastest and the cheapest designs to fix the design boundary. Sehwa then requests the user for a speed or cost constraint and generates several solutions meeting the constraint. The user then changes the constraints and iterates. Finally, the user can perform exhaustive search in a small part of the design space to tune the design. Once a design is selected, redesigning may occur with a This research wa.s supported by the Semiconductor Research Corporation contract 8601075 and by the Army Research Offtce Grant DAAG29-83-K-0147.
Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM'copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission.
--OmoeHlL::::::
::I,:* amona,-2oa)ptc3 If one could predict approximately where designs would occur in the design space, the search for a satisfactory design could be narrowed considerably.
At the University of Southern California, we have developed a technique for predicting an approximate cost/speed tradeoff curve for pipelined datapaths, from a specification of the desired functional behavior.
The prediction technique presented here is part of the ADAM Advanced Design AutoMation [4] system being constructed at USC, and works in conjunction with Sehwa. However, the technique is general and can be applied to any pipeline synthesis system which produces near-optimal results.
In the next section we give the problem description and the solution approach. Section 3 discusses the lower bound estimates for area and time. Section 4 contains the list of experiments which were conducted using the estimation techniques and Sehwa. The last section concludes with experimental results and observations. Furthermore, for the sake of estimating lower bounds , we assume a zero resynchronization rate for the pipeline. This means we assume no data dependency or resource conflicts. It does not affect our results as resynchronization serves only to increase the delay for the same design area and such designs lie above our predicted lower bounds.
We also assume clock-cycle = maximum (delay{ ). This assumption is valid for our lower bound estimate studies, and will be explained at the end of Section 3.1. We are able to use the simple estimation technique mentioned above because the theory of pipeline synthesis predicts such curves. We now present the theoretical foundation of our technique.
Operator and AT Curve Estimation
We shall first compute the effective number of operations of a given type c-opni in a dataflow graph. Consider a dataflow graph with one conditional branch and one type of operation foo . The YES path from the IF-CONDITION node has 10 f oo nodes and the NO path has 5 foo nodes. Also, the remaining (with no conditional branches) dataflow graph has 3 foo nodes. Then, the maximum number of foo nodes which can be executed for any set of inputs is 3+maz-imum (5,10)=3+10=13.
Thus our c-opnf,,, =13 for this dataflow graph. We can generalize this counting mechanism over a dataflow graph with many conditional branches and operations.
For a dataflow graph with no conditional branching, let c-opni, the effective number of operations of a given type, be the same as opni. For a dataflow graph with conditional branching, we compute c_opniin the following manner. For the graph portion for whmh there is no conditional branching, c_opni will be calculated in the same way as opni . For the conditional branches we take the longest path emanating from the distribute node and find the number of operations in that path. Then we take the other paths and add to each c-opni the number by which the operation count in the shorter path exceeds the count in the previously traversed paths. For example, in We use c opni to define the utilization of each operator. A utilizat& of 1 is optimal. 
For a given dataflow graph and a set of possible module types, the right hand side is a constant, and hence (AT ),in=constant . 0
We shall now justify the assumption that clock-cycle =maximum (delay; ), where maximum is taken over all types of operators. We make two observations for our justification.
For a given latency we can compute the minimum number of operators of each type required for the implementation of the dataflow graph as Also, the area of implementation, A , the number of operators, can be com-.,. -puted as A = c ( areai Xop$). We observe that making i=O latency a constant also fixes the minimum required number of each type of operator. Thus for a fixed l&en-cy , the minimum area A of the design becomes a constant.
Consider two pipeline designs of the same dataflow graph with the same latency, and hence same area, but different clock cycles. Then the design with lower clock cycle will have a lower AT (as the latency and area of the two designs is the same). Under the above assumptions, the minimum value clock cycle can take is maxamum delayi . Note that a design with the above computed minimum (AT )min or (AT )lb may not exist. In fact, the best possible actual design might not be one with a minimum clock cycle. However, there can be no design which will have an AT curve lower than the one computed by Algorithm 1 under the above assumptions. =l, and hence (AT )min=constant, is not possible for every latency and every operator. In this case, although a lower bound curve (AT)lb does exist, it is not equal to a constant due to the ceiling function computing Opti. Our estimation technique, Algorithm 1 calculates the (AT )[b curve. In case the design is optimal it calcuWe shall now discuss the effect of register cost and delay on the curve. It is assumed that each operation is diadic and produces exactly one output. We shall later show how this restriction can be relaxed. In our discussion we first estimate the number of internal registers which will be required for implementation. The number of registers (internal and external) can be estimated as maximum( number of external registers, number of internal registers).
The external registers are those required at the input or the output of the dataflow graph. We state that a value is consumed if it is not required in any subsequent clock cycles. : Any pipeline implementation of a datatlow graph will require registers to store the output of an operator, which will be used in a future clock cycle. It may be used in the immediately next clock cycle, or several clock cycles later. The best case for our analysis of lower bound is the former case, i.e. where a value is produced in one Paper 3.2 clock cycle and it is. consumed in the immedi$$y next one, thus freeing the register to be used again. C c-opn; i=O gives the total number of operations which will be execut- number of operations will be performed every clock cycle, and hence this number of registers .will be required. 0
If an operation is not diadic and produces more than one output, we simply add each additional output to the summation in. the above expression.
An example of a dataflow graph with its implementation and schedulinn nerformed bv Sehwa is aiven in Fig. 3 . In this case the eitfmated number of registers and the number requiredi by the implementation is the same.
If one considars the registers required at the input and the output of the dataflow graph, then it is not possible to make a deterministic analysis of the total number of registers required, given only the dataflow graph. This is because the consumption of an input value and the generation of an output value depend on the scheduling of the dataflow gra,ph. A better deterministic minimum estimate of the number of registers required for implementation can be computed. This estimate requires the pipeline schedule of the dataflow graph as well as the latency. Assume that the dataflow graph is pipelined into 72 stages.
Definition 2: Given a data flow graph and a pipeline schedule, cuti is defined to be the number of edges cut by the jth stageline. gives the number of edges crossing a%me step. This gives the number of reg'isters required in that clock cycle. As the registers can be shared between any two clock cycles, the minimum number of registers required, assuming all registers can be shared, will be for the clock cycle with maximum (total; ). El The above theorem gives an estimate of the number of registers required for a given latency. Thus, to get register estimates over all possible latencies, the above comput&n is performed with latency varying from 1 to C OpPai. A new schedule of the dataflow graph has to be i=O supplied each time the latency is varied.
To evaluate the effect of registers on the lower bound curve, the clock cycle will now be the sum of We now estimate the number of multiplexers required. i.e. the i=O design has 0 terminals to which the registers are connected. For example, a design having 5 adders and each adder having two input terminals, the total number of terminals 0 =2x5=10. Assume that all the registers will be connected to the terminals through multiplexers alone (i.e. no bus interconnect). Furthermore, if we restrict the usage of multiplexers to only one type (i.e. dto-l type multiplexers)' , then we can calculate a lower bound on the number of multiplexers required for the design, as in the theorem below. Hypothesis : Let the above equation be correct for R =n and the number of multiplexers computed be M=x.
Induction : Let R =n +l, and the number of multiplexers computed be M =y . We have two subcases here.
Subcase 1: z =y . This is the case where 5 (d-l) is strictly greater than (n -0 ). Hence, we can add a register (increase 12 by 1) without having to increase z . We observe that this case arises when there is at least one multiplexer with at least one of its d inputs unconnected, and the the additional register can be connected to this input (Fig.  4a ).
Subcase 2: 2 +l=y . In this case z (d-l)=(n -0 ), and all the multiplexers have al1 their d inputs connected. Thus, to connect another register to the design (increasing n by l), we have to increase the number of multiplexer-s by 1. Following Fig. 4b we shall observe the effect of connecting this new multiplexer.
To connect the new multiplexer, an already connected register will have to be removed from the design, the new multiplexer connected in place of the removed register, and this removed register added to the input of the new multiplexer.
We now have d-l additional inputs to which d -1 additional registers can be connected. Hence, for every new multiplexer which is added to the design d -1 additional registers can be connected without increasing the number of multiplexers. This d -1 also gives the divisor in the equation. The new register can now be connected to an input of the new multiplexer.
•i
The restriction on using d-to-l multiplexers can be relaxed in that any d-to-l multipl,exer can be replaced at the same level in the tree by a d -to-l multiplexer such that the replacement does not increase the number of multiplexers. For example, if in a design we have to connect 6 registers to 2 terminals using 4-to -1 multiplexer, then we can replace these by either two 3-to-I multiplexers (if they exist) or by one 4-to-1 multiplexer and one 2-to-l multiplexer.
Thus if Dmd is the delay through a d-to-l multiplexer, then the effect of the multiplexers is calculated as overall-clock-cycle =maximum (dclayi)+D, +D,, + Several experiments using Sehwa [3] were conducted. Of these, the three examples described herein are : (i) an AR lattice filter systolic array element [l] which was converted to a data flow graph and expanded for complex operations (Fig. 5) ; ( ii a random data flow graph generat-) ed using a random number generator, (Fig. 6 , and (iii) a data flow graph with conditional branching, 1 see Fig. 5 .1 of Ref. 31 . Three sets of modules with different area-delay parameters were used for each dataflow graph. These area-delay characteristics were obtained from PLEST [2] and are given in Table 1 . In this paper we only present the results obtained from one set of modules. The results from the remaining two sets of modules were consistent with the results reported here. 3. The design curve produced by Sehwa, neglecting the cost and delay of registers.
6 of 7 
RESULTS AND FUTURE RESEARCH
The curves show that (AT )Ib is a good lowerbound approximation for pipelined design. For the AR lattice filter, several optimal design points have been achieved by Schwa., as can be seen in Fig. 10 . This approximation allows us to narrow our search of the design space when synthesizing pipelines. The nonlinearity in the curve drawn on the log scale reflects the non-optimal use of the operators.
In the case of non-optimal designs also, the (AT h produced by procedure estimate lower-hound is a good approximation. This can be seen in the curves for the random dataflow graph Figs. 8 and 11) an.d the graph with conditional branches Figs. 9 and 12). the ratio ___ latent hk to be $n integer.)
Thus if we increase latency, we ! ecrease opti proportionally. However in many general cases we have a pipeline consisting of one or two very expensive (in terms of area and delay) operators with very low usage as against several cheap ones with high usage, resulting in nonoptimal designs. The design, in this case, will not be optimal if latency is greater than the number of these operations (as c-opni <latency and utilization <l).
If the latency is increased beyond c-opni , the expensive operator will remain idle for (latency -c-opni ) clock-cycles.
In the curves for a random dataflow graph (Figs. 8 and ll), one can observe the effects of having 2 expensive operations (multiply) as against 26 cheap operations (add and subtract).
Thus selecting module sets such that the non-optimality is minimized is a major problem. An approach to reducing the imbalance in the operators cost/delay may be achieved by the decomposition of the expensive node to smaller, cheaper nodes in the dataflow graph.
The example having conditional branches shows that the theoretical curve is non-linear and for this type of dataflow graph also, the above procedure is a very good approximation of the results produced by Sehwa.
Comparing the estimated (AT)Ib curve with the estimated (AT )I* and estimated register kurve in Fig. 10 it is seen that the register cost becomes more significant as the area is reduced.
Thns even though the operators can be optimized, the dataflow graph is so partitioned that it leads to a non-optimal number of registers. It would be interesting to see if a partitioning of the graph which optimizes the use of both registers and operators could be arrived at. 
