System clock estimation based on clock wastage minimization by Narayan, Sanjiv & Gajski, Daniel D.
UC Irvine
ICS Technical Reports
Title
System clock estimation based on clock wastage minimization
Permalink
https://escholarship.org/uc/item/11j6p444
Authors
Narayan, Sanjiv
Gajski, Daniel D.
Publication Date
1991-12-05
 
Peer reviewed
eScholarship.org Powered by the California Digital Library
University of California
Notice: This Material 
may be protected 
by Copyright Law 
(Title 17 U.S.C.) 
flystem Clock Estimation based on 
Clock Wastage Minimization 
Sanjiv N arayaJ:L 
Daniel D. Gajskí 
Technical Report #91-49 
December 5, 1991 
Dept. of Information and Computer Science 
University of California, Irvine 
Irvine, CA 92717-3425 
(714) 856-8059 
narayan@ics. uci. ed u 
Abstract 
h {)' 
When synthesizing a hardware implementation from behavioral descriptions, an important 
decision is the selection of a clock cycle to schedule the datapath operations into control steps. 
Most existing behavioral synthesis systems either require the designer to specify the clock cycle 
explicitly or require that the delays of the operators used in the design be specified in multiples 
of a clock cycle. In the absence of any tool to guide the selection of a clock cycle, a bad choice 
of the clock period could adversely affect the performance of the synthesized design. We present 
an algorithm for estimating the system clock based on a clock wastage minimization criteria. 
Limitations of previous approaches to the problem are discussed. The results obtained prove that 
the clock cycle estimated by the Clock Wastage Minimization method produce faster designs than 
previous solutions to the problem. 
•I. 
Contents 
1 Introd u et ion 
2 Problem Definition 
3 Previous Work 
4 Design Model 
5 The Clock Wastage Minimization Algorithm 
5.1 Definition of Terms . 
5.2 N otation . . . . . . . . . . . . . . . . . . . . . . . 
5.3 Algorithm . . . . ................ . 
5.4 The HAL Second Order Differential Equation Example . 
5.5 Computational Complexity . . . . . . . . . . . . . . .. 
6 Experimental Results 
6.1 Implementation ......................... . 
6.2 Experiments using the Clock Wastage Minimization method . 
7 Conclusions and Future Work 
8 Acknowledgements 
9 References 
A SCESTCLK : Manual Page 
B Benchmarks/Examples used in the Document 
B .1 HAL Differential Equation 
B.2 Fifth Order Digital Elliptic Filter ..... . 
B .3 AR Lattice Fil ter . . . . . . . . . . . . . 
B.4 Linear Phase B-Spline Interpolated Filter 
List of Figures 
1 Effect of the dock cycle on DFG completion times (numbers adjacent to the operators 
1 
3 
3 
5 
6 
6 
7 
7 
10 
11 
12 
12 
13 
16 
17 
17 
18 
20 
20 
21 
22 
24 
represent respective operator delays, dotted lines represent state boundaries) 2 
2 Design Model for Clock Calculation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 
3 Minimal Wastage Clock Calculation Algorithm . . . . . . . . . . . . . . . . . . . . . . 9 
4 Graphical Representation of Clock Wastages for the HAL Differential Equation example 10 
5 Clock Utilization as function of the Clock Cycle for the HAL Differential Equation 
example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 
6 The VTI VDPlOO Datapath Component Library . . . . . . . . . . . . . . . . . . . . . 12 
8 Effect of Clock U tilization on Completion times for the HAL Differential Equation 
example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 
9 Comparing completion times for different functional unit allocations for the Elliptical 
Filter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 

1 lntroduction 
Behavioral Synthesis involves the transformation of a specification or design description into a 
set of interconnected micro-architectural components which satisfy the behavior and any spec-
ified constraints. Several tasks are performed during synthesis. Component selection chooses 
components from a design library to perform the desired functions. Scheduling determines the 
appropriate control step for each operation in the behavioral description. Resource binding 
assigns components to the specification's operations. 
The scheduler tries to execute as many operations as possible in each control step to extract 
as much parallelism as possible. The overall performance ( or execution time) of the design 
depends on the duration of each control step i.e. the clock cycle. 
Most existing behavioral synthesis systems [Wa90, PaKnGi86, PaKn89, BaMa89] approach 
the problem of determining the dock cyde in one of following ways : 
• The designer specifies a dock cycle explicitly. 
• All operators are assumed to have identical delays. A unit dock cyde is assumed and each 
operator requires exactly one clock to execute. 
• The designer specifies the operator delays in multiples of a dock. Once again, a unit dock 
cycle is assumed. 
The quality, with respect to performance, of the schedule generated by the scheduler, depends 
on the dock cyde. In each of the approaches outlined above, the designer has nothing more than 
intuition to assist him while selecting a dock cyde. Abad choice of the dock cycle given to the 
synthesis tools could result in inefficient designs wherein the operators are greatly underutilized 
and the performance of the resulting design is unacceptably slow. Let us illustrate this with 
the example of Figure 1. 
Figure l(a) shows a datafl.ow graph (DFG) with 2 multiply operations and 4 add operations. 
The delays of the multiply and add functional units are 150 ns and 80 ns respectively. Assume 
that we are interested only in the output values o1 and o2 • As shown in Figure 1( a), we use 
a dock cyde of 380 ns to give us the fastest possible DFG execution time of 380 ns. But we 
are using a large number of functional units - 2 multipliers and 4 adders. Since this is not 
desirable, we attempt to share functional units amongst the various operations. Figure l(b) 
shows that a reduction of the dock cyde to 150 ns requires only 1 multiplier and 2 adders, 
giving us a higher completion time of 450 ns (3 dock cycles ). We further reduce the functional 
units to 1 multiplier and 1 adder with the same dock cyde as shown in Figure 1( c). However, 
the com pletion time for the D FG is now 600 ns ( 4 dock cycles). Finally in Figure 1 ( d), we 
see that a dock cycle of 80 ns gives a completion time of 400 ns ( 5 dock cydes) and uses a 
minimal number of functional units - 1 multiplier and 1 adder. With respect to performance, 
this implementation is very close to the fastest implementation of Figure l(a), while using 
significantly fewer number of functional untis. 
1 
s 
1 
Figure 1: Effect of the dock cycle on DFG completion times (numbers adjacent to the operators 
represent respective operator delays, dotted lines represent state boundaries) 
2 
From Figure 1 it is evident that the choice of a particular dock cycle influences the sharing 
of functional units that will be used in the design as well the completion time or performance 
of the design. 
Thus it is essential that dock estimation be an integral part of any synthesis system which 
should provide the designer with feedback as to how various dock cyde values could possibly 
a:ffect the design performance. 
In Section 3, we discuss previous approaches to estimating the dock. In Section 4, we 
explain the underlying design model which forms the basis of our dock cyde calculation. The 
Clock Wastage Minimization Algorithm is presented along with a detailed example in Section 5. 
Finally the results of estimations on well known examples are presented in Section 6. 
2 Problem Definition 
The goal of dock cyde calculation can be defined as follows : 
Given a description of a design and a list of components that will be used to 
implement the design, determine that value of the dock cycle which will minimize 
the execution time for the dataflow graph of the design. 
The execution time of a dataflow graph is the length of the schedule determined for it. Since 
the scheduling of the operations in the dataflow graph depends on the operator delays and the 
dock cyde, the execution time is dependent on the dock cyde. 
3 Previous Work 
A few synthesis tools [PaPiMi86, PaPa85, JaMiPa88] have incorporated dock estimation tech-
niques which are then used to either examine area-time tradeo:ffs in the design or to guide 
synthesis tasks such as scheduling. 
In MAHA [PaPiMi86], first the critica! path in the dataflow graph is determined. The 
maximum delay of any operator in the critica! path is chosen as the dock cycle. 
The clocking scheme proposed in (PaPa85] computes a lower bound for the dock cycle of a 
multistage system as being the longest stage time. Since the longest stage time is at least as 
large as the longest operator delay, this scheme computes a dock cycle greater than or equal 
to the largest operator delay. 
A model for Area-Time Estimation is presented in [JaMiPa88]. The dataflow graph is divided 
into a number of time steps. The critica! path delay and the number of time steps are used to 
compute the lower bound on the dock as given in the following equation : 
3 
Critical Path Delay 
clk = MAX[ -------, MAX(Operator Delay)] (1) 
No. o f Time Steps 
Each of the above approaches assume that each operation must execute within one dock 
cyde. Multi-cyde operations, where an operation could be scheduled in two or more control 
steps, are not permitted. Consequently, they are similar to each other in one respect - the dock 
cyde calculated by each of the above methods is at least as long as the largest operator delay. 
We shall refer to these dock estimation methods as the Maximum Operator Delay methods. 
The advantages of the above methods is that they are simple to implement and their algorithm 
complexity is linear in the number of different operator types that will be used to implement 
the design. 
Let the dock cyc:le computed by the Maximum Operator Delay (MOD) methods be denoted 
by CLKmod· Let T denote the number of distinct operator types in the DFG and OpDelay(ti) 
denote the delay of an operator of type ti. From the above analysis of Maximum Operator 
Dela y methods we can dearly see that, 
CLKmod 2:: MAX [OpDelay(ti) ], f or all operator types ti (2) 
A shortcoming of these methods is evident in cases where the operators have widely differing 
delays. Consider a dataflow graph with a multiplier ( delay : 200 ns) and an adder ( delay : 50 
ns) as the only components. The Maximum Operator Delay method would use a dock of 200 
ns to schedule the dataflow graph. Thus whenever an add operation is scheduled in a control 
step, the adder is idle for 150 ns in that dock cyde. The utilization of the adder is only 25%. 
If instead, the graph was scheduled .with a dock cyde of 50 ns, both the adder ( delay: 1 dock) 
and the multiplier ( delay : 4 docks) can be utilized throughout the dock cyde. 
Another approach to dock cyde calculation involves scheduling the dataflow graph for all 
possible dock cyde values, and selecting that dock which results in the fastest completion time. 
Scheduling algorithms typically have computational complexities of O(n2logn), where nis the 
number of nodes in the dataflow graph. An exhaustive scheduling technique for all possible 
dock cycles will be computationally expensive even for small dataflow graphs. Also , to be 
able to schedule a dataflow graph, we require an allocation to have been already determined. 
In addition, it is difficult to define the fastest schedule for complex designs, such as a VHDL 
description with several concurrent processes. 
However, the above exhaustive scheduling method could be feasible, if we could somehow 
reduce the search space involved with examining an entire range of dock values. The Clock 
Wastage Minimization algorithm presented in this paper achieves this by providing the designer 
with fast estimates of a feasible dock cycle. It calculates a dock cycle which would result in 
the maximum utilization of all the operators, without performing any scheduling and 
its computational complexity is linear with respect to the number of operations in the entire 
design. 
4 
4 Design Model 
The underlying design model assumed for the purpose of dock calculation is shown in Fig 2. 
A two level bus structure is assumed for the interconnection across the registers and functional 
units. This model allows for easy analysis of performance issues since the delay of the tristate 
driver can be considered to be constant with respect to the number of the tristate drivers 
driving a bus. 
Operations can execute over several docks. Thus if a functional unit has a delay of 90 ns 
and the dock cycle is 50 ns, then the functional unit executes in two dock cycles. A typical 
register-to-register transfer involves operands being read from registers, an operation performed 
on the operands, and the results stored in another register. The delays associated with this 
type of re gis ter-to-register transfer are the following : 
• Delay of the operation, 
• Delay associated with two levels of tristate drivers. 
• Register setup time and propagation delay. 
Let delay(ti) represent the total delay in a register-to-register transfer involving an operator 
of type ti. Then, 
delay(ti) - OpDelay(ti) + 2 * (tristate driver delay) 
Tristate 
Drivers 
Buses 
Registers 
Trlstate 
Drivers 
Buses 
Operators 
+ Register setup time + Register propagation delay (3) 
Figure 2: Design Model for Clock Calculation 
Thus the operator delays used in our algorithm are actually the values, delay(ti), computed 
above. To support operator chaining, links are needed from the output ports of sorne functional 
units directly to the input ports of other functional units. This linking is accomplished by using 
the path from a functional unit's output port through one of the buses to sorne other functional 
unit's input port. 
5 
5 The Clock Wastage Minimization Algorithm 
5.1 Definition of Terms 
A few terms which will be referenced frequently are defined here. 
DFG Completion Time: Represents the execution time of the DFG. If the DFG is scheduled 
into e control steps with a dock cyde clk, then the completion time of the DFG, dnFG, is 
defined as : 
dnFG = C X clk (4) 
Operator Occurrences : This represents the number of occurrences of an operator of type ti 
in the behavioral statements (or the DFG) representing the design and is denoted by occur(ti)· 
Operator Wastage : The portion of the clock cycle for which the functional unit implementing 
an operation in the DFG is idle is defined as the operator wastage. Wastage for a particular 
operator can be defined as the difference between the operator delay and the next higher mu/tiple 
of the dock cyde. For a given dock value, clk, the wastage for an operator type t¡ is denoted 
by Waste( clk, ti)· If an adder has delay of 80 ns and the dock cyde is 100 ns, then the wastage 
involved with every use of the adder is 20 ns, or W aste(lOO, +) = 20 ns. However, a multiplier 
with a delay of 150 ns will take 2 docks to execute and the wastage will be 50 ns. Note, that 
in this document, the term Waste(lOO, x) represents the wastage involved when a multiplier 
is used and the dock cyde is 100 ns. 
Average_Wastage/Operator : This represents the average value of the time for which any 
operator will be idle in a dock cyde. For a given dock cyde, clk, the individual dock wastages 
for each operator are weighted by the number of occurrences of that operator in the dataflow 
graph and divided by the total number of operators in the dataflow graph. As an example, 
considera design which has 2 add and 4 multiply operations in the DFG. Assume a dock cycle 
of 100 ns. If we use adders and multipliers with delays of 80 ns and 150 ns respectively, we 
ha ve, 
clk = 100 ns (given) 
T = number of operator types = 1 { +, x} 1 = 2 
occur( +) = 2, occur( x) = 4 (given) 
delay( +) = 80ns, delay( x) = 150 ns (given) 
Waste( clk, +) = 20 ns, Waste( clk, x) = 50 ns 
Total W astage = [occur(+).Waste(clk, +)] + [occur(x).Waste(clk, x)] 
= (2 X 20] + (4 X 50) 
= 240ns 
Average Wastage/Operator =Total Wastage + (2:f=1 occur(ti) ) 
= 240 + (2 + 4) 
= 40ns 
6 
Clock U tilization : This is defined as the percentage of the dock cyde that is utilized for 
useful computation by all the functional units in the implementation. In the above example, 
the dock utilization for a dock cyde of 100 ns can be calculated as follows : 
Clock Utilization = 1 - (Average_Wastage/Operator + clk) 
= 1 - ( 40 + 100) 
= 0.6 or 60% 
5.2 Notation 
We use the following variables in our algorithm : 
T 
delay(ti) 
occur( ti) 
Nops 
W aste( clk, t¡) 
CLI<1ower 
C LI<upper 
CLI<mod 
CLI<wm 
: Number of distinct operator types in the DFG. 
: Delay of the operator of type ti, as computed in Equation 3. 
: The number of occurrences of operator of type ti in the DFG. 
Total number of operators in the DFG (Nops = Ef=1 occur(ti) ). 
Clock wastage for an operator of type t¡ for a speci:fic dock cyde, clk. 
Lowest dock cyde examined by the Wastage Minimization method 
Highest dock cyde examined by the Wastage Minimization method. 
Clock cyde estimated by the Maximum Operator Delay method (Section 3). 
Clock cyde estimated by the Wastage Minimization method. 
5.3 Algorithm 
The Clock Wastage Minimization method for calculating the dock examines the occurrences of 
operators in the DFG, and selects a dock cyde so as to minimize the wastage involved when 
an operator is idle for a portion of the dock cyde. 
Intuitively, the algorithm examines a set of dock cydes within a specific range. For each 
dock cyde, the Wastage Minimization method computes the wastages associated with each 
operator and determines the dock utilization. The value of the dock cyde which yields the 
highest dock utilization is determined to be the best dock cyde. The complete algorithm is 
presented in Figure 3. An outline of the method is presented below. 
Step 1 : Compute delay(ti), occur(ti) 
For each operator type t¡ in the DFG, delay(ti) is computed as given in Equation 3 and 
occur(ti) is computed by examining all the nodes in the datafiow graph for occurrences of 
operator type ti. 
Step 2 : Determine C LI<1ower and C LI<upper 
The Wastage Minimization algorithm computes wastage for the operators over a range of 
possible dock values from C LI<1ower to C LI<upper. C LI<upper is the largest value of dela y( ti) 
( over all operator types t¡ in the design), as computed in step 1 above. 
7 
Design libraries often specify the maximum dock frequency at which the dock input of a 
bistable circuit may be driven such that stable transitions of logic levels are maintained. For 
example, the VLSI Technology Inc. VDPlOO Datapath Element Library [VTI88] specifies the 
maximum dock frequency for a D Flip-Flop as being 75 MHz. Thus the value of CLI<tower that 
will be used is 1/(75 MHz) or 14 ns. In case such a maximum dock frequency is not specified, 
then C Ll<tower is approximated as the smallest value of dela y( t¡) computed in Step 1, over all 
operator types ti in the design. 
Step 3 : Clock Wastage Calculation Loop 
Repeat the following steps for all values of the dock cyde, clk, where C Ll<tower < clk < 
CLJ<upper 
• Compute W aste( clk, ti), the Clock Wastage for each operator type ti : For each 
operator ti, the difference between the operator delay and the next higher dock multiple 
can be calculated as follows: 
Waste(clk,ti) = ( (fdelay(ti) +clkl) X clk)-delay(ti) (5) 
• Compute Average_W astage/ Operator( clk) : This is a measure of how much of the 
dock cyde is wasted on the average by each of the operators in the D FG, for a specific 
value of the dock, clk. First· the total wastage is calculated by weighing the wastage 
Waste( clk, ti) involved with each operator type ti by the number of its occurrences in the 
DFG, occur(ti)· The total wastage is then divided by the total number of operators in the 
design to give the average value of the wastage per operator. If T represents the number 
of distinct operator types in the design, we can compute: 
Average_Wastage/Operator(clk) = 
T L ( occur(ti) x W aste( clk, ti)) 
i=l 
T 
L:occur(ti) 
i=l 
(6) 
• Compute Clock_Utilization( clk) : The percentage of the dock cyde that is wasted is 
the ratio of the Average W asta ge/ Operator( el k) calculated above and the dock cycle, 
clk. The utilization is then calculated as : 
1 - ( 
Average_Wastage/Operator(clk) ) 
Clock_Utilization( clk) = 
clk 
Step 4 : Calculating best clock, C LI<wm 
(7) 
The value of the dock cyde which maximizes the Clock Utilization is selected as the best 
dock, CLl<wm· 
8 
procedure MINIMUM_WASTAGE_CLOCK ( DFG: dataJlow...graph) 
/* Given a datafiow graph, return a dock estimate which minimizes 
the wastage of the dock cycle over ali the operators in that graph. * / 
be gin 
for ali operator types ti do 
Compute occur(ti) by examining the DFG 
Compute delay(ti) 
end for 
Determine CLKupper and CLK1ower 
M ax_Utilization =O 
for ali values of clk, such that CLKrower < clk < CLKupper do 
for ali operator types ti do 
Waste(clk,ti) = ( (ldelay(ti) + clkl) X clk)- delay(ti) 
end for 
Average_Wastage/Operator( clk) 
T I: (occur(ti) X Waste(clk, ti)) 
i=l 
T 
l:occur(ti) 
i=l 
1 - ( 
Average_W astage/ Operator( clk) ) 
Clock_Utilization( clk) = 
clk 
If [ Clock_Utilization( clk) > M ax_Utilization] then 
M ax_Utilization = Clock_Utilization( clk) 
CLKwm = clk 
end if 
end far 
return ( CLKwm ) ; 
end MINIMUM_WASTAGE_CL<?CK; 
Figure 3: Minimal Wastage Clock Calculation Algorithm 
9 
5.4 The HAL Second Order Differential Equation Example 
To illustrate how the Wastage Minimization Algorithm calculates C LI<wm, we apply it to the 
Second Order Di:fferential Equation benchmark [PaKnGi86]. The components being used are 
from the VTI VDPlOO datapath library given in Figure 6. 
operators 
1 x CLK 2x CLK 
6x __ _ 
2-~'!l:l:l:l'!l:!~~.......,....,¡'~1111111 
50 
(11 Utilizad for 
""" Computation 
• Wastage 
100 150 
delay(X) = 163 ns, 
delay(-) = 56 ns, 
del ay(+) = 48 ns 
Clock Cycle = 65 ns 
(a) Operator delays and wastages 
represented graphlcally 
3x CLK 
Average 
Wastage per 
Operator 
6x 32 
24.4 ns 
2X17 
2x9 1 
+ 11 + 
(b) Graphlcal representatlon of 
average wastage/operator 
Figure 4: Graphical Representation of Clock Wastages far the HAL Differential Equation example 
Step 1 : Compute delay(ti), occur(ti) 
The values of delay(ti) are calculated using Equation 3 for each operator type ti. The values 
of delay(ti) are given in Figure 6(b ). The datafl.ow graph for the differential equation benchmark 
has 2 add, 2 subtract and 2 multiply operations. Thus we have, 
occur(x) = 6 
occur(-) = 2 
occur( +) = 2 
delay( x) = 163 ns 
delay( - ) = 56 ns 
delay( +) = 48 ns 
In Figure 4(a), the delays of the operators are represented graphically as the length of the 
lightly shaded regions along the X-axis. The occurrences of the operators is represented by the 
thickness of the shaded regions along the Y-axis. 
Step 2 : Determine C LI<rower and C LI<upper 
Since the VDPlOO library specifies that the maximum frequency for docking registers is 75 
MHz, C LI<1ower = 1 / 75 Mhz = 14 ns. C LI<upper is the largest operator delay, i.e. C LI<upper = 
delay( X) = 163 ns. 
Step 3 : Clock Wastage Calculation Loop 
We shall illustrate the calculation of dock utilization for a dock cycle of 65 ns. U sing 
Equation 5, we get the wastages for each operator as the di:fference between the delay and the 
next higher multiple of the dock cycle: 
10 
.- 100 ~ 
._, 
d 80 o 
•l"'I 
~ 60 lSl 
•l"'I 
=-= 40 .....
:::> 
~ 20 ~ 
o 
...-4 o o 
20 40 60 80 100 120 140 160 
Clock Cycle (ns) 
Figure 5: Clock Utilization as function of the Clock Cycle for the HAL Differential Equation example. 
Waste(65, x) = (3 x 65) - 163 = 32 ns 
Waste(65,-) = (1x65)-56 = 9 ns 
Waste(65, +) = (1 X 65) - 48 = 17 ns 
The wastages for each operator type for a dock cyde of 65 ns are shown as the dark shaded 
regions in Figure 4(a). The average wastage per operator for a dock cyde of 65 ns can be 
calculated by Equation 6. This is shown graphically in Figure 4(b). 
[ 
6 X 32 + 2 X 9 + 2 X 1 7 ] 
Average W astage/Operator(65) = = 24.4 ns 
6 + 2 + 2 
Finally, Clock Utilization for a 65 ns dock cyde can be computed as given in Equation 7, 
Clock_Utilization(65) = 1 - (24.4/65) = 0.62 or 62% 
Step 4 : Calculating best clock, C LI<wm 
The Clock Wastage calculation loop of step 3 is repeated for all values of the dock cycle 
from 14 ns to 163 ns. We :find that the maximum value of dock utilization of 92% is achieved 
at dock cyde value of 56 ns as shown in Figure 5. This is selected as the best value of the 
dock, CLI<wm· 
The Maximum Operator Delay method estimated a dock value of 163 ns which resulted in 
a utilization of 73%. An analysis of the dock estimation results for this example is presented 
in Section 6. 
5.5 Computational Complexity 
Let T be the number of operator types in the design, Nops be the total number of operators in 
the DFG and range be the number of dock cycles examined by the method, i.e., 
11 
range = C LI<upper - C LI<1ower 
The occurrences of each operator in step 1 can computed in O(Nops) time . For each of the 
range dock values considered (loop of step 3), W aste( clk, ti) and Average_W astage/ Operator 
can be computed in O(T) time. The overall time complexity is thus O(Nops + T.range) . 
It can be shown that the dock cycle value which minimizes the dock wastage is either one 
of the operator delays or their divisors. This observation follows from Figure 5 where the peaks 
in the utilization curve occur at dock cyde values which are the divisors of the operator delays. 
By evaluating dock utilization at these points only, we can significantly reduce the number of 
dock cyde values ( ra:nge) that are examined by the algorithm. In the current implementation 
the user can decide whether the entire range of dock cydes or only the divisors of the delays 
within the range of dock val u es C LI<1ower and C LKupper are to be examined. 
Datapath VTI Component Del ay Component Name 
Adder VDP3ALU001 38 ns 
Multiplier VDP3MLT001 153 ns 
Subtractcr VDP5ALU001 46 ns 
Tristate Buffer VDP3TSB001 2.6 ns 
Register VDP3DFF001 setup time 3.8 ns 
hold time 1.0 ns 
max. Freq 75 Mhz 
a) Components from the VTI 
VDP100 Datapath Element Library 
Operator Delay(I) 
( Equation 3) 
Adder 48 ns 
Multiplier 163 ns 
Subtractor 56 ns 
b) Functlonal Unit Delays used in the 
Wastage Mlnimlzatlon Algorlthm. 
[derivad from Equatlon 3 and (a) ] 
Figure 6: The VTI VDPlOO Datapath Component Library 
6 Experimental Results 
6.1 lmplementation 
The Clock Wastage Minimization Algorithm presented in the previous section has been imple-
mented. The input to the estimator is a SpecChart [NaVaGa91] description representing the 
behavior of the design for which dock has to be determined. In addition, the delays of the 
operators as given in Figure 6(b) are provided. The dock estimator outputs that value of the 
dock ( C LI<wm) which minimizes wastage over all the operators. In addition, it calculates the 
utilization of the operators as a percentage of the dock cycle. The manual page for the dock 
cycle estimator, scestclk, is given in Appendix A. 
We used the VLSI Technology Inc. VDPlOO Datapath Element Library [VTI88] to obtain 
the delays of the functional units. The datapath elements used are given in Figure 6. 
12 
6.2 Experiments using the Clock Wastage Minimization method 
The Clock Wastage Minimization Algorithm was tested on four well known examples: the HAL 
Second Order Differential Equation [PaKnGi86], a Fifth Order Elliptical Fil ter [Ku WhKa85], 
the AR Lattice filter [JaMiPa88] and a Linear Phase B-Spline Interpolated Filter [PaFe90]. 
The VHDL for these example can be found in Appendix B. 
To verify that the dock estímate produced by the Wastage Minimization algorithm did 
indeed produce faster completion times as compared to the Maximum Operator delay method, 
for each of the benchmarks above, we did the following : 
l. We estimated the dock using both the Maximum Operator Delay method ( C LI<mod) and 
the Clock Wastage Minimization method (CLI<wm)· The values oí the estimated dock 
cydes and dock utilization for both the methods are given in columns A and Boí Figure 
7. For example, in the case oí the HAL Differential Equation example, the Maximum 
Operator Delay method estimates a dock oí 163 ns (utilization: 73%), whereas the Clock 
Wastage Minimization method gives a dock estímate of 56 ns (utilization: 92%). 
2. The completion time oí a DFG scheduled with an estimated dock cycle is a good measure 
oí how good that dock estímate is. We scheduled the DFG using a dock cyde equal to 
that estimated by both the methods, by employing the scheduling algorithm presented 
in [PaGa87]. For all the examples, we specified an allocation oí two operators oí each 
type as an input to the scheduler. For example, the Differential Equation Benchmark was 
scheduled with an allocation oí 2 adders, 2 subtractors and 2 multipliers. The number oí 
control steps produced as a result oí scheduling without chaining the functional units is 
.given in column C of Figure 7. 
It must be emphasized that scheduling and allocation are not required by the Clock 
Wastage Minimization algorithm - they are simply being used to determine 
completion time (performance) on the basis of the clock cycle generated by 
the method. This ena es us to evaluate the quality of the estimate. 
3. We calculate the completion time for the DFGs as defined in Equation 4. The completion 
times for the DFG's can be found in column D oí Figure 7. The improvement in the 
completion times produced by the dock estimated by the Clock Wastage Minimization 
method compared to the Maximum Operator Delay method is expressed as a percentage 
in column E. 
( 
Completion Time(CLKwm) ) 
Performance lmprovement = 1 -
C ompletion Time( C LKmod) 
(8) 
Similar results for the case when chaining of functional units is permitted during scheduling 
is shown in columns F, G, and H 
It is evident from the results of Figure 7 that the dock cydes estimated by the Clock Wastage 
Minimization method produce designs with better performance when compared to dock cydes 
13 
¡-...¡. 
~ 
Clock Clock Clock Scheduling wlthout Schedullng wlth Example Estimatlon chained operators chalned operators 
Method Estimated Uti llzatio n 
A B e D E F G 
Control Steps Completlon Performance Control Steps Completlon Time lmprovement Time 
(ns) (%) (ns) (%) (ns) 
Max. Operator 163 73% 4 652 
-
4 652 Dlfferentlal Del ay 
Equation 
[PaKnGi86] Clock Wastage 56 92% 10 560 14.1% 10 560 Minimzation 
Max. Operator 163 46% 16 2608 
-
11 1793 Flfth Order Dela y 
Digital EIDiptic 
Filler 
Clock Wastage [KuWhKa85] 
Minimzation 24 95% 49 1176 54.9% 47 1128 
Max. Operator 
AR lattlce Dela y 
163 70% 11 1793 
-
10 1630 
Filler 
[JaMIPa88] Clock Wastage 55 92% 26 1430 20.2% 23 1265 Minimzation 
Max. Operator 163 57% 6 978 
-
5 815 Unear Phase Dela y 
8-Spline 
lnterpolated Clock Wastage 24 92% 24 576 41.1% 23 552 Filler [PaFe90] Minimzation 
Figure 7 : Comparison of Completion Times of DFG's when scheduled with the clock cycles 
estimated by the Maximum Operator Delay and Clock Wastage Minimization methods. 
H 
Performance 
lmprovement 
(%) 
-
14.1% 
-
37.1% 
-
22.4% 
-
32.2% 
500 1 1 92~ 
1 
400 ...__~~--~~----~~--~~--~----
º 
20 40 60 80 100 
Clock Utilization (%) 
Figure 8: Effect of Clock Utilization on Completion times for the HAL Differential Equation example. 
Clock Allocation 1 Allocatlon 1 Allocation 1 
Estimatlon ( 2 add, 2 mult) ( 4 add, 2 mult)' ( 5 add, 1 mult) 
Method 
Control Completlon Control Completion Control 
Steps Time Steps Time Steps 
Max. Operator 
Del ay 16 2608 ns 14 2282 ns 15 
(CLK = 163 ns) 
Clock Wastage 
Mi ni mization 49 1176 ns 47 1128 ns 65 
(CLK = 24 ns) 
Performance 
- 54.9% - 50.5% -lmprovement 
Faster completlon times are obtalned lf the clock cycle computed by 
Clock Wastage Mlnlmlzatlon method Is used, regardless of the final 
allocatlon used to lmplement the deslgn. 
Completion 
Time 
2445 ns 
1560 ns 
36.2% 
Figure 9: Comparing completion times for different functional unit allocations for the Elliptical Filter. 
15 
produced by Maximum Operator Delay methods. We must now verify our initial assertion that 
a dock cyde with a higher utilization will give us a better performance of the DFG. 
Figure 8 examines whether the dock wastage minimization criteria that we adopted for 
selecting a dock cyde is justified. Completion times are plotted against dock utilization values 
for the HAL Differential Equation example. The points in the graph are labeled with the 
corresponding dock cyde. For example, a 56 ns dock cyde results in a dock utilization of 92% 
and the lowest DFG completion time of 560 ns. The figure dearly shows that better performance 
can be obtained if we use a dock cyde that maximizes the utilization of the functional units. 
We now examine if the dock cyde estimated by the Wastage Minimization method will 
produce faster designs, regardless of the final allocation of functional units used to implement 
the design. To demonstrate this, we scheduled the DFG for the Fifth Order Digital Elliptical 
Filter with different allocations for both the dock cydes, ( C LI<mod and C LI<wm) generated by 
the two estimation methods and compared the completion times of the DFG. This is shown 
in Figure 9. We observe that for each allocation, the completion time obtained on scheduling 
with C LI<wm is much faster than that obtained by scheduling with C LI<mod· Thus we can 
condude that the dock estimated by the Clock Wastage Minimization algorithm produces 
faster completion times (as compared to the one estimated by the Maximum Operator Dela y 
Method) regardless of the allocation used for finally implementing the design. 
7 Conclusions and Future Work 
Traditional high-level synthesis systems require the designer to specify the dock cycle explicitly 
or express operator delays in terms of multiple of a dock cycle. We have presented an algorithm 
for dock estimation from dataflow graphs, based on dock wastage minimization. This will 
provide both designers and synthesis tools with a realistic and useful estimate of the dock cycle 
that can be used to implement a design. By using real life components and examples, we have 
shown that that the dock estimates produced by our method yield faster execution times for 
the designs, as compared to the maximum operator delay methods. We also observe that the 
designs scheduled with our dock cycle estimates had faster execution times regardless of the 
components finally allocated for implementing the design during synthesis. 
We plan to extend our model to incorporate wire delays in the register to register path. We 
are currently looking into ways by which it will be possible to derive the lower limit of the range 
of dock cycles that are examined by the algorithm ( C LK1ower ), based on the power constraints 
specified for the design. 
We plan to further extend our algorithm to handle conditional branching in the VHDL 
description. In the current implementation, operators are counted exactly once when computing 
occur(ti), even though they may be mutually exclusive because they lie on different conditional 
branches. If the description is divided into statement blocks, and the branch probabilities are 
known, we can determine the frequency of execution of each statement block. While calculating 
occur(ti), the occurrences of operators of type ti in each statement block is weighed by the 
frequency of execution of that block. 
16 
8 Acknowledgements 
This work was supported by the Semiconductor Research Corporation (grant #91-DJ-146). We 
are grateful for their support. The authors would also like to thank Frank Vahid, Allen Wu 
and Lognath Ramachandran for their useful suggestions. 
9 Ref eren ces 
[BaMa89] 
[JaMiPa88] 
[KuWhKa85] 
[NaVaGa91] 
[PaFe90] 
[PaGa87] 
[PaKnGi86] 
[PaKn89] 
[PaPa85] 
[PaPiMi86] 
[VTI88] 
[Wa90] 
M. Balakrishnan, and P. Marwedel, "lntegrated Scheduling and Binding 
: A Synthesis approach for design space exploration", Proceeding of the 
Design Automation Conference, 1989, pages 68-74. 
R. Jain, M. Mlinar, and A. Parker, "Area-Time Model for Synthesis of 
Non-Pipelined Designs", ICCAD 88. 
S. Kung, H. Whitehouse, and T. Kailath, "VLSI and Modern Signal Pro-
cessing", Prentice-Hall 1985, pp. 258-264. 
S. Narayan, F: Vahid and D.D. Gajski, "System Level Specification and 
Synthesis with the SpecCharts Language", International Conference on 
Computer Aided Design, Santa Clara, November 1991. 
D. Pang, and L. Ferrari, "Unified Approach to General IFIR Filter Design 
using the E-spline Function", Asilomar Conference on Signals, Systems 
and Computers, October 1989. 
B. Pangrle, and D. Gajski, "SLICER : A state Synthesizer for Intelli-
gent Silicon Compilation", Proceedings of the International Conference 
on Computer Aided Design, 1987. 
P.G. Paulin, J.P. Knight, and E.F. Girzyc, "HAL : A Multi-Paradigm 
Approach to Datapath Synthesis", Proceedings of the Design Automation 
Conference, 1986. 
P.G. Paulin, and J.P. Knight, "Algorithms for High-Level Synthesis", 
IEEE Design & Test of Computers, December 1989. 
N. Park, and A. C. Parker, "Synthesis of Optimal Clocking Schemes", 
Proceedings of the Design Automation Conference, 1985 
A.C. Parker, T. Pizzaro, and M. Mlinar, "MARA: A Program for Datap-
ath Synthesis", Proceedings of the Design Automation Conference, 1986, 
pages 461-466. 
VLSI Technologies Inc., "VDPlOO 1.5 Micron CMOS Datapath Cell Li-
brary" 
R. Walker, "A Survey of High-Level Synthesis Systems", Report No. 90-
30, Rensselaer Polytechnic Institute, October 1990 
17 
A SCESTCLK: Manual Page 
SCESTCLK(L) LOCAL COMMANDS SCESTCLK(L) 
NAME 
scestclk - estimate clock cycle for given SpecChart 
SYNOPSIS 
scestclk [ -wmb ] [ -vd ] [ -a allocf ile ] specchart 
DESCRIPTION 
scestclk estimates the clock cycle for a given SpecChart and 
an associated allocation file. 
The allocation file contains the types of operators in the 
design and their respective delays. In case the allocation 
file is not specified using the '-a' option below, the clock 
cycle estimator will look in the file called ALLOC in the 
current directory. 
The clock cycle estimator can estimate the clock using 
either the Clock Wastage Minimization method or the Maximum 
Operato Delay method. In the former, the clock cycle is 
computed so as to minimiza the wastages involved when a par-
ticular operator is used in a control step. In the latter 
method, the largest delay of any operator is computed as the 
clock cycle. 
If no options are specified, the estimator by default pro-
duces a non-verbose output.using the Clock Wastage Minimiza-
tion method and uses the file ALLOC. 
OPTIONS 
-w Estimate the clock cycle using the Wastage Minimization 
method 
-m Estimate the clock cycle using the Maximum Operator 
Delay method 
-b Estimate the clock cycle using both the Wastage Minimi-
zation and the Maximum Operator Delay methods 
18 
FILES 
-v Verbosa output which lists the number of occurrences of 
each operator and operator delays. 
-d Output the results on three files : 11 ,tmpclkest" which 
contains the clock cycle estimated, "tmpclkutil" which 
contains the utilization of the operators for that 
clock cycle, and ",tmpclkusage" which contains the 
number of operators of each type and their delays. 
-a allocf ile 
Allocation file containing the operator delays. The 
default allocation file is set to ALLOC. 
./*.se 
. /*.scp 
, tmpclkutil 
,tmpclkest 
,tmpclkusage 
SpecChart files . 
SpecChart packages. 
File containing the clock cycle utiliza-
tion when the '-d' flag is used. 
File containing the clock cycle 
estimated when the '-d' flag is used. 
File containing the number of operators 
of each type in the design and their 
delays. 
SEE ALSO 
xscestclk, scestarea 
AUTHDR 
Sanjiv Narayan (narayan~ics.uci.edu) 
U. C. Irvine Last change: 4 December 1991 1 
19 
B Benchmarks/Examples used in the Document 
B.1 HAL Differential Equation 
Differential Equation Example 
Source: Adapted from example in paper 
"HAL: A Mul ti-Paradigm Approach to Automatic Data Path Synthesis" 
by P. Paulin, J. Knight and E. Girczyc 
23rd DAC, June 1986, pp. 263-270 
Benchmark author: Joe Lis 
Copyright (e) 1989 by Joe Lis 
entity HAL is 
port (dx: in BIT_VECTOR(O to 7); 
a: in BIT_VECTOR(O to 7); 
cntrl: out BIT); 
end HAL; 
--VSS: design_style BEHAVIORAL 
architecture EX of HAL is 
be gin 
process 
variable x,y,u: BIT_VECTOR(O to 7) ; 
variable u1,u2,u3,u4,u5,u6,y1: BIT_VECTOR(O to 7) 
begin 
while (x < a) loop 
u1 .- u 
* 
dx; 
u2 := 5 
* 
x· ,
u3 := 3 
* 
y; 
y1 .- u 
* 
dx; 
X := X + dx; 
u4 := u1 
* 
u2; 
u5 := dx 
* 
u3; 
y := y + y1; 
u6 := u - u4; 
u := u6 - u5; 
end loop; 
end process; 
end EX; 
20 
B.2 Fifth Order Digital Elliptic Filter 
Fifth Order Digital Elliptic Filter Benchmark 
Source: 1988 High Level Synthesis Workshop Benchmark 
Example taken from "VLSI and Modern Signal Processing" 
by S.Y. Kung, H.J. Whitehouse, T.Kailath (eds.) 
Prentice-Hall 1985, pp. 258-264 
Benchmark author: Joe Lis 
Copyright (e) 1989 by Joe Lis 
entity ELLIPTIC_FILTER is 
port (In_port: in BIT; Out_port: out BIT); 
end ELLIPTIC_FILTER; 
architecture EX of ELLIPTIC_FILTER is 
be gin 
process 
variable a,b,c,d,e,f,g,h,i,j,k,o: BIT ; 
variable t2,t13,t18,t33,t39,t26,t38 : BIT 
variable m21,m24,m9,rn30,rn40,rn36,m16,rn6: BIT 
begin 
i := In_port; 
a := i + t2; 
b .- a + t13; 
g 
e 
d 
f 
.-
.-
.-
.-
t33 + t39; 
g + t26 + 
(rn21 
* 
e) 
(m24 * e) 
b· 
' + 
+ 
t26 := f + d + e; 
b; 
g; 
e := m9 * (b + d) + a; 
h := m30 * (f + g) + t39; 
j := t18 + e + d; 
k := t38 + f + h; 
o := m40 * (h + t39); 
t39 := o + h; 
t38 := t38 + (rn36 * k); 
t33 := t38 + k; 
t18 := t18 + (m16 * j); 
t13 := t18 + j; 
t2 :=e+ i + rn6 *(a+ e); 
Out_port .- o; 
end process; 
end EX; 
21 
B.3 AR Lattice Filter 
A. R Lattice Filter Example 
Source: 1989 26th Design Automation Conference 
" Experience with the ADAM Synthesis System" 
by R.Jain, K. Kucukcar, M. Milnar, A. Parker 
Benchmark author: Sanjiv Narayan 
Copyright (e) 1991 by Sanjiv Narayan 
entity LATTICE_FILTER is 
end LATTICE_FILTER; 
--VSS: design_style BEHAVIORAL 
architecture EX of LATTICE_FILTER is 
begin 
process 
variable i1, i2, i3, i4, i5, i6, 
variable a1, a2, a3, a4, a5, a6, 
variable b1, b2, b3, b4 BIT 
variable c1, c2 : BIT ; 
variable di, d2, d3, d4 BIT 
variable e1, e2 : BIT ; 
variable f1, f2, f3, f 4 BIT 
variable g1, g2 : BIT ; 
variable o1, o2, o3, o4 BIT 
be gin 
a1 := i1 
* 
i2 
a2 := i3 
* 
i4 
a3 := i1 
* 
i4 
a4 .- i3 
* 
i2 
a5 .- i5 
* 
i6 
a6 := i7 
* 
i8 
a7 := i5 
* 
i8 
a8 := i7 
* 
i6 
b1 := a1 + a2 
b2 := a3 + a4 
i7, i8, 
a7, a8 
22 
i9, i10, i11, 
: BIT ; 
i12, i13,i14:BIT; 
b3 .- a5 + a6 
b4 .- a7 + a8 
c1 := i9 + b3 
c2 .- i10 + b4 
o1 .- c1; 
o2 .- c2; 
d1 .- i11 
* 
c2 
d2 .- i12 * c1 
d3 .- i11 * c1 
d4 .- i12 * c2 
e1 .- d1 + d2 
e2 .- d3 + d4 
f 1 .- i13 
* 
e2 
f 2 .- i14 * e1 
f 3 .- i13 
* 
e1 
f 4 .- i14 * e2 
g1 .- f 1 + f2 
g2 .- f 3 + f 4 
o3 .- b1 + g1 
o4 .- b2 + g2 
end process; 
end EX; 
23 
B.4 Linear Phase B-Spline Interpolated Filter 
Linear Phase B-Spline Interpolated Filter 
Source: Adapted frorn exarnple in paper "Unified Approach to General 
IFIR Filter Design using the B-spline Function" by 
D. Pang, and L. Ferrari, Asilornar Conference on Signals, 
Systems and Computers, October 1989. 
Benchmark author: Joe Lis 
Copyright (e) 1989 by Joe Lis 
entity LPBFIR_FI~TER is 
port (In_port: in BIT); 
end LPBFIR_FILTER; 
-- VSS: design_style BEHAVIORAL 
architecture EX of LPBFIR_FILTER is 
begin 
process 
variable a0,a1,a2,a3,a4,a5,a6,a7,a8,x0,x1,x2,x3: BIT 
variable y0,y1,y2,y3,y4,z1,z2,z3,z4: BIT ; 
variable m0,rn1,m2,m3,m4: BIT ; 
be gin 
a8 a7 and 1 · ,
a7 .- a6 and 1 · ,
a6 .- a5 and 1; 
a5 := a4 and 1 · ,
a4 := a3 and 1 · ,
a3 := a2 and 1 · ,
a2 := a1 and 1; 
a1 := 
ªº 
and 1 · ,
ªº 
:= In_port and 1; 
xO := 
ªº 
+ a8; 
xi .- a1 + a7; 
x2 := a2 + a6; 
x3 .- a3 + a5; 
yo := mO * xO; 
24 
yi .- mi 
* 
xi; 
y2 .- rn2 
* 
x2; 
y3 .- rn3 
* 
x3; 
y4 .- m4 
* 
a4; 
zi .- yo + yi; 
z2 .- zi + y2; 
z3 .- z2 + y3; 
z4 .- z3 + y4; 
end process; 
end EX; 
25 
2 
111111111111111111111111111111111111111111111111111~1111111111 
3 1970 00882 8821 
.A. 
