Automated Exploration of the ASIC Design Space for Minimum Power-Delay-Area Product at the Register Transfer Level by Karakaya, Fuat
University of Tennessee, Knoxville
Trace: Tennessee Research and Creative
Exchange
Doctoral Dissertations Graduate School
5-2004
Automated Exploration of the ASIC Design Space
for Minimum Power-Delay-Area Product at the
Register Transfer Level
Fuat Karakaya
University of Tennessee - Knoxville
This Dissertation is brought to you for free and open access by the Graduate School at Trace: Tennessee Research and Creative Exchange. It has been
accepted for inclusion in Doctoral Dissertations by an authorized administrator of Trace: Tennessee Research and Creative Exchange. For more
information, please contact trace@utk.edu.
Recommended Citation
Karakaya, Fuat, "Automated Exploration of the ASIC Design Space for Minimum Power-Delay-Area Product at the Register Transfer
Level. " PhD diss., University of Tennessee, 2004.
https://trace.tennessee.edu/utk_graddiss/2274
To the Graduate Council:
I am submitting herewith a dissertation written by Fuat Karakaya entitled "Automated Exploration of the
ASIC Design Space for Minimum Power-Delay-Area Product at the Register Transfer Level." I have
examined the final electronic copy of this dissertation for form and content and recommend that it be
accepted in partial fulfillment of the requirements for the degree of Doctor of Philosophy, with a major in
Electrical Engineering.
Donald W. Bouldin, Major Professor
We have read this dissertation and recommend its acceptance:
Gregory D. Peterson, Chandra Tan, Michael A. Langston
Accepted for the Council:
Dixie L. Thompson
Vice Provost and Dean of the Graduate School
(Original signatures are on file with official student records.)
To the Graduate Council:
I am submitting herewith a dissertation written by Fuat Karakaya entitled “ Automated
Exploration of the ASIC Design Space for Minimum Power-Delay-Area Product at the
Register Transfer Level.” I have examined the final electronic copy of this dissertation
for form and content and recommend that it be accepted in partial fulfillment of the
requirements for the degree of Doctor of Philosophy, with a major in Electrical Engi-
neering.
Dr. Donald W. Bouldin
Major Professor
We have read this dissertation
and recommend its acceptance:
Dr. Gregory Peterson
Dr. Chandra Tan
Dr. Michael A. Langston
Accepted for the Council:
Dr. Anne Mayhew
Vice Provost and
Dean of The Graduate Studies
(Original signatures are on file with official student records.)
AUTOMATED EXPLORATION OF THE ASIC DESIGN SPACE
FOR MINIMUM POWER-DELAY-AREA PRODUCT AT THE
REGISTER TRANSFER LEVEL
A Dissertation
Presented for the
Doctor of Philosophy
Degree
The University of Tennessee, Knoxville
Fuat Karakaya
May, 2004
DEDICATION
Dedicated to
my wife Selda, my daughters Hatice Nur, and Zeynep Hannan for their love and support
ii
ACKNOWLEDGEMENT
First of all, I would like to thank Professor Donald W. Bouldin, my advisor, for his
guidance, encouragement, support and valuable criticism during this research. I also
thank my other committee members, Dr. Michael A. Langston, Dr. Gregory Peterson
and Dr. Chandra Tan for their guidance and valuable feedback. I would also like to
acknowledge the Defense Advanced Research Projects Agency for providing partial
financial support for this research under grant F30602-01-2-0562.
iii
ABSTRACT
Exploring the integrated circuit design space for minimum power-delay-area
(PDA) product can be time-consuming and tedious, especially when the target
standard-cell library has hundreds of options. In this dissertation, heuristic algorithms
that automate this process have been developed, implemented and validated at the reg-
ister transfer level. In some cases, the PDA product was 1.9 times better than the initial
baseline solution. The parallel search algorithm exhibited 9x speed up when executed
on 10 machines simultaneously. These two new methods also characterize the design
space for the given RTL code by generating power-delay-area points in addition to
the minimum PDA point in case the designer wishes to select a different solution that
is a tradeoff among these metrics. As a final step, these two search algorithms are
integrated into a fully automated ASIC design flow.
iv
TABLE OF CONTENTS
CHAPTER PAGE
1. Introduction                                  1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2 Goals and Expected Contributions . . . . . . . . . . . . . . . . . . . 9
1.3 Thesis Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2. Background                                  12
2.1 Power Consumption in CMOS . . . . . . . . . . . . . . . . . . . . . 12
2.1.1 Dynamic Power . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.1.2 Static Power . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2 Delay in CMOS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.3 Optimization in CMOS Circuits . . . . . . . . . . . . . . . . . . . . 20
2.3.1 Power Optimization . . . . . . . . . . . . . . . . . . . . . . 20
2.3.2 Delay Optimization . . . . . . . . . . . . . . . . . . . . . . . 24
2.4 ASIC Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3. Design Space Search Algorithm: DSSA                    38
3.1 Implementation of DSSA . . . . . . . . . . . . . . . . . . . . . . . 38
3.2 Integrating DSSA into an ASIC Flow . . . . . . . . . . . . . . . . . 44
3.3 Experimental Results for DSSA . . . . . . . . . . . . . . . . . . . . 46
v
CHAPTER PAGE
3.4 Parallel Design Space Search Algorithm: PDSSA . . . . . . . . . . . 54
3.4.1 Implementation of PDSSA . . . . . . . . . . . . . . . . . . . 54
3.4.2 Experimental Results for PDSSA . . . . . . . . . . . . . . . 56
3.5 Pareto Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
3.6 Chapter Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
4. Searching the Design Space Using Simulated Annealing           66
4.1 Simulated Annealing . . . . . . . . . . . . . . . . . . . . . . . . . . 66
4.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
4.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . 70
4.4 Chapter Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5. Graphical User Interface (GUI) Design                    76
5.1 MacroGen . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.2 Initiating Design Space Search . . . . . . . . . . . . . . . . . . . . . 80
5.3 Reporting and Analyzing Run Results . . . . . . . . . . . . . . . . . 85
5.4 Initiating Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91
6. Summary and Future Work                          96
BIBLIOGRAPHY                                 99
VITA                                        102
vi
LIST OF TABLES
TABLE PAGE
1.1 The effect of parameter optimization on GOPS/W and GOPS/   . . . 9
3.1 Characteristics of the macros . . . . . . . . . . . . . . . . . . . . . . 46
3.2 Default values of delay and PDAP for the macros . . . . . . . . . . . 47
3.3 Optimum values of delay and PDAP for the macros . . . . . . . . . . 47
3.4 Number of generated points and run times . . . . . . . . . . . . . . . 54
3.5 Run times for DSSA and PDSSA . . . . . . . . . . . . . . . . . . . . 57
3.6 Effect of step-size on PDAP (for multiplier) . . . . . . . . . . . . . . 59
4.1 A general SA algorithm in pseudo code . . . . . . . . . . . . . . . . 67
4.2 Characteristics of the macros . . . . . . . . . . . . . . . . . . . . . . 70
4.3 Constraints of the macros . . . . . . . . . . . . . . . . . . . . . . . . 71
4.4 PDA products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
4.5 PDAP improvement ratios . . . . . . . . . . . . . . . . . . . . . . . 71
4.6 Run times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
vii
LIST OF FIGURES
FIGURE PAGE
1.1 Design level, estimation accuracy and estimation speed. . . . . . . . . 2
1.2 3D design-space . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
1.3 Delay vs. area trade-off curve. . . . . . . . . . . . . . . . . . . . . . 5
1.4 Power vs. area trade-off curve. . . . . . . . . . . . . . . . . . . . . . 6
1.5 Power vs. delay trade-off curve . . . . . . . . . . . . . . . . . . . . . 7
1.6 GOPS/W-PDA/1000 vs. delay curve . . . . . . . . . . . . . . . . . . 10
2.1 Current flow in a CMOS inverter. . . . . . . . . . . . . . . . . . . . . 13
2.2 CMOS inverter while charging and discharging a load. . . . . . . . . 14
2.3 Two-stage inverter and its equivalent circuit for delay calculations. . . 18
2.4 Power optimization methods. . . . . . . . . . . . . . . . . . . . . . . 21
2.5 Before and after clock gating. . . . . . . . . . . . . . . . . . . . . . . 23
2.6 Before and after operand isolation. . . . . . . . . . . . . . . . . . . . 25
2.7 Sample switching activity file. . . . . . . . . . . . . . . . . . . . . . 26
2.8 Inserting buffer chain. . . . . . . . . . . . . . . . . . . . . . . . . . . 28
2.9 Before and after pipelining . . . . . . . . . . . . . . . . . . . . . . . 30
2.10 ASIC design flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
2.11 Monte-Carlo design space exploration flow . . . . . . . . . . . . . . 34
3.1 DSSA in pseudo code - Part I . . . . . . . . . . . . . . . . . . . . . . 39
3.2 DSSA in pseudo code - Part II . . . . . . . . . . . . . . . . . . . . . 40
viii
FIGURE PAGE
3.3 Default ASIC design flow . . . . . . . . . . . . . . . . . . . . . . . . 41
3.4 ASIC design flow with optimization . . . . . . . . . . . . . . . . . . 43
3.5 DSSA integrated into an ASIC flow. . . . . . . . . . . . . . . . . . . 45
3.6 Searched space for 16-bit adder . . . . . . . . . . . . . . . . . . . . . 48
3.7 Searched space for 16-bit multiplier . . . . . . . . . . . . . . . . . . 49
3.8 Searched space for 16-bit complex-multiplier . . . . . . . . . . . . . 50
3.9 Searched space for 12-bit 32 tab FIR . . . . . . . . . . . . . . . . . 51
3.10 Searched space for 12-bit 4 pole 32 tab poly FIR . . . . . . . . . . . 52
3.11 Searched space for 12-bit 64-point FFT . . . . . . . . . . . . . . . . 53
3.12 PDSSA in pseudo code . . . . . . . . . . . . . . . . . . . . . . . . . 55
3.13 A sample shell script generated by PDSSA for parallel search . . . . . 56
3.14 Perl code to determine the Pareto points. . . . . . . . . . . . . . . . . 58
3.15 Pareto curve for 16-bit adder. . . . . . . . . . . . . . . . . . . . . . . 60
3.16 Pareto curve for 16-bit multiplier. . . . . . . . . . . . . . . . . . . . . 61
3.17 Pareto curve for 16-bit complex-multiplier. . . . . . . . . . . . . . . 62
3.18 Pareto curve for FIR. . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.19 Pareto curve for poly-FIR. . . . . . . . . . . . . . . . . . . . . . . . 64
3.20 Pareto curve for 64 point FFT. . . . . . . . . . . . . . . . . . . . . . 65
4.1 Searched space for adder . . . . . . . . . . . . . . . . . . . . . . . . 72
4.2 Searched space for multiplier . . . . . . . . . . . . . . . . . . . . . . 73
4.3 Searched space for complex-multiplier . . . . . . . . . . . . . . . . . 74
5.1 Main window of the GUI. . . . . . . . . . . . . . . . . . . . . . . . . 77
ix
FIGURE PAGE
5.2 Initiating MacroGen from the main window. . . . . . . . . . . . . . . 78
5.3 Selecting a macro from MacroGen. . . . . . . . . . . . . . . . . . . . 79
5.4 Viewing specifications for the selected macro. . . . . . . . . . . . . . 81
5.5 Entering values for the generic parameters of the selected macro. . . . 82
5.6 Viewing VHDL file of the selected macro. . . . . . . . . . . . . . . . 83
5.7 Initiating a design space search. . . . . . . . . . . . . . . . . . . . . 84
5.8 Design space search in progress. . . . . . . . . . . . . . . . . . . . . 86
5.9 Tabulated design space search results for the selected macro. . . . . . 87
5.10 Plotting search results for the selected parameters. . . . . . . . . . . . 88
5.11 Generating Pareto curve. . . . . . . . . . . . . . . . . . . . . . . . . 89
5.12 Viewing the statistics of the standard-cells, number of transistors for
the selected design point. . . . . . . . . . . . . . . . . . . . . . . . . 90
5.13 Selecting a design point to send for layout. . . . . . . . . . . . . . . . 92
5.14 Initiating place and route procedure. . . . . . . . . . . . . . . . . . . 93
5.15 Selected design point after placement and routing. . . . . . . . . . . . 94
5.16 Final layout for the selected design point. . . . . . . . . . . . . . . . 95
x
CHAPTER 1
Introduction
The need for integrated circuits that consume less power and area yet are faster than
unoptimized circuits has accelerated the development of new and powerful synthesis
tools. Synthesis tools and the state-of-the-art standard-cell libraries are some of the
crucial parts of successful application-specific integrated circuit (ASIC) design. ASIC
design can be summarized in 3 major phases [1]:
1. Description of the design in a process-independent hardware description lan-
guage (HDL) such as VHDL or Verilog at either the behavioral level or at the
register transfer level (RTL).
2. Mapping the HDL to a specific process to generate a process-dependent net-list.
3. Physical placement and routing of the design.
It is possible to optimize the design during each of these phases to meet the de-
lay, power, and area specifications. In the first phase, the HDL code can be optimized
with the help of HDL power estimation tools [19] (however, the estimates may not be
very accurate; see Figure 1.1). A statistical design space exploration at the behavioral
level can also be conducted [10]. A successful optimization during the second phase
depends on how well the targeted standard-cell library has been characterized, the ca-
1
RTL
GATE Level
POST Layout (GATE Level)
POST Layout (Transistor)
Speed of Estimator
Accuracy
Figure 1.1: Design level, estimation accuracy and estimation speed.
2
pabilities of the synthesis tool and the optimization procedure used. Today’s standard-
cell libraries contain several versions of the same cell with different drive strengths [5],
which makes it easy to target the same design with different sets of constraints and
find a solution. Because of this fact, there are several different power-delay-area so-
lutions for the design. Design space exploration using RTL involves generating a set
of power-delay-area combinations between the lower and the upper boundaries of the
design space (an example 3-D design space is illustrated in Figure 1.2 which was gen-
erated for a 16-bit multiplier). As a result, it is almost impossible or very difficult to
explore the design space manually. Even by using semi-automated search techniques,
thorough design space exploration can be very time-consuming (taking hours or even
days). The main contribution of this work is a fully automated search technique which
explores the design space. This technique does not require any initial target values
for power, delay, and area. It automatically determines the upper and lower bounds of
the design-space for the targeted process and explores the design space to minimize a
given cost function.
1.1 Motivation
Designing efficient chips in terms of power, speed, and silicon area is the principal
goal of an ASIC designer. But as seen from Figure 1.3, Figure 1.4 and Figure 1.5
(these figures were generated experimentally using a 16 bit multiplier), designing a
higher speed circuit means making sacrifices in power consumption and in silicon
area, or designing a low-power circuit means making a sacrifice in speed. As a result,
3
Figure 1.2: 3D design-space
4
Figure 1.3: Delay vs. area trade-off curve.
5
Figure 1.4: Power vs. area trade-off curve.
6
Figure 1.5: Power vs. delay trade-off curve
7
the designer must constantly make trade-offs between conflicting requirements while
looking for a solution that satisfies the design constraints. The designer will make the
trade-offs depending on the purpose of the design. If it is designed for wireless appli-
cations, power and the portability will be the main focus. On the other hand if it is de-
signed for real-time applications speed may be the main focus. But some cases require
attention for power, speed and portability, such as in tactical missile applications. The
ASIC must be compact enough to fit into a confined space with limited capability to
remove the heat generated during high-speed processing. Thus, low power consump-
tion is of equal concern with processing speed and area so that optimizing the design
for minimum PDA will be the best choice. Two important parameters to measure in-
tegrated circuit efficiency are operations per second per watt (OPS/W) and operations
per second per square micron (OPS/    ). The first parameter is an indication of energy
efficiency of the chip and the second parameter is an indication of the area efficiency
of the chip. Both parameters are closely related to the power, delay and area values of
the design. OPS/W is inversely proportional to the product of delay and power (DPP),
and OPS/    is inversely proportional to the product of area and delay. To increase
operations per second per watt, DPP has to be minimized but at the same time silicon
area also has to be kept in mind. Thus, minimization of the product of power times
delay times area (PDA) can be a more efficient means of increasing OPS/W without
hurting OPS/    efficiency.
Table 1.1, generated experimentally using a 16 bit multiplier, shows that optimizing
the design for PDAP gives the best GOPS/W with an insignificant effect on GOPS/    .
8
Table 1.1: The effect of parameter optimization on GOPS/W and GOPS/   
Optimized parameter GOPS/W GOPS/   
D 200 14.5
P 190 9.05
A 187 12.5
PDA 240 12.7
Figure 1.6, generated experimentally using a 16 bit multiplier, indicates that maximum
GOPS/W occurs when the PDA is minimum.
To implement an efficient chip, the design space has to be searched for minimum
PDAP. However, this process is not an easy task and today’s very sophisticated com-
mercial tools do not provide an automated technique to meet this need. Hence, there
is a need for a technique which can work with commercial tools and guide them to
search the design space for minimum PDAP. Not only should the technique provide
the means to achieve this goal, it should also be integrated into an automated ASIC
design flow to reduce design time.
1.2 Goals and Expected Contributions
The main objective of this dissertation is to develop techniques which make design
space search easier or possible for certain parameters of the design. The emphasized
parameter in this thesis will be the product of power-delay-area or PDAP. Moreover,
9
Figure 1.6: GOPS/W-PDA/1000 vs. delay curve
10
the tools will be capable of handling other parameters like PAP, PDP, DAP etc.. Thus,
the goals of this research can be itemized as follows:
1. Develop a technique to assist or to guide a commercial tool to do a design-space
search.
2. Perform the design-space search with user guidance in terms of:
(a) search time,
(b) number of iterations, and/or
(c) improvement ratio.
3. Integrate the search technique into an automated ASIC design flow.
4. Design a graphical user interface (GUI) for the automated flow.
1.3 Thesis Overview
In Chapter 2, background about ASIC flow, CMOS circuits (power consumption,
delay, etc.), power optimization techniques, delay optimization techniques is given.
The related work section is also given in this chapter. Implementation of the proposed
design space search algorithms (DSSA, PDSSA) and experimental results are given in
Chapter 3. Chapter 4 presents the application of simulated annealing to design space
search and some experimental results of the application. Designing a graphical user
interface (GUI) for the developed algorithms is given in Chapter 5. Finally Chapter 6
contains the conclusions and future work.
11
CHAPTER 2
Background
2.1 Power Consumption in CMOS
When the inverter in Figure 2.1 is functioning, it consumes two types of power:
1. Dynamic power
(a) Power due to charging/discharging of capacitive load and parasitic capaci-
tances (which is also known as switching power).
(b) Short-circuit power
2. Static power
(a) Leakage power
(b) Sub-threshold conductance
Total power consumption can be formulated as following [11]:
 
	



fiffffifl !"#

ffffifl
%$&(')(*($
#
 (2.1)
The following subsections explain each of these classes of power consumption.
12
OUTIN
Ip
In
C
Isc
Vdd
Figure 2.1: Current flow in a CMOS inverter.
13
2.1.1 Dynamic Power
Switching Power
Dynamic dissipation due to capacitive switching often consumes 80-90% of the total
power [17]. Dynamic dissipation is the result of charging and discharging parasitic
capacitances in the circuit. Figure 2.2 represents the equivalent circuit of a CMOS
inverter while charging and discharging a load capacitance in which the load capaci-
tance represents the total lumped parasitic capacitances. When the input to the inverter
goes from 1 to 0 the output will go from 0 to 1. For that phase (Figure 2.2) Equa-
tion 2.2 to Equation 2.6 show the calculations to determine the stored energy in the
load capacitance [25].
Ip
Vdd
OUTIN
Rp
C L
Vdd
OUTIN
C L
Rn
In
Figure 2.2: CMOS inverter while charging and discharging a load.
14
fl 

	



 
	



 

 (2.2)


	






	
 


 

 (2.3)


	

fl
 


fffi (2.4)


	
 




	



 






	
fl



 

ffi

 (2.5)


	


 

 (2.6)
The total stored energy is given by Equation 2.6. The same amount of energy
is dissipated in the PMOS transistor (due to    resistor). If the input goes from 0
to 1 then the output will go from 1 to 0 (Figure 2.2). In this case there is a direct
path to ground for
#
to discharge. The energy stored (Equation 2.6) in
 
will be
dissipated by the NMOS transistor (due to ! ). Fortunately, switching power is only
dissipated if a parasitic capacitor is charged or discharged. Thus, the total switching
power consumption can be written as:
15
 
 
	


##


#
(2.7)
where

is the number of capacitors switched (1 -  0 or 0 -  1, switching activity)
per clock period, and f is the clock frequency. Equation 2.7 illustrates that the switch-
ing power is proportional to switching activity, total lumped parasitic capacitance, and
the square of the supply voltage. Note that dynamic switching power consumption is
independent of the effective resistances of the transistors.
Short-Circuit Power
At some point during the switching transient, both the NMOS and PMOS transistors
are turned on at the same time. This occurs for gate voltages between

  
and

+


 
.
During this time, there is a short circuit between

and ground and a short-circuit
current flows through that path (
fl  !
). Short-circuit power (Equation 2.8) generally
accounts for 10%-20% of the total power dissipated in CMOS circuits [11].
 
 !
	
fl  ! #
 (2.8)
2.1.2 Static Power
Since CMOS transistors are not ideal switches, there is always a leakage current from

to ground through the channel of an off transistor.
16
2.2 Delay in CMOS
Prior to the advent of sub-micron processes, chip delay was dominated by the rise
and fall delays of the CMOS transistors. But in a sub-micron process, interconnect
delays become a more significant part of the total delay. Figure 2.3 shows an inverter
driving a second inverter as a load. If we know the the effective resistance of the
transistor, the delay calculation will be very simple. To calculate a resistor value to
represent the transistor over its entire operating range, the transistor’s resistance is
measured at two operating points (saturation and linear) and averaged [14] as shown
in Equation 2.9.

 
	

 
 
fl 
(
ff


 
 
fl

 
 

 (2.9)
fl 
(	


  







 (2.10)
 
 	

 (2.11)


 
 	







 (2.12)
17
IN OUT
Vdd Vdd
Ip
Vdd
OUT
C LC wire
RIN
Rp
wire
Figure 2.3: Two-stage inverter and its equivalent circuit for delay calculations.
18
fl
 
 	
 
 
   







 (2.13)
After inserting Equation 2.10, Equation 2.11, Equation 2.12, and Equation 2.13,
into Equation 2.9 approximately we get;

 


 












(2.14)

 

	



	





 
ff

 
 
$



 
 
$
ff 

 (2.15)
for a rise time corresponding to a change in output voltage from    

 to   
	



  

$ 	

 



 
ff


 

 
 
$
ff 
 (2.16)

  

$


 




 













&(
ff

 
 
$

&(
 (2.17)
As seen from Equation 2.16, delay is inversely proportional to both

 and tran-
sistor size. Increasing the supply voltage or increasing the transistor size results in a
faster circuit.
19
2.3 Optimization in CMOS Circuits
2.3.1 Power Optimization
As the popularity of portable and wireless devices has increased, power consumed by
these devices has become a critical issue. They have to be designed to consume low
power because of battery life time and reliability issues [18]. As the circuits get hot,
their failure rate gets higher. Proper cooling mechanisms have to be installed in the
system to prevent failures. But the best way of increasing the battery life and decreas-
ing the failure rate is designing low-power devices. At each design level, a designer
can use strategies to lower the power consumption. Figure 2.4 shows the power saving
strategies at each design level and possible percentages of power savings by utilizing
these strategies. Only the ones applicable at the RTL level will be mentioned in this
section, and they are:
1. Clock gating
2. Operand isolation
3. Switching activity back annotation.
Clock Gating
As mentioned earlier, most power consumption occurs when there is switching. If
unnecessary switching is prevented, power consumption can be lowered dramatically.
20
Behavioural Level
Scheduling, Allocation, Resource
      Sharing & Retiming
Gate Lavel         
Technology Mapping, Rewiring,
Phase Assignment, Lowering Glitching
Device Level         Buffering, Transistor Sizing
Physical Level
P&R Interconnect Capacity Reduction
Clock-Tree Synthesis, Floorplaning
System level
Algorithms, Process, Library
Supply Voltage
LEVEL                     OPTIMIZATION METHODS              POWER REDUCTION
20-30%
10-20%
5-10%
50-90%
40-70%
 Precomputation, FSM Encoding
 RTL Level      Clock-Gating, Operand-Isolation
30-50%
Figure 2.4: Power optimization methods [2]
21
For that purpose, clock-gating [9], [26], [4] can be the answer. The circuit in Fig-
ure 2.5(a) will consume switching power even if there is no state or output transition.
But after inserting clock-gating circuitry (Figure 2.5(b)), unnecessary switching will
be prevented. Clock-gating circuitry consists of a clock enable circuit, a latch, and
an AND-gate. The state and input signals are fed into the clock enable circuit which
determines whether to enable the clock. After latching the clock enable signal, the
system clock and the clock enable signal are passed through an AND-gate. The output
of the AND-gate is the GATED-CLOCK signal. Clock gating can reduce the dynamic
power up to 40% depending on the design. The latch is in the clock-gating circuitry to
prevent glitches in the clock.
Operand Isolation
In a design, data-path operators (multipliers, adders, etc.) are implemented in a way
that they are always operational. Even when the output of the operation is not used,
these circuits continue to experience switching activity which results in power dissipa-
tion. A data-path operator is mainly a combinational circuit. A combinational circuit
continues to switch whenever its inputs change value, even if its output is not used. If
the data-path operator output is an input to an unselected multiplexer or if it is an input
to a register that is currently disabled, its output is not used, even though it contin-
ues to switch. This useless switching increases the wasted power consumption. This
waste of power can be prevented if the data-path operators are stopped from switching
when their output is ignored. This can be done by using the operand-isolation tech-
22
FF
IN OUT
CLK
      Logic
Combinational
State
Latch
GCLK
Clock 
Enable
Circuit
IN
State
      Logic
CombinationalFF
OUT
CLK
Figure 2.5: Before and after clock gating.
23
nique [4], [21]. Figure 2.6 shows a design before and after operand-isolation. With
the operand-isolation technique, additional logic (AND or OR gates) is inserted along
with an activation signal to hold the inputs of the data-path operators stable whenever
their output is not used.
Switching Activity Backannotation
It is clear from Equation 2.1 that switching activity has an effect on dynamic power.
More switching increases the dynamic power. Switching activity is used by the syn-
thesis tools, and power estimation tools for accurate power optimization and accurate
power estimation. A switching activity file can be generated either at the RTL or at the
gate-level abstraction. This file contains a list of 0-  1 or 1-  0 transitions of the inter-
nal nodes and the input and output ports for a given time. The switching activity file
is utilized by the optimization tools to pinpoint the hot-spots (more power consuming
regions) of the design [4]. Power optimization tools put more effort on these hot-spots
in order to reduce the power consumption. The switching activity file (Figure 2.7) can
be captured using simulation tools. The input vector set, which is used to generate the
switching activity file, should characterize a typical operation of the design.
2.3.2 Delay Optimization
Buffer Insertion
Delay through a stage increases as the capacitance driven by that stage increases. Usu-
ally each stage drives a load which matches its drive capability. However, there are
24
XX
1
0
SEL
IN1
IN2
IN3
IN4
CLK
FF
X
X
&
&
&
&
1
0
CLK
IN4
IN3
IN2
IN1
SEL
FF
Figure 2.6: Before and after operand isolation.
25
Figure 2.7: Sample switching activity file.
26
several cases in which the load can be much larger [3]:
1. Capacitive load due to long wires.
2. Driving an off-chip component.
3. Driving a global signal (clock, reset, etc.).
The simple answer to this problem is to increase the transistor size of the driver,
which will increase the current available for the load. However, this does not solve the
problem completely since it just pushes the problem one logic level back. Increasing
the transistor size of the driver also increased its gate capacitance. Now it has become
a large load for the stage which drives it. It is obvious that eventually we have to
use stages with large transistor sizes, but we can minimize delay by using a chain of
drivers. Figure 2.8 shows a chain of buffers to drive a large capacitive load. Calcula-
tions to find an optimum number of buffers to drive a large capacitive load are given
below [25], [23]:


	

 



 
 
 

 

 
 
 
ff

 



 
 
 






 
 
 
ff
               
ff

 



 
 
 

 



(2.18)

&(
	

 



 
 
 

 
 
 

 ff  ff
               
ff


 




 
 
 
 (2.19)
At each stage, the ratio of the output capacitance to the input capacitance is

. It
should be also the same for last stage:
27
α
n-1
C
min
α
2
α
n-1
C
min
R
min
C
min
R
minR min
C
min
R
min
α
2
α
n-1
α
2
W/L W/L
W/L W/Lα
α
α
C L
Figure 2.8: Inserting buffer chain.

	


 




 
 
 
 (2.20)

 
	



 
 
 
 (2.21)
After inserting Equation 2.21 into equation Equation 2.19;

&
	
 
#

 
 
 (2.22)

 
 
 	

 



 
 
 

 
 
 (2.23)
28
& 	
 




 
 
 

 



 
 
 (2.24)

 
 
 
is the delay for a minimum size driver (       ) which drives a minimum size
load (

 
 
 ). To find the optimum number of stages which minimizes  &( , we have to
solve the following equation:
fffi
&
ff
 
	
 (2.25)
which gives:
 

 

 
 

 
	
 



 
 
 
 (2.26)
Pipelining
The clock period in a design is determined by the delay of the largest conbinational
block. The large combinational block in Figure 2.9 has a delay of 

. This indicates
that the clock period for this design has to be at least 

(if we ignore set-up and hold
times of the flip-flops for the sake of simplicity). If the designer wants to run at faster
clock, he/she has to reduce the delay of that large combinational block if possible. If it
is not possible, pipelining is a technique which offers a solution. Pipelining partitions
blocks of combinational logic into stages of equal delays, with the stages separated by
banks of pipeline registers. Figure 2.9 also shows the design after pipelining registers
29
T d
Combinational
       Logic
IN OUT
CLK
FF FF
Combinational
       Logic
Combinational
       Logic
IN OUT
T d T d
FF FF
/2 /2
FF
CLK
Pipelining Register
Figure 2.9: Before and after pipelining
30
have been inserted which has reduced the delay by half. If this reduction is not enough,
new pipelining registers can be inserted to partition the combinational blocks further.
This process can be carried on until a desired delay obtained.
Pipelining increases the throughput of the design but it introduces latency between
the input data and the resulting output. Also pipelining increases the gate count be-
cause of the inserted pipelining registers. This technique is most useful for systems
with a very high sampling rate.
2.4 ASIC Design Flow
An ASIC design flow is shown in Figure 2.10. The first step in the design flow is
the development of a hardware description of the design. This can be done by using
any of the hardware description languages (VHDL or Verilog) or by schematic entry.
The second step is the functional verification of the HDL. This can be done by using
RTL level simulators. A well defined testbench can be very helpful for verification
purposes. The third step is the synthesis of the HDL. Synthesis is the process of gener-
ating logic-level representation from the HDL. Output of the synthesis step is a net-list
of standard cells and interconnects. Standard cells are obtained from the vendors tech-
nology library. Technology libraries contains precharacterized standard cells for the
target technology. The fourth step is the verification of the gate-level net-list. This
step is also called gate-level simulation. The same testbench used for RTL verifica-
tion can be used at this step. The fifth step is the placement and routing of the standard
cells. Standard cells in the gate-level net-list are placed together to generate the layout.
31
HDL  
RTL  Simulation
Synthesis
Gate-Level Netlist
Gate-Level Simulation
Gate-Level Netlist
Place&Route
Layout
Extraction
Transistor-Level Netlist
Transistor-Level
Simulation
HDL Code 
Standard-Cell Lib.
Tech. Lib.
011101101
010110111
010101011
111001010
Vector Set
Figure 2.10: ASIC design flow.
32
Physical characteristics of the standard cells (height, width, I/O locations, power and
ground rails, layers, vias, etc.) are already defined in the standard-cell libraries. Then,
placed standard cells are routed. Routing can be timing driven, to give the first priority
to the timing constraints, or power driven, to give the first priority to power constraints.
If there is no violation, a transistor-level net-list is extracted from the layout for final
verification of the functionality, timing and power. If the functionality and constraints
are met, the design is placed in an I/O frame and sent for fabrication.
2.5 Related Work
In a paper closely related to the research, Bruni [10] developed a flow which starts
by synthesizing the behavioral representation of the design with randomly selected
constraints. The resulting RTL net-list is then evaluated for area, delay and power. The
flow (Figure 2.11) utilizes a Monte-Carlo sampling of design space which is repeated
automatically in an unsupervised mode to produce a statistical characterization of the
design space. Extreme value theory is applied to extrapolate achievable bounds from
the sampling points. Thus, this flow is intended as an off-line precursor to collect
statistical information before actually starting a detailed search for an optimal solution.
In contrast, the proposed design space search techniques are intended to identify the
best solution possible given the user constraints.
In [15], RTL design space search is done through projected AT-curves (Area-Time
curves). First the RTL module is synthesized for minimum area (    ,   ) and minimum
delay (    ,   ), and these two delay values are treated as upper and lower bound tim-
33
Loop
Design
Metrics
RTL
Sh
or
tc
ut
Synthesis Script
Gate-lavel netlist
Signal Activity
RTL simulator
R
Sp
ec
ifi
ca
tio
n
R
TL
 L
ib
ra
ry
Te
ch
no
lo
gy
 L
ib
ra
ry
Bounds
Synopsys
Verilog  XL
Synopsys’ DC
Mentor Graphics’
Monet
Constraint Generator
Figure 2.11: Monte-Carlo design space exploration flow
34
ing for the module. Then, the module is synthesized again for (   +   )/2 to generate
third design point (      

,   
 

). Then, (    ,   ),(    ,   ) points are used to generate a
two-point AT projection. For three-point AT projection, points (    ,   ), (      

,   
 

),
(    ,   ) are used. Then, a linear function is used to derive the AT-curves. Optimum
module selection is done using projected AT-curves. This approach has two main dif-
ferences from our approach: (1) Optimum point selection is done from a projected
AT-curve, in our case optimum design is selected among real data points generated
through synthesis. (2) design space is considered 2 dimensional (Area-Time), in our
case design space is considered 3 dimensional (Area-Time-Power).
In [8], a tool called GALOPS [7], a transformational based tool that uses a Ge-
netic Algorithm (GA), has been used for design space search. GALOPS utilizes a GA
to apply high level transformations to DSP algorithms at the behavioral level. Each
high level transformation changes the area, speed, and power characteristics of the de-
sign. GA is used to generate a set of designs with the lowest power value for every
generated area value. After each generation is created, Pareto-optimal points are se-
lected from the generated set of designs. These steps are repeated and selected Pareto
points at each generation are combined to create a global set of Pareto-optimal points.
By using those Pareto-optimal points, Pareto-optimal surfaces are generated to illus-
trate the trade-offs between the conflicting parameters. Our approach combines three
conflicting parameters (area, delay, power) in to a single cost function, and also our
search technique is at the logic synthesis level.
In [13], authors presented a module selection procedure which uses both a com-
35
plex operators library and the voltage scaling in order to optimize the design. The
procedures goal is to determine the optimal supply voltage and the optimal operators
set from a given module library according to a DSP application for design optimiza-
tion. The module selection procedure uses two entries; the application, and the com-
plex module library. The module library contains several parameters for each operator
(module): area, latency time, pipeline stages number, functionality (adder, multiplier,
etc.) and the effective capacitance. To explore the area-power space the module selec-
tion algorithm explores all the different solutions dealing with different operators with
different area-power-time characteristic and different supply voltages. At each supply
voltage value, the complex module library is searched through to identify the modules
which minimize the given cost function (Cost    	   

 
Area(s)+

Power(S)) For
the selected set of operators and the supply voltage the cost function is re-evaluated us-
ing area and power estimates for the selected operators at the selected supply voltage.
This method requires a pre-characterized module library. Whereas our method uses a
standard-cell library which is available through a vendor. Also the cost function used
in this method only accounts for area and power whereas in our approach the cost
function accounts for area, power, and delay.
In [24], authors proposed a method to obtain area and delay estimates from RTL
description. They observed that technology-dependent area and delay optimization
consumes 85% of the total design time. Therefore, they proposed a method to estimate
area and delay on technology- independent design. The estimates are obtained through
fast compiler-type optimizations on the RTL description followed by application of
36
best-fit polynomial area and delay models (models are previously generated from a
technology library) on the resulting technology-independent code. Surely, this will
help the designer to do a design space search at RTL code before mapping it into a
technology. However, this approach does not provide any estimate about power.
37
CHAPTER 3
Design Space Search Algorithm: DSSA
A Design Space Search Algorithm (DSSA) has been developed to explore the de-
sign space for the minimum power-delay-area product (PDAP). Since the design-space
is discrete, DSSA is implemented as heuristic. This algorithm is integrated into an
ASIC design flow to guide the synthesis tool. DSSA has been tested on several macros
and results are presented in section 3.3.
3.1 Implementation of DSSA
Pseudo code for the Design Space Search Algorithm (DSSA) is given in Figure 3.1
and Figure 3.2. DSSA is completely automated so that no human interaction is re-
quired once all the necessary initial parameters and files are provided. The initial
parameters and files required are:
  RTL level description (using either VHDL or Verilog).
  Testbench for RTL-level, gate-level simulations and for the generation of the
switching activity file.
  Initial synthesis script for the default run (Figure 3.3) which does not contain
any constraints. This run is to provide the initial values of power, delay and area
38


 
	 
 
 






 
	

 
	 






 







   
	
 


 








 


 
	 
  





 
	 





 







 
	









 
	






 







 
  	 	

	

  
 ff








 
 


ff






  




 



 

fi 
 
fl

ffi
 
 
	

	



 
 


 


 



 
 
	

	



 
 

 
 

 



 
 
	

	



 
 


 



 



 !

#"
 
	

	



 


	 


#" 
 
 

 




 
 
	

	



 
  







#" 
  




 


	
$!%'&

 
 

 
	


 


fi






)(
 
(
 
(
 
ff



 
  	 	

	


*

ff



 


 
+




 

(
 
(
 
(
 
(
$!%'&


ff




  	 	

	
Figure 3.1: DSSA in pseudo code - Part I
39



 
	


 


fi







$ % &



(
$!%'&

 
 



+

ff



 


 
+




 
 
 
(
 
(
 
(
 
(
$ % &

 
 

(



	ffi


 

 
 
 

	

 





 




 



 
 

 !

#"
(
 
(
 
(
 
(



ff





   
#"



 
 

 

 

 
 
 

 
 
ff ff












 
 






 
 


	 

 
 




ff





 	 
 




 
	


 




*



ffi






	 
	




 
	


 




*



ffi



 
  	
 
 




 
	
 

 




*



ffi+








 
  
 
 

	 

 







 



 
 

    fi






 
(


 
  
 
 



ff





   
#"



 
 

 



 
  
 
 
ffi


"
 


 
 
 



*



 

 
 



 


 


 


 




 
	
 





 







 




 
	

 



	







	 	

 

 
	
 

 
ff







	 	

 

 
	




ff

	


 
ffi







	 	
 
 
 
 
	
 
 
ff

	



 
ffi
 




*



ffi




 

 
 




ffi
Figure 3.2: DSSA in pseudo code - Part II
40
Compile
Read Design
Define Design
Environment
Constraint
Set Design
Select
Strategy
Design
Synthesize
RTL CODE
Scan
Insertion
Check Design
   Design
Save Synthesized
Gate-Level
Verilog Netlist
P&R
Simulate
Cadence SE
ModelSim
Synopsys Design Compiler
Figure 3.3: Default ASIC design flow
41
for the DSSA.
  Optimization synthesis script for the optimization runs (Figure 3.4).
  Control parameters that are set by the user: the desired run time, number of
iterations and improvement ratio.
DSSA starts with the default run (which doesn’t have any kind of optimization) for
the given RTL. Then the initial values for delay, power, area and the path where the
maximum delay occurs are extracted from the report files generated by the synthesis
tool. The maximum delay path is constrained by the value of the maximum delay
minus the iteration step size (choosing an iteration step size very small will increase
the total run time but it will give a more complete picture of the design space.) and the
synthesis script is updated with this new constraint. The DSSA also checks the other
paths for any timing violations (which are then added as constraints).
At each iteration step, the CPU time, the number of iterations and the improvement
ratio of the figure of merit (e.g. PDAP, DPP, power, delay, area, etc.) are checked.
If the value of any of these parameters satisfies the user-defined control parameters
(run time, number of iterations, improvement ratio), the iteration stops and waits for a
response from the designer. Detailed sorted results are presented on the screen and in
a file so the designer can decide whether to continue or stop. If the designer decides to
continue, the DSSA will resume from the point where it stopped.
During the run, the improvement ratio of the figure of merit for the previous it-
eration and current iteration are checked. If the improvement is less than epsilon (a
42
Compile
Define Design
Environment
Constraint
Set Design
Select
Strategy
Design
Synthesize
Scan
Insertion
Check Design
   Design
Save Synthesized
Power Compiler
Register
Retiming
RTL CODE
Gate-Level
Verilog Netlist
Read Design
Operand
Isolation
Gating
Clock
Simulate
P&R
Simulate
Forward SAIF
Back-Annotation
SAIF
ModelSim
Cadence SE
Synopsys Design Compiler
Synopsys VSS
Figure 3.4: ASIC design flow with optimization
43
user-specified parameter), the iteration will stop even if the other user-defined speci-
fications are not met. No change or little change in the improvement ratio means the
iteration process has likely already reached the boundaries of the design space.
3.2 Integrating DSSA into an ASIC Flow
Figure 3.5 shows how DSSA is integrated into an ASIC flow. DSSA integrated
ASIC flow contains 7 commercial tools coordinated by set of Perl and Tcl/Tk scripts.
DSSA itself is written in Perl. RTL-level and gate-level simulations are performed by
ModelSim. ModelSim is also used to generate RTL-level and gate-level switching ac-
tivity files. Synopsys Design Compiler is used for synthesis. Synopsys PowerCompiler
is invoked within the Design Compiler for power optimization and power estimation.
DSSA performs design-space search (characterization) until one of the user defined
parameters is satisfied. The search process is shown as a loop in Figure 3.5. Once the
search is concluded, the design point with minimum FOM is sent for placement and
routing. Placement and routing is performed by Silicon Ensemble. After placement
and routing is completed a parasitic information file called DSPF (Detailed Specific
Parasitic Format) is generated and backannotated to Design Compiler for more ac-
curate power and delay values. Layout generation is performed by Cadence Design
Frame Work. After a successful LVS (Layout versus Schematic) check a transistor-
level net-list with parasitic RC values is extracted from the Layout. A transistor-level
simulation is performed to verify functionality and to obtain more accurate power and
delay values. Transistor-level simulation is performed by NanoSim and PathMill.
44
Synopsys DC
    Synopsys Power Compiler
    ModelSim
Forward Switching activity information fileBack switching activity file
*.v 
RTL
Standard Cell Library
Synthesis script
Modify the Synthesis Script
with new constraints.
Power, area, delay
verilog netlist
RTL
Testbench
Loop
Cadence Design Frame Work
DSSA
Netlist with minimum F.O.M.
iteration step size
number of iterations
run time
improvement ratio
Figure of Merit
(PDA, PD, PA, DA)
Check CPU time
Check iteration number
Check improvement ratio
DSPF ( Detailed Specific Parasitic Format ) file
Silicon Ensemble
Figure 3.5: DSSA integrated into an ASIC flow.
45
Table 3.1: Characteristics of the macros
Design # of I/O’s # of standard cells # of Transistors
16-bit adder 49 19 492
16-bit multiplier 64 722 8552
16-bit complex-multiplier 130 2991 36218
12-bit 32-tap FIR 41 6445 92008
12-bit 4-pole 32-tap PFIR 111 9170 130036
12-bit 64-point FFT 101 8751 155084
3.3 Experimental Results for DSSA
We have tested the DSSA integrated flow on three combinational macros and three
sequential macros. Characteristics of the macros are listed in Table 3.1. Table 3.2
shows the default delay and PDAP for the macros. Default values are obtained by
using the default flow shown in Figure 3.3. Table 3.3 contains the optimum delay and
PDAP. These are the optimum values for the optimization flow (Figure 3.4). Table 3.3
also tabulates the improvement ratio on the FOM.
Figure 3.6, Figure 3.7, Figure 3.8 Figure 3.9, Figure 3.10, and Figure 3.11 are the
plots of data points generated during the design space search. Table 3.4 shows the
number of points generated during the design-space search and how long it took to
complete the search.
46
Table 3.2: Default values of delay and PDAP for the macros
Design
ff

$
 
  % (nsec)       

$
 
  % (nJx    )
16-bit adder 3.35 225.5
16-bit multiplier 4.43 129123
16-bit complex-multiplier 8.67 4543481
12-bit 32-tap FIR 7.06 43659291.6
12-bit 4-pole 32-tap PFIR 6.14 65476076.8
12-bit 64-point FFT 23.29 60758193.4
Table 3.3: Optimum values of delay and PDAP for the macros
Design
ff

 
 (nsec)            (nJx    )       

$
 
    /    
 
 
 

adder 3.43 183.2 1.23
multiplier 3.69 89120 1.45
complex-multiplier 5.07 2383333 1.91
FIR 5.16 35350768.4 1.24
PFIR 3.95 41202846.1 1.59
FFT 16.46 44239530.7 1.37
47
Figure 3.6: Searched space for 16-bit adder
48
Figure 3.7: Searched space for 16-bit multiplier
49
Figure 3.8: Searched space for 16-bit complex-multiplier
50
Figure 3.9: Searched space for 12-bit 32 tab FIR
51
Figure 3.10: Searched space for 12-bit 4 pole 32 tab poly FIR
52
Figure 3.11: Searched space for 12-bit 64-point FFT
53
Table 3.4: Number of generated points and run times
Design Number of points Run time
adder 30 0.8 hrs.
multiplier 32 1.67 hrs.
complex-multiplier 41 8.5 hrs.
FIR 24 41.8 hrs.
PFIR 25 25.6 hrs.
FFT 40 72.35 hrs.
3.4 Parallel Design Space Search Algorithm: PDSSA
As shown in Table 3.4 design-space search times are very long. One way to reduce
the search time is using parallel search algorithms. We have implemented parallel
DSSA, called PDSSA, for that purpose. This section exs implementation of PDSSA,
and presents the experimental results.
3.4.1 Implementation of PDSSA
Pseudo code for PDSSA is given in Figure 3.12. PDSSA starts with two runs to deter-
mine the lower and upper bound of the design-space. The upper limit for the design-
space is the point in which power constrain set to minimum (at this point delay is
not a constraint). The lower bound of the design-space is the point in which timing
54
 
	 






 
 





 
 
	
  


 



 
  






 



 
	









 
	

 
	 



 

+



  

 
 

 


 


	



 
	

	



  
 
 

 




fi
	
  
ff

 
ff



 

  

 
 
ff

fi

 
 


	
  
 
	

	



  
 
 

 



fi 


  
ff

 
ff

fi

 

fi
 






(
 



 




 

 
  

 
  



 




(
 
(



(
  
	



 



 
  

 
  

Figure 3.12: PDSSA in pseudo code
55
Figure 3.13: A sample shell script generated by PDSSA for parallel search
constraint and the power constraint are set to minimum. After determining the lower
and upper bounds of the design-space, it is divided in to the number of the machines
available for the parallel run. Then, PDSSA generates a shell script for the parallel
run. A sample shell script, generated by PDSSA, is given in Figure 3.13. The sample
script was generated to run 10 machines in parallel. As seen from Figure 3.13, on each
machine a DSSA runs in its assigned range. For this purpose DSSA exed in section 1.1
is slightly modified to except lower and upper boundaries for the iteration. Since the
design space is discrete, there is no need for the communication of the parallel running
machines during the search.
3.4.2 Experimental Results for PDSSA
Design-space search for all macros are repeated using PDSSA and run times for DSSA
and PDSSA are tabulated in Table 3.5 for comparison. As seen from the results PDSSA
dramatically reduced the run times. For DSSA runs only one machine (Sun Enterprise
56
Table 3.5: Run times for DSSA and PDSSA
Run Time
Design DSSA PDSSA
adder 0.8 hrs. 4 min.
multiplier 1.67 hrs. 12 min.
complex-multiplier 8.5 hrs. 1.12 hrs.
FIR 41.8 hrs. 4.5 hrs.
PFIR 25.6 hrs. 3.75 hrs.
FFT 72.35 hrs. 7.4 hrs.
220R server, 450 MHz. dual CPU) is used. PDSSA runs used 10 similar machines in
parallel.
3.5 Pareto Points
It is possible to characterize the design-space by the set of optimal tradeoff design
points. For those points there is no design with a smaller PDAP and the same or
smaller delay, and no design with a smaller delay and the same or smaller PDAP. Such
points are called Pareto points. An algorithm (Figure 3.14) is attached to the DSSA and
PDSSA to identify the Pareto points. This algorithm is written in Perl. It goes through
the design-points generated during the design-space search and selects Pareto points
57
Figure 3.14: Perl code to determine the Pareto points.
58
to characterize the design-space. Figure 3.15, Figure 3.16, Figure 3.17, Figure 3.18,
Figure 3.19, Figure 3.20 shows the Pareto curves for the DSP macros.
3.6 Chapter Conclusion
Table 3.6 shows that as we increase the step size, the “optimum” PDAP found by
this technique gets worse. The reason is, because of the large step-size, what is prob-
ably the true optimum point is skipped. To prevent this, we have to use an algorithm
which is not using a fixed step-size.
Table 3.6: Effect of step-size on PDAP (for multiplier)
Step-size(nsec) ff     (nsec)  !         (nJx    )
0.1 3.18 86656
0.2 3.69 89120
0.3 3.61 90577
0.4 3.49 91483
59
Figure 3.15: Pareto curve for 16-bit adder.
60
Figure 3.16: Pareto curve for 16-bit multiplier.
61
Figure 3.17: Pareto curve for 16-bit complex-multiplier.
62
Figure 3.18: Pareto curve for FIR.
63
Figure 3.19: Pareto curve for poly-FIR.
64
Figure 3.20: Pareto curve for 64 point FFT.
65
CHAPTER 4
Searching the Design Space Using Simulated Annealing
Due to the step size dependence of the DSSA and PDSSA, it is possible to skip the
true optimum PDAP point during the design space search. As a result we decided to
investigate an alternate search technique to determine whether it could produce higher
quality results.
4.1 Simulated Annealing
Simulated Annealing (SA) is an iterative improvement algorithm (optimization
technique) based on the principles of thermodynamics. SA is based on an analogy
of the annealing process of solids. It was first introduced in a paper published by
Metropolis et al. in 1953 [20]. In [16] it was first used on a large combinatorial prob-
lem to find an approximate solution. SA accepts better moves (which reduces the cost
function) unconditionally, but unlike some other algorithms it also accepts inferior
moves as the new solution with a probability to provide a possibility of escaping a
local minimum. A general SA algorithm is given in Table 4.1.
  Initialize() is to set initial guess value for the parameter to be optimized.
  HeatUp() is to determine the starting temperature.
66
Table 4.1: A general SA algorithm in pseudo code
 
!

 
$  
 
			

 
T:=HeatUp()
Loop
Loop
 
 $ 
 
		ff
flfiffi	!  "#%$ 
&




 
	



 
 
 $ 





 
 
!

 
$  



&





 or ')(%(
*+
*,.-0/21

(43

 
!

 
$   	
 
 $	 
until Equilibrium()
T:=CoolDown()
until Frozen()
  Perturb() is to disturb the system with the current input to determine the next
input.
  Accept() is to accept the inferior moves (which increases the cost function) with
the probability of

5
76



89;:
 
  Equilibrium() is to check whether the required number of iterations has been
made or not.
  CoolDown() is to reduce the temperature according to a given schedule.
67
  Frozen() is to evaluate the rate of change of the system response. If it is smaller
than a given epsilon, terminate the algorithm.
4.2 Implementation
Simulated Annealing has been implemented in a number of applications in elec-
tronic design automation including routing [6], circuit partitioning [22], schedul-
ing [12]. In our implementation we have used SA to search the design space for
minimum Power-Delay-Area product. The SA code is written in C. The interface
between the SA code and the synthesis tool is provided by a set of scripts written in
Perl. Synopsys Design Compiler [3] is used for logic synthesis and Synopsys Power
Compiler [4] is used for power optimization during the search. A switching activity
file is generated for better power optimization using a random set of input vectors. At
each run, Design Compiler writes out power, delay and area reports. Perl scripts are
used to extract the power, delay and area values from those report files and to update
the current value of the cost function. The SA algorithm makes its decision according
to this new value of the cost function. The cost function in our SA based optimization
is PDA product which is given in Equation 4.1. However, the cost function can easily
be changed to be Power-Delay product, Power-Area product or Area-Delay product
depending on the user’s preference.
68



	  

 

 
ff
 
 
 
  D   d   fl
 (4.1)
As seen from Equation 4.1, it is a cost function with constraints.
ff
 
 
 
is the mini-
mum delay for the design.
ff
 
fl is the delay where power consumption of the design
is minimum. Before starting the main design space search, an automated procedure
finds
ff
 
 
 
and
ff
 
fl
.
ff
 
 
 
is found simply by setting the timing constraint to zero, and
ff
 
 
is found by setting the power constraint to zero during synthesis. Using a con-
straint cost function reduces the search time. We have used Cauchy training as given
in Equation 4.2 and Equation 4.3 (T is the temperature and

is a constant).
&
 	






 
  (4.2)

	







 
ff 


 (4.3)
Equation 4.4 shows how the next delay point is generated. After each new delay
point is generated, the timing constraint in the synthesis script is updated to reflect the
new delay target. At each point power, delay, and area values are recorded and the
technology-dependent net-list is written out.
  $	 	 
!

 
$  
ff
&
 (4.4)
69
Table 4.2: Characteristics of the macros
Design # of Transistors # of I/O’s
16-bit adder 492 49
16-bit multiplier 8552 64
16-bit complex-multiplier 36218 130
The schedule used for cooling is given by Equation 4.5. Where   is the starting
temperature,
 
is a constant, and k is the current iteration step number.




	 


ff
 


(4.5)
4.3 Experimental Results
We have tested the SA implementation using three combinational macros. The
characteristics of the macros are given in Table 4.2. The adder and multiplier were
taken directly from Synopsys Design Ware library. The complex-multiplier was cre-
ated using components from the Synopsys Design Ware library.
Table 4.3 shows the upper ( ff     ) and lower ( ff      ) limits of the search space for the
combinational macros used.
ff

 
 is the delay point where the minimum PDA product
was found. PDA products at
ff
 
 
 
,
ff
 
fl
,
ff

$
 
, and
ff

 

are listed in Table 4.4. Table 4.5
tabulates improvement ratio on PDAP for the three search techniques. Figure 4.1,
Figure 4.2 and Figure 4.3 are the plots of data points generated during the search.
70
Table 4.3: Constraints of the macros
Design
ff
 
fl (nsec) ff      (nsec) ff     (nsec)
adder 4.05 0.77 3.30
multiplier 5.11 2.20 3.17
complex-multiplier 8.67 2.99 4.69
Table 4.4: PDA products
PDA (nJx    )
Design
ff
 
fl
ff
 
 
 
ff

$
 
ff

 

adder 210 1161.3 225.5 179.5
multiplier 115925.5 163371.7 129123 87470.7
comp-mult. 2969216 4186043.6 4543481 2314768.7
Table 4.5: PDAP improvement ratios
PDAP Improvement Ratio
Design DSSA PDSSA SA
adder 1.23 1.23 1.26
multiplier 1.45 1.45 1.48
complex-multiplier 1.91 1.91 1.96
71
Figure 4.1: Searched space for adder
72
Figure 4.2: Searched space for multiplier
73
Figure 4.3: Searched space for complex-multiplier
74
4.4 Chapter Conclusion
Table 4.6 tabulates the number of points generated during each search technique
and also the total run time of each technique. It should be noted that run times for
the macros using the SA technique are from 2x-9x greater than those for the DSSA.
Of course, more design points are generated for the SA technique. As tabulated in
Table 4.5, the improvement ratio for the SA is only slightly better than those obtained
by DSSA and PDSSA. For example, for the case of the complex-multiplier, the SA
produced only a 3% better result than DSSA and PDSSA, but it took SA 9x more than
DSSA and 67x more than PDSSA to find it. As a result we have decided not to run the
design space search using the SA technique for FIR, PFIR, and FFT since the DSSA
and PDSSA achieved comparable results in far less time.
Table 4.6: Run times
Number of Points Generated Run Time
Design DSSA PDSSA SA DSSA PDSSA SA
adder 30 30 324 0.8 hrs. 4 min. 4.93 hrs.
multiplier 32 32 49 1.67 hrs. 12 min. 2.25 hrs.
comp.-mult. 41 41 389 8.5 hrs. 1.12 hrs. 75.62 hrs.
75
CHAPTER 5
Graphical User Interface (GUI) Design
Team Liberator1.0 (Figure 5.1) is a Graphical User Interface (GUI) which is de-
signed to combine the optimization techniques developed by three universities (Uni-
versity of Tennessee, University of Washington, and University of California at Santa
Cruz) for the Phase I of DARPA project. The GUI is written in Tk. Data transfer
between the tools and the GUI is provided by a set of Perl scripts.
5.1 MacroGen
MacroGen is a DSP macro generator tool. It can be initialized within the
Team Liberator1.0 (Figure 5.2) or it can be run as a stand-alone tool. MacroGen con-
tains templates of the generic DSP macros. It generates macros from those templates
according to the values for the generic parameters given by the user. MacroGen is
initialized within the Team Libeartor1.0 by simply clicking on the ”MACROLIST”
button located in the main window (Figure 5.2) The set of available DSP macros is
listed in the MacroGen window (for now the only macros available are adder, multi-
plier, complex-multiplier, FIR, poly-FIR, and FFT). The desired macro can be selected
from the MacroGen window simply by clicking on the macro name. A new window
will then pop-up automatically(Figure 5.3). Values for the generic parameters of the
76
Figure 5.1: Main window of the GUI.
77
Figure 5.2: Initiating MacroGen from the main window.
78
Figure 5.3: Selecting a macro from MacroGen.
79
DSP macro are entered through this window. Specification of the selected DSP macro
can be viewed simply by clicking on the ”Specs” button in the macro window (Fig-
ure 5.4). After entering values for the generic parameters, clicking on the ”Generate”
button (Figure 5.5) will generate the VHDL file for the macro according to the user
specifications. The VHDL file for the macro can be viewed by clicking on the ”VHDL”
icon in the main window (Figure 5.6).
5.2 Initiating Design Space Search
The Synopsys synthesis tools used during the technology mapping phase can gen-
erate, depending on the target technology library and user constraints, many design
points with different power, area, delay values. Finding the best combination of power,
area, delay may require several iterations. This iteration process can be very time-
consuming and tedious. Several algorithms (DSSA, PDSSA, SA) have been devel-
oped at University of Tennessee to assist the synthesis tool to perform an automated
design space search. This design space search can be performed according to user-
defined parameters and is initiated by clicking on the ”Synopsys Design Compiler
Power Compiler” icon. A new dialog window will pop-up (Figure 5.7) to let the user
enter user-defined parameters for the design space search. The target technology can
be selected from the radio button labeled ”Technology”. The desired run time (unit
for the desired run time has to be selected using the radio button labeled as ”Unit”
(Figure 5.7)), number of iterations, improvement ratio and step size can be entered
using this dialog window. The figure of merit for the search can be selected using the
80
Figure 5.4: Viewing specifications for the selected macro.
81
Figure 5.5: Entering values for the generic parameters of the selected macro.
82
Figure 5.6: Viewing VHDL file of the selected macro.
83
Figure 5.7: Initiating a design space search.
84
radio button labeled ”F.O.M” (Figure 5.7). The desired algorithm (DSSA, PDSSA,
SA) for the search is selected using radio button labeled ”Search Algorithm”. Then,
the automated design space search for the selected macro can be initiated by pressing
the ”OK” button in the dialog window. After a few seconds, the dialog window will
disappear and several splash windows will start popping up. The last splash window
(Figure 5.8) will stay there until the search is complete. Depending on the user-defined
parameters, this step may take several minutes or even hours. But since the process is
completely automated, it can be left to run by itself.
5.3 Reporting and Analyzing Run Results
As soon as the design space search is over, a new window will pop up with search
results (Figure 5.9). From this window one can sort the results with respect to the cho-
sen figure of merit (power, delay, area, power-delay product, power-delay-area prod-
uct). It is also possible to plot the results. To initiate a plot an option (D vs P, D vs A,
P vs A, or D vs PDA) has to be selected from the radio button (Figure 5.9) located on
the results window. After selecting the option, clicking on the ”PLOT” button will pop
up a new window with the desired plot (Figure 5.10). Plotting the search results will
help the designer to visualize the design space. A Pareto curve fitting algorithm has
been also integrated into the GUI. Fitting a Pareto curve to the desired plot can be done
simply by clicking on the ”Pareto Curve” (Figure 5.11) button in the plot window. A
statistic about each design point can be obtained using ”Statistic” button in the results
window. This button will pop up a text window with the report (Figure 5.12) which
85
Figure 5.8: Design space search in progress.
86
Figure 5.9: Tabulated design space search results for the selected macro.
87
Figure 5.10: Plotting search results for the selected parameters.
88
Figure 5.11: Generating Pareto curve.
89
Figure 5.12: Viewing the statistics of the standard-cells, number of transistors for the selected design point.
90
will include each standard-cell used in the design and its number of instances, the to-
tal number of standard-cells used, and the total number of transistors for the selected
design point. It is also possible to select a design point from the list and send it for
layout. The desired design point can be selected by clicking on its iteration number
(column ”n” (Figure 5.13)). After selecting the design point, it can be sent to layout
by clicking on the ”Send for Layout” button in the results window.
5.4 Initiating Layout
The layout process can be started by clicking on the ”Cadence PlaceRoute” icon.
This will pop up a layout dialog window (Figure 5.14). After all the necessary files are
entered, one can start Silicon Ensemble from the ”Start SE” button in the layout win-
dow (Figure 5.14). After starting SE, the designer may have to wait several seconds
because SE has to read in the LEF file, the library VERILOG file, and the selected
design point. All of these actions will be completed automatically. When all of the
files have been read, the designer can interact with Silicon Ensemble to complete the
rest of the place and route process. After the place and route is complete (Figure 5.15),
the design has to be exported as a DEF file. The name of the DEF should be the same
as the macro name. To finalize the layout process, the designer should click on the
”Start ICFB” button in the layout dialog window (Figure 5.14). This will initiate the
Cadence Design Framework and the DEF file for the macro will be read in automati-
cally. After importing the DEF file is complete, the final layout for the macro will pop
up (Figure 5.16).
91
Figure 5.13: Selecting a design point to send for layout.
92
Figure 5.14: Initiating place and route procedure.
93
Figure 5.15: Selected design point after placement and routing.
94
Figure 5.16: Final layout for the selected design point.
95
CHAPTER 6
Summary and Future Work
Two important parameters to measure integrated circuit efficiency are operations
per second per watt (OPS/W) and operations per second per square micron (OPS/    ).
The first parameter is an indication of energy efficiency of the chip and the second pa-
rameter is an indication of the area efficiency of the chip. Both parameters are closely
related to the power, delay and area values of the design. OPS/W is inversely propor-
tional to the product of delay and power (DPP), and OPS/    is inversely proportional
to the product of area and delay. To increase operations per second per watt, DPP has
to be minimized but at the same time silicon area also has to be kept in mind. Thus,
minimization of the product of power times delay times area (PDA) can be a more ef-
ficient means of increasing OPS/W without hurting OPS/    efficiency. Experimental
results presented in this dissertation show that optimizing the design for PDA gives
the best GOPS/W with an insignificant effect on GOPS/    . Experimental results also
indicate that maximum GOPS/W occurs when the PDAP is minimum. As a result
for an efficient chip design, designer has to find the design point on the design space
where PDAP is minimum. To do that the designer may have to search the entire de-
sign space. Searching the design space manually can be a very time-consuming and
tedious process. An automated design space search can be a life saver for the designer.
96
Unfortunately today’s sophisticated synthesis tools do not provide any solution for this
problem.
To overcome this obstacle, several algorithms have been developed and imple-
mented at the University of Tennessee at Knoxville and presented in this thesis. The
main contributions of this dissertation include:
  Developed an algorithm (DSSA) to assist or to guide a commercial tool to do a
fully automated design space search.
  The developed algorithm searches the design space with user guidance in terms
of:
1. search time,
2. number of iterations, and/or
3. improvement ratio.
  A parallel version of DSSA (PDSSA) has been implemented to reduce the search
time.
  Several optimization methods at the RTL level have been investigated and three
of them have been applied to the selected designs to conduct design space
searches.
  To minimize the design space which characterizes the design, a Pareto point
selection algorithm has been added to the DSSA and PDSSA.
97
  The developed search techniques are integrated into an automated ASIC design
flow.
  A search technique which utilizes Simulated Annealing has been investigated.
  A graphical user interface (GUI) has been developed for the automated flow.
  All developed algorithms, the automated ASIC flow and the GUI have been
tested and verified using several DSP macros.
In general, the design space search algorithms, the automated ASIC flow, and GUI
introduced in this thesis have created a faster and automated way of searching the de-
sign space for hardware implementation. In some cases, the found PDA product was
1.9 times better than the initial baseline solution as a result of the optimization tech-
niques applied during the design space search. The parallel search algorithm exhibited
9x speed up when executed on 10 machines simultaneously.
One possibility of extending this dissertation is to include the design space search
at the behavioral level and transistor level.
98
BIBLIOGRAPHY
99
BIBLIOGRAPHY
[1] ASIC Design Methology Primer. IBM, 1998.
[2] Low Power Training Course Material. Synopsys, Inc., 1998.
[3] Synopsys Design Compiler Reference Manual: v2001.08. Synopsys, Inc., 2001.
[4] Synopsys Power Compiler User Guide v2001.08. Synopsys, Inc., 2001.
[5] TSMC 0.18   m Process Standard Cell Library Databook. Artisan Components,
2001.
[6] P. Banerjee, M. Jones, and J. Sarjent. Parallel Simulated Annealing Algorithms
for Cell Placement. In IEEE Transaction on Parallel and Distributed Systems,
volume 1, pages 91–105, January 1990.
[7] M. S. Bright and T. Arslan. Multi-Objective Design Strategy for High-Level
Synthesis DSP Systems for Low Power. In ISCAS 99, volume 1, pages 80–83,
1999.
[8] M. S. Bright and T. Arslan. A Genetic Algorithm for the High-Level Low Power
Design of DSP Systems. In Proc. IEE/IEEE Conf. on Genetic Algorithms in
Engineering Systems, pages pages 174–179, Sept. 1997.
[9] D. Brooks and M. Martonosi. Value-Based Clock Gating and Operation Packing:
Dynamic Strategies for Improving Processor Power and Performance. In ACM
Transactions on Computer Systems, volume 18, pages 89–126, May 2000.
[10] D. Bruni, A. Bogliolo, and L. Benini. Statistical Design Space Exploration for
Application-Specific Unit Synthesis. In Design Automation Conference, 2001.
[11] A. P. Chandrakasan and R. Brodersen. Minimizing Power Consumption in
CMOS Circuits. In Proceedings of IEEE, volume 83, pages 498–523, 1995.
[12] S. Devadas and A. R. Newton. Algorithms for allocation in datapath synthesis.
In IEEE Transaction on CAD of Integrated Circuit and Systems, volume 8, pages
210–215, 1991.
[13] S. Gailhard, O. Sentieys, N. Julien, and E. Martin. Area/Time/Power Space Ex-
ploration in Module Selection for DSP High Level Synthesis. In Int. Workshop,
PATMOS’97, pages 35–44, September 1997.
[14] D. A. Hodges and H. G. Jackson. Analysis and Design of Digital Integrated
Circuits. -Hill, 1983.
100
[15] P. C. Kao, C. K. Hsieh, C. F. Su, and C. H. Wu. An RTL Design-Space Ex-
ploration Method for High-Level Applications. In IEICE Trans. Fundamentals,
volume E84-A, pages 2648–2654, Novomber 2001.
[16] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. Optimization by simulated An-
nealing. In Science, volume 220, pages 671–680, 1983.
[17] P. E. Landman. Low-Power Archhitectural Design Methodologies. PhD thesis,
University of California at Berkeley, Berkeley, CA, August 1994.
[18] A. Maheswari, W. Burleson, and R. Tessier. Trading off Reliability and Power-
Consumption in Ultra-Low Power Systems. In International Symposium on
Quality Electronic Design, May 18-21, 2001.
[19] R. Martin and J. Knight. Power-profiler: Optimizing ASICs power consumption
at the behavioral level. In Design Automation Conference, 1995.
[20] N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller.
Equation of State Calculations by Fast Computing Machines. In Journal of
Chem. Phys., volume 21No, pages 1087–1092, 1953.
[21] M. Munch, B. Wurth, R. Mehra, J. Sproch, and N. Wehn. Automating RT-
Level Operand Isolation to Minimize Power Consumption in Datapaths. In IEEE
Design Automation and Test in Europe, pages 624–631, March 2000.
[22] C. P. Ravikumar. Parallel Algorithms for VLSI Physical Design. Ablex Publish-
ing Corporation, 1996.
[23] P. Rezvani, A. H. Ajami, M. Pedram, and H. Savoj. LEOPARD: A Logical
Effort-Based Fanout Optimizer for Area and Delay. In ICCAD, 1999.
[24] A. Sirinivasan, G. D. Huber, and D. P. LaPotin. Accurate Area and Delay Estima-
tion from RTL Descriptions. In IEEE Transactions on VLSI Systems, volume 6,
pages 168–172, March 1998.
[25] W. Wolf. Modern VLSI Design: A systems Approach. Prentice-Hall, Inc.,
Englewood Cliffs, NJ, 1994.
[26] Q. Wu, M. Pedram, and X. Wu. Clock-Gating and Its Application to Low Power
Design of Sequential Circuits. In IEEE Transactions on Circuit and Systems,
volume 47, pages 415–420, March 2000.
101
VITA
Fuat Karakaya was born on January 1st, 1970 in Aksaray, Turkey. He received his
primary and secondary education in Ankara. He studied at Hacettepe University, in
Ankara-Turkey, where he received his Bachelor of Science degree in Electrical Engi-
neering in May 1993. He received his Master of Science degree in Electrical Engi-
neering from Illinois Institute of Technology, Chicago-USA, in May 1998. He then
began working toward his Doctor of Philosophy degree at University of Tennessee,
Knoxville-USA. While pursuing his doctarate degree, he was a research assistant for
the Industrial Plasma Laboratory and the Microelectronic Systems Research Labora-
tory. He received his Doctor of Philosophy degree in Electrical Engineering in May
2004.
102
