Multiple voltage scheme with frequency variation for power minimization of pipelined circuits at high-level synthesis by Radhakrishnan, Bharath
UNLV Retrospective Theses & Dissertations 
1-1-2003 
Multiple voltage scheme with frequency variation for power 
minimization of pipelined circuits at high-level synthesis 
Bharath Radhakrishnan 
University of Nevada, Las Vegas 
Follow this and additional works at: https://digitalscholarship.unlv.edu/rtds 
Repository Citation 
Radhakrishnan, Bharath, "Multiple voltage scheme with frequency variation for power minimization of 
pipelined circuits at high-level synthesis" (2003). UNLV Retrospective Theses & Dissertations. 1560. 
http://dx.doi.org/10.25669/hzch-kvbq 
This Thesis is protected by copyright and/or related rights. It has been brought to you by Digital Scholarship@UNLV 
with permission from the rights-holder(s). You are free to use this Thesis in any way that is permitted by the 
copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from 
the rights-holder(s) directly, unless additional rights are indicated by a Creative Commons license in the record and/
or on the work itself. 
 
This Thesis has been accepted for inclusion in UNLV Retrospective Theses & Dissertations by an authorized 
administrator of Digital Scholarship@UNLV. For more information, please contact digitalscholarship@unlv.edu. 
MULTIPLE VOLTAGE SCHEME WITH FREQUENCY VARIATION FOR 
POWER MINIMIZATION OF PIPELINED CIRCUITS AT HIGH-LEVEL
SYNTHESIS
by
Bharath Radhakhshnan
Bachelor of Engineering 
University of Madras 
2001
A thesis submitted in partial fulfillment 
of the requirements for the
Master of Science Degree in Electrical Engineering 
Department of Electrical and Computer Engineering 
Howard R. Hughes College of Engineering
Graduate College 
University of Nevada Las Vegas 
December 2003
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
UMI Number: 1417732
INFORMATION TO USERS
The quality of this reproduction is dependent upon the quality of the copy 
submitted. Broken or indistinct print, colored or poor quality illustrations and 
photographs, print bleed-through, substandard margins, and improper 
alignment can adversely affect reproduction.
In the unlikely event that the author did not send a complete manuscript 
and there are missing pages, these will be noted. Also, if unauthorized 
copyright material had to be removed, a note will indicate the deletion.
UMI
UMI Microform 1417732 
Copyright 2004 by ProQuest Information and Learning Company. 
All rights reserved. This microform edition is protected against 
unauthorized copying under Title 17, United States Code.
ProQuest Information and Learning Company 
300 North Zeeb Road 
P.O. Box 1346 
Ann Arbor, Ml 48106-1346
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
UNTV Thesis ApprovalThe Graduate College 
University of Nevada, Las Vegas
The Thesis prepared by
October 22 2(3’03
Bharath Radhakrlshnan
Entitled
Multiple Voltage Scheme With Frequencv Variation for Power 
 Minimization of Pipelined Circuits at High-Level Synthesis
is approved in partial fulfillment of the requirements for the degree of 
Masters in Electrical and C om puter Engineering
'.xamination Commith nber
Examination Committee Meiffber 
Graduate College Faculty Representative
Examination Cdmmittee Chair
Dean o f  the Graduate College
PR/1017-53/I-00 11
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
ABSTRACT
Multiple Voltage Scheme with Frequency Variation for Power Minimization of 
Pipelined Circuits at High-Level Synthesis
by
Bharath Radhakrishnan
Dr. Muthukumar Venkatesan, Examination Committee Chair 
Professor of Electrical and Computer Engineering 
University of Nevada, Las Vegas
High-Level Synthesis (HLS) is defined as a translation process from a behavioral 
description into structural description. The high-level synthesis process consists of three 
interdependent phases: scheduling, allocation and binding. The order of the three phases 
varies depending on the design flow. There are three important quality measures used to 
support design decision, namely size, performance and power consumption. Recently, 
with the increase in portabihty, the power consumption has become a very dominant 
factor in the design of circuits. The aim of low-power high-level synthesis is to schedule 
operations to minimize switching activity and select low power modules while satisfying 
timing constraints. This thesis presents a heuristic that helps minimize power 
consumption by operating the functional units at multiple voltages and varied clock 
frequencies. The algorithm presented here deals with pipelined operations where multiple 
instance of the same operation are carried out. The algorithm was implemented using 
C-H-, on LINUX platform.
Ill
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
TABLE OF CONTENTS
ABSTRACT............................................................................................................................. iii
UST OF FIGURES..................................................................................................................vi
ACKNOWLEDGEMENTS.................................................................................................... vii
CHAPTER 1 INTRODUCTION..........................................................................................1
1.1 High Level Synthesis..................................................................................................4
1.2 A look on the design problem Power.......................................................................... 6
1.3 Structure of the thesis................................................................    8
CHAPTER 2 LITERATURE REVIEW OF POWER OPTIMIZATION
ALGORITHMS AT HIGH-LEVEL SYNTHESIS.................................................................. 9
2.1 Static Power vs Dynamic Power............................................................................... 10
2.2 Review of some existing Power Optimization Algorithms...................................... 11
2.3 Algorithms for Pipelined Circuits............................................................................. 14
CHAPTER 3 HIGH LEVEL SYNTHESIS -  DEFÏNinONS...........................................17
3.1 High Level Synthesis................................................................................................ 17
3.2 Operation Scheduling................................................................................................ 18
3.2.1 Classification of Scheduling Algorithms............................................................ 18
3.2.1.1 As Soon As Possible Scheduling (ASAP)....................................................19
3.2.1.2 As Late As Possible Scheduling (ALAP).................................................... 21
3.2.1.3 Force-Directed Scheduling (FDS)............................................................... 23
3.2.1.4 Integer-Linear Programming (ILP).............................................................. 23
3.2.1.5 Iterative Schedule (IR)................................................................................. 24
3.2.1.6 List Scheduling............................................................................................ 25
3.3 Mathematical Formalization of the problem.............................................................25
3.3.1 Basic Notations:................................................................................................... 25
3.4. Allocation and Binding.............................................................................................28
CHAPTER 4 MULTIPLE VOLTAGE SCHEME WITH FREQUENCY VARIATION
FOR POWER MINIMIZATION OF NON-PIPELINED OPERATIONS............................. 30
4.1 Design Object............................................................................................................30
4.2 Power Estimation......................................................................................................30
4.3 Techniques for Power Minimization........................................................................31
4.4. Multiple Voltage Scheme with Frequency Variation for Non-Pipelined Circuits.... 33
IV
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 5 MULTIPLE VOLTAGE SCHEME WITH FREQUENCY VARIATION 
FOR POWER MINIMIZATION OF PIPELINED CIRCUITS............................................. 42
5.1 Pipelining....................................................................................................................42
5.2 Representation of loops:............................................................................................ 44
5.3 Loop Pipelining: A MFDS Approach........................................................................44
CHAPTER 6 SUMMARY AND DIRECTIONS FOR FUTURE RESEARCH................ 46
6.1 Summary.....................................................................................................................46
6.2 Discussion.................................................................................................................. 47
6.3 Simulation Results..................................................................................................... 48
6.3.1 Non-Pipelined Cases........................................................................................... 48
6 3.1.1 Dot Format Benchmarks Result................................................................... 48
6.3.1.2 Pipelined Case.............................................................................................. 53
6.4 Open Problems and Road Map for Future Design..................................................... 54
6.4.1 Partition and Scheduling for Power Minimization............................................. 55
6.4.2 Control Path Synthesis.........................................................................................56
BIBLIOGRAPHY....................................................................................................................57
APPENDIX 1 C++ CODES AND DOCUMENTATION.................................................. 59
A l.l Codes developed for DOT and STG Benchmarks...................................................59
APPENDIX 2 LIST OF ACRONYM USED IN THE DOCUMENTATION................... 65
APPENDIX 3 BENCHMARKS -  A DISCUSSION.......................................................... 67
A3.1 DOT format............................................................................................................. 67
A3.2 Standard Task Graphs (STG).................................................................................. 68
VITA........................................................................................................................................71
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
LISTOPnCURES
Figure 1.1 Levels in VLSI System Design............................................................................ 2
Figure 1.2 Design Problems- A Classification...................................................................... 3
Figure 1.3 High-Level Synthesis Process/ Steps involved in High-Level Synthesis.............5
Figure 2.2 Power Algorithms -  A Classification..................................................................12
Figure 3.1 Classification of Scheduling Algorithms............................................................. 19
Figure 3.2 a: Pseudo code for ASAP algorithm...................................................................20
Figure 3.2 biUnscheduled Data Flow Graph, c: ASAP Scheduled Graph......................... 21
Figure 3.3 a: Pseudo code for ALAP algorithm...................................................................22
Figure 3.3 b: Unscheduled Data Flow Graph, c: ALAP scheduled Graph......................... 22
Figure 3.4 Pseudo code for FDS Algorithm......................................................................... 23
Figure 3.5 Representation of Operations in Search Space....................................................28
Figure 4.1 Tree Formation showing the search path to determine the optimal voltage and
frequency............................................................................................................ 35
Figure 4.2 a: Unscheduled Data Flow Graph b: ASAP scheduled Data Flow Graph.......... 37
Figure 4.3 a: ALAP Schedule of Operations........................................................................38
Figure 4.4 Grouping of Nodes that may occur in a given c-step.......................................... 38
Figure 4.5 Final Scheduled Data Flow Graph......................................................................40
Figure 5.1 Loop Representation by Data Dependence Graph and its equivalent
pseudocode......................................................................................................... 43
Figure 5.2 Pipelined Data Flow Graph (k=2).......................................................................45
Figure 6.1 Voltage Varied Power.........................................................................................50
Figure 6.2 Voltage and Frequency Varied Power................................................................50
Figure 6.3 Voltage Varied Power.........................................................................................52
Figure 6.4 Voltage and Frequency Varied Power................................................................52
Figure 6.5 Power Variation for a two-staged Pipelined Operations..................................... 54
Figure A3.1 Dot format representation and its equivalent graph.............................................68
Figure A3.2 STG Benchmark.................................................................................................. 69
VI
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
ACKNOWLEDGEMENTS
First, I must thank God for his grace and mercy, his loving kindness and his many 
blessings too numerous to count. I also extend my gratitude to my parents, who prayed 
unceasingly for me while I was here. I am indebted to my advisor and thesis chair. Dr. 
Muthukumar Venkatesan for his valuable guidance and his valuable time. He was always 
there to hsten and to help me with research and personal problems. His common sense 
and uncommon wit have made my years at UNLV enjoyable. It's a privilege to have 
worked with him.
I would hke to thank Professor Henry Selvar^, Yingtao Jiang and Laxmi P. Gewah 
for being on my thesis committee and also in providing me some valuable tips which 
made this thesis a success.
It was a blessing for me to work in such a wonderful environment provided by the 
Department of Electrical and Computer Engineering at UNLV. I thank the administrative 
staffs of the Department, in particular, Lavinia. B.Alldredge, who made my days easier at 
UNLV. I would like to extend thanks to my colleagues at University of Nevada, Las 
Vegas for their unconditional support in helping me. Finally I would hke to thank all 
those who were involved either directly or indirectly in the completion of this thesis. I am 
too grateful to say anything else.
Vll
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 1 
INTRODUCTION
Over the past few decades, the electronics industry has achieved a phenomenal 
growth mainly due to the rapid advances in integration technologies. The world has 
advanced to such an extent that, it is almost impossible to find a place without computers, 
televisions, micro-ovens or other basic electronic equipments. All these devices require 
performing complex tasks at high speed within short time and be as compact as possible. 
Performing complex functions at high speed necessitates more functional units to be 
embedded on a single chip. As semiconductor technologies move towards Gner 
geometries, the available performance and the functionality become increasingly complex 
and forces high degree of hardware assimilation on a single substrate to carry out 
complex functions. This impels the designers to fabricate Integrated Circuits (IC’s) with 
different levels of integration, with each level representing the total number of functional 
units (FU) on a single chip. The number of logic gates in a monolithic chip, which 
measures the level of integration, has been steadily rising for almost two decades, mainly 
due to the rapid progress in processing and interconnect technology. Very Large Scale 
Integrated Circuits (VLSI), recently Ultra Large Scale Integration (ULSI), is the present 
level of computer microchip miniaturization and refers to microchips containing milhons 
of transistors on a single chip. The other levels of integration include LSI (Large-Scale 
Integration), which consists of microchips containing thousands of transistors, MSI
1
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
(Medium-Scale Integration) with a microchip containing hundreds of transistors and SSI 
(Small-Scale Integration) with tens of transistors packed in a single chip. This process of 
assembling millions of FU's on a single chip to meet complex functions is generally 
known as design. The process of design involves various steps or levels, with each step 
involving lots of design considerations. The levels involved in a system design are shown 
inFigurel.l.
CIRCUIT DESIGN
FUNCTIONAL DESIGN
RTL DESIGN
LAYOUT DESIGN
SPECIFICATION
LOGIC DESIGN
Figure 1.1: Levels in VLSI System Design
Algorithmic level or Behavioral Level, generally referred, as High-Level Synthesis 
(HLS) is the step where the architecture or function of the hardware being processed is 
specified. This level may also be called as the Functional Level (FL). The Logic Level 
(LL) is the level, where, the system is specified in terms of Boolean equations. The
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Register Level (RTL) is the level, where the system uses registers and memories to 
describe itself and the final level of architecture is the Circuit Level (CL), which 
describes the system by actual circuits. Each of these design levels mentioned have some 
design problems associated with them. The common design problems may be categorized 
as shown in the Figure 1.2.
AREA POWERCOST SPEED
RC STATICNRC DYNAMIC
RELIABILITY
DESIGN PROBLEMS
Figure 1.2: Design Problems- A Classification.
The next issue to be considered is the reliability of the system under design. It is 
possible that the system designed may not work for all cases. This forces the designers to 
change their design to meet all the functions and hence consider reliability issues in 
depth. The speed of operation of the circuit must also be high enough to complete the 
entire task within the allotted time. It would not be a good design if the design comes up 
with a circuit really large in size. This calls for the designers to optimize area of the chip. 
The last important design consideration is the power consumed by the circuit. Of all the 
design consideration, the power consumed by the circuit is generally considered more 
important and it is given a high priority. This is because the size of devices needs to be
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
kept to a minimum for most cases. If the power consumed is large the size of the battery 
needed to operate the device increases and it adds to the increased weight of the device. 
With various design problems in hand; the level under which each of these problems is 
addressed must be taken into consideration with utmost care. Various researches in the 
process of choosing levels have pointed out the fact that higher the level, the better and 
faster is the approach. The above statement clearly shows that the level, which specifies 
the system or its behavior, would be a good choice to deal with. Clubbing the design 
problem with the synthesis level and choosing the better features of both, made this 
research direct towards low-power design of circuits at high-level synthesis.
1.1 High Level Synthesis 
Synthesis is generally divided into three categories: high-level synthesis, logic 
synthesis and layout synthesis [14]. High-Level Synthesis is the process of conversion of 
a behavioral specification of a system into a set of register-transfer level components, 
such as ALU, registers, bus controllers, interface components etc. Behavioral 
specification is to define the functionality of the system by algorithms, high-level 
languages or Hardware Description Languages (HDL) such as VHDL or Verilog. The 
constraints, which are considered in HLS, include time delays, area upper bound, pin 
number, power consumption, reliability, testability, cost etc. The goals of synthesis 
should aim at the maximization of system speed and reliability and minimization of cost, 
pin number, power consumption and design time. High Level Synthesis consists of five 
m^or synthesis tasks:
(i) System representation and compilation.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
(ii) High-Level transformation.
(iii) Scheduling.
(iv) Allocation and binding.
(v) Output generation.
Figure 1.3 shows an overall high-level synthesis process.
RTL DESCRIPTION
ALLOCATION/ BINDING
TRANSFORMATION
COMPILATION
DATA FLOW GRAPH
OUTPUT GENERATION
HDL
SCHEDULING
Figure 1.3: High-Level Synthesis Process/ Steps involved in High-Level Synthesis.
A system may be represented using Hardware Description Languages such as VHDL 
or Verilog. This process is known as Design Entry. Compilation refers to the process of 
converting the VHDL code into an intermediate form that may be manipulated easily. A 
Control Data Flow Graph (CDFG) or Data Flow Graph (DFG) may be considered as an 
intermediate format. This thesis manipulates on DFG, obtained by compiling a VHDL
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
description using analyzers like VAN (VHDL Analyzer). The VAN parses the VHDL 
description into control and data flow graphs and the DFG is used for all subsequent 
operation including scheduling and allocation.
1.2 A look on the design problem Power 
Power consumption in VLSI has become an important consideration in circuit design 
in recent years. In many application domains, we need to use low-power circuits in order 
to lower the packaging and cooling costs and to extend the battery life. In designing, low 
power circuits, a number of techniques for reducing internal power dissipation have been 
proposed. These techniques often focus on reducing the dominant term in the power 
dissipation in CMOS digital circuits. The power consumption in a given CMOS circuit 
can be mathematically given by the equation 1.1.
P  — Csw * f  * Vdd (1.1)
where ‘P’ denotes the power dissipated in charging and discharging the output capacitive 
load Cg '̂. 'Vdd' is the supply voltage and T  is the frequency. One way to reduce power 
consumption is to lower the total load capacitance 'C^v'. In general, there are different 
types of functional modules which perform the same computation but have different 
areas, speeds capacitive loads and power consumptions. A comparison of various 
functional units is shown in table 1.1 [14]. Consequently, for power efficient design it is 
important to make suitable choices among the different types of functional modules 
available. Another way to reduce power consumption is to reduce ' f  by inhibiting 
unnecessary circuit switching activity. Recent studies indicate that the clock signal
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
drivers in digital circuit consume somewhere between 15 to 45 percent of the total power 
[14]. This is because the clock signals have high activity factor (The number of 
transitions from 0 ^ 1  or 1 —> 0). However, the clock signals are not needed all the time. 
Some portion of the circuit will not compute anything for a period of time and the clock 
signal will not be used at all.
Table 1.1: Delay and Average Power of Functional Units.
Functional Units Delay (ns) Power (mW)
Ripple Carry Adder 20.0 22.7
Carry Look Ahead Adder 10.0 37.3
Booth Multiplier 160.0 84.0
Array Multiplier 100.0 295.6
If we regenerate the clock signals so that clock will be on only when the clock signal 
is used, a process known as clock gating, we can save power consumption in driving 
clock signals. In a digital circuit, functional modules are active all the time even if they 
are not used for some period of time. Consequently, some functional modules generate 
unused output. In CMOS circuit, clocked functional modules consume power if the clock 
is on because there will be some active switching activités even though output will not be 
used. So reducing the clock frequency or making suitable modifications to clock 
frequency can achieve a significant amount of power reduction.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
1.3 Structure of the thesis 
The remainder of the thesis is structured as follows. This thesis spans over six 
chapters. Chapter 2 presents a brief survey on some existing power optimization 
algorithms in HLS. Chapter 3 presents definitions used in High Level Synthesis along 
with a mathematical formulation of the problem under consideration. Chapter 4 deals 
with the power minimization of non-pipelined operations using multiple voltages and 
frequency variation. In Chapter 5, we present a polynomial time algorithm for 
minimizing the dynamic power consumption for pipelined circuits. The algorithm 
employs the constructive Force-Directed Scheduling to perform the scheduling of all 
operations and uses a Modified force for assigning voltages and frequency for each 
operation. Chapter 6 forms the conclusion of this thesis.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 2
LITERATURE REVIEW OF POWER OPTIMIZATION ALGORITHMS AT HIGH-
LEVEL SYNTHESIS 
The focus of this chapter is to present some detailed review on the existing algorithms 
in the area of power optimization. Power optimization is an important task and it is 
considered as one of the latest and most dominating factor in VLSI circuit generations. 
Before, presenting the methods available for power optimizations, factors affecting power 
consumption and other relevant details on power need to be understood. There are two 
main components of power. They are the static component and dynamic component. 
Static power comes from circuit design techniques that include voltage bias generators 
and any DC paths through active devices. Leakage power is primarily due to sub- 
threshold leakage currents that result from reduced threshold voltages that prevent 
transistors from turning completely off. Dynamic power is dissipated when a device is 
switching. Switching power comes from charging and discharging capacitive loads. Short 
circuit power is dissipated due to the current that conducts when both the n-channel and 
p-channel transistors are momentarily on at the same time and is dependent on the 
switching frequency, input slew rate and the difference between the operating and 
threshold voltages. Standby power is dissipated when a device is not switching. The 
complete power consumption in a CMOS circuit is given by equation 2.1.
Ptotal — Pstatic "F Pshort 'F Cgw ' f* V jj +  Pglitching (2.1)
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
2.1 Static Power vs Dynamic Power 
The static power, 'Pgtadc' is the power consumed through leakage currents and it 
occurs even when the circuit does not operate. This power is very small for CMOS 
circuits, almost negligible. Pshon occurs with every gate output switching, when two 
output transistors of a CMOS gate are open in the same time. With a good design and 
technology this power can be kept under 10% of the dynamic component of power. The 
third term in equation 2.1 is the switching power and it is dependent on the clock 
frequency (f), the supply voltage (Vjd) and the switching capacitance (Q^). The last term 
is the power dissipation due to glitching. Spurious transitions are generally refered as 
glitches and they are a well-known source of unwanted power dissipation in CMOS 
logics. In this thesis the minimization of power due to switching has been given priority. 
The entire thesis deals with the Cs f̂Vdd  ̂term. This term states that power consumption is 
directly dependent on frequency of operation (f), switching capacitance (C^*), which 
depends on the size of the load (wire capacitance, output capacitance of driver, and input 
capacitance of the driven cells), and the square of the operating voltage (Vdd). It is clear 
from the equation that reducing the supply voltage, clock frequency, switching 
capacitance or switching activity in the circuit reduces the dynamic component of the 
power. A variety of optimization methods targeting each of these four factors have been 
explored. Reduction of supply voltage, multiple voltage supplies, reduction of capacitive 
loads throu^ gate sizing, and minimization of switching activity by exploiting signal 
correlation are just a few. Some of the works available in the literature are discussed in 
section 2.2 and a classification of power is shown in Figure 2.1
10
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
cjRcun
3H0RT-CIRCUrr
DYNAMK COMPONENTSTATIC COMPONENT
CLASSIFCATKN OP POWER
Figure 2.1: Classification of Power
2.2 Review of some existing Power Optimization Algorithms 
In [1] Shiue and Chakrabarti presents a resource constrained scheduling scheme and a 
latency-constrained scheduling scheme that minimize power consumption for the case 
when the resources operate at multiple voltages. The resource constrained scheduling 
reduces power consumption by maximally utilizing resources operating at reduced 
voltages and at the same time reducing the latency. The latency-constrained scheduling 
scheme reduces the power consumption by assigning as many nodes as possible of the 
data flow graph to the resources operating at reduced voltages. Both these schemes 
consider the effect of switching activity on the power consumption of the Functional 
Units (FU) and also the power consumed in the level shifters.
11
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
NON-PIPELINED CASEHPELINED CASE
POWER ALGORITHM
Figure 2.2: Power Algorithms -  A Classification.
In [2] Mohanty et.al, talks about a new data path-scheduling algorithm called 
Dynamic Frequency Clock Scaling based on the concept of dynamic frequency clocking. 
The algorithm schedules lower frequency operators at earlier c-steps and delays higher 
frequency operators to later c-steps. It then regroups some higher ûequency operators 
with low frequency operators so as to meet the time constraint. In [3] Manzak, presents a 
resource and latency constrained scheduling algorithms to minimize power/energy 
consumption when resources operate at multiple voltages. The algorithm is based on 
efficient distribution of slack among nodes in the data flow graph. The distribution 
procedure tries to implement the minimum energy relation derived using the Lagrange 
multiplier method in an iterative fashion. It basically consists of two algorithms, a low 
complexity one and a high complexity one. In [4] Kim shows a transformation technique 
called power conscious loop folding for high-level synthesis of a low power system. It 
focuses on reducing power consumed by functional units through the decrease of 
switching activity in a data path dominated circuit containing loops. Kumar et.al, presents 
another algorithm, which performs binding for low power in [5]. It considers binding of 
functional units, operating at multiple voltages, in post-scheduling scenario. It minimizes
12
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
the power consumption due to switching activities on the physical adders and multipliers. 
This is achieved by gathering the switching activities of the functional units through 
profiling the data flow graph of the design by employing random input pattern. Cases are 
considered, when all the resources are operating at the same voltage, as well as, when 
resources are operating at possibly different voltages. This is the first reported work on 
binding in a post-scheduling phase when the resources are operating at multiple voltages. 
In [6] Mesman et.al, presents a constraint driven approach to loop pipelining and register 
binding. The analysis identifies sequencing constraints between operations additional to 
the precedence constraints. Without, the explicit modeling of these sequencing 
constraints, a scheduler is often not capable of hnding a solution that satisAes the timing, 
a resource and register constraints. The presented approach results in an efficient method 
of obtaining high quality instruction schedules with low register requirements. In [8] yoo 
et.al, shows a scheduling algorithm for pipelined data path with resource constraint. The 
algorithm first checks a possibility of scheduling in case of being assigned to the earliest 
step or to the latest step among the assignable control steps of all operations. If it is 
impossible to assign an operation to those steps due to resource constraint violation, the 
algorithm does away with those steps. That is it reduces the mobility of the operation. 
The scheduling algorithm is iterated until Anal schedule is obtained. If Anal schedule is 
not obtained, even though there are no proper operaAons to reduce the mobility, the 
current scheduling state is selected. A 16-point FIR Alter and 5^ order elhptic wave Alter 
were used to illustrate the scheduling algonthm. The algorithms reviewed so far dealt 
with non-pipelined circuit realizadons. Synthesis for pipelined circuits was also
13
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
accomplished as a part of this thesis. Some of the algorithms, which were reviewed, are 
presented in seed on 2.3.
2.3 Algorithms for Pipelined Circuits 
In [7] Maiinescu and Rinard present a new approach for automatically pipelining 
sequendal circuits. The approach repeatedly extracts a computation from the critical path, 
moves it into a new stage, and then uses speculation to generate a stream of values that 
keep the pipeline full. The newly generated circuits retain enough state to recover from 
incorrect speculadons by Hushing the incorrect values from the pipeline, restoring the 
correct state, and then restarting the computation. Two extensions to this basic approach 
have also been implemented. The first one is stalling, which minimizes circuit area by 
eliminating speculation and forwarding, which increases the throughput of the generated 
circuit by forwarding correct values to preceding pipeline stages. A prototype synthesizer 
based on this approach was also implemented. The experimental results showed that 
starting with a non-pipelined or insufficiently pipelined specification, this synthesizer 
could effecAvely reduce the clock cycle Ame and improve the throughput of the 
generated circuit. Ahmad et.al, in [9] shows a resource constrained scheduling algorithm 
based on Tabu search for high-level behavioral synthesis of funcAonal pipelines. The 
proposed algorithm, designated as TLS, integrates list scheduhng with tabu search. List 
scheduling algorithms have been widely used in high-level synthesis, because they are 
very fast, easy to implement and the low computaAonal complexity of these algorithms 
makes them applicable to large data How graphs. In List scheduling, the quality of the 
soluAon is very sensiAve to the pnority assigned to the operaAons. The disAnct feature of
14
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
this technique is that, it uses tabu search to And suitable pnonAes. Tabu search is a meta- 
heunsAc that can be superimposed on exisAng procedures to guide these procedures 
towards searching in more desirable neighboihoods and to prevent them Aom becoming 
trapped at locally opAmal soluAons. The proposed algonthm is conceptually very simple, 
easy to implement and is very eAecAve in conquenng the intractable nature of resource- 
constrained scheduling problem. The proposed technique can handle mulA-cycle 
operaAons, structural pipelined operaAons, chained operaAons and funcAonal pipelining. 
The effecAveness of this algonthm was demonstrated by comparing against the exisAng 
scheduling techinques for funcAonal pipelining on a number of benchmark examples 
reported in the literature. In [10] Kim et.al, deals with a systemaAc pipelining method for 
a linear system to minimize power and maximize throughput, given a constraint on the 
number of pipeline stages and a set of resource consAaints. The method Arst reAmes 
operations such that as many operations as possible take common operands as their inputs 
and then performs the operand sharing based on the list scheduling. Experimental results 
show that the proposed approach reduces the power consumpAon of the funcAonal units 
by up to more than 20% compared to the state-of-the-art pipelining and operand sharing 
techniques. In [11], a pipeline scheme for efAcient realizaAon of a complex mulAplier 
consists of one convenAonal mulAplier that is mulAplexed and some small addiAonal 
circuitry on the boundary. The proposed scheme reduces the chip area as well as the 
interconnecAons by nearly half compared to a convenAonal complex mulAplier. The 
pipelined mulAplier is efAcient in terms of chip area and interconnects, which reduces the 
power consumpAon. In [12] Papaefthymiou presents, a novel system-level power 
esAmaAon methodology for electronic designs consisAng of intellectual property (IP)
15
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
components. This methodology relies on analytical output and power macro-models of 
the IP blocks to estimate system dissipation without performing any simulation. For 
circuits without feedback, a sufficient condifion for the worst-case power esAmation error 
to increase only linearly with the length of the IP cascades is given.
The above secAon presents some of the important algonthms used in the field of 
power minimizaAon at HLS. Power minimizaAon received attenAon at other levels of 
design. An excellent literature survey on the power esAmaAon techniques at logic and 
lower levels of abstracAon can be found in [15, 16].
16
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTERS
HIGH LEVEL SYNTHESIS -  DEFINITIONS 
This chapter presents some definihons used in high-level synthesis. Terms such as 
scheduhng, allocation and binding are explained in this chapter. A simple overview of 
high-level synthesis is presented in section 3.1.
3.1 High Level Synthesis 
The analysis problem studies the characteristics or behavior of a given circuit 
structure. Synthesis process is the reverse of analysis process. Synthesis may be defined 
as the process, which does a sequence of transformations to meet a specified objective 
under certain constraint. High-level synthesis starts at the system level and proceeds 
downwards to Register Transfer level. Logic Level (LL) and finally Circuit Level (CL), 
each time adding some additional information needed at the next level of synthesis. The 
m^or functions involved in high-level synthesis are CompilaAon, Parddon, Scheduling, 
Allocadon and Control Generadon. The first three steps lead to the data-path formation 
and the last two steps lead to the formadon of the controller. The main focus of this 
thesis will be on the most important step in HLS, the operadon Scheduling.
17
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3.2 Operation Scheduling
Scheduling is the problem of assigning the operations to control steps in order to 
minimize the amount of hardware used. It imposes additional constraints on how the 
operations may be allocated and bounded. Scheduling may also be defined as the 
assignment of operations to time frames, usually, referred as control-steps (c-steps), 
possibly satisfying given constraints and minimizing any given cost function. Scheduling 
algorithms may be classified into various types as shown in the following section.
3.2.1 Classification of Scheduling Algorithms
There are a variety of scheduling algorithm available in the literature. They may be 
broadly classified under the following head [13].
» BASIC Algorithms
o As Soon As Possible Algorithm 
o As Late As Possible Algorithm
* Time Constrained Algorithms
o Force-Directed Scheduling Algorithm 
o Integer-Linear Programming Algorithms 
o Iterative Refinement Algorithms
* Resource Contrained Algorithms
o List-Based Algorithm 
o Static-List Algorithms
* Miscellaneous Algorithms
o Simulated Annealing Algorithms 
o Path-Based Algorithms
18
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
o Neural-Net Based Algorithms 
The classification of Scheduling algorithms available in literature is given Figure
3.1.
ASAP ALAP
PDS ILP ITERATIVE
NEURAL NETPATH BASED
LIST BASED
BASIC RESOURCE
CONSTRAIhED
TIME
CONSTRAINED
STATIC LIST BASED
SIMULATED ANNEALING
MISCELLANEOUS
SCHEDULING ALGOEUTHMS
Figure 3.1: Classification of Scheduling Algorithms.
3.2.1.1 As Soon As Possible Scheduling (ASAP)
The ASAP Algorithm[13], starts with the highest nodes (that have no parents or 
predecessor nodes) in the DFG and assigns time steps in increasing order as it proceeds 
downwards. It follows the simple rule that a successor node can execute only after its 
parent has been executed. This algorithm clearly gives the fastest schedule possible. In 
other words, it schedules in least number of control steps but never takes into account the 
resource constraints. A pseudocode for the ASAP scheduling algorithm is shown in
19
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Figure 3.2a. Rgure 3.2b shows an unscheduled DFG while 3.2c shows ASAP Scheduled 
DFG.
A3AP(Gs(V,E))
{
Schedule V& by setting tos = 1 ;
Repeat { Select-*  V j} whose 
predecessors are all scheduled;
S chedule -*  ig by  setting tjj = m ax { t* + di} ;
i : ( V i , V j ) e E
until (7 i) is scheduled;
Return (t);
Figure 3.2a: Pseudo code for ASAP algorithm.
where, Gs(VJE), represents a DFG with V = {Vo, V]...Vn} as its nodes and E = {ei, 
e2 ...em} as its edges connecting the nodes. tos=l denotes the start time of node Vo as to, D 
= {di, d2 ...dg}denotes the respective delay of each nodes. Vjrepresents the current node 
under consideration. Vn represents the last available node, 't,/ is the start time of node 
'Vj'.
20
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Figure 3.2 b: Unscheduled Data Flow Graph, c: ASAP Scheduled Graph.
3.2.1.2 As Late As Possible Scheduling (ALAP)
The ALAP algorithm[13], works exactly in the same way as the ASAP algorithm 
expect that it starts at the bottom of the DFG and proceeds upwards. This algorithm gives 
the slowest possible schedule that takes the maximum number of control steps. However 
this doesn't necessarily reduce the number of functional units used. The pseudocode for 
ALAP Scheduling algorithm is given in Figure 3.3a. Figure 3.3b shows an unscheduled 
DFG and 3.3c shows ALAP scheduled DFG.
21
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
ALAP (OK?
{
S chedule -»  t y  setting y ,  = X + 1; 
Repeat { S e lec t—> V j} whose 
successors are all schedule^
Schedule-» 'Vjby settingtÿ, = min{ - 4};
i: (Vi, V ^ e E
until (V o) is scheduled;
R eturn (ti).
Figure 3.3a: Pseudo code for ALAP algorithm.
where, Gs(V,E), represents a DFG with V = {Vo, V;...Vn} as its nodes and E = {ei, 
6 2 ...em) as its edges connecting the nodes, 'tgi' denotes the completion time of node Vn as 
tn, D = {di, d2 .. .dg}denotes the respective delay of each nodes. Vj represents the current 
node under consideration. Vn represents the last available node. 't,r is the completion
time of node ‘V j ’ .
CStepl
CStep2
CStep4
Figure 3.3 b: Unscheduled Data Flow Graph, c: ALAP scheduled Graph.
22
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3.2.1.3 Force-Directed Scheduling (FDS)
The Force directed scheduling (FDS) is a heuristic method [13] that is a very popular 
scheduling technique for time constrained scheduling. The main goal of this algorithm is 
to reduce the total number of FUs used. This algorithm achieves its goal by uniformly 
distributing the operations of the same type over the available control steps. The 
pseudocoede for FDS is given in Figure 3.4
FDS
{
while (no more operations)
{
Stepl : Evaluate Time Frames
Stepl a; Compute ASAP Time 
Stepl b: Compute ALAP Time 
Step2. Update D istributi on Gr aph 
steps : Calculate Self Force (SF)
Step4: Add Successor and Predecessor Forces 
Step5 : Schedule operation with least force; Set time frame 
to the selected c-step.
}
Return ( ) ;
}
Figure 3.4: Pseudo code for FDS Algorithm.
3.2.1.4 Integer-Linear Programming (ILP)
The integer linear programming (ILP) [13] method tries to find an optimal schedule 
using a branch-and-bound search algorithm. It involves some amount of backtracking, 
i.e., decisions made earlier are changed later on. A simplified formulation of the ILP 
method is given below.
Mobility range for each operation j  < Lt]is calculated, where &  and
L* are the ASAP and ALAP values of V/ respectively. The scheduling problem in ILP is
23
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
defined by the following equations: =m̂am abbsX̂ î  ĵ Ĵ l\t=i y
number of operations.
where 1 < t  < m operation types are available, and is the number of PUs of 
operation type ‘k’ and C* is the cost of each FU. Each %<.yis 1 if the operation i is 
assigned in control step j and 0 otherwise. Another equation that enforce the resource and 
data dependency constraints are:
n
JKf. V M, ((g * A:,, y) -  (p * X,. p)) < - 1 , p < ^
where p and q are the control steps assigned to the operations xz and x/ respectively. The 
ILP formulation increases rapidly with the number of control steps. For unit increase in 
the number of control steps we will have 'n' additional x' variables. Therefore the time
of execution of the algorithm also increases rapidly. In practice the ILP approach is 
applicable only to very small problems. If it is possible to eliminate the backtracking 
involved in the ILP method considerable amount of computation time could be saved. 
Heuristic methods do the job by scheduling one operation at a time based on some 
criterion.
3.2.1.5 Iterative Schedule (IR)
The iterative rescheduling [13] method based on the graph-bisection problem 
proposed by Kemighan and Lin, proceeds by rescheduling one operation at a time. Any 
initial schedule is taken to begin with. Each operation is then scheduled into an earlier or 
later step keeping in mind the data dependency constraints derived from the initial 
schedule.
24
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
3.2.1.6 List Scheduling
List based scheduling[13] is a generalization of the ASAP algorithm with the 
inclusion of resource constraints. A list based algorithm maintains a priority list of ready 
nodes, i.e., nodes whose predecessors have already been scheduled. The priority list for 
each operation is sorted with a priority function that resolves any resource contentions. In 
each iteration, operations with higher priority are scheduled first and lower priority 
operations are deferred to later control steps. Scheduling an operator to a control step 
makes other successor operations ready, which will be added to the priority list.
The other scheuling algorithms are some form of combination of the basic algorithms 
and hence are not discussed here. However, each one has its own application. A detailed 
mathematical formalization of the problem under consideration is presented in section
3.3.
3.3 Mathematical Formalization of the problem 
3.3.1Basic Notations:
Given: The set of task T={Ti, Ti . Tm) and a set of operations 0={Oi, Og... On) a 
resource library R={rk}.
The relation Tm—K)n defines the operation On capable of executing the task Tm and On 
E rt defines the functional unit r̂  6 om the library that is capable of executing the 
operation On. A partial order t; < tk on T, where ti < tk implies q precedes tj (precedence 
constraints), is defined. Scheduling may then be mathematically defined as a mapping
(7:T (3.1)
such that.
25
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
|{6 e  ÿ : = C5"]! 6 3  = {o, (3.2)
and & E {1,2 ,...f}
and 6 < (/ => <7 (6 ) < (7 (0 ) => (7 (0 ) < (7 (0 ) (3.3)
where, D' denotes the overall deadline, l/7r| is the number of operation/Functional Units
of type 'k'. If unlimited FU's are available 'D' corresponds to the minimum schedule
length. This path or the path in the DFG, which contributes to this value of 'D', is defined 
as the critical path and hence said, the critical path dominates the total execution time 
required for the entire system. If N /o) is the minimal number of resources of type 'r' that 
is required to execute the scheduled (o), data flow graph the goal is to minimize
ZW(rk) * Nr(o) (3.4)
where, W(rk) is the power cost of each FU in library R. N /o) is the total resource used for 
the schedule. Now, the goal is to select rk(vi) and rk(f,) such that the following Objective 
Function
min[Wrk)] = min[rk(v;^) * rk(fi)] (3.5)
is satisfied. Here, Vj Functional Unit (FU) under consideration, f; is the operating clock 
frequency under which the specific FU is operated. Nr(o) is obtained by performing the 
Modified Force Directed Scheduling (MFDS) and W(rk) is computed from the developed 
algorithm.
Force Directed Scheduling (FDS)
Let 1 4  be the mobility of each operation Oi.
M i  =  T a k p ' - T a s a p '  ( 3 . 6 )
26
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
where, T.,.p' is the ALAP time of the operation Oj and Tasap' is the ASAP time of the 
operation O;. Force may be given as
[0 [al3p)]
S F (  <r(0) = CS) = n ,(C S )  -  V  £ /7»(CS-) (3.7)
where, O. is the operation of type 'k' at control step Sj. nk(Sj) is the probability of
occurrence of the operation O, in the control step Sj.
AT(Of) = (7(0i)aZqp -  + A(oJ (3.8)
where, A(Oi) < 1 denotes FU’s with faster operating speed, A (OJ = 1 denotes operations 
with single cycle. A(o() = , denotes the time for multi-cycle operations. The
Probability Distribution PD(CS) of a type of Operation O. may be given as:
pD(cs) = o . , )
where, Pk(CS)(=l/pi) is the probability of occurrence of the Operation Oi in the c-step 
'OS’. Oaiap is the ALAP time of the operation O,. Oasap is the ASAP time of the operation 
Oi Pi denotes the probability of occurrence of the operation Oi in the c-step T. 
Formulation for W{rk)
From equation (4) and (5)
(3.10)
= (3.11)
where,T,y.F denotes the proportionality constant, operating voltage and clock frequency 
respectively. The aim is to minimize the rk(Vi) and rk(fi) terms in equation (5). The 
voltages and frequencies that are available for supply are V; and V; as, V3 is used for 
operations on non-ciitical paths and F = {fi.f2}. The set of operations are = {vi,;v2 ...yvn}.
27
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Each operation in the set may have two different combination of voltages and 
frequencies. This gives four possible combinations for each node. The nodes with all 
possible combination of voltage and frequencies form the branches of the tree as shown 
in Figure 3.5. Each node in the tree is represented as a tuple <N,, Vn, Fm, Cj>, where N, is 
the current node, Vn and Fm is the voltage and frequency assigned to the node and the c- 
step range Cj. The notation explained above is used to construct the tree (explained in 
chapter 4) and optimal voltages and frequencies are assigned to the node (operations) by 
determining the best depth first search transversal with back tracking. This ensures that 
W(rk) is minimal.
<tV],Fi,C:+a>or <N|,Vi,Fa,C,4a>
Figure 3.5: Representation of Operations in Search Space.
3.4. Allocation and Binding 
Allocation is the task of assigning operations onto available functional unit types 
(available in the target technology). This step usually follows the operation scheduling. 
Hence, allocation may also be defined as the assignment of operations to hardware, 
possibly given a schedule, given constraints and minimizing a cost function. Allocation
28
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
though closely intertwined with scheduling, involves partitioning of intermediate 
representation with respect to space (hardware resources) which is also known as spatial 
mapping. It basically does the following operations
# Map Operations on to Functional Units
# Map variables and constants on registers and memories
# Map data transfers on interconnection units
It may sometimes be split into two parts as follows
# Unit allocation or unit selection
# Unit binding
The next step, which usually follows the operation allocation, is the operation 
binding. Binding assigns operations to specified instances of unit types in order to 
minimize the interconnection cost. Usually the process of allocation and binding are
carried out simultaneously and in some cases may be considered as a single step.
29
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4
MULTIPLE VOLTAGE SCHEME WITH FREQUENCY VARIATION FOR POWER 
MINIMIZATION OF NON-PIPELINED OPERATIONS.
4.1 Design Object
The m^or VLSI design and research efforts until now have been focused on 
optimizing speed and area. However, with the advent of portable and mobile 
communication and computing services, the minimization of power consumption has 
become a major factor in VLSI design. Clearly, power efGcient synthesis needs to be 
developed for power minimization. Fortunately, the increasing density of VLSI systems, 
due to sub-micron feature size scaling and high-density packaging such as multi-chip 
modules, has enabled the development of a high-level strategy, which can be used to 
trade-off area, speed and power for a fixed throughput.
4.2 Power Estimation 
To evaluate the power dissipation in a data path, hardware mapping and compilation 
steps to convert a data flow graph into its corresponding layout needs to be done. Thus, it 
is impractical to try to evaluate the exact power dissipation at the HLS stage. Rather, a 
power estimation model must be developed. The total power is given by [14]
Ptotal — Pstatic + Pshort + Csw *  f  ^  Vdd * Vdd + Pglitching (4 .1)
30
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Here, for simplicity it is assumed that all FU's have the same Vdd- The total capacitance is 
given by:
Ctotal — Cfit +  Creg +  C arl +  C  int er ( 4 .2 )
where, Cfu is the estimated capacitance due to the FU's. C^g is the estimated capacitance 
due to registers, Cctri is the estimated capacitance due to the control logic and Cmm is the 
capacitance due to the interconnections. The capacitance due to the FU's is given by:
= (4.3)
;=1
where 'n' is the total number of FU types, ^ is the number of access of 'type-j FU' and Cj 
is the average capacitance of each 'type-j FU' which is a given constant. The capacitance 
due to the registers is expressed as given in equation 4.4
= (4.4)
where fr is the number of register access (read/ write) and Q  is a constant corresponding 
to the register capacitance.
4.3 Techniques for Power Minimization 
From equation 4.1, it is known that power is proportional to the supply voltage, 
frequency and switching capacitance. Any technique developed for power minimization 
must deal with any of these or a combination of these factors. Some of the techniques 
used in the process of power minimization are: transformation, use of multiple voltages 
and frequencies, increasing parallelism, shutdown of components while not in use etc. 
The basic approach in transformation is to scan the design space by utilizing various flow 
graph transformations with high-level power estimations and to transform data flow 
graphs into less power consumed data flow graphs. In Shutdown of Components, the
31
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
massive switching activity in large components, such as adders, multipliers and registers, 
which consume power, are minimized by turning off the components when not in use. 
Disabling the clock signal and forcing the internal nodes to remain at static voltage levels 
and thereby prevent power consumption achieve this. In CMOS, power consumption 
decreases quadraticaUy with reduction in voltage while the speed reduction is linear. A 
good idea would be to reduce the supply voltage. A more commonly employed technique 
is the use of mixed voltage circuit or multiple voltages for the same circuit. In Mixed 
Voltage Circuit, dual voltages or multiple voltages on one IC are used to reduce power 
consumption. Another method is to bring in the concept of pipelining. Slower operations 
can be used on non-time-critical paths, while parallelism can be increased to compensate 
for slower components. The total consumption of the parallely operating system is lower 
and its total delay is smaller. The extra area might be used to achieve parallelism. 
Reducing the Clock Frequency is another widely used option. The clock frequency 
determines the switching of the circuit components. Reducing the clock frequency will 
lower the switching and hence the power dissipated. The clock frequency may be varied 
by using special circuitry. An example may be the use of power effective clock gating 
circuits. Other methods such as Choice of Power Efficient Functional Units, Algorithmic 
Modifications etc. may be used according to the application under consideration.
A variety of optimization methods tai^eting each of the above mentioned factors have 
been explored. Reduction of supply voltage, multiple voltage supplies, reduction of 
capacitive loads through gate sizing, and minimization of switching activity by exploiting 
signal correlation are just a few. However, these factors strongly interact in ways that 
may cancel out power optimization benefits obtained by adjusting only one of them. This
32
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
called for a refinement in approach where more than one of the above factors needs to be 
optimized simultaneously. Hence this thesis presents a constructive ModiGed Force 
Directed Scheduling at High Level Synthesis, which schedules operation such that each 
operation receives the minimum possible operating voltage and clock frequency. The 
algorithm proposed in this chapter deals with power minimization of non-pipelined 
operations. The proposed algorithm is presented in section 4.4.
4.4. Multiple Voltage Scheme with Frequency Variation for Non-Pipelined Circuits.
The proposed MFDS algorithm consists of two phases: 1) The Grst phase is the 
assignment of voltages to the functional units and 2) assignment of clock frequency to 
the functional units. A preprocessing phase of node minimization is executed before 
scheduling. The input is an unscheduled DFG G(VE), where V is the set of verticies V = 
{vi, V2 ...Vn} and E = {ei, e2 ...em} is the set of edges. The available operation types 
represented as Op(n) = {Opi, Op2 } are given as input to the algorithm. Opi and Op2  
denote operations with the characteristics of adder and multiplier respectively. Assume 
that operations with lesser computational delay have the characteristics of adder and 
operations with higher computational delay have the characteristics of multiplier. The 
clock frequencies available are, CF(n) = {Fi, F2 } where Fi corresponds to the clock 
frequency used to operate the adder and F2 corresponds to the clock Gequency needed to 
operate the multiplier. The library of resotirces is given as Rt = {R;, R2 ..Rt}. The set of 
operating voltages available V = {V], V2 and V3 } and the maximum allowed time to 
complete the entire task (timing constraint Ct, a function of critical path time delay) are 
given prior to the start of the algorithm. It is assumed that V;> V2 > V3
33
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Algorithm:
Stepl: The data Gow graph is scanned in the pre-processing phase and redundant nodes 
are eliminated.
Step2: Mobility (pi) of each operation is computed by determining the ASAP and ALAP 
schedule.
Step3: The modiGed Force is computed as follows:
(i) Group all nodes that may occur in a given c-step by using the ASAP and
ALAP c-step range.
(ii) A tree for the data Gow graph shown in Figure 3.3b is formed with its
branches developed as shown in Figure 4.1
34
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CStepl
CStep4
CStepd
Figure 4.1: Tree Formation showing the search path to determine the optimal voltage and
frequency.
Step4: To choose the optimal combination of voltage and frequency, the tree is traversed 
as follows: A node existing in the critical path is selected and a depth first search (DFS) 
traversal is performed. Each DFS path corresponds to unique voltage and frequency 
allocation to all operations in the path. This process is repeated till the last node is
35
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
reached or till the timing constraint is violated. If the timing is violated backtracking is 
used to select a different path that satisfies the constraint. This process is carried out 
repeatedly until a best possible path with optimal voltages and frequencies allocated to all 
nodes in the path is found. This ensures that the voltage and frequency allotted for each 
node is optimal and hence W(rt) is optimized. Figure 4.1 shows the tree with all possible 
paths and the optimal path with optimal voltage and frequency satisfying the timing 
constraints.
Step5: The maximum power consumed by the circuit is now computed, which is 
obtained by running the operations at the maximum possible operating voltage and clock 
frequency.
Step6 : The varied power is now computed using the new voltage and frequency allotted 
to each operation in the task.
Step?: The power reduction computed from the varied and unvaried power is reported 
along with the scheduled graph.
Pseudocode:
voltage_frequency_scheduling (G, map op, Cstep, q j 
Require : DFG = G(V, E), r̂
Ensure : map < op, Cstepx—<|)
: map < op, Cstep> ASAP = asap(G, rij 
: map < op, double>MFDS = modifred_force(G, ASAP, ALAP)
: map < op, Cstep> result = voltage_frequency_schedule(G, result)
: output = original_power(G, rk)-scheduled_power(G, result)
36
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
The following section presents an example to illustrate the working of the proposed 
algorithm. Consider an unscheduled data-flow graph shown in Figure 4.2a.
CStepB
Figure 4.2 a: Unscheduled Data Flow Graph b: ASAP scheduled Data Flow Graph.
The ASAP Scheduled graph is shown in Figure 4.2 b. The scheduled data flow graph 
shows the time instance or the c-step in which each operation is executed. It may be seen 
from the ASAP scheduled graph that each operation is scheduled immediately after all its 
predecessors are scheduled. The algorithm then proceeds to compute the ALAP schedule 
for the original DFG and reports the latest time at which an operation may be executed. 
The ALAP schedule for the data flow graph is shown in Figure 4.3. It can be seen from 
the figure that each operation is delayed and executed as late as possible.
37
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CStepl
( H ;  )
CStepa
(  N* j f  N? j
CSt»p3
CStep4
Figure 4.3 a: ALAP Schedule of Operations.
The mobility (Pi) of each operation is computed from the ASAP and ALAP time. The 
mobility determines the time range for each operation and is generally considered an
important factor. The mobility computed is used in the calculation of Modified Force. 
The modiAed force is now computed as follows:
© © © CStepl
© © © CStep2
© © c s te p ]
© CStep4
Figure 4.4: Grouping of Nodes that may occur in a given c-step.
38
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
All nodes that may occur in a given c-step are grouped together as shown in Figure
4.4. Nodes Ni, N; and Ng can occur in c-step 1 and hence they are grouped together. 
Nodes N3 , N5 and N? may occur in c-step 2 (ALAP time of Ng is c-step 2 and ASAP time 
of N? is c-step 2). Likewise N4  and N? are grouped together. Ng cannot be grouped with 
any other node and hence is placed separately in c-step 4. Having performed the 
operation of grouping, the algorithm proceeds to form the tree as shown in Figure 4.5. 
The Agure shows all possible search space and the opbmal path with opAmal voltages 
and frequencies allocated to each operaAon in that particular path for a given Ame hmit. 
The Ame limit in this case is 1.5 Ames the cnAcal path Ame. The algonthm opAmizes the 
operaAons for power within the given Ame limit. The algonthm starts with the Arst 
available path. It keeps propagaAng through the nodes choosing any one path. The 
process is carried out unAl a Aming constraint is violated or all nodes are scheduled. If a 
Aming constraint is met backtracking is done and the tree is now traversed through 
another path. It can be seen from Figure 4.5 that once the algonthm reaches c-step 7, it 
immediately backtracks as timing constraint is violated. Another path corresponding to a 
different combination of voltage and frequency is now traversed. This process is repeated 
unAl all operaAons are scheduled. The power reducAon achieved here is at the expense of 
the execuAon Ame delay. The Anal scheduled data Aow graph is shown in Figure 4.5.
The path with the dark arrow is the opAmal path. It shows the Ame at which each 
operaAon is executed, the operaAng voltage used and the clock frequency allocated to 
each node. This voltage and frequency allocated are used to compute the power 
consumed by the entire task. It should be noted that not all operaAons are operated at the 
maximum supply voltage and maximum clock frequency. This reduces the power
39
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
consumed by the circuit as a whole. The search employed here is a kind of Depth First 
Search (DFS). The algorithm terminates by displaying the scheduled data flow graph, the 
c-steps to which each operation is allocated, total number of functional units required to 
perform the entire task, the power consumed by running the operations at the maximum
CStepl
CStep4
Figure 4.5: Final Scheduled Data Row Graph.
40
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
available supply voltage, the power due to the modified force directed schedule and the 
timing constrain. The power reduction achieved by running this algorithm on standard 
benchmarks is reported in chapter 6 .
41
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTERS
MULTIPLE VOLTAGE SCHEME WITH FREQUENCY VARIATION FOR POWER 
MINIMIZATION OF PIPELINED dRCUTTS
5.1 Pipelining
Pipelining a data-flow graph is an efficient way of accelerating the design. The goal 
of pipelining is to achieve the minimum possible latency, which is hardware resource or 
data dependent. There are basically two types of pipelining approach which require 
attention:
(i) Structural pipelining
(ii) Functional pipelining.
In functional pipelining, the algorithm description is sub-divided into sequences of 
operation stages that will be performed concurrently. Successive stages are streamed into 
the pipe so that different algorithm instances are executed in an overlapping fashion on a 
single data path. In structural pipelining, temporal parallelism is obtained through the use 
of pipelined functional units, e.g., a two-staged pipelined multiplier. In this case, the 
operation instances are executed in an overlapping fashion. A simple instance of 
pipelining is a loop. One of the most important and effective techniques for exploiting 
parallelism in loops is loop pipelining. Let be a loop where T  and 'g' denote the 
operators in the loop body and 'n' is the iteration count. Pipelining relies on the fact that 
this loop is equivalent to f  " 'g and improves performance by overlapping the
42
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
execution of different iterations. The operators of the loop executed before the loop body 
after the transformation (f) form the loop prolog, the operators executed after the body (g) 
are the loop epilog and the interval at which iterations are started is the initiation interval 
of the system. Loops monopolize execution times in most cases. In many application a 
few loops, if not only one, determine the throughput achievable by implementation of a 
behavioral description. For example, DSP filters often consist of an infinite loop that 
repeatedly executes for every sample of the input stream. In architectural synthesis, the 
problem of optimizing loop execution under timing and area constraints is crucial to 
obtain high quality architectures. The techniques that address this problem attempt to 
overlap the execution of different loop iterations to reduce the cycle count (initiation 
interval) per iteration. Different methods have been proposed with such a goal: loop 
folding [15], functional pipelining [16], loop winding [17] and rotation scheduling [18]. 
Loops are usually represented by means of a Data Dependence Graph (DDG). Figure 5.1 
shows an example of a loop representation.
f o r i -  0, i-1
{
AcXM-RH+SM; 
Y [t+1]-Y [i]+2 *X[i], 
Ci ) Ct Zp] -  4 * Y[i+1] + T[i];
}
Figure 5.1: Loop Representation by Data Dependence Graph and its equivalent
pseudocode.
Unlabeled edges represent loop-carried dependencies (LCD's), e.g. Bi depends on Ai. 
Labeled edges (L) represent loop-carried dependencies. The loop optimization problem
43
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
addressed here comprises a large variety of formulations with different timing and 
resource constraints. The two cases are of loop pipelining namely, Resource-Constrained 
loop pipelining (RCLP) and Time-Constrained loop pipelining (TCLP). Following is a 
section that deals with some basic ideas about loop. It talks about loop representation and 
then presents the FDS approach to schedule the DFG for power minimization.
5.2 Representation of loops:
A loop is represented by a labeled dependence graph, DG(y,E). Vertices and edges 
represent operations and data dependencies respectively. If each iteration of a loop 
require using a FU during 'C  cycles and the architecture has 'N' FU's of such type, then
the initiation interval is given as 1/ > C
N
. Therefore, the FU with maximum such ratio
determines a lower bound on initiation interval.
5.3 Loop Pipelining: A MFDS Approach:
The task of scheduling functionally pipelined operations is solved by modifying the 
MFDS algorithm proposed in Chapter 4. A pipeline DFG is constructed by arranging k (k 
is the number of pipelined stages) instances of the MFD scheduled operations. Figure 5.2 
shows the construction of the pipelined DFG (k=2). When constructing the pipelined 
DFG, the operations (N;, Ni3 in subsequent instances should differ by the latency of the 
operation calculated during the initial MFD schedule. The MFDS algorithm explained in 
Chapter 4.
44
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Figure 5.2; Pipelined Data Flow Graph (k=2)
The algorithm performs the ASAP, ALAP on the new pipelined data flow graph 
obtained from the original data flow graph. The mobility, modified force is now 
computed and the optimal path with optimal voltage and frequency is taken. The power 
due to the varied voltage and frequency is computed and the power reduction obtained is 
reported in chapter 6 .
45
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 6
SUMMARY AND DIRECTIONS FOR FUTURE RESEARCH 
The development of algorithms to support a low-power design methodology has been 
an active research area for past several years. While much of the literature deals with 
circuit and gate level techniques, a significant amount has also been published on high- 
level power estimation. This thesis discusses the fundamentals of high-level synthesis and 
reviews research work on high-level synthesis with special emphasis on power 
minimization. It is noteworthy to mention that it is not the power that always needs to be 
minimized; more emphasis should also be given to "energy" minimization, because it is 
the energy that dominates most of the digital circuit applications. The algorithm 
presented in this paper comes up with a method that tries to minimize the dynamic 
component of the power. The simultion was conducted on standard benchmarks.
6.1 Summary
High-level synthesis is inherently computation intensive. Given an input description, 
and user specified constraints, a vast number of alternative design possibilities exist. In 
fact, each high-level synthesis step is known to be NP-hard or NP-complete. The 
algorithm presented in this thesis comes up with a heuristic that tries to minimize the 
dynamic component of the power. The simultion was conducted for various benchmarks. 
The performance of the proposed algorithm was compared with theperformance when all
46
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
functional units are operated at the maximum available supply voltage and maximum 
clock frequency. The increase in the performance, i.e. the reduction in power 
consumption was found to be at the expense of the computation time or the execution 
time delay. However, the power reduction achieved in the case of latency or timing is at 
the expense of increased clock frequency. The algorithm also helps in choosing the right 
set of functional units from the library thereby performing allocation and binding. The 
following section presents a small discussion on the thesis as a whole.
6.2 Discussion
This thesis spans over six chapters and talks about power optimization at high-level 
synthesis. Chapter 1 forms the introduction of this thesis and discusses the various design 
levels available in VLSI circuit design, the design problems and presents a discussion on 
the design problem power consumption. Chapter 2 discusses extensively about some of 
the works done previously in this area. The works done for both pipelined and non- 
pipelined cases are reported. Chapter 3 presents the various definitions used in this work. 
It explains the steps involved in high-level synthesis, and explains the methodologies 
along with a mathematical explanation for the operation scheduling and the ModiGed 
Force Directed scheduling algorithm. Chapter 4 describes the algorithm developed for 
non-pipelined case. It shows the various steps performed on the Data Flow Graph. The 
algorithm first schedules the various operations, using the ModiGed Force Directed 
Scheduling algonthm and then performs the process of voltage and frequency assignment 
for each FuncGonal Unit (FU). Chapter 5 discusses the algorithm developed for pipelined
47
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
cases. This chapter presents the results obtained on HLS benchmarks and also presents 
some of the open problems and future work.
6.3 Simulation Results 
Throughout the course of development, two main groups of benchmarks were 
used. The way in which these benchmarks differ is based on their format. The Grst one is 
the DOT format benchmark and the next one is the Standard Task G r^h  (STG) format 
benchmark [Appendix 3]. This section presents the simulation results obtained for 
standard benchmarks. The Grst part of this section presents the results obtained for non­
pipelined cases. The next part presents the simulation results obtained for the pipelined 
case.
6.3.1 Non-Pipelined Cases 
The algorithm was developed for both pipelined and non-pipelined cases and this 
section presents the results obtained for non-pipelined case. The first part shows the result 
for the dot format benchmarks and the next part shows the results for the STG part.
6.3.1.1 Dot Format Benchmarks Result 
The algorithm was implemented in C++ and run on Linux platform. The algorithm 
was tested on standard benchmarks like HAL Alter, FIR Alter, and Cosine Alter. The total 
nodes in each benchmark varied from thirty to one hundred nodes. The performance of 
the algonthm was quiet saAsfactory in all the cases. The results obtained by running the 
algonthm on these benchmarks are tabulated in Table 1.
48
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Table 6.1: Dot Format Benchmarks Results
Bench Op Cmax Rmax Pmax (nl) P((nJ) Gi (%) C t Pz(nJ) Gz(%)
HAL 2 1 6 4,4 735 308.7 42
1 . 0 423229 42
1.3 144693 62
1.5 160157 64
1 . 8 164595 65
FIR 40 1 1 4,2 1400 757.1 46
1 . 0 757113 46
1.3 220308 61
1.5 243504 63
1 . 8 256816 64
COS 11 6 6 8 12,14 2310 1321.8 42
1 . 0 1321850 42
1.3 411159 60
1.5 426623 61
1 . 8 442087 62
COS22 6 6 8 12,14 2310 1321.8 42
1 . 0 1321850 42
1.3 550384 61
1 .5 426633 60
1 . 8 442087 62
DFG3 1 0 0 26 7 , 3 3500 1892.80 46
1 . 0 3500000 46
1.3 550384 61
1 .5 601348 63
1 . 8 634562 65
In table 1, 'Op’ denotes the total operations available in the entire circuit. Cmax 
denotes the maximum time(c-step) needed to perform the entire task, assuming no 
resource and timing constraint. gives the total functional units required. The first 
sub-column in this coltnnn gives the total FU's of adder characteiisAcs and the second 
sub-column gives multiplier characteiisAcs. Pmaz is the unvaried power consumpAon or 
the power consumed when aU operaAons are operated at the maximum supply voltage 
and maximum clock frequency. Ff is the voltage-varied power. Gy is the gain obtained 
due to voltage-varied power. Q  is the relaxaAon factor given for the Ame. Fz is the 
voltage and frequency varied power and Gz is the gain due to voltage and frequency
49
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
variation. The graph shown in Figure 6.1 shows the power reduction achieved due to 
voltage variation alone and the graph in Figure 6.2 shows the power reduction due to 
voltage and frequency variation.
4.000 
3,500
3.000 
3  2,500 
5  2,000 
g  1,500
1.000 
500
0
I
HAL
11
i m  A n a x  
# P t
FIR COSINE 1 COSINE2 DFG 3 
BENCHMARKS
Figure 6.1: Voltage Varied Power.
4.00
3,500
3.000
2.500
2.000
d  1.3Ct1.500
O 1 .6 C1
1.000
1 .8 CI
HAL FM COSINE 1 COSINE2
BBNCHMARKS
DFG 3
Figure 6.2: Voltage and Frequency Varied Power.
50
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Table 6.2. STG format benchmarks results.
Bench # Op Cmax C t Pi(nJ) P2 (nJ) %Reduction
ROBOT 8 8 25 1 . 0 1129125 418657,587145 37,52
1.3 1129125 418657,598436 37,53
1.5 1129125 418657,609727 37,54
1 . 8 1129125 418657,621019 37, 55
SPARSE 96 7 1 . 0 1274700 351844,535374 28,42
1.3 1274700 351844,548121 28, 43
1.5 1274700 351844,548121 28,43
1 . 8 1274700 351844,548121 28,43
STGl 334 16 1 . 0 4381650 1810340,2453724 41,56
1.3 4381650 1810340,2497541 41,57
1.5 4381650 1810340,2541357 41,58
1 . 8 4381650 1810340,2585174 41,59
RANDl 1 0 0 14 1 . 0 907800 367626,499290 40, 55
1.3 907800 367626,508368 40, 56
1 .5 907800 367626,517446 40,57
1 . 8 907800 367626,526524 40, 58
RAND2 50 1 0 1 . 0 506550 195101,268472 38, 53
1.3 506550 195101,105355 38, 54
1.5 506550 195101,107306 38, 55
1 . 8 506550 195101,109257 38, 56
51
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
2
5,000
4,500 III
4,000 nill
3,500 III
3,000
2,500
2,000 iüü
1,500
1,000
500
0 L m z L l ï
ROBOT SPARSE STGl RANDl 
BByCWIAfiKS
RAND2
Figure 6.3: Voltage Varied Power.
5.000
4.500
4.000
3,500
3,000
2,500
2,000
1.500
1,000
ROBOT SPARSE STGl RANDl 
BBCHMARKS
RAISD2
jO PI 
P2 
id 1.3Ct 
O 1.5Ct 
m I.BCt
Figure6.4: Voltage and Frequency Varied Power.
52
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
6.3.1.2 Pipelined Case 
The algonthm developed for non-pipelined case was modified for pipelined cases and 
tested on the same set of benchmarks used for non-pipelined case. The result obtained in 
tabulated in table 3.
Table 6.3: Two-Staged Pipelined results for STG benchmarks.
Bench Op Pi(uJ) C t ? 2 (uJ) %R
ROBOT 176 1641 1 . 0 328 2 0
1 .3 328 2 0
1 .5 345 2 1
1 . 8 345 2 1
SPARSE 192 2129 1 . 0 681 32
1.3 681 32
1.5 703 33
1 . 8 703 33
STGl 6 6 8 3855 1 . 0 1349 35
1.3 1388 36
1.5 1388 36
1 . 8 1388 36
RANDl 2 0 0 44 1 . 0 18 41
1.3 18 41
1.5 18 42
1 . 8 18 42
RAND2 1 0 0 29 1 . 0 1 2 43
1.3 1 2 43
1.5 1 2 43
1 . 8 13 44
In table 2, '# Op' denotes the total operations available in the entire circuit. Cmaz 
denotes the maximum time(c-step) needed to perform the entire task, assuming no 
resource and timing constraint. C, is the relaxation factor given for the time. P/ is the 
unvaried power consumption or the power consumed when all operations are operated at
53
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
the maximum supply voltage and maximum clock frequency. P; is the voltage and 
frequency varied power. In table 3, 'Op' denotes the total operations available in the 
entire circuit. P; is the unvaried power consumption or the power consumed when all 
operations are operated at the maximum supply voltage and maximum clock frequency. 
Cf is the relaxation factor given for the time. Pg is the voltage and frequency varied 
power.
-93
2
10,000
8,000
6.000 -II
4.000 4
2.000 !
n
km .
B PI 
B 1 .3 0  
O 1.50 
O 1.80
1 1 , n » r
Robot Sparse STGl RANDl RAh02 
BBCHMARKS
Figure 6.5: Power Variation for a two-staged Pipelined Operations.
6.4 Open Problems and Road Map for Future Design 
As most thesis start with a problem statement, perhaps the best place to end this thesis 
would be to posit it with many questions. Research in high-level synthesis has been going 
on for more than a decade and the results achieved so far are promising. There are still 
topics that need to be explored further in order to make it a more powerful aid in
54
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
designing the VLSI systems. The work presented in this thesis is a small step in what we 
believe is the right direction; however there is obviously much more to be done. There 
are many unsolved or partially touched issues, some of which are listed below.
6.4.1 Partition and Scheduling for Power Minimization
Most modem digital systems are too large to fit on a single chip. An efficient solution 
to this problem is the utilization of a Multi-Chip Modules (MCM). An MCM has several 
chips bounded to a single substrate. Interconnections among the chips are provided in 
substrate. The major advantages of MCM's are performance, mixed technologies, size 
and reliability. MCM's offer a high-density packaging option to meet the requirements of 
many high-performance digital signal-processing applications.
The system-partitioning problem is that of dividing the data flow graphs into several 
parts so that each part can be synthesized on a chip. Inter-chip communication delays 
would lead to several control steps. After the input description is partitioned at the 
behavioral level, several design alternatives can be explored. The tasks of scheduling and 
allocation can be performed together with partitioning or after partitioning. In recent 
years, power requirement has become an important consideration in circuit design. Since 
power consiunption is a quadratic function of the supply voltage, it is desirable that 
operations be executed at a low voltage. However, lowering the supply will lead to an 
increase in circuit delay and consequently, to a reduction in throughput. A new approach 
in circuit design was explored; one which allows different voltages for different 
operations.
55
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
6.4.2 Control Path Synthesis
Most work in high-level synthesis has concentrated on the many problems in data 
path synthesis without considering the quality of the control path. Control path synthesis 
is also important since the control path often occupies a large portion of the total design 
area as well as power consumption. A control path is built according to the speciAcations 
given in the state transitions graph. Several general algorithms, which take into account 
the encoding problems in control path synthesis, such as kiss. Mustang, take no account 
of the power consumption. Moreover, data path and control path are very closely related 
but the interaction between data path and control path is not considered in either data path 
synthesis or control path synthesis. Consequently, in future research, control path 
synthesis wiU be an important issue to be studied when a very high quality design is 
required. Algorithms must be developed such that both the control and data path are 
synthesized and optimized.
The static component of power has not received much attention in high level 
synthesis. Most of the work available in literature deals with the dynamic component of 
power. Also, the other components of power like glitch power have not been attended 
extensively. It is impossible to achieve a greater power reduction unless optimization is 
done at all stages of digital design. No one has a monopoly of good ideas. Hence, it 
would sound imprudent to say that this area has been explored to its maximum limit. As 
the density of the IC increases in accordance with the Moore's law, a new problem may 
arise which might need a different approach altogether. Also, research at the other levels, 
like logic level need to be carried out and it could turn out that those levels may produce 
better results than those produced at the HLS.
56
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
BIBLIOGRAPHY
1. Shiue. T.W and Chakrabarti. C, " Low-Power Scheduling with Resource Operating at 
Multiple Voltages", in the proceedings of IEEE International Symposium, 1997, pp.3759.
2. Mohanty. P S, Ranganatha. N and Krishna.V, "Datapath Scheduling using Dynamic 
Frequency Clocking", in the proceedings of IEEE Computer Society Annual Symposium 
on VLSI, April 25-26, 2002, Pittsburgh.
3. Manzak. A and Chakrabarti. C, "A Low Power Scheme with Resources Operating at 
Multiple Voltages", in the proceedings of IEEE Transaction, VLSI Systems, March 2002.
4. Kim. D, "Power Conscious High Level Synthesis using Loop Folding", in the 
proceedings of Design Automation Conference, 1997.
5. Kumar. A and Bayoumi. M, "Methodologies For Binding Functional Units For Low 
Power in High Level Synthesis", in the proceedings of ICCD 1999: Austin, Texas, USA.
6. Mesman. B, Jess. J. A. G et.al, “ A Constraint Driven Approach To Loop Pipelining and 
Register Binding", in the proceedings of Design Automation and Test in Europe, 1998.
7. Marinescu. M. V and Rinard. M, “ High-Level Automatic Pipelining for Sequential 
Circu 
2001.
its”, in the proceedings of the 14* international symposium on systems synthesis,
8 . Yoo. H and Park. D, " A Scheduling Algorithm for Pipelined Datapath Synthesis with 
Grandual Mobility Reduction.", in the proceedings of AP-ASIC 1999.
9. Ahmad. I, Dhodhi. K and Ali. M. F, "TLS: A Tabu Search Based Scheduling Algorithm 
for Behavioral Synthesis of Functional Pipelines", Computer Journal, Volume 43, Issue 
2, pp. 152-166.
10. Kim. D, Shin. D, Choi. K, "Low Power Pipelining of Linear Systems: A Common 
Operand Centric Approach", in the proceeding of Design Automation Conference 
(DAC), June 2003.
11. Li. W, Zhuang. S and Wanhammar. L, "An EfGcient Pipelined Complex Multiplier".
57
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
12. Liu. X and PapaeAhymiou. C. M, " A Static Power Estimation Methodology for IP-Based 
Design", in the proceedings of Design Automation and Test in Europe, 2001. pg 280-287.
13. Govindar^an. S and Vemuri. R, " Scheduling Algorithms for High Level Synthesis", 
Technical report Electronic version. University of Cincinnati, OH.
14. Park. C, "Task Scheduling in High Level Synthesis", Thesis report, UIUC.
15. N^m. N. F, "A Survey of Power Estimation Techniques in VLSI Circuits", in the 
proceedings of IEEE Transactions, VLSI Systems, vol. 2, no.4, pp.446-455, January 995.
16. Hwang. T.C, Hsu. C.Y and Lin. L.Y, "Scheduling for Functional Pipelining", in the 
proceedings of the 28* Design Automation Conference, pg.764-769. May 1987.
17. Girczyc. F. E, "Loop Winding: A Data Flow Approach to Functional Pipelining", in the 
proceedings of International Symposium on Circuits and Systems, pg 382-385, May 
1987.
18. Chao. F. L, LaPaugh. A and Sha. M. H. E, "Rotation Scheduling: A Loop Pipelining 
Algorithm", in the proceedings of the 30* Design Automation Conference, pg. 566-572, 
June 1993.
19. Cortadella. J, Badia. M. R and Sanchez. F, "A Mathematical Formulation of the Loop
Pipelining Problem”, Technical Report UPC-DAC- 1995, Department of Computer 
Architecture, October 1995.
20. Girczyc . M. E, “Loop Winding- A data flow approach to functional pipelining”, in the 
proceedings of the International Symposium on Circuits and Systems, pg 382-385, May 
1987.
58
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
APPENDIX 1
C++ CODES AND DOCUMENTATION 
In this appendix the c codes that were developed for the thesis are dealt with. The 
actual code is given in the accompanying disk. The first part of this appendix presents the 
code developed for dot format benchmarks and the code developed for STG benchmarks 
are presented in the second section.
A 1.1 Codes developed for DOT and STG Benchmarks.
Two data structures were created. The first one was for the Node and the second 
one was for Interface. The class name followed by a brief description about the member 
functions of the class and structure is given below:
A l l .  lNode_Class( );
This class maintains the details of each node used in the process. The various 
member functions under this class are follows:
Add_Nodes( );
This function is responsible for reading the node ids from the file and creating the node 
list.
Display_Node( );
This function when called outputs the entire node list in ascending order.
Add_Interface( );
59
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
This function is responsible for creating the interface between each node in the node list. 
This function manages the edges between each node.
Get_Number_OfJ4odes( );
This function gets the total nodes present in the node hst. It gives a total node count in the 
entire task.
Get_Link_Id( );
This gets the edge number or the link id between each pair of node.
Number_Of_Inputs( );
This function is responsible for maintaining the total number of input each node has. It
points to NULL if there is no input available for a particular node.
Ret_C_Step( );
This function returns the c-step of each node after performing the process of operation 
scheduling.
Arrow_Detector( );
This function is responsible for detecting the arrowed part or the later part of the input 
file in dot format.
Parse( );
This function reads the input file and parses it into node type and node id. The node could 
be of input, addition type, subtraction type, multiplication type or any other type.
A 1.1.2 Interface Class 
The member functions in this class are listed below:
Add_Interface( );
This member function creates the interface or the edges between each pair of nodes.
60
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Show_Interface( );
This function is responsible for showing the interface id or the edge id between each pair 
of node.
Get_Link_Id( );
This function is responsible for reading the input file and parsing it in order to get the link 
id.
Number_Of_Interfaces( );
This function stores all possible interfaces possible from a given node to any other node. 
Get_In_Node_Id( );
This function when called returns all the input nodes available for any node under
consideration. It returns a NULL when no input is available for a node.
Al.1.3 ASAP_CLASS 
The member functions in this class are presented below:
Initialize( );
This function performs all initialization needed prior to the start of other reprocess. It sets 
the initial c-step of all nodes to be zero and the availability of each node is set to zero. 
C_Step_Of_A_Node( );
This function computes the ASAP C-Step of each node present in the task. The function 
has various sub-functions, which are:
Maximuml( );
This function implements the formula ScAgdwZe —> V/ by setting (/.r = max{fü + dz j; }. It 
basically computes the maximtun c-step of all the predecessor nodes of the node under 
consideration.
61
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
C_Step_Allot( );
This function allots the c-step of each node. It computes the ASAP c-step of each node 
and allots to each node.
Retum_The_MaximunL_C_Step( );
This function computes the maximum c-step needed for the completion of the entire task. 
The function returns the maximum c-step after performing the ASAP scheduling.
Output( );
This function displays the node id and the c-step of each node obtained after performing 
the ASAP scheduling.
The next class is the ALAP_CLASS and the member functions in this class are listed 
below.
Initialize( );
This performs the initialization process by assigning a maximum number as its c-step for 
each node.
No_Input_List( );
This function gets the total inputs available for each node and displays it.
No_Output_List( );
This function performs the opposite function of the above funciton. It gets the total 
outputs available for each node.
ALAP_Schedule( );
This function performs the ALAP schedule of each node.
DISPLAY( );
62
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
This function is responsilbe for printing out the ALAP c-steps of each node. 
Set_Availability( );
This function sets the availability of each node to 5 once it has been alloted ALAP c-step. 
Check_Availability_5( );
This function checks if the availabiltiy of each node is set to 5. This ensures that the 
nodes already scheduled are not re-visited.
C_Step_Assign( );
This function assigns the ALAP c-step for each node. It then calls the Set_Availability( ) 
function to set the availability of nodes, which have been alloted ALAP c-step, to 5. 
Out_List_Avail_5_Check( );
This function checks the availability of the output list of the node under consideration is 5
or not.
C_Step_Allot( );
This function allots the ALAP c-step for each node.
Dump_List( );
This dumps the results on to the screen.
Report( );
It performs the same function as the Dump_List, but also displays the ASAP c-steps in 
addition to ALAP c-teps.
Output_Exit( );
This function performs the operation of exiting the computation once all nodes have been 
assigned the ALAP c-steps.
63
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Al.1.4 FDS CLASS 
Mobility( );
This function computes the mobility of each node by finding the difference between the 
ALAP and ASAP c-steps of each node.
Probability! );
This function computes the probability of each node's occurance in each of its c-steps 
ranging from its ASAP c-step to its ALAP c-step.
C_Step_For_Zero_Mobility( );
This function performs the assignment of c-steps to those nodes whose mobility is equal 
to zero.
64
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
APPENDIX 2
LIST OF ACRONYM USED IN THE DOCUMENTATION
Acronyms Expansions
IC Integrated Circuits
SSI Small Scale Integration
MSI Medium Scale Integration
LSI Large Scale Integration
VLSI Very Large Scale Integration
ULSI Ultra-Large Scale Integration
FU Functional Units
HLS High-Level Synthesis
FL Functional Level
LL Logic Level
RTL Register Transfer Level
CC Circuit Level
NRE Non-Recurring Engineering Cost
RE Recurring Engineering Cost
CMOS Complementary Metal Oxide Semiconductor
TLS Tabu List Scheduling
IP Intellectual Property
C-Step Control Step
ASAP As Soon As Possible
ALAP As Late As Possible
FDS Force-Directed Scheduling
ILP Integer Linear Programming
IR Iterative Scheduling
LS List Scheduling
DFG Data Row Graphs
CDFG Control-Data Row Graphs
STG Standard Task Graphs
DOT Department of Telegraph
DG Distribution Graph
n Initiation Interval
CF Clock Frequency
MFDS Modified Force Directed Scheduling
NP Non-Polynomial
65
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
TG Task Graphs
MCM Multi-Chip Modules
66
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
APPENDIX 3
BENCHMARKS -  A DISCUSSION 
Throughout the course of development, two main groups of benchmarks were used. 
The way in which these benchmarks differ is based on their format. The Erst one is the 
DOT format benchmark and the next one is the Standard Task Graph (STG) format
benchmark.
A3.1 DOT format
The Erst type of benchmark to be dealt with is the "DOT' format type. "DOT' is a 
drawing tool for graphs and was developed at AT&T Bell Laboratories. The tool is a part 
of Graph Viz and is freely available. A small example of the format and the corresponding 
Data Flow Graph is shown in Figure A3.1. The dot form benchmark may be explained as 
follows. The first line describes the name of the DFG under consideraEon. The second 
line describes the general characterisEcs of the node. The third line 1 [label = imp]; may 
be decoded as follows: The number T ' denotes the node number. The type of the node is 
denoted by the label keyword. Here, the node 1 is of type input. All lines unEl line 
number 10 may be decoded in a similar way. The line niunber 10(1—» = l] ;) may
be decoded as follows: The node 1 forms the input of node 4 and number 1 denotes the 
edge number or the interface number.
67
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
diagraphjest( node[fontcolor =  qhite, style =  filed, color =  biue2]; 
l[label =  imp];
2[labei =  imp];
3[label =  imp];
4[label= add];
5[label =  sub];
([label =  add];
7[label =  add],
1-4[name =  1];
2-4[nam e =  2]
4—6[name =  3]
4—7[name =  4]
3 - 5[name =  5]
5-7[nam e =  6]
)
impimp
NS.
subadd
bE.
addadd
Figure A3.1: Dot format representation and its equivalent graph.
A3.2 Standard Task Graphs (STG)
Standard Task Graph Set is a kind of benchmark for evaluation of multiprocessor 
scheduling algorithms. STG is proposed for every researcher to evaluate their algorithms 
under the same conditions covering various task-graph (TG) generation methods 
including task graphs generated from actual application programs. A simple STG is 
shown in the Figure A3.2. The STG decoding procedure is given below the diagram.
The STG may be divided into two parts. The first part is the Task Graph Part and the 
second part is the Information Part. Task graph part is the field for each number and is 11 
characters wide. The first line "8 8 " represents the number of tasks. The second line holds 
the information of the entry dummy node (70). The first "'0" represents task number 0. 
The second "0" means that the processing time for task 0 is zero. The third "0" means 
that task 0 has no predecessors. The "1" in the third line represents task number 1. The
68
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
88
0 0 0
1 2 0  1 0
2 1 1 0
3 5 1 1
4 5 1 1
5 10 1 1
6 2 8  2 1 4
88 24 3 3 86
89 0 2 81 88
#  Standard Task Graph Set Project
#  Application Task Graph robot, stg
#  Application Name; robot control program 
#Tasks; 88 (+  dummy tasks: 2)
#Edges.T31 (+ dummy edges: 4)
#Max. Predecessors :3 
#Min. Predecessors .3 
#Ave. Predecessors ; 1.488636 
#Max. Proc. Time; 111 
#Min. Proc. Time; 0 
#Avg. Proc. Time; 28.215909 
# C P  Lcngdi: 5 6 9  
#ParaUelism; 4.363796
Figure A.3.2: STG Benchmark.
"20" means that the processing time for task 1 is 20. The second '^1" means that task 1 
has one predecessor. The "0" means that the predecessor of task 1 is task 0. The eighth 
line may be interpreted as follows. The "6 " represents task number six. The "28" means 
that the processing time for task 6  is 28. The "2" means that task 2 has two predecessors. 
The "  1 4" mean that the predecessors of task 6  are tasks 1 and 4 respectively. The other 
lines represent the information for each task in the same manner. The last line holds the 
information of the exit dummy node having "0" processing time. The information part is 
described below.
69
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
The information part is composed of four different parts: a common part (task graph 
file name, etc.), precedence constraints form, task processing time, and other information 
such as critical path length and task graph parallelism. The common information part is 
shown in the second line. The other major parts are described in subsequent lines. They 
generally serve the purpose of documentation and hence are not considered here in detail.
70
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
VTTA
Graduate College 
University of Nevada, Las Vegas
Bharath Radhakrishnan
Local address:
1165 Maryland Circle #1 
Las Vegas, NV 89119
Home Address:
New No: 153, Old No: 35/1,
Ashtalakshmi Apartments 
Mambalam High Road,
T.Nagar, Chennai, Tamil Nadu,
ZIP-6 0 0  017.
INDIA
Degrees:
Bachelor of Engineering, Electronics and Communication Engineering, 2001 
Dr .MGR Engineering College, University of Madras, India
Publications:
# "Multiple Voltage with Frequency Variation for power minimization at high 
level synthesis", ÆEE 2003 Euro Dfgzta/
Bharath Radhakrishnan and Dr. Muthukumar Venkatesan
Thesis Title: Multiple Voltage and Frequency Variation for Power Minimization in High 
Level Synthesis
Thesis Examination Committee:
Chairperson, Dr. Muthukumar Venkatesan, Ph.D.
Committee Member, Dr. Henry Selvar^, Ph.D.
Committee Member, Dr. Yingtao Jiang, Ph.D.
Graduate Faculty Representative, Dr. Laxmi P. Gewali, Ph.D.
71
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
