University of South Florida

Scholar Commons
Graduate Theses and Dissertations

Graduate School

2005

Leakage Power Driven Behavioral Synthesis Of Pipelined ASICs
Ranganath Gopalan
University of South Florida

Follow this and additional works at: https://scholarcommons.usf.edu/etd
Part of the American Studies Commons

Scholar Commons Citation
Gopalan, Ranganath, "Leakage Power Driven Behavioral Synthesis Of Pipelined ASICs" (2005). Graduate
Theses and Dissertations.
https://scholarcommons.usf.edu/etd/2903

This Thesis is brought to you for free and open access by the Graduate School at Scholar Commons. It has been
accepted for inclusion in Graduate Theses and Dissertations by an authorized administrator of Scholar Commons.
For more information, please contact scholarcommons@usf.edu.

Leakage Power Driven Behavioral Synthesis Of Pipelined ASICs

by

Ranganath Gopalan

A thesis submitted in partial fulfillment
of the requirements for the degree of
Master of Science in Computer Engineering
Department of Computer Science and Engineering
College of Engineering
University of South Florida

Major Professor: Srinivas Katkoori, Ph.D.
Nagarajan Ranganathan, Ph.D.
Soontae Kim, Ph.D.

Date of Approval:
March 4, 2005

Keywords: MTCMOS, speedup, simulated annealing, clique-partitioning, data initiation
intervals

c Copyright 2005, Ranganath Gopalan

DEDICATION
To my family and friends.

ACKNOWLEDGEMENTS
I would like to profoundly thank my major professor, Dr. Srinivas Katkoori, for never
being short on encouragement and support. He has been an immense source of sound
advise and judgement, which helped this thesis lead to fruition. I would like to thank Dr.
N. Ranganathan and Dr. Soontae Kim, for being on my committee and providing me with
valuable feedback for my thesis and future work. I would like to acknowledge the team at
Tech Support for their help and assistance. I would also like to acknowledge the help of
my colleagues Vyas, Hariharan, and Narender. Finally, I am deeply grateful towards my
very good friends both near and far, for their help and support at all times.

TABLE OF CONTENTS

LIST OF TABLES

ii

LIST OF FIGURES

iii

ABSTRACT

iv

CHAPTER
1.1
1.2
1.3

1 INTRODUCTION AND RELATED WORK
Leakage reduction techniques
High level leakage reduction
Power reduction during functional pipelining

1
5
8
9

CHAPTER
2.1
2.2
2.3
2.4
2.5
2.6

2 PRELIMINARIES
Behavioral synthesis
Pipelining during behavioral synthesis
The AUDI system
AUDI frontend
MTCMOS binding
Simulation-based architectural leakage estimation

13
13
16
20
22
23
25

CHAPTER 3 PROPOSED APPROACH
3.1 Functional resource optimization
3.1.1 Simulated annealing
3.1.2 Generation of neighbour states
3.1.3 Evaluation of neighbour-state cost
3.1.4 Cost function characterization
3.2 Register allocation & optimization

27
27
30
31
32
33
37

CHAPTER 4

EXPERIMENTAL RESULTS

39

CHAPTER 5

CONCLUSIONS AND FUTURE WORK

45

REFERENCES

47

i

LIST OF TABLES

Table 2.1

Controller for FIR filter

19

Table 3.1

Leakage and settling constants table for 16-bit adder activity
combinations (δ0 =3)

36

Table 4.1

Benchmarks used for analysis

39

Table 4.2

FIR filter leakage & area analysis

40

Table 4.3

EWF filter leakage & area analysis

40

Table 4.4

IIR filter leakage & area analysis

41

Table 4.5

AR filter leakage & area analysis

41

Table 4.6

FFT leakage & area analysis

41

Table 4.7

Running times for SA-algorithm

44

Table 4.8

Speedup analysis for various benchmarks

44

ii

LIST OF FIGURES

Figure 1.1

Current components of transistor leakage (reproduced from [1])

4

Figure 1.2

MTCMOS XOR gate

6

Figure 1.3

LECTOR technique

7

Figure 1.4

Pipelines with various initiation rates

10

Figure 2.1

ASAP schedule of FIR filter

15

Figure 2.2

Resource allocation of FIR filter

16

Figure 2.3

Resource bindings for FIR filter

17

Figure 2.4

RTL datapath of FIR filter

18

Figure 2.5

Procedure for generic pipeline allocation

19

Figure 2.6

AUDI high level synthesis flow

20

Figure 2.7

MTCMOS Macrocell implementation

24

Figure 3.1

Placement of operations in space-time matrix

28

Figure 3.2

Pipeline placement of operations (δ0 =2)

29

Figure 3.3

Simulated annealing procedure for FU leakage optimization

32

Figure 3.4

Binding procedure for SA

33

Figure 3.5

Generation of new states

33

Figure 3.6

Activity table example

34

Figure 3.7

Activity combinations for various data-initiation rates

35

Figure 3.8

Cost Evaluation procedure

36

Figure 3.9

Clique partitioning for leakage power optimization

37

Figure 4.1

Total area consumption of datapath (AR filter)

42

Figure 4.2

Leakage power profile comparison (AR filter)

43

iii

LEAKAGE POWER DRIVEN BEHAVIORAL SYNTHESIS OF
PIPELINED ASICS
Ranganath Gopalan
ABSTRACT
Traditional approaches for power optimization during high level synthesis, have targetted single-cycle designs where only one input is being processed by the datapath at
any given time. Throughput of large single-cycle designs can be improved by means of
pipelining. In this work, we present a framework for the high-level synthesis of pipelined
datapaths with low leakage power dissipation. We explore the effect of pipelining on the
leakage power dissipation of data-flow intensive designs. An algorithm for minimization of
leakage power during behavioral pipelining is presented. The transistor level leakage reduction technique employed here is based on Multi-Threshold CMOS (MTCMOS) technology.
Pipelined allocation of functional units and registers is performed considering fixed data
introduction intervals. Our algorithm uses simulated annealing to perform scheduling, allocation, and binding for obtaining pipelined datapaths that have low leakage dissipation.
We have developed fully pre-characterized RT-level leakage libraries for efficient derivation
of the cost functions and fast accurate simulations of the synthesized designs. Results show
an average leakage power reduction of 38.2% for various benchmarks, and an average area
overhead of 6.2% over unoptimized pipelined designs. However when a latency of 1, 2, or 3
is introduced to the schedule length, area optimizations are noticed, which are in the range
of 3.89-4.6%. Total leakage reduction however reduces by around 2.8-3.4% for these cases.

iv

CHAPTER 1
INTRODUCTION AND RELATED WORK

Power management is increasingly becoming a major driving force behind the methodologies governing the design and development of high performance VLSI systems. Until
recently, speed and reliability formed the key considerations in the pre-Deep Sub Micron
(DSM) era during the conception, development, and tape-out of large ASIC systems. This
was due to a perspective that optimizing large designs for power consequentially results
in the lowering of their performance, and hence this trade-off did not highly favour power
optimization during the design flow. However, with the rapid advancements in technology
coupled with increasingly large performance requirements, the power issue has also matured to become a significant bottleneck towards reliable design of large ASIC systems [2].
Intelligent algorithms for minimizing power dissipation of CMOS circuits, during various
stages of design, are gaining considerable attention as technology advances into the DSM
era.
At the logic- and device-levels, power dissipation consists of two major components
namely dynamic dissipation and static dissipation. Dynamic power which is also known as
switching power is caused due to the effect of charging and discharging of capacitances in
signal paths and is a major concern for circuits manufactured using transistors with large
feature-sizes. For high-speed circuits, the switching power can be very large as given in
Equation 1.1.

2
fm
Pdynamic = CL VDD

(1.1)

where CL is the load capacitance, VDD is the supply voltage, and fm is the frequency
of the system. As seen from Equation 1.1, the supply voltage VDD is a major contributor
to the dynamic as well as the overall power dissipation of CMOS circuits. Modern CMOS
1

processes generally employ very low VDD supply voltages. VDD scaling provides for a substantial reduction in dynamic power, due to the quadratic dependence on VDD . However,
this consequentially increases the transistor delay (td ) and reduces the overall performance
of the circuit as given in Equation 1.2.
CL
VDD

(1.2)

(x − 0.1) 1
1
+ ln(19 − 20x)]
(1 − x)
2
(1 − x)

(1.3)

td = K
where

K=[

x=

Vtp
Vtn
or
VDD
VDD

(1.4)

where Vtn is the threshold voltage for NMOS transistor, and Vtp is the threshold voltage
for PMOS transistor. The above equation reduces to

Tdelay ∝

CL .VDD
K.(VDD − Vth )α

(1.5)

where CL is the load capacitance, K is a process-dependent constant value, Vth is the
threshold voltage, α is a factor dependent on the channel length [3]. Delay of CMOS gates
suffer as the VDD supply voltage is reduced to as low as 1V in the DSM regime. Designers
offset this delay issue by employing transistors with lower threshold voltage Vth , which
provide faster switching and improved propagation delay-times. But Vth -scaling is not
without ills as well, as this can lead to an exponential increase in the sub-threshold leakage
current as given in Eqn.1.6

q

0

Isub = A.e n0 kT .(Vgs −VT −γ Vs +ηVds ) .(1 − e

−qVds
kT

)

(1.6)

W

ef f
2 1.8
. kT
where A = µ0 .Cox . Lef
q  .e , Cox is the gate oxide capacitance per unit area. µ0
f

is the zero bias mobility. n0 is the sub-threshold swing coefficient of the transistor. VT is
the zero bias threshold voltage, Vs is the source voltage, Vgs , the gate-source voltage, and

2

Vds , the drain-source voltage. γ 0 is the linearized body effect coefficient and η is the Drain
Induced Barrier Lowering (DIBL) coefficient [4].
In this work, we will primarily focus on the specific problem of leakage power reduction
for high performance VLSI systems. For completeness, the various current components of
transistor leakage are illustrated in Fig. 1.1 and summarized below [1]:
1. Reverse-biased PN diode leakage: An effect that occurs due to the reverse biased
pn junctions between the drain/source and the substrate of the CMOS transistor.
Band-to-Band Tunneling also occurs due to the bias across the pn junction causing
electrons to tunnel from p-region to the n-region. This phenomenon usually occurs
in high and unevenly doped silicon junction regions.
2. Sub-threshold leakage: When the gate-source voltage (Vgs ) is reduced below the
threshold voltage (VT H ) a weak inversion condition is formed in the source-drain
channel of the transistor. This inversion region gives rise to drift current that flows
due to minority carriers. This weak-inversion current is negligible in devices with
larger VT H , however as shown in Eqn. 1.6, it becomes significant for modern low
VT H devices. An off-shoot of sub-threshold leakage is Drain Induced Barrier Lowering
(DIBL) [5], which happens when the threshold voltage is further reduced by barrier
lowering caused by the source-drain interaction near the channel surface, in longchannel devices.
3. Gate-oxide Tunneling Leakage: When the gate-oxide thickness is reduced, the increased effect of the gate voltage causes electrons to tunnel from the gate to substrate
and vice versa, leading to gate leakage current. Two mechanisms associated with this
phenomenon are Fowler-Nordheim Tunneling and direct tunneling [5].
4. Hot-carrier gate leakage: This occurs from electrons or holes gaining sufficient potential to cross the interface barrier and enter into the oxide layer. This effect is also
known as hot-electron injection.
5. Gate induced drain leakage (GIDL): When the gate-drain junction is reverse-biased,
the high field effect of the gate causes the drain region under the gate to be depleted of
3

minority charge carriers. This causes an inversion condition in the drain which can
lead to avalanche multiplication and Band-To-Band-Tunnelling (BTBT)[5] effects.
This gives rise to a drain to substrate current, which is dependent on the doping
level of drain.
6. Punchthrough: This occurs when the depletion region around the drain extends
towards the source, causing a current. When the drain is a high voltage, this tends
to reduce the channel length and is irrespective of the gate voltage. This causes an
increase in the total sub-threshold current.
Gate
Drain

Source

3 4

n

n

2
6

5

1

p-substrate

Figure 1.1. Current components of transistor leakage (reproduced from [1])

As the technology scales into Deep Sub Micron (DSM) design regimes (<90nm), the
leakage component of power begins to command a greater proportion of total power dissipation than switching power [6]. Large systems exhibiting significant standby periods
will be plagued by this issue, as they tend to dissipate huge amounts of leakage power.
This problem escalates as more and more computing applications move into the wireless

4

domain. Rapid depletion of battery power and damages to circuitry over a long term, can
be caused if these problems are overlooked.
1.1

Leakage reduction techniques
To reduce leakage power in low-voltage systems, specialized techniques have to be

devised that can effectively curb the afore-mentioned effects. Leakage reduction can be
performed at the process-level, which may involve either channel engineering or numerous
techniques involving doping variations [1]. It can also be accomplished at the circuitlevel which involves specialized re-designing of digital CMOS circuits or implementing
design methodologies that are targeted towards leakage management. Circuit level design
techniques lend an appreciable degree of freedom to designers, towards the devising and
development of automated leakage reduction strategies that can readily integrated in design
tools.
Although there exist a number of leakage reduction techniques in the literature, we
will discuss some of the techniques that have been dealt with more recently. These are
summarized as follows:
1. Multi-Threshold CMOS: In this technique, an extra transistor known as the sleep
transistor is inserted in series between VDD and the rest of the functional circuitry.
Usually, subsystems are put to ‘sleep’ by turning off VDD by gating the sleeptransistor off. These subsystems are generally functional units such as full-adders,
multipliers, floating-point ALUs, decode-units etc., which may be idle at certain
points in time, during the operation of the entire system. When a functional unit
becomes idle or is not performing any useful work at that time, control signals known
as the sleep signals, turn off the sleep transistors, and isolate the unit from its VDD
supply. This results in a large reduction in the sub-threshold leakage current flowing
through the OFF stacks of transistors. This technique was presented by Mutoh [7]
and is illustrated in Figure 1.2.
The disadvantage of this technique is the delay penalty incurred due to frequent
‘sleep’ and ‘wake-up’ transitions of the functional circuitry. In MTCMOS, sleep
5

transistors are generally of a larger VT H than that of the rest of the transistors in
the circuit. The sleep transistor also has to be of an appropriate size, since it has to
supply large current through the active transistors and the load capacitance of the
circuit. This drives up the area overhead associated with the sleep transistor sizing,
which is necessary to limit the performance loss.
Vdd
SLEEP

Large Vth transistor

A

B
small Vth

A

B

A

A

B

B

transistors

Figure 1.2. MTCMOS XOR gate

2. Dual Threshold CMOS: In a logic circuit, the critical path dictates the maximum
timing constraint and the available slack that can be imposed on other paths that
are not critical [8]. Low VT H transistors allow for faster switching, but dissipate
more leakage power compared to high VT H transistors which have less switching
speed, but significantly lower leakage dissipation. Thus, transistors (cells) along the
critical path can be low VT H transistors (cells), while the slacks in timing that exist
in non-critical paths can be utilized by employing high VT H transistors (cells). This
technique has the advantage of little to no area overhead, while also providing very
little compromise in the timing of the logic circuit.
3. Transistor stacking: Halter and Najm [9] establish that leakage dissipation is an
input-dependent condition. Different inputs to the same logic can result in different
leakage currents. In fact, the input vectors to a logic circuit can be ordered in terms of
increasing leakage power. By varying the inputs, a phenomenon known as transistor
6

stacking takes place. In this approach, a number of transistors either in the p-network
or in the n-network, are turned off due to the input vector combination. When more
than one transistor in a stack of series-connected transistors is turned off, the subthreshold leakage current is seen to reduce substantially [10]. A minimal leakage input
vector is one that maximizes the stack of off transistors. Many techniques have been
explored that attempt to determine leakage minimizing input vector combinations
also known in literature as Input Vector Control (IVC) [11].

i/p

P - Network

o/p

N - Network

Figure 1.3. LECTOR technique

4. Self-stacking Dual Transistor method: This technique also known in literature by its
acronym LECTOR (for Leakage Controlling Transistors) was proposed by Hanchate
and Ranganathan [12]. A pair of p and n transistors, whose sources are shorted to
their counterpart gates, is placed in series with the p and n network. These form a
pair of self-stacking transistors, which though invisible to the rest of the functionality serve to greatly increase the resistance to the flow of leakage current. Figure
1.3 illustrates this technique. The authors present a few circuit level methods for
managing the potential area overhead that can be incurred in this method. However,
it is a technique that does not require any threshold modifications nor does it incur
any delay penalties, while providing substantial gains in leakage reduction.

7

1.2

High level leakage reduction
Although leakage is a transistor-level phenomenon, the characteristics of various leakage

reduction techniques enable optimizations to be performed at various levels of design abstraction. Powell et al [13] present a novel approach to reduction of leakage power in cache
memories using ‘intelligent’ caches that can dynamically identify portions that are unused
and apply supply voltage gating to reduce leakage in those portions. A popular method
that substantially reduces sub-threshold leakage by turning off stacks of series-connected
transistors by controlling the inputs, is presented in [14]. Chen et al [15] use genetic algorithms to determine optimal low-leakage input vectors that can minimize leakage in various
components through transistor stacking.
Sundarajan and Parhi [16] present an approach which uses the dual-VT H technique for
combinational logic. Static timing analysis of various logic blocks is performed to determine
critical path and non-critical path timings. Using dual-VT H transistors, they then balance
all non-critical path delays so as to match up with the critical path delay. The final delaybalanced design is well optimized for leakage, while also satisfying the user-specified timing
constraint.
Most of these techniques perform optimizations at the logic and physical levels. In
our work, we consider the behavioral level of abstraction for our optimization strategy.
Behavioral or high level synthesis is a process during which the following take place: an
algorithmic behavior of a system, usually specified in an high-level language such as VHDL
or C, is resolved into a Control-Data-Flow Graph (CDFG). The CDFG contains information of the various data and control-dependencies within the behavior. This intermediate
description is then analyzed by a synthesis system, which performs the tasks of operation
scheduling, resource allocation, and hardware binding, to obtain a register transfer level
(RTL) description. This RT-level description can then be processed by logic and layout
synthesis tools to produce physical layouts at specified CMOS technologies. Techniques for
leakage reduction when applied during behavioral synthesis, can result in RT-level solutions
with significantly reduced leakage power.

8

Khouri and Jha [17] presented a technique to reduce leakage power during behavioral
synthesis using the Dual-Vth technique. They provide a gate-level leakage analysis procedure using which they develop a pre-characterized gate library for leakage estimation.
Their optimization strategy uses a prioritization algorithm to identify frequently idle operations in a CDFG, and bind them to resources that will later on become non-critical
path elements capable of being instantiated with high-VT H cells. Critical path operations
are made up of low-VT H cells, and thus the final design does not incur any area or delay
penalty.
The low-level leakage reduction technique employed in this work is based on MultiThreshold CMOS technology (MTCMOS). Gopalakrishnan and Katkoori [18] proposed
an approach for leakage power optimization during behavioral synthesis using MTCMOS.
Functional units and registers are bound in such a way, that their idle-times are maximized
and contiguous. The functional units with maximal idle time and potentially minimal areaoverhead penalties are then bound to MTCMOS technology.
Traditionally, leakage power reduction strategies have targeted regular single-cycle datapath synthesis. In such cases, leakage optimized designs have shown lower levels of performance than unoptimized designs. To date we have not seen much work been done with
regards to leakage power reduction during pipeline synthesis. We deal with functional
pipelining during behavioral synthesis and its effects on leakage power dissipation of the
system, and how pipelining can be easily adapted to synthesize systems that have both
low leakage dissipation and high throughput. This is one of the first works that actively
emphasizes the impact of pipelining during low leakage power synthesis.
1.3

Power reduction during functional pipelining
Functional pipelining is the method of segmenting a data-flow graph of a behavioral

description, into several stages where each stage contains the mapped hardware resources to
execute the operations of the segmented sub-graphs. The results of each stage are stored in
registers or latches generally known as stage latches or stage registers. The characteristic
feature of this technique is that successive tasks are initiated before the results of the

9

previous tasks are obtained. This results in an increase in the throughput of the circuit,
compared to that of a non-pipelined datapath obtained from behavioral synthesis, where
consecutive results are obtained only after a latency which is equal to or greater than the
length of the critical path of the circuit. However, there is an inherent increase in the
register and functional resource cost, due to implementation of the pipeline stages.

DII = 1input/clock

DII = 1 input / 2 clocks

DII = 1 input / 3 clocks

Flow-based High level
Synthesis
DII = 1 input/

+ 1 clock

cycles

Figure 1.4. Pipelines with various initiation rates

Pipeline synthesis has been widely explored in the literature, and many behavioral
synthesis systems have incorporated functional pipelining algorithms within their synthesis flow. Sehwa [19] is the first synthesis system that generates pipelined datapaths
from behavioral descriptions. It schedules a data-flow graph with feasibility constraints
to determine a minimal-cost maximal-performance pipeline. It performs scheduling and
allocation simultaneously to determine optimal-performance designs, and considers a fixed
pipeline latency for synthesis purposes. HAL [20] uses the force-directed scheduling algorithm to perform loop winding. PLS [21] uses a heuristic-based list scheduling algorithm,
and performs forward and backward scheduling for minimal delay and optimized resource
usage. It considers loops with inter-iteration dependencies. SODAS-DSP [22] uses an iterative/constructive type scheduler and a two-pass allocation approach for the synthesis
of pipelines. While most works synthesize pipelines with fixed data initiations, Jun and
Hwang [23] consider the problem of synthesis of pipelines supporting variable data initiation
intervals.

10

In pipelined systems, the throughput is dependent on the data introduction rate and
the clock frequency. A fully pipelined system is one that has a data introduction rate of
one input per clock cycle. Even if a pipeline with data introduction rate of two is clocked
at the same frequency as a fully pipelined system, its throughput will be effectively half of
the fully pipelined system. Figure 1.4 shows pipelines with data introduction rates of 1, 2
and 3, and also a regular single-cycle datapath system. Each box depicts the time interval
between successive data initiations known as a data window. It also depicts the resource
sharing possibilities between operations within a data window. Operations executing at
concurrent times cannot share resources amongst each other, whereas they can share with
those that switch at all other non-concurrent times.
A general notion is that pipelined systems consume lesser power than non-pipelined
systems. This is due to the fact that implementation of stage registers can allow for the
individual components to function at a lower rate [24]. This can provide a flexibility for the
system supply voltage to be dropped down substantially, causing the power dissipation to be
about 2-2.5 times lesser. In the same work, the authors describe a method to reduce power
by having parallel pipelines for a single-flow circuit. Here, consecutive data initiations
are switched on different parallel paths, so as to reduce power while also maintaining the
same throughput. A shortcoming in this approach could be seen in terms of the amount
of registers, and interconnect required to implement the input switching logic and output
multiplexing logic.
Dynamic power which is a particularly dominant component of power in ASICs that
are fabricated at the sub-micron level (0.35µm and above), can be considerably reduced
by these methods. Chang [25] present a method of minimization of dynamic power in
functionally pipelined datapaths with conditional branches. They formulated a resource
binding strategy which is solved as a network flow problem, where an optimal binding
is determined so as to minimize the total switching activity. Heo et al [26] explore the
effect of pipelining on the power consumption, and attempt to determine a logic-depth
within each pipeline stage, such that the overall power is reduced due to pipelining. A loop
pipelining algorithm which makes use of rotation scheduling to minimize power while also
reducing schedule length was presented in [27]. Kim et al [28] minimize power in pipelines
11

by scheduling the operations such that ops with common inputs share the same functional
units to minimize input transitions.
These works address the issue of dynamic power reduction, however sub-threshold leakage has become an important consideration due to the shrinking of transistor sizes and
usage of lower supply voltages. Therefore, algorithms are needed that are capable of synthesizing leakage-optimized pipeline designs, giving the designer the advantages of low
supply voltages, increased throughput and lower leakage power dissipation. In our work,
we will attempt to address this particular issue.
The rest of this thesis is organized as follows: Chapter 2 gives a preliminary overview of
basic behavioral synthesis and describes the steps that are to be taken into consideration
for pipelined designs. Leakage power optimization in pipelined designs is a limiting case
of similar problems when dealing with regular high level synthesis. Hence the assumptions
that are to be considered when performing optimization during pipelining is also described
in this chapter. Chapter 3 gives a detailed description of our simultaneous, scheduling,
allocation, and binding approach. Chapter 4 describes the experimental procedure and
results. Chapter 5 concludes this thesis and gives details of the future work.

12

CHAPTER 2
PRELIMINARIES

Before beginning the explanation of our approach, we will give a detailed overview
of the various preliminaries involved in behavioral and pipeline synthesis, as well as the
underlying techniques and assumptions being employed in our approach. We will begin by
giving an overview of behavioral synthesis and its various aspects.

2.1

Behavioral synthesis
A behavioral synthesis system converts an algorithm usually specified in a high-level

language to a more elaborate structural level description, known as a Register Transfer
Level (RTL) description. This process consists of many steps, each of which may or may
not be independent of each other. In the initial step, a behavioral description is resolved
into an intermediate graph level specification. This graph captures the complete data-flow
and control-flow behavior of the system, and is an intermediate language bridging a highlevel language and a synthesis system. Hence the process of behavioral synthesis (hereafter
referred to as high-level synthesis) can also be referred to as behavioral compilation.
Typically, graph descriptions for high level synthesis are of three types:
1. Data-flow graph (DFG): which describes the data-dependence of various operations
on each other. It stores the predecessor and successor informations of operations,
which will be used in the scheduling phase. It contains the information regarding
the number and type of functional units that will be needed during synthesis. Each
operation generally consists of two inputs and one output, which can be varied for
catering to different user-specific requirements.
2. Control-flow graph (CFG): which describes the control-flow of the behavioral description and adheres strictly to the precedence conditions present in the data-flow
13

graph. Information of advanced sequential constructs such as branching and looping,
is captured in the control-flow graph.
3. Control-Data-flow graph (CDFG): which is a more comprehensive representation format, and is a combination of DFG and CFG. This format represents the behavioral
nature of many real-world designs. Many architecture modelling languages such as
VHDL and System-C are resolved into CDFGs before compiler level optimizations
are made.
In this work, we consider a data-flow graph G(V, E) where V represents the number of
operations in the graph, and E represents the edges between them. Graphs for high-level
synthesis can be either data-flow intensive or control-flow intensive (CFI) or both. In the
data-flow intensive designs, there are few control-flow changes, and most of the functionality
is characterized by computationally intensive operations present in the graph. Control-flow
intensive designs contain many advanced sequential constructs such as conditionals and
loops, that affect the control-flow making the design of the controller more complicated. In
this work, we consider datapath intensive designs over control-flow intensive designs due
to the following reasons:
1. Design of the controller is simpler in this case.
2. Controller leakage power has very low contribution and the bulk of the leakage power
dissipated is concentrated in the datapath in this case.
The data-flow graph extracted from the high level design is then analyzed during the
steps of scheduling, allocation, and binding for the realization of the RTL. These steps are
explained below:
1. Scheduling: is the process of assigning operations in a data-flow graph to specific
time-steps, such that data dependencies are satisfied and certain user-specified system constraints are met (such as resources, latency, area, or power). Typical scheduling algorithms available in literature are As-Soon-As-Possible (ASAP), As-Late-AsPossible (ALAP), Force-directed scheduling (FDS), etc. The output of the scheduling
14

X

X

1

X

2

X

4

X

6

T1

8

+

T2
3

+

T3
5

+

T4
7

+

T5
9

Figure 2.1. ASAP schedule of FIR filter

phase is a time-stamp for each functional operation, and lifetime information for each
edge in the data-flow graph.
2. Resource allocation: refers to partitioning the scheduled data-flow graph, such that
operations with non-overlapping execution lifetimes belong to the same partition.
Generic graph partitioning algorithms such as clique-partitioning, Left-edge algorithm (LEA), etc., are used in this phase for resource allocation.
3. Binding: is the phase during which these partitions are then bound to hardware
instances. A typical goal of the binding phase is the maximization of the utilization
of a hardware instance, which is related to the area-consumption by the datapath.
Binding decisions typically affect the various characteristics of the final datapath,
such as area, power, crosstalk, etc. Hence, it is necessary to have an intelligent
binding algorithm that gives an optimal solution matching the needs of the designer.
As an example, we will illustrate the the high-level synthesis of a Finite Impulse Response (FIR) filter. Figure 2.1 shows the ASAP schedule of the FIR filter data-flow graph.
Figure 2.2 gives the various partitions formed during the allocation phase. The binding
of operations to functional instances and edges to registers is shown in Figure 2.3. The

15

MULT_1

X

X

1

X

2

X

4

X

6

T1

8

+

T2
3

ADD_1

+

T3
5

+

T4
7

T5

+
9

Figure 2.2. Resource allocation of FIR filter

final RTL datapath resulting after the binding phase is illustrated in Figure 2.4. The FSM
controller for this datapath is illustrated in Table 2.1.
The FIR filter consists of 5 multiplier operations and 4 adder operations. The critical
path has a length of 5, which is also the latency of the ASAP schedule. The number of
allocated multipliers and adders is 5 and 1 respectively. In the binding phase, multiplexers
are generated which form the data-steering logic and are switched by the control signals
generated by the controller. The final RTL datapath containing functional units, registers,
and multiplexer logic is then obtained.
Having given an overview on high-level synthesis and its important aspects, we will
now focus our attention towards functional pipelining during behavioral synthesis.
2.2

Pipelining during behavioral synthesis
There are generally two forms of pipelining known in the literature:

1. Structural pipelining: in which operations in the data-flow graph are bound to
pipelined hardware instances. However the execution of the data-flow graph as such
is not pipelined. This is also known as hierarchical pipelining.

16

Multiplier

Adder

Instance

Ops

Instance

M0

op1

A0

M1

op2

M2

op4

M3

op6

M4

op8

Ops
op3 op5 op7 op9

Register
Instance

Edges

R0

yout

r6

R1

r7

x0

R2

r5

c2

R3

r3

x1

R4

r1

c3

R5

x4

R6

c5

R7

x3

R8

c4

R9

x2

r4

r2

r0

c1

Figure 2.3. Resource bindings for FIR filter

2. Functional pipelining: in which there is no hierarchical pipelining or pipelined resources and the data-flow graph as whole is pipelined. In our work, we consider the
problem of functional pipelining.
A functionally pipelined datapath is segmented into N linear stages [29], where each
stage contains the required resources to execute the relevant operations of that stage. The
number of stages in a pipelined data-flow graph is dependent on the schedule length or
latency λ of the design, and the data introduction interval (δ0 ) also known as pipeline
latency as follows,

17

c1

x0

c2

x1
r3

r0 r2 yout

r5

r7

R0

R1

x2

x3

c4

c5

R9

R7

R8

R6

r1

R2

R3

R4

R5
M0

M1

M2

M3

r7 r5 r3 r1
r4
r6

M4

A0

Figure 2.4. RTL datapath of FIR filter

pipeline stages N = d

λ
schedule length
e=d e
data initiation interval
δ0

The data introduction interval is a global constraint representing the time interval
between two consecutive input vectors. It is generally a constant term and smaller than the
latency λ. We shall take the example of a scheduling algorithm such as ASAP scheduling,
where the schedule is obtained by topologically sorting the vertices of the data-flow graph.
This ASAP schedule for an FIR filter is already shown in Figure 2.1.
Consider a data-flow graph containing k resource types which may be adder, multiplier,
comparator, etc. In the FIR example, k = 2 where the two resource types are an n-bit
adder and an n-bit multiplier. For each resource type, given a fixed pipeline latency (say
18

Table 2.1.
States
S0
S1
S2
S3
S4
S5
S6
S7

Controller for FIR filter
Control bus
000000000000000000
111111111101111100
111110000010000000
100000000000000000
100000000000000010
100000000000000001
100000000000000011
000000000000000000

δ0 = {1, 2, 3, 4, ... d}, where d is the maximum allowed data introduction interval), the
operations that execute at time-steps i.δ0 + l (where i is an integer in the interval {1, 2,
... N } and l is an integer in the interval {1, 2, ... δ0 − 1}) occur concurrently, and cannot
share resources amongst one another. Operations that do not execute concurrently are thus
compatible and typically can share the same functional resource. From this information,
a compatibility graph S of functional resources and registers, is built individually and
partitioned into a minimal number of cliques. Pseudo-code of this allocation procedure is
presented in Figure 2.5.
Pipeline allocation: P Alloc[G(Vn , TV , δ0 )]
1 for i in {v0 , v1 , v2 , ...., vn } of V
2
for j in {v0 , v1 , v2 , ...., vn } of V
3
if (vi 6= vj )
4
if (Tvi mod δ0 ) 6= (Tvj mod δ0 )
5
S(i, j, 1)
6
else
7
S(i, j, 0)
8
else
9
S(i, j, 1)
10 Clique Partition{S}
Figure 2.5. Procedure for generic pipeline allocation

These cliques are then mapped to functional resources and registers. The pipeline
controller is different from the regular controller in that it has only δ0 states. But the
number of control signals per control state is much more in the case of pipelined systems
than in regular systems.
19

2.3

The AUDI system
For this work, we make use of AUDI [18], a behavioral synthesis system, as the frame-

work in which our algorithms are integrated and experiments are carried out. AUDI (also
known as AUtomatic Design Instantiation) is currently being developed at the University
of South Florida, VLSI research group. It is an interconnect-centric behavioral synthesis
system, and can currently produce low-power and leakage-power optimized designs. The
AUDI high-level synthesis flow is shown in Figure 2.6.
Behavioral VHDL

Frontend

VHDL2AIF

AUDI Intermediate Format (AIF)

CDFG Generation

Scheduling

AUtomatic
Design
Instantiation
System

ASAP scheduling
Force-directed scheduling

Resource Allocation

Considerations

Clique-partitioning

MTCMOS
component
library

Binding

Power
Crosstalk
Structural VHDL
Generation

0110101
1010100

Controller

1010101

Datapath

1110011

1111111
1010110

FASL
Leakage Simulation
Library

Figure 2.6. AUDI high level synthesis flow

20

Register
Transfer
Level
Design

Behavioral VHDL is used as the input language to the system, which is converted to
a CDFG-like language using a VHDL-to-AIF translator. This CDFG format is referred to
as the AUDI Intermediate Format (AIF), and is the input format for the AUDI system.
The AIF format always consists of: 2-input/1-output operations, control-flow indicators,
such as if and while, and memory read-write operations.
The operations contained in AIF are scheduled using one of the many available scheduling algorithms such as ASAP, Force-directed scheduling etc. This is followed by the allocation phase where these operations are partitioned into a minimal number of maximal-size
mutually-exclusive partitions using a clique partitioner.
Multiplexors, which form the data-steering logic primarily being used in AUDI, are
generated during the binding phase. The partitioned cliques are mapped to hardware-level
instances, and a structural netlist of functional units, registers, multiplexors, and their
interconnections is outputted in structural VHDL.
Many circuit-level considerations make their way into the binding phase to influence
the final design netlist. Designs may be bound in such a way so as to minimize power,
leakage, cross-talk, delay, and other such parameters. The type of technology and devicelevel characteristics can determine how binding is performed during high level synthesis.
Power can be minimized by using intelligent scheduling and binding solutions, and work
regarding this issue has been done in [30], [31], [32], [33], [34]. Leakage power minimization
during behavioral synthesis has been dealt with in [17], [18].
In AUDI, low leakage binding is performed using Multi-threshold CMOS (MTCMOS)
using an area-efficient Knapsack algorithm [35]. Our work determines an optimal binding
solution for pipelined datapaths and uses the approaches in [18] and [35] to obtain an
optimal MTCMOS datapath.
Before we give an overview of the MTCMOS binding solution that is integrated in
AUDI, we will now describe the VHDL-to-AIF AUDI frontend that was developed as part
of this work.

21

2.4

AUDI frontend
The AUDI system was conceptualized for RTL synthesis, using VHDL as its main target

language. The system takes as input code written in behavioral VHDL and provides as
output a netlist in structural VHDL. The frontend of AUDI (also known as VHDL2AIF),
processes the behavioral VHDL and converts it to the AUDI Intermediate Formate (AIF).
VHDL2AIF is a high-level translator and was written using Lex, Yacc, and C. AIF is the
CDFG input format for the schedulers used in AUDI. Some of the important features of
VHDL2AIF are:
1. Full IEEE-754 Standard 32-bit floating point representation.
2. Floating-point intensive operations such as SINE, COSINE, LOG, REAL, FP-add,
FP-mult, FP-divide are fully supported.
3. Support for Memory intensive operations (using Op-codes as MEMREAD and MEMWRITE).
4. Support for signal-indexing with integer and variable indexes.
5. Full support for nested loops and nested conditionals.
6. Supports WAIT statements.
7. Stable support for large ASIC benchmarks (>5K lines)
It consists of two parts: a Lexical analyser, which resolves the input VHDL file into
a stream of discrete units called tokens; and a Yacc parser, which reads in these tokens,
validates them with a pre-specified grammar, and executes an action which is coded in C.
The grammar conforms to the existing VHDL-93 standard language syntax. Currently, the
tool can handle only single-process architectures and single entities. Some of the VHDL
features that are handled in the translator are given below:
1. Package declarations: Constant, Type, and Subtype declarations are handled and
inline expanded wherever they are encountered in the architecture body.

22

2. Array operations: Arrays are instantiated as memory in AUDI. Operations on arrays are replaced with equivalent MEMREAD {Memory read(array, index)}, and
MEMWRITE {Memory write(array, index)} operations.
3. Real and Floating-point operations: Pre-compiled FP-benchmarks are employed for
performing floating-point operations [36]. Wherever floating point intensive operations are encountered, the translator substitutes them with simple atomic ops such as
FADD, FSUB, FMUL, FDIV; which are later instantiated by the VHDL2AIF floating
point library.
4. Signals and variables: Signals and variables are synthesized as registers in AUDI.
5. Conditionals: Nested conditionals which include if, else, endif, elsif, exit when are
fully supported.
6. Loops: Nested For and While loops with either definite or indefinite number of loop
iterations are handled.
7. Wait statements: While indefinite wait and wait on are not supported, wait for and
wait until are supported. These are translated into simple controller wait cycles.
Currently, the list of benchmarks that have been compiled using VHDL2AIF are NAVIFIND, a system for a mobile tour guide; RECOG, a coherent GPS signal reciever, and a
cancer detection benchmark performing a deconvolution fast-fourier transform.
We will now give an overview of the MTCMOS binding solution that is integrated in
AUDI.
2.5

MTCMOS binding
At the physical level, the functional modules are selectively bound to MTCMOS in-

stances. The MTCMOS design methodology is described in the Figure 2.7. Each MTCMOS
instance consists of a sleep transistor placed in series with its VDD rail and in series with
a macrocell or standard-cell implementation of the module. The delay of the instance is a
function of its corresponding sleep transistor width. As this width increases, the delay of
23

the instance decreases. However, the leakage power dissipated by the sleep transistor also
increases due to the dependency of the sub-threshold leakage current on the (W/L) ratio.
When a sleep transistor is sized beyond a certain level, the sub-threshold leakage power
dissipated by it becomes larger, when it turned OFF during idle-time.

sleep
transistor

8-bit multiplier

Figure 2.7. MTCMOS Macrocell implementation

In this work, we consider a parameterized MTCMOS component library [18], which
consists of characterized functions to choose the appropriate sleep transistor width for a
require delay margin. The sleep transistor is generally of a higher VT H than the other transistors present in the module. An active high sleep signal gates the sleep transistor off and
isolates the module from the VDD rail, thus causing a large reduction in the sub-threshold
leakage current flowing through the off stacks of transistors. This technique provides large
reductions during standby mode and during times when a functional unit becomes idle. In
AUDI, modules that have large idle times after scheduling, are allocated and bound to these
MTCMOS functional units. The sleep signals are generated by the controller circuitry and
generally cause a small increase in the dynamic power of the circuit. Thus MTCMOS can
offer a level of leakage power reduction comparable to other techniques though it bears the
additional burdens of area-overhead and delay penalty. However MTCMOS is a simpler
design technique and more readily applicable to the high level synthesis of large circuit
designs, whereas other techniques may need long precomputation times for determination
and application.

24

2.6

Simulation-based architectural leakage estimation
For large datapaths, leakage estimation through HSpice becomes very slow, due to its

long simulation times. There is a need for fast and accurate leakage estimation algorithms
so as to determine the efficacy and validity of leakage optimization algorithms. For this
purpose, we make use of a tool called FASL or Fast Architectural Simulator for leakage
power which is a register transfer leakage simulation library and is compatible with the
Cadence NCLaunch HDL simulation tools [37]. This library consists of pre-characterized
leaf cell components (namely full-adder, nand, and, or, not, xor, and so on), that are
exhaustively characterized for leakage power dissipation. The simulation model utilized in
FASL is explained as follows. The leakage dissipation of a leaf cell is primarily divided
into two regions: a transient region and a steady-state region. In the transient region, the
leakage power dissipated by the cell is temporally dependent on both the current input
and the previous input to the cell, while in the steady-state region, it depends only on the
current input. Using automated scripts, the threshold time or the time after the inputs
have changed for which this transient condition exists, is determined for each leaf-cell.
Thus, the instantaneous leakage power of a leaf-cell is given as

Ti
Pleakage



 P transient [current input, previous input] T0 ≤ Ti ≤ Tthreshold
leakage
=

 P steady state [current input]
Ti > Tthreshold
leakage

Since all leaf-cell level components are fully characterized, a simulation of the hierarchical description results in the simulation of the leaf cells. Thus the total leakage power
is calculated as a summation of all the individual leaf-cell leakage powers.
The leaf-cells are of two categories: non-MTCMOS cells and MTCMOS cells. For the
MTCMOS cells, the optimum sleep transistor width determined as in section 2.5, is used in
the design and extensive characterization is performed for both ’sleep’ mode and ’wake-up’
mode. Thus a complete simulation library for leakage power is available for simulating the
datapaths obtained from synthesis. The accuracy of the simulation is within 2% of the
results obtained from HSpice, while the run-times are several orders of magnitude shorter.

25

To summarize, we describe the process of behavioral synthesis, which consists of the
steps of scheduling, allocation, and binding. These steps determine the quality of the
final RTL datapath, and intelligent algorithms for these steps can greatly optimize the
user-parameters for the RTL such as latency, power, and area. We provide allocation
algorithms for pipelining data-flow graphs, and these algorithms can be integrated into
existing synthesis systems for generation of pipelined designs. We also describe the MTCMOS methodology for leakage power reduction, which is integrated into our behavioral
synthesis system known as AUDI. We also describe FASL, which is a leakage estimation
tool, that makes available physical level leakage power information at the RT-level, so as
to drive the behavioral synthesis procedure towards optimizing the datapaths for minimal
leakage power. FASL will be made extensive use of, for the experiments related to our
approach and these will be detailed in the next chapter.

26

CHAPTER 3
PROPOSED APPROACH

In pipelined datapaths, the number of functional units executing at various clock cycles
is much more than that in non-pipelined datapaths. In non-pipeline synthesis, datapath
component types exhibit varying idle-times for each component instance. Due to the
implementation of pipelined stages, the cycle-lifetime of each component instance is greatly
reduced, which also reduces its idle periods. This reduction in idle-periods also reduces
the number of times functional units or registers can be put in standby mode. Hence
leakage power reduction in pipelined datapaths is potentially lesser than in non-pipelined
datapaths. It should be noted that when the data introduction rate δ0 is one input per clock
cycle, all functional units and registers are active at all times. In this case, standby leakage
power is at its minimum since no functional unit has idle states at any time. Any form of
leakage reduction through MTCMOS becomes useless, due to the absence of idle-states for
this case. The notion of low leakage scheduling, allocation and binding for pipelined circuits
becomes relevant only when the data introduction rate δ0 becomes greater than one input
per clock cycle. Based on these observations, the following approaches are proposed. In this
work, leakage optimization for both functional resource allocation and register allocation
is performed.
3.1

Functional resource optimization
In this section, we discuss our leakage optimization algorithm for functional resource

allocation. In traditional techniques, a data-flow graph is scheduled using a generic scheduler such as ASAP or FDS. This gives a set of timestamps for operations, which are then
analysed by the allocation procedure. This timestamp set is then resolved into an oper-

27

ation and a register compatibility graph, which are then partitioned using generic graph
partitioning algorithms such as clique-partition or left-edge (LEA), etc.
Assuming as before, that there are k resource types; k ∈ {k1 , k2 , k3 ...kn }, the allocation
phase outputs a I ×λ resource-allocation table, where I is the number of instances allocated
for that resource type ki and λ is the latency of the schedule. The allocation table for an
ASAP schedule of an FIR filter shown in Figure 2.1 is illustrated below
M1
T1

op1

M2
op2

M3
op4

M4
op6

M5
op8

A1
T1

T2

T2

op3

T3

T3

op5

T4

T4

op7

T5

T5

op9

Multiplier

Adder

Figure 3.1. Placement of operations in space-time matrix

From Figure 3.1, it can be seen that multipliers M 1, M 2, M 3, M 4, M 5 all have 1
active state (during timestep T 1) and 4 standby states (during T 2, T 3, T 4, T 5). The
adder instance A1 has only 1 idle-state (T 1) while its remaining states are active. From
this, we can intuitively guage that the 5 multipliers can be bound to MTCMOS, since they
all experience long idle-times. We can also bind the adder A1 to MTCMOS, due to the
presence of 1 idle-state. But this may prove to be counter-productive as area-overhead and
delay of the adder may be increased, just for the sake of the 1 idle-state in the adder.
Let us now consider the pipelined case, where λ of the allocation table is equal to the
data introduction interval δ0 . This case is illustrated in the Figure 3.2. Here, we can see
that each of the mulipliers {M 1, M 2, M 3, M 4, M 5} have 1 idle and 1 active state each.
Hence to minimize leakage power, they each have to be bound to MTCMOS instances.
However, we see that none of the adders are in standby, hence by pipelining we have
reduced the area-overhead of adders due to MTCMOS and also the possible dynamic and

28

leakage power by-product due to the sleep-transistor. While the adder allocations have
been optimized, the multiplier allocations are unoptimized. Such an arrangement is due to
the schedule obtained from ASAP. A good schedule would produce an optimal allocation
for the bulky multipliers with MTCMOS binding. Our work provides such an approach
targetting this kind of problem.

T1

M1

M2

M3

M4

M5

op1

op2

op3

op4

op5

T2
Multiplier

A1

A2

T1

op5

op9

T2

op3

op7

Adder

Figure 3.2. Pipeline placement of operations (δ0 =2)

The premise for our optimization approach is framed as follows:
A resource instance (say an adder) may have a number of operations bound to it.
There may also be some finite idle periods that may emerge within this instance after all
operations are bound. If there are many such types of instance bindings, where resources
contain lot of idle-times as well as operations, the following will hold.
• number of resource allocations are high
• number of MTCMOS bound instances are high (increased area overhead and controller
overhead)
• increase in delay and capacitive power dissipation (due to frequent sleep and wake up
transistions)

Existing allocation heuristics tend to be unaware of such problems. It can be considered infeasible to bind resources to MTCMOS if they are not idle for more than a single
clock period in a data introduction interval. Since the sleep transistor itself dissipates
leakage power, the amount of leakage power savings through the application of sleep transistors needs to be greater than the effect of leakage power dissipated by the MTCMOS
instance. Our algorithm accounts for this fact, and makes modifications to the schedule
such that lesser number of modules are considered for MTCMOS binding. We propose a
29

scheduling, allocation, and binding algorithm that uses simulated annealing for searching
and discovering optimal solutions to this problem.
3.1.1

Simulated annealing

Simulated annealing is a meta-heuristic [38] that mimics the annealing process of metals.
A metal is heated to a very high temperature state, such that its atoms become free to
move about. The temperature is then slowly reduced and the metal is gradually cooled,
until it crystallizes and a thermal equilibrium is reached. When this state is achieved, the
atoms’ mobility becomes very rigid, and further cooling achieves no improvement in the
metals’ equilibrium. In the same way, an initial solution is taken as the starting solution
and its cost is taken as the initial cost. Simulated annealing (SA) thereafter generates new
solutions by making either random moves or pre-computed ones. The gain in cost of new
solutions are then evaluated at each iteration. At higher temperatures, positive gain moves
are accepted while negative gain moves are accepted with a probability which is dependent
on the energy of the system at that time. The number of negative gain moves accepted
reduces as the temperature decreases. Simulated annealing accepts non-improving moves
(known as hill climbing) so as to escape out of local minima.
Devadas et al [39] use simulated annealing for synthesizing datapaths with low area
and latency.

The scheduling, allocation, and binding problem is modelled as a two-

dimensional placement problem. Operations are placed in a two-dimensional matrix of
rows and columns. Scheduling operations in the data-flow graph is equivalent to placing
them in a row, which corresponds to a timestep. The columns correspond to possible
instances of a resource. The binding step is equivalent to placing operations in columns.
Figure 3.1 shows an example of this matrix which is unique for different resource types.
The matrix contains an operation placement after an As-Soon-As-Possible (ASAP) schedule is performed on the FIR filter example. Moves can be either interchanging the position
of two operations in the matrix, or finding a new position for an operation in the matrix or
interchanging the inputs between symmetric operations. The cost function is dependent on
total hardware area and execution time. They also describe a framework for synthesizing

30

pipelined datapaths using simulated annealing and producing solutions with reduced area
and minimum latency.
Prabhakaran et al [40] present a simultaneous scheduling, allocation and floorplanning
algorithm which uses simulated annealing to discover datapath solutions with minimized
interconnect power dissipation. The cost function is dependent on the transition activity
and the distance between modules on the floorplan for satisfaction of latency constraints.
Pipeline synthesis has a smaller search space than non-pipeline synthesis, and hence is
a limiting case of non-pipeline synthesis. The basic simulated annealing procedure which
is used in our approach is shown as a pseudocode in Figure 3.3. The data-flow graph is
first subjected to both ASAP and ALAP scheduling to determine the individual operation
mobilities. We select the initial ASAP schedule as the starting solution with which improvements are made. A certain user-specified data introduction interval δ0 is considered
for operation binding. The operations are then bound in order of their schedule times and
their mobilities as shown in Figure 3.4. This binding is then cost-evaluated to give an
initial cost for the SA procedure. A move is selected from a pre-computed set of moves
and is then made by an operation irrespective of its resource type.
3.1.2

Generation of neighbour states

At the beginning of every iteration, a set M (i, j) containing all operation moves possible
from the current state, is created. Here i denotes the operation under consideration, and j
is the number of positions that i can move within its mobility range. A move is then made,
and the schedule is altered as directed. The resulting scheduled data-flow graph is then
operated on by the ALLOCATE BIND procedure, and this binding solution is evaluated
for cost. The gain of this solution is determined, and as per the selection-rejection heuristic
of SA, the current schedule is saved. Our algorithm assumes that there is no change in
latency λ and that λ is equal to the length of the critical path of the data-flow graph.

31

SA Optim[G(V, E)]
1 ASAPG ← ASAP scheduling performed on G(V,E)
2 ALAPG ← ALAP scheduling performed on G(V,E)
3 µ(G) ← ALAPG - ASAPG
4 Select starting solution xstart ← ASAPG
5 Bxstart ← allocate bind(xstart )
6 Initial Cost ← cost(Bxstart )
7 Current solution S ← Bxstart
8 Initial temperature T ← T0
9 while cost is changing
10 I ← number of iterations
11 while I > 0
12
new solution S 0 ← generate neighbor(S)
13
∆C ← cost(BS ) - cost(BS 0 )
∆C
14
if (∆C < 0) or (random(0,1) < e T )
15
S ← S0
16
else
17
S ← Best f ound so f ar
18
endif
19
I ←I −1
20 endwhile
21 T ← T ∗ cooling rate
22 endwhile
23 return Best f ound so f ar
Figure 3.3. Simulated annealing procedure for FU leakage optimization

3.1.3

Evaluation of neighbour-state cost

A move by an operation in the current schedule, produces a new schedule known as
its neighbor state. This operations are then bound based on this neighbor schedule by the
ALLOCATE BIND procedure. This binding phase produces a set of tables Tki unique for
each resource type ki . These tables are referred to as activity tables. Such an activity table
is represented in the Figure 3.6. The dark squares represent operations executing in that
time-slot. Note that the number of time-slots is dictated by the data initiation rate δ0 . A
blank square represents an idle-time for the functional unit. These idle-times are used for
enabling the ‘sleep’ signals which are supplied from the controller.
We identify the following criterions for an optimal pipeline RTL datapth:
1. The pipeline datapath should have minimal area.
32

ALLOCAT E BIN D[G(V, E)]
1 ∀ V ∈ G(V, E), determine µM AX = max(ALAPG(V ) − ASAPG(V ) )
2 for i ∈ {1,2,3...µM AX }
3
for j ∈ {1,2,3...V }
4
for k ∈ Rx
5
if µ{j} = i
6
Bk {l, (sched(x) mod δ0 )} = j
7
endif
8
endfor
9
endfor
10 endfor
Figure 3.4. Binding procedure for SA

GEN ERAT E N EIGHBOU R(S)
1 Create move set M (i, j)
2 Pick a move m at random from M (i, j)
3 sched(i) = sched(i) + j
4 modify schedule[successor(i)]
Figure 3.5. Generation of new states

2. It should have minimal standby leakage dissipation and consequentially minimal overall leakage dissipation.
3. It should have minimal transitions between ‘sleep’ and ‘wake-up’ states to minimise
circuit delay and glitching capacitance.
4. Lesser number of MTCMOS instances to minimize area-overhead due to MTCMOS
and delay of functional instances.
In the next section, we will discuss our cost-function devising which accounts for the
above criteria.
3.1.4

Cost function characterization

The aim of our SA-based approach is to produce optimal RTL datapaths with low
leakage power and minimal area overhead. However, the MTCMOS technique imposes a
significantly large area overhead on the datapath. When there are very few idle-states in
33

8-bit multiplier

8-bit adder

8-bit xor

8-bit subtractor

Figure 3.6. Activity table example

a resource, the gains in leakage power reduction is offset to a good extent by the leakage
dissipated by the sleep transistor. Hence, we consider it necessary to minimize the usage
of sleep transistors to optimize area, and also optimize the resource binding such that each
instance has either many idle-states or no idle-states.
Each MTCMOS instance goes through several ‘sleep’ and ‘wake-up’ transitions. An
instance with such frequent activity is always found to be in its transition region of leakage
dissipation. Here leakage power is slightly higher due to the glitching characteristics of the
transistor. The threshold time is the time after which this leakage is seen to settle down.
This requires the instance to have long idle-times so as to have settled leakage dissipations.
Instances with long contiguous idle-times show lower leakage profiles than instances with
frequent transition activity.
These considerations are taken into account for our cost factor derivation experiments.
We use a sample DFG such as an FIR filter for our analysis. We synthesized the RTLs
in such a way as to force the binding of the operations to match with our desired activity
combinations. The RTL is then simulated using FASL, and the leakage power profile of
the functional instance under observation is extracted using a script.
We split the leakage cost factors into two tables: a Leakage Constant table and a
Settling Constant table. Initially we normalize all the weights in the tables to 1/2δ0 ,
before beginning our iterations. After a set of iterations, we update the leakage dissipation
34

Data Initiation Rate - 2

1

2

3

Data Initiation Rate - 3

1

2

3

4

5

6

7

Data Initiation Rate - 4

1

2

3

4

5

6

7

13

14

15

8

9

10

11

12

Figure 3.7. Activity combinations for various data-initiation rates

constants. The cost evaluation procedure is described in Figure 3.1.4. The number of
distinct combinations NC for a data initiation rate δ0 is 2δ0 .
If δ0 is 3, there will be 8 possible combinations. Hence the initial weight of all the
combinations is 0.125. We then perform RTL synthesis and simulation for two combinations at a time. The average leakage powers for both the combinations are compared. If
combination A has a better average leakage power than combination B, the weightage for
combination A is then increased by 0.5. This increase is now noted, as the value of increase
or decrease of this combination will be (1/8) * 0.5, for the next time. If this combination
has a lower leakage dissipation than a combination (say C), then its weightage reduces
by (1/8) * 0.5, and combination C weightage increases by 0.5 or (1/8)*0.5 depending on
whether it was already increased. This comparison process continues until all the combinations are evaluated against each other, and repeated for several iterations. The iterations
are repeated for several trials and random input vector sets (in our case, we used 500
vectors), and the average weightages are recorded. Table 3.1 was obtained after extensive
characterization on a 16-bit adder. This table contains the Leakage Constant table and

35

COST [G(V, E)]
1 for all instances r ∈ {1, 2, 3...rn } of resource
type ki ∈ {k1 , k2 , k3 ...kn }
2
Determine Active-Idle combination
Combr
4
Match Combr with Leakage Cost table
5
cost C(Combr ) ← Lc × (1 − WL )
6
cost ← cost + C(Combr )
9
Match Combr with Settling Cost table
10 cost S(Combr ) ← Sc × (1 − WS )
11 cost ← cost + S(Combr )
12 endfor
13 return cost

Figure 3.8. Cost Evaluation procedure
the Settling Constant table. When this table is looked-up, the value of Lc × (1 − Wl ) is
added to the total cost.
Table 3.1. Leakage and settling constants table for 16-bit adder activity combinations
(δ0 =3)
Comb.
1
2
3
4
5
6
7
8

Lc
2.6628
2.7359
2.7528
2.7715
2.7559
2.8100
2.7731
2.3842

Wl
0.73
0.66
0.64
0.62
0.54
0.33
0.52
0.78

Sc
1.1289
1.8178
1.8256
1.4687
1.4687
1.4812
1.4799
1.2112

Ws
0.69
0.56
0.53
0.46
0.46
0.34
0.45
0.78

For the Settling Constant table, we consider RTL components such as multiplier, adder,
subtractor individually. We synthesize RTLs in the same way as before. We then observe
the leakage profile of each RTL component as per its activity table. We then average the
leakage power in the transient region only. Frequently active components will always be
in their transient regions, and this leakage profile will be greater than components that
experience long idle-times. Based on this, we compare transient profiles for all combinations
and vary the weightage in the same way as explained before.

36

3.2

Register allocation & optimization
The schedule obtained from the SA-based algorithm is then analyzed by the register

allocation phase. A lifetime analysis for all the edges is conducted, and the compatibility of
each edge is determined. We note that edges that have a lifetime of more than 1 timestep
are split into sub-edges. This compatibility analysis is the same as the functional resource
compatibility algorithm shown in Figure 2.5, with the vertices replaced by edges.
The compatibility graph is then partitioned using a clique-partitioning heuristic proposed by Tseng and Sieworek [41]. The objective of this algorithm is to form cliques in
such a way that the idle-times of registers are contiguous and maximized. It aggregates
the edge-bindings to completely utilize a register’s time cycle. Once the aggregation is
completed based on the schedule, the remaining edges are bound such that edges that are
more temporally closer to each other are present within the same register binding.
In our work , we use a modified version of the clique-partitioning based method presented in [18]. The algorithm which supports pipelining is explained in Figure 3.9
Modified Clique Paritioning Algorithm:
Procedure clique partition low leakage
1 while (N is not empty) do
2 x ← select node();
3 if x is not connected to any other node
4 add it to clique
5 else
6 form set Y with all neighbours of x
7
for each node i in Y ,
8
determine a set of nodes M in Y incompatible with i.
9
end for
10 for each node y in M that excludes minimum nodes in Y
11
calculate the effect on sleep time cost from y
12
calculate the effect on transition cost from y
13
update the set of nodes with maximum cost factor
14 end for
15 merge nodes with max cost factor y into set x
16 end if
end procedure
Figure 3.9. Clique partitioning for leakage power optimization

37

Thus to summarize, we describe approaches for leakage power optimization of pipelined
datapaths using MTCMOS. Since MTCMOS imposes significant area, delay and glitching
considerations on the synthesis procedure, we develop cost functions that take into account
these considerations so as to maximize leakage power optimization, and minimize area
and capacitive effects. We provide approaches for both functional resource, and register
optimization.

38

CHAPTER 4
EXPERIMENTAL RESULTS

We synthesized pipelined datapaths for six linear DSP benchmarks which are shown in
table 4.1. The RTL simulations were performed using the FASL architectural simulation
library. The run-times for FASL were considerably short, being typically the run-times
reported by the Cadence NCLaunch VHDL simulator. The simulations were run on a Sun
Ultra Sparc II machine with a Dual-200Mhz CPU and 256MB memory.
We compare the datapaths generated by our approach to unoptimized datapaths synthesized using force-directed scheduling and binding using regular clique-partitioning. In
tables 4.2, 4.3, 4.4, 4.5, and 4.6 we report the leakage power reductions provided by our
approach for various values of data initiation rates (δ0 ). In these tables, we also report
the average area reductions provided by our approach. Since our approach uses simulated
annealing, we provide an analysis on the average running time of our approach in Table
4.7.
Table 4.1. Benchmarks used for analysis
Benchmark
Finite Impulse Response (FIR) filter
Elliptic Wave filter (EWF)
Infinite Impulse Response (IIR) filter
Auto Regression (AR) filter
Fast Fourier Transform (FFT)

Operations
5 *, 4 +
7 *, 26 +
5 *, 4 +
16 *, 12 +
16 *, 25 +, 7 -

Latency (λ)
5
14
4
8
6

For each simulation run, we simulated with 1000 random test vectors, to observe the
leakage reduction. The current FASL library has simulation support for the 100nm Berkeley
Predictive Technology Models, and extensive characterization has been performed for this
generation. The clock period was kept at 50ns to fully observe the effects of the MTCMOS
leakage transistors. For each benchmark, we ran simulation runs from δ0 = 2 upto δ0 = 6.

39

As noted before, it is only from δ0 = 2 that our algorithm performs any optimization
through the usage of a standby leakage technique such as MTCMOS.
Table 4.2. FIR filter leakage & area analysis
δ0

2
3
4
5

Regular
Leakage
Area
(µW)
(µ2 )
0.01941 140727.67
0.01841 138768.96
0.02407 96923.78
0.01489 94654.53

SA-based
Leakage
Area
(µW)
(µ2 )
0.01364 151928.23
0.013784 111322.49
0.01794 110040.66
0.01108 106581.12

Leakage
Reduction
(%)
29.72
25.12
25.46
25.58

Area
Overhead
(%)
7.95
-19.77
13.53
12.61

Table 4.3. EWF filter leakage & area analysis
δ0

2
3
4
5
6
7
8
9
10
11
12

Regular
Leakage
Area
(µW)
(µ2 )
0.03781 254996.51
0.05430 240736.48
0.06793 233283.65
0.06583 187234.59
0.09305 147085.09
0.10307 180952.28
0.09553 140014.73
0.08028 137864.93
0.12168 138199.39
0.20879 137984.42
0.20763 137864.90

SA-based
Leakage
Area
(µW)
(µ2 )
0.03245 272959.93
0.03026 227423.48
0.02702 283322.18
0.04085 173279.12
0.04134 172924.48
0.04249 211773.85
0.04200 163622.51
0.05388 163764.76
0.05586 165383.15
0.05995 164570.98
0.05853 160621.64

Leakage
Reduction
(%)
14.17
44.27
60.22
37.94
55.57
58.77
56.03
32.88
54.07
71.28
71.81

Area
Overhead
(%)
7.04
-5.53
21.44
-7.45
17.56
17.03
16.86
18.78
19.66
19.26
16.50

From the tables, we notice that our approach obtains, on the average a leakage power
reduction of 38.2%. This ranges from as low as 5% in some cases to as high as 71% in a
few cases. However, we notice that when the schedule latency λ of the design is equal to
its critical path length, there is a definite area overhead that is incurred by our approach.
Though for some of the cases, we noticed substantial area reductions ranging from 8% to
26%, for most of the other cases the SA-based approach was not able to optimize area.
The overall area overhead of the approach was around 6.2%.
We attribute this to a lack of sufficient operation mobility in the designs we considered.
By increasing the latency of the schedule, we impose a finite mobility on all the operations,
giving the SA-based approach more freedom to optimize area. We notice quite good im40

Table 4.4. IIR filter leakage & area analysis
δ0

2
3
4

Regular
Leakage
Area
(µW)
(µ2 )
0.01689 137264.06
0.02756 136069.75
0.03563 134015.48

SA-based
Leakage
Area
(µW)
(µ2 )
0.01390 149942.93
0.02141 154207.84
0.02609 152118.56

Leakage
Reduction
(%)
17.70
22.31
26.77

Area
Overhead
(%)
9.23
13.33
13.54

Table 4.5. AR filter leakage & area analysis
δ0

2
3
4
5
6
7
8

Regular
Leakage
Area
(µW)
(µ2 )
0.05005 462236.62
0.08714 373625.06
0.09851 290627.65
0.08660 368872.00
0.12258 207223.70
0.15931 203903.62
0.12515 199795.06

SA-based
Leakage
Area
(µW)
(µ2 )
0.04744 385632.81
0.05334 424518.06
0.06460 213779.62
0.04060 420956.72
0.05315 238991.09
0.06653 236505.41
0.05584 232221.79

Leakage
Reduction
(%)
5.21
38.78
34.42
53.11
56.64
58.23
55.39

Area
Overhead
(%)
-16.57
13.62
-26.44
14.12
15.33
15.98
16.23

Table 4.6. FFT leakage & area analysis
δ0

2
3
4
5
6

Regular
Leakage
Area
(µW)
(µ2 )
0.07016 453783.03
0.09974 403099.18
0.12847 278065.46
0.06097 274339.40
0.07164 267173.50

SA-based
Leakage
Area
(µW)
(µ2 )
0.06115 417946.59
0.08181 339320.12
0.07344 245987.60
0.03505 242233.36
0.03442 300249.57

41

Leakage
Reduction
(%)
12.83
17.96
42.83
42.51
51.95

Area
Overhead
(%)
-7.89
-15.82
-11.53
-11.70
12.38

provements in area for the resulting synthesized designs. We provide an analysis for the
AR filter benchmark in Figure 4.1 and Figure 4.2.
We obtain an area-reduction in the range of 3.89% to 4.6% when the latency is increased
by 1, 2, and 3 timesteps. However, we also kept a watch on the leakage optimization profiles
during this latency increase. We notice that there is a small decrease in the leakage power
optimization due to the improved area optimizations.

Figure 4.1. Total area consumption of datapath (AR filter)

We notice that as we move towards λ + 3, we obtain less leakage reductions though
area is becoming increasingly optimized. Hence it is not feasible to use very high values
of λ for moderately large benchmarks. A λ increase of upto 30% of the schedule length
provides good area-leakage tradeoff. Anymore increase can be counter-productive.
The average running times of our SA-based algorithm is indicated in the Table 4.7. This
running time is dependent on the mobility of the operations and the size of the benchmark.
If the mobilities are less, only a finite number of moves can be made at every temperature

42

Figure 4.2. Leakage power profile comparison (AR filter)

iteration. We notice that the running times remains nearly the same for all values of
data-initiation rates.
In summary, we provide an indication of the typical improvements in throughput that
can be afforded by our approach. We notice that the leakage power dissipated by regular
HLS datapaths remain lower than the leakage power dissipated in pipelined datapaths. The
regular HLS datapaths considered are MTCMOS optimized datapaths generated using the
approaches presented in [18]. We determine the pipeline latency for which the synthesized
datapath has a leakage profile as near to the leakage profile the regular datapath. We then
compute the improvement in throughput, which is given in Table 4.8.

43

Table 4.7. Running times for SA-algorithm
Benchmark
FIR filter
IIR filter
AR filter
EWF filter
FFT filter

Running times (s)
73
65
133
182
144

Table 4.8. Speedup analysis for various benchmarks
Benchmark
FIR
IIR
EWF
AR
FFT

Regular
0.008339
0.008377
0.02222
0.02943
0.02523

SA-based
0.01364
0.01390
0.02702
0.04060
0.03505

44

δ0
2
2
4
5
5

Speedup
2.5
2
3.5
1.6
1.2

CHAPTER 5
CONCLUSIONS AND FUTURE WORK

In this work, we have presented an approach for the behavioral synthesis of pipelined
datapaths with low leakage power. Our approach uses simulated annealing and provides
a flexibility for varying the data introduction intervals during synthesis. Our algorithm
intelligently determines a schedule and binding that has lower area and reduced leakage.
Also the algorithm is MTCMOS-aware and thus provides a datapath that has a very
selective MTCMOS binding. The area overheads incurred by our approach are also low.
Our algorithm though is fully extendable towards regular high level synthesis, we have
chosen to present it in a limiting case such as pipelining. Also any standby leakage reduction
technique other than MTCMOS is fully supported by this method. Our algorithm attempts
to provide an effort of balancing the factors of performance and power. Since standby
leakage reduction techniques can provide a performance loss, we provide an approach that
reduces leakage in a performance enhancing technique such as functional pipelining.
The following are the main contributions of this work:
1. A pipelining controller module for linear data flow graphs, which can support the
low-power and crosstalk optimized designs synthesized in AUDI.
2. A scheduling, allocation and binding algorithm with completely characterized MTCMOSbased cost functions for the behavioral synthesis of pipelined datapaths.
3. A frontend for the AUDI system (also known as VHDL2AIF) which can support a
vast portion of descriptions written in behavioral VHDL.
This work was concieved to provide for a comprehensive framework for the synthesis of
pipelined designs. Howver we have identified some key components of this work which are
part of the future work.
45

• Currently, the FASL system handles RTL structures that are synthesized from linear
data flow graphs. Work is currently on to realise FASL2 as a complete RTL system and
backend for AUDI supporting large control-flow intensive designs.
• The pipelining module implemented in AUDI currently fully supports linear behaviors.
We are now integrating the ability of generation of controllers for control-flow intensive
designs. An approach to determine an optimal FSM array for handling branching and
looping is currently under study.
• Once the CFI controller is developed, we will perform modification and analysis using
our SA-based method on large CFI designs from the USF VLSI research group that are
concentrated in mobile and wireless areas. Since our approach primarily targets this area
for low power designs with increased performance.
• For linear behaviors, controller leakage power forms a small percentage and can
generally be ignored. But for CFI based designs, the controller leakage power becomes
larger and cannot be disregarded. Our algorithm currently does not perform controller
leakage reduction. However a study is underway to determine an activity-driven logic
synthesis mechanism and use this mechanism to drive the high level synthesis procedure
to generate efficient and optimized controllers.
• VHDL2AIF in its current state is a behavioral VHDL to AUDI Intermediate Format
(AIF) converter capable of performing simple compiler level optimizations. When completed, VHDL2AIF will be capable of generating highly optimized CDFGs with improved
control and data-flow for high level synthesis. Loops in the body of a design are a difficult
construct to pipeline as the source vertex of a loop is not only executed for every initiation,
it is also executed for every iteration of the loop in case of any loop-carried dependencies.
Currently underway is a study to determine how to apply transformations to loops so as to
reduce this problem during pipeline synthesis. Previous work regarding this problem has
been done in [42].

46

REFERENCES

[1] K. Roy, S. Mukhopadhyay, and H. Mahmoodi-Meimand. “Leakage current mechanisms
and leakage reduction techniques in deep-submicrometer CMOS circuits”. Proceedings
of the IEEE, 91(2):305–327, Feb 2003.
[2] D. Melinak. “Power integrity comes home to roost at 90nm”. Technical report,
www.elecdesign.com, February 2005.
[3] J. T. Kao and A. P. Chandrakasan. “Dual-Threshold Voltage Techniques for LowPower Digital Circuits”. IEEE Journal on Solid State Circuits, 35(7):1009–1018, July
2000.
[4] K. Roy. “Leakage power reduction in low-voltage CMOS design”. In Proceedings of
the IEEE International Conference on Circuits and Systems, pages 167–173, Sep 1998.
[5] Y. Taur and T. H. Ning. Fundamentals of Modern VLSI Devices. New York:Cambridge
University Press, 1998.
[6] S. Borkar. “Design challenges of technology scaling”. IEEE Micro, 19(4):23–29, Aug
1999.
[7] S. Mutoh, T. Douseki, Y. Matsuya, T. Aoki, S. Shigematsu and J. Yamada. “1V
Power supply High Speed Digital Circuit Technology with Multi-Threshold Voltage
CMOS”. IEEE Journal on Solid State Circuits, 30(8):847–854, Aug 1995.
[8] L. Wei, Z. Chen, M. Johnson, K. Roy, Y. Ye, and V. De. “Design and optimization of
dual-voltage circuits for low voltage low power applications”. IEEE Transactions on
Very Large Scale Integrated Systems, 7(1):16–24, March 1999.
[9] J. Halter and F. Najm. “A gate-level leakage power reduction method for ultra lowpower CMOS circuits”. In Proceedings of the IEEE Custom Integrated Circuits conference, pages 475–478, 1997.
[10] M. C. Johnson, D. Somasekhar, L. Chiou and K. Roy. “Leakage control with efficient
use of transistor stacks in single threshold CMOS”. IEEE Transactions on Very Large
Scale Integrated Systems, 10(1):1–5, Feb 2002.
[11] A. Abdollahi, F. Fallah, and M. Pedram. “Leakage current reduction in CMOS VLSI
circuits by input vector control”. IEEE Transactions on Very Large Scale Integrated
Systems, 12(2):140–154, Feb 2004.
[12] N. Hanchate and N. Ranganathan. “LECTOR: A Technique for Leakage Reduction in CMOS circuits”. IEEE Transactions on Very Large Scale Integrated Systems,
12(2):196–205, Feb 2004.
47

[13] M. Powell, S. Yuang, B. Falsafi, K. Roy, and T.N. Vijaykumar. “Gated-Vdd: a circuit
technique to reduce leakage in deep sub-micron cache memories”. In Proceedings of
the International Symposium on Low Power Electronics and Design, pages 90–95, July
2000.
[14] Y. Ye, S. Borkar, and V. De. “A new technique for standby leakage reduction in high
performance circuits”. In Digest of Tech. Papers Symposium VLSI Circuits, pages
40–41, June 1998.
[15] Z. Chen, M. Johnson, L. Wei, and K. Roy. “Estimation of standby leakage power in
CMOS circuits considering accurate modelling of transistor stacks”. In Proceedings of
Low Power Electronics and Design, pages 239–244, Aug 1998.
[16] V. Sundararajan and K. K. Parhi. “Low power synthesis of dual threshold voltage
CMOS VLSI circuits”. In Proceedings of the International Symposium of Low Power
Electronics and Design, pages 139–144, Aug 1999.
[17] K. S. Khouri and N. K. Jha. “Leakage power analysis and reduction during behavioral
synthesis”. IEEE Transactions on Very Large Scale Integrated Systems, 10(6):876–885,
Dec 2002.
[18] C. Gopalakrishnan and S. Katkoori. “Resource allocation and binding approach for
low leakage power”. In Proceedings of 16th International Conference on VLSI Design,
pages 297–302, 2003.
[19] N. Park and A. C. Parker. “Sehwa: A software package for synthesis of pipelines
from behavioral specifications”. IEEE Transactions on Very Large Scale Integrated
Systems, 7(3):356–370, Mar 1988.
[20] P. G. Paulin and J. P. Knight. “Force-directed scheduling for the behavioral synthesis
of ASICs”. IEEE Transactions on Computer-Aided Design of Circuits and Systems,
8(6):661–679, June 1989.
[21] C. Hwang, Y. Hsu, and Y. Lin. “PLS: A scheduler for pipeline synthesis”. IEEE
Transactions on Computer-Aided Design of Circuits and Systems, 12(9):1279–1286,
Sep 1993.
[22] H. Jun and S. Hwang. “Design of a pipelined datapath synthesis system for digital signal processing”. IEEE Transactions on Very Large Scale Integration Systems,
2(3):292–303, Sep 1994.
[23] H. Jun and S. Hwang. “Automatic synthesis of dynamically configured pipelines
supporting variable data initiation intervals”. IEEE Transactions on Very Large Scale
Integration Systems, 4(2):279–285, June 1996.
[24] A. Chandrakasan, S. Sheng, and R. W. Broderson. “Low-Power CMOS Digital Design”. IEEE Journal on Solid State Circuits, 27(4):473–484, January 1992.
[25] J. Chang and M. Pedram. “Module assignment for low power”. In European Design
Automation Conference EURO-DAC 96, pages 376–381, Sep 1996.

48

[26] S. Heo and K. Asanovic. “Power-optimal pipelining in deep submicron technology”.
In Proceedings of the 2004 Internal Symposium on Low Power Electronics and Design,
pages 218–223, Aug 2004.
[27] T. Z. Yu, F. Chen, and E. H. -M. Sha. “Loop scheduling algorithms for power reduction”. In ICASSP, pages 3073–3076, May 1998.
[28] D. Kim, D. Shin, and K. Choi. “Low power pipelining of Linear Systems: A common
operand centric approach”. In International Symposium on Low Power Electronics
and Design, pages 225–230, 2001.
[29] Giovanni De Micheli. Synthesis and Optimization of Digital Circuits. McGraw Hill
Inc., 1994.
[30] A. Raghunathan and N.K. Jha. “Behavioral Synthesis for Low Power”. In IEEE
International Conference on Computer Design, pages 318–322, Oct 1994.
[31] A. K. Murugavel and N. Ranganathan. “A game theoretic approach for power optimisation during behavioral synthesis”. IEEE Transactions on Very Large Scale Integrated
Systems, 11(6):1031–1043, Dec 2003.
[32] S. P. Mohanty and N. Ranganathan. “A framework for energy and transient power
minimization during behavioral synthesis”. IEEE Transactions on Very Large Scale
Integrated Systems, 12(6):562–572, June 2004.
[33] R. San Martin and J. P. Knight. “Optimising power in ASIC behavioral synthesis”.
IEEE Design & Test of Computers, 13(2):58–70, Summer 1996.
[34] N. Kumar, S. Katkoori, L. Rader, and R. Vemuri. “Profile-driven behavioral synthesis for low power VLSI systems”. IEEE Design & Test of Computers, 12(3):70–83,
Autumn 1995.
[35] C. Gopalakrishnan and S. Katkoori. “Knapbind: an area efficient binding algorithm for
low leakage datapaths”. In Proceedings of 21th International Conference on Computer
Design, pages 430–435, 2003.
[36] P. R. Panda and N. D. Dutt. “1995 high level synthesis design repository”. In Proceedings of the Eighth International Symposium on System Synthesis, pages 170–174,
Sep 1995.
[37] C. Gopalakrishnan and S. Katkoori. “An architectural leakage power simulator for
VHDL structural descriptions”. In Proceedings of the IEEE Computer Society Annual
Symposium on VLSI, pages 211–212, Feb 2003.
[38] S. Kirkpatrick, C. D. Gelatt, and M. P. Vecchi. “Optimization by simulated annealing”. Science, 220(4598):671–680, 13 May 1983.
[39] S. Devadas and A. R. Newton. “Algorithms for hardware allocation in data path
synthesis”. IEEE Transactions on Computer-Aided Design of Circuits and Systems,
8(7):768–781, July 1989.

49

[40] P. Prabhakaran, P. Bannerjee, J. Crenshaw, and M. Sarrafzadeh. “Simultaneous
scheduling, allocation and floorplanning for interconnect power optimization”. In
Proceedings of the 12th International Conference on VLSI design, pages 423–427, Jan
1999.
[41] C. Tseng and D. P. Siewiorek. “Automated synthesis of data paths in digital systems”.
IEEE Transactions on Computer-Aided Design of Circuits and Systems, 5(3):379–395,
Mar 1986.
[42] M. Rim and R. Jain. “Valid transformations: a new class of loop transformations
for high level synthesis and pipelined scheduling applications”. IEEE Transactions on
Parallel and Distributed Computing, 7(4):399–410, Apr 1996.

50

