Timing Analysis and Behavioral Synthesis with Process Variation by Lucas, Gregory M.
c© 2009 Gregory M. Lucas
TIMING ANALYSIS AND BEHAVIORAL SYNTHESIS WITH PROCESS
VARIATION
BY
GREGORY M. LUCAS
B.S., Pennsylvania State University, 2007
THESIS
Submitted in partial fulfillment of the requirements
for the degree of Master of Science in Electrical and Computer Engineering
in the Graduate College of the
University of Illinois at Urbana-Champaign, 2009
Urbana, Illinois
Adviser:
Assistant Professor Deming Chen
ABSTRACT
The move to deep submicron processes has brought about new problems that
designers must contend with in order to obtain functional circuits. Process
variation has been recognized as one of the leading issues that must be dealt
with in deep submicron processes. The problems experienced with deep submicron
processes have ushered in a new era of statistical design, in which process
parameters are no longer considered to be deterministic but are modeled as
probability distributions. In order to support statistical design, new algorithms
and methods are needed.
One of the chief problems with process variation is the need for accurate
timing analysis in which process parameters such as gate length and oxide
thickness are now modeled as probability density functions (pdf) instead of
deterministic quantities. Statistical static timing analysis (SSTA) has emerged
to fill that void. Much work has been performed in the realm of SSTA; however,
the majority of it has focused on improving the main SSTA algorithm. The first
piece of work that will be presented in this thesis extends SSTA to be able to
handle complicated timing constraints such as multi-clock domain circuits, false
paths, and multi-cycle paths.
The second piece of work extends the behavioral synthesis task of binding
to be variation-aware through our algorithm FastYield. FastYield offers several
contributions to the field of statistical design. First, it presents the unit correlation
model, a model that can be used to model correlation at the functional unit
ii
level. Second, it offers a bipartite matching formulation for variation-aware
binding in high-level synthesis. Last, it presents a statistical timing-driven
floorplanner that is used to obtain correlation and interconnect information for
more accurate timing analysis.
iii
To my parents and grandparents, for their love, support, and encouragement.
iv
ACKNOWLEDGMENTS
I would like to acknowledge Professor Deming Chen for his guidance and the
many interesting conversations that have led to the research that is presented in
this thesis. Without his guidance and support this thesis would not have been
possible. I would also like to acknowledge Scott Cromar who has contributed
significantly to Chapter 3 and who has also been a great sounding board for
new ideas.
v
TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . viii
LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix
LIST OF ABBREVIATIONS . . . . . . . . . . . . . . . . . . . . . . . . . x
CHAPTER 1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Process Variation . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.1.1 Variation types and correlation modeling . . . . . . . . . . 3
1.2 Statistical Static Timing Analysis . . . . . . . . . . . . . . . . . . 7
1.3 Behavioral Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . 9
1.3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.3.2 Allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1.3.3 Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3.4 Binding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
CHAPTER 2 STATISTICAL STATIC TIMING ANALYSIS FOR DESIGNS
WITH MULTI-CLOCK DOMAINS . . . . . . . . . . . . . . . . . . . . 21
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
2.2 Problem Formulation and Motivation . . . . . . . . . . . . . . . . 22
2.3 Algorithm Description . . . . . . . . . . . . . . . . . . . . . . . . 25
2.3.1 Max between different cycle paths . . . . . . . . . . . . . . 26
2.3.2 Application to PCA . . . . . . . . . . . . . . . . . . . . . 28
2.3.3 Multi-cycle graph traversal . . . . . . . . . . . . . . . . . . 29
2.4 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 33
CHAPTER 3 FASTYIELD: VARIATION-AWARE, LAYOUT-DRIVEN
SIMULTANEOUS BINDING AND MODULE SELECTION FOR
PERFORMANCE YIELD OPTIMIZATION . . . . . . . . . . . . . . . 39
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
3.3 Resource and Correlation Modeling . . . . . . . . . . . . . . . . . 41
3.4 FastYield Algorithm Description . . . . . . . . . . . . . . . . . . . 43
3.4.1 Initial binding . . . . . . . . . . . . . . . . . . . . . . . . . 44
3.4.2 Register allocation and binding . . . . . . . . . . . . . . . 44
vi
3.4.3 Initial functional unit allocation and binding . . . . . . . . 44
3.4.4 Multiplexer/connection allocation . . . . . . . . . . . . . . 47
3.5 Timing-Driven Floorplanner . . . . . . . . . . . . . . . . . . . . . 47
3.5.1 Unit correlation model . . . . . . . . . . . . . . . . . . . . 47
3.5.2 SSTA algorithm . . . . . . . . . . . . . . . . . . . . . . . . 48
3.5.3 Floorplanner cost function . . . . . . . . . . . . . . . . . . 50
3.6 Rebinding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.6.1 Register and functional unit ranking . . . . . . . . . . . . 52
3.6.2 Swapping critical functional units . . . . . . . . . . . . . . 53
3.6.3 Selection of operations to be rebound . . . . . . . . . . . . 54
3.6.4 Operation rebinding . . . . . . . . . . . . . . . . . . . . . 55
3.7 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 56
3.7.1 Spatial correlation in timing analysis . . . . . . . . . . . . 57
3.7.2 FastYield compared to BindBWM . . . . . . . . . . . . . . 59
CHAPTER 4 CONCLUSIONS AND FUTURE WORK . . . . . . . . . . 64
4.1 Multi-Clock SSTA . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.2 FastYield . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.3 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
vii
LIST OF TABLES
2.1 Process Variation Parameters . . . . . . . . . . . . . . . . . . . . 34
2.2 MCSSTA Benchmark Characteristics and Run Times . . . . . . . 36
2.3 Benchmark Results . . . . . . . . . . . . . . . . . . . . . . . . . . 37
3.1 HLS Benchmark Characteristics . . . . . . . . . . . . . . . . . . . 57
3.2 Correlation vs. No-Correlation Experimental Results . . . . . . . 58
3.3 FastYield Experimental Results . . . . . . . . . . . . . . . . . . . 61
viii
LIST OF FIGURES
1.1 Grid correlation model . . . . . . . . . . . . . . . . . . . . . . . . 6
1.2 Quad-tree correlation model . . . . . . . . . . . . . . . . . . . . . 6
1.3 A sample CDFG . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
1.4 The effects of allocation . . . . . . . . . . . . . . . . . . . . . . . 12
1.5 Different scheduling algorithms . . . . . . . . . . . . . . . . . . . 14
1.6 Register binding example . . . . . . . . . . . . . . . . . . . . . . . 17
1.7 Bipartite graph setup . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.1 Motivational circuit from high-level synthesis . . . . . . . . . . . . 23
2.2 Path decomposed motivational circuit . . . . . . . . . . . . . . . . 26
2.3 Timing graph setup: gate-level multiplexer used in Figure 2.1 . . 31
3.1 Illustration of the bipartite graph created for the functional unit
binding of the first control step . . . . . . . . . . . . . . . . . . . 46
3.2 Sample floorplan showing data connections . . . . . . . . . . . . . 49
3.3 Example FU ranking and operation selection for rebinding . . . . 54
3.4 Chem delay distributions . . . . . . . . . . . . . . . . . . . . . . . 59
ix
LIST OF ABBREVIATIONS
ASAP As soon as possible
ALAP As late as possible
CDFG Control data flow graph
DAG Directed acyclic graph
DFG Data flow graph
DSP Digital signal processing
FF Flip-flop
FU Functional unit
HLS High-level synthesis
Mux Multiplexer
PCA Principal component analysis
pdf Probability density function
PY Performance yield
RAM Random access memory
STA Static timing analysis
SSTA Statistical static timing analysis
x
CHAPTER 1
INTRODUCTION
Moore’s Law states that every 18 months the density of transistors on a chip
will double. To keep up with Moore’s Law, the size of transistors has continued
to shrink at astounding levels. However, with each new process technology,
the ability to control the exact fabricated parameters of each transistor on
a chip has decreased. Data released by IBM and Intel show this effect. Intel
shows that at the 130 nm technology node, the frequency of their fabricated
chips varied up to 30% [1]. IBM then shows that when moving to the 65 nm
node, the variation in frequency increased to 50% [2]. The increased variation
that is present in deep submicron processes has mandated a change from a
deterministic design paradigm to a statistical design paradigm. This thesis
will explore process variation and what can be done to mitigate it. First, I will
introduce some background information on process variation, statistical static
timing analysis, and behavioral synthesis. I will then introduce an extension to
the standard statistical static timing analysis algorithm to consider multi-clock,
multi-cycle, and false path timing constraints. Next, I will put these concepts
together to present a process variation-aware binding algorithm. Last, I will
present some conclusions and future directions for research.
1
1.1 Process Variation
The move to deep submicron process technologies has caused process variation
to become a major issue that must be dealt with during the design of circuits.
In older process technologies, the issue of process variation was mitigated
through the use of a guardband. Designers would design their circuits under
a tighter timing constraint and a smaller power budget than they were actually
trying to meet. In the course of timing and power analysis, corner analysis
would be used to determine the worst-case power and timing that the circuit
would encounter with the manufacturing process. Therefore, when the circuit
was actually manufactured, it would still function according to the original
specifications. The problem with this approach is that in deep submicron
processes, the size of the guardband is prohibitively large if the circuit is
constructed so that every chip meets the timing and power requirements. In
place of guardbanding, and guaranteeing that every manufactured die functions,
designers have moved toward meeting a performance yield requirement, where
performance yield is defined as the percentage of manufactured die that will
function within the timing and power constraints. This section will introduce
the concept of performance yield and how process variation can be modeled.
Process variation is the deviation of a parameter from its intended value.
Parameters that are normally considered to be affected in deep submicron
processes include gate length, Leff ; device width, W , oxide thickness, tox; and
doping density, Na. These parameters each have an effect on the resulting
device characteristics. Two assumptions are normally made about the effects
that variation has on the delay of the resulting device. The first is that the
variation on the parameter follows a Gaussian distribution. The second is
that the variation affects the characteristics of the device linearly. These
2
assumptions are not strictly correct, and there has been research into examining
non-Gaussian variation modeling and nonlinear characteristic effects. However,
there is often significant overhead associated with these techniques for a small
gain in accuracy. Therefore, the work presented in this thesis follows the
assumption that process parameters are normally distributed and that the
delay of the associated device is affected linearly. Using these assumptions we
can express the delay of a device as shown in Equation (1.1), where µo is the
mean delay of the device without any variation, an is a sensitivity value for the
delay with respect to a change in the variation parameter xn.
D = µo + a1x1 + a2x2 + · · ·+ anxn (1.1)
Given the deviation of the process parameters from their mean values, the
delay of a device can be found using Equation (1.1). Rarely are the actual
manufactured device parameters known, so the delay of a device is more often
expressed as a probability density function of a normal distribution, N ∼
Gaussian(µ, σ), where µ is the mean delay of the device and σ is the standard
deviation of the delay.
1.1.1 Variation types and correlation modeling
The underlying cause of the process variation is the imperfections in the
different mechanical processes that are applied to a wafer during its manufacturing.
These different mechanical processes act over different distances, giving rise to
two different types of process variation. These two types of process variation are
called (1) inter-die variation and (2) intra-die variation.
3
1.1.1.1 Inter-die variation
Inter-die variation is variation that can be seen over large distances such as
between different dies, wafers, or lots. This means that within a single die,
the value for a process parameter is constant and any variation of that process
parameter can be represented as a shift in the mean value for that die. Inter-die
variations can therefore be modeled using a single random variable for all
the devices within a die. An example of this type of variation is gate-length
variations due to lithography exposure time differences.
1.1.1.2 Intra-die variation
Intra-die variation occurs over less than the distance of a die. This means that
within a single die, multiple different values for a process parameter might
occur. To model intra-die variation, a random variable must be assigned to
each device within a circuit. Intra-die variation can be further divided into two
types: spatially correlated variation and spatially uncorrelated variation. The
doping concentration, Na, when doping is performed using ion implantation,
exhibits spatially uncorrelated variation since each ion randomly strikes the
wafer’s surface. To model spatially uncorrelated variation, a random variable is
assigned to each device and considered to be independent.
Spatially correlated variation occurs as a result of processes such as chemical-
mechanical polishing (CMP) of a grown oxide layer. In spatially correlated
variation, devices that are close to each other are more likely to have similar
values for a process parameter than devices farther away have. Keeping track
of the spatial correlation between the process parameters of different devices is
not a simple task. Initially, one might think that assigning a random variable
to each device would be sufficient to model the correlation. However, one would
4
quickly find that this method is too computationally intensive to be feasible. To
rectify this situation, different models have been proposed, including the grid
model and the quad-tree model.
Figure 1.1 shows an example of the grid-based correlation model. For the
grid model, the device area has been divided into a number of different grid
squares, and each grid square has been assigned a random variable to represent
the variation in that region. Grid squares that are close together are assumed
to be more correlated than grid squares that are farther apart. Calculating the
correlation coefficient is accomplished through evaluating a correlation function.
Functions that make for valid correlation functions can be found in [3]. Two
common exponential functions are listed in Equations (1.2), where b is a fitting
parameter and d is the distance between the two grid squares. These functions
are both monotonically decreasing functions, with Equation (1.2b) decreasing
faster than Equation (1.2a). The decision on which correlation function is more
valid and should be used depends on the process parameter being modeled
and on the data from the foundry. In Figure 1.1 the correlation of the grid
squares with respect to the upper left square is shown. The correlation function
used is based on (1.2a) with a fitting parameter of 2 and a distance of 1 µm
between grid squares. The correlation between any two devices can then be
found by examining which square each device is in and finding the correlation
between them. For devices that lie within the same grid square, their variation
is assumed to be perfectly correlated.
ρ(v) = exp(−bd) (1.2a)
ρ(v) = exp(−b2d2) (1.2b)
5
Figure 1.1: Grid correlation model
Figure 1.2: Quad-tree correlation model
The quad-tree model is shown in Figure 1.2. The quad-tree model takes
the device area and recursively divides it into quadrants, where each quadrant
is assigned its own independent random variable, labeled ∆Xnm in the figure.
The total amount of variation for the circuit is then divided among the different
levels of the quad-tree depending on whether the variation is correlated over a
shorter or longer distance. For short-distance correlation more of the variation
will be assigned to the bottom of the quad-tree, while for longer-distance
correlation more variation will be assigned to the upper levels of the quad-tree.
The total variation for a device can then be found by summing up the random
variables for the quadrants that the device occupies. For example, if a device
was located in the quadrant that contains ∆X6, then total variation for the
gate would be the sum over the three levels for the quadrants that the device
occupies, making the total variation equal to
6
∆Xtotal = ∆X6 + ∆X2 + ∆X1 (1.3)
To find the correlation between two devices, the number of quadrants and the
amount of variation in those quadrants that are shared by the devices can be
summed.
1.2 Statistical Static Timing Analysis
Statistical static timing analysis (SSTA) has emerged as a new timing analysis
tool to deal with the effects of process variation. Like traditional static timing
analysis (STA), the goal of SSTA is to perform a timing analysis of a circuit for
the purpose of deciding whether or not the circuit meets design requirements.
However, the results given by STA and SSTA are different. The result from
an STA is a delay number that is considered to be the maximum delay of any
path within the circuit. The result from SSTA is a probability density function
(pdf) of the circuit delay. This pdf can then be used to find the performance
yield of a circuit, where the performance yield is defined to be the percentage of
manufactured die that will function at a specific clock period. Mathematically,
the performance yield, PY, can be defined as:
PY = P (X1 ≤ y,X2 ≤ y, . . . , Xn ≤ y) (1.4)
where Xi is the delay distribution at the output i of the circuit, and y is the
chosen clock period. Complicating this equation is the fact that the delay
distributions exhibit correlations that must be considered. As was mentioned
in Section 1.1.1.2, two main models have been proposed to deal with spatial
correlation among devices, with the grid-based model being more popular
7
because of its simplicity. It has been shown that if correlations between the
process parameters are not considered during the timing analysis, the standard
deviation of the resulting circuit delay pdf can be underestimated by over
100% [4]. Therefore, it is extremely important to consider correlations during
any performance yield calculation and timing analysis.
Overall, SSTA algorithms can be divided into two main groups: block-based
approaches and path-based approaches. In path-based approaches, a number of
paths that are expected to be critical are chosen from the circuit. A statistical
static timing analysis is then performed for the set of selected paths to arrive at
a pdf for the delay of the circuit. Path-based approaches suffer from two main
problems: (1) the need to select the critical paths from within the circuit is not
trivial and (2) combining the timing results from the individual paths into a
global pdf requires knowledge of the correlations of the circuit.
For these reasons, block-based timing approaches have become more popular
recently. Block-based approaches closely follow deterministic static timing
analysis approaches through the use of two operations: a sum operation and
a max operation. [5] proposes an SSTA approach for normally distributed
random variables. The traversal proceeds by first adding the input edges to
all the vertices of the timing graph. Next, a breadth-first search traversal is
peformed on the timing graph during which a max operation is perfomed across
all the vertices in the timing graph. A major disadvantage of this method is the
difficulty in tracking correlation information between the different vertices in
the timing graph.
Chang and Sapatnekar [4] propose a block-based method based on principal
component analysis (PCA), which transforms a set of correlated random
variables into a set of independent random variables, also known as principal
components. The delay distribution of each gate within the circuit is expressed
8
as a function of the principal components. With the delays represented as
principal components, a PERT-like traversal of the timing graph can be performed
using a sum and max function.1 The advantage of the PCA method is that all
the correlation information is inherently stored within the principal components
and does not have to be explicitly tracked. All of the block-based methods
depend on two operations, the sum operation and the max operation. These
operations are explored in greater depth in Chapter 2.
A common assumption within the SSTA methods mentioned above is that
process parameters can be represented by Gaussian distributions and that
the delay of the gate depends linearly on the process parameter. In general,
these assumptions have merit and provide good accuracy. However, there are
some process parameters where either the delay is not linearly related to the
parameters or where the process parameter cannot be accurately represented
by a Gaussian distribution. [6], [7], and [8] have all proposed SSTA algorithms
that can handle non-Gaussian distributions and parameterized delay models.
However, they are more computationally complex and have not found widespread
adoption.
1.3 Behavioral Synthesis
High-level synthesis (HLS), also known as behavioral synthesis, is a synthesis
technique that allows designers to move up the design chain to a higher level of
abstraction. This means that instead of designing at the register transfer level
(RTL), where a designer must specify all the timing of the circuit, the designer
can work at a behavioral level, where only the data flow of the required circuit
has to be specified. This frees the designer from the burden of many low-level
1PERT: Program Evaluation Review Technique, a project management algorithm to
schedule, organize, and coordinate tasks.
9
Figure 1.3: A sample CDFG
details of circuit design, allowing for productivity increases of up to 10× and
code reductions of up to 100× [9]. As manufacturing technologies continue to
shrink, HLS is becoming a powerful technique to decrease the amount of time
required to design a chip.
1.3.1 Overview
HLS can be divided into three different tasks (scheduling, binding, and allocation),
which are performed on a control data flow graph (CDFG). A CDFG is a
directed acyclic graph (DAG) that shows how data flows through a circuit.
Figure 1.3 shows a sample CDFG. The operations are represented by the
vertices of the graph, and the flow of the data is represented by the edges.
This example consists of 5 multiply operations and 4 add operations. It will be
used to demonstrate the tasks involved in HLS.
Given a CDFG, the three interrelated tasks of scheduling, binding, and
allocation must be performed. These tasks are all interrelated, since the result
of one affects the results of the others. It has been proven that the problem
of scheduling, binding, and allocation is NP-hard. This means that we must
come up with meaningful heuristics to arrive at a solution that is as close
to optimal as possible. The task of allocation is to decide on the number of
10
each type of functional unit (FU) that will be allocated for the design. The
task of scheduling consists of dividing the CDFG into control steps followed
by assigning the operations to control steps so that none of the dependency
constraints are violated and there are no control steps where more operations
are assigned to execute than there are functional units allocated. The task
of binding assigns each operation from the schedule to an FU that has been
allocated during allocation. These tasks do not have be performed in any
specific order. They can each be performed independently, with the results of
one task fed to another, or they can be combined and performed simultaneously.
1.3.2 Allocation
Allocation is an important step in the HLS process. It is a simple concept but
has major implications on the scheduling and binding steps that follow. In
the allocation step, the functional units that will be available for use during
scheduling and binding must be decided on. This step does not just consist
of determining the number of each type of functional unit but also considers
the implementation of each type of functional unit. It is often the case that
more than one implementation of a unit exists, with different delay-area
characteristics. For example, a ripple-carry adder might take up less area than a
carry-select adder, but it will also be slower than the carry-select adder. These
tradeoffs must be evaluated during allocation to decide which allocation of FUs
will be the best to meet the goals of the design.
Looking at the example CDFG from Figure 1.3, we can see that if we want
to minimize the latency of the CDFG, we should allocate enough FUs so that
all the operations on the critical path of the CDFG can execute sequentially.
Using an as-soon-as-possible (ASAP) schedule, in which each operation is
11
(a) Unlimited unit allocation (b) Constrained unit allocation
Figure 1.4: The effects of allocation
executed as soon as is possible, this means that we must allocate 2 multipliers
and 3 adders in order to achieve the minimal latency allowed by the ASAP
schedule. Figure 1.4(a) shows the resulting schedule from this allocation.
However, because of design constraints such as area and power, it is not always
possible to allocate as many units as is necessary to achieve the minimal
latency. In cases such as these, it might be possible that only 1 multiplier
and 2 adders can be allocated while still meeting an area constraint imposed
upon the design. The resulting ASAP schedule from this allocation is shown in
Figure 1.4(b). It can be seen that the new schedule differs significantly from the
previous schedule in the ordering of operations and that the number of control
steps has had to be increased.
The importance of allocation goes further than is demonstrated in Figure
1.4. Allocation also affects the clock period of the design. If the effects of
multiplexers and interconnect delay are ignored, then the slowest FU that
is allocated will determine the clock period of the design. Choosing a slow
implementation for FUs that are on the critical path of the circuit can negate
any benefits that are found from allocating more units or from good scheduling
and binding solutions. If we look back at Figure 1.4 and assume that a slow
12
multiplier implementation that takes 3 ns to complete is chosen for the unlimited
unit allocation and that a fast multiplier implmentation that takes 2 ns to
complete is chosen for the constrained unit allocation, the overall latencies for
the two schedule are as follows:
Latency = Clock Period × Control Steps
LatencyUnconstrained = 3 ns × 4 Control Steps = 12 ns
LatencyConstrained = 2 ns × 5 Control Steps = 10 ns
As can be seen, the constrained schedule with the larger number of control
steps is actually a better solution, since the total latency is 2 ns less than
that of the unconstrained schedule. This example illustrates the importance
of choosing a good FU implementation for each FU that is allocated. This
problem is also known as the module selection problem. For this example, we
ignored the effects of multiplexer and interconnect delays on the circuit. When
these delays are considered, the module selection problem becomes even more
complex, since the clock period is not necessarily dominated by the slowest FU
implementation that is allocated.
1.3.3 Scheduling
As was mentioned, the goal of scheduling is to divide the CDFG into a number
of control steps and to assign each operation a control step(s) during which it
executes while meeting the design constraints and maximizing or minimizing
a design quantity. For the examples that have been shown so far, the goal was
to minimize the latency of the schedule. However, other types of scheduling
algorithms have been proposed that try to minimize or maximize quantities
such as the static power, dynamic power, performance yield, or multiplexer size.
13
(a) ASAP schedule (b) ALAP schedule
(c) List schedule
Figure 1.5: Different scheduling algorithms
Three of the most simple scheduling algorithms, which often form the
basis for other, more advanced, scheduling algorithms, are as-soon-as-possible
(ASAP) scheduling, as late as possible (ALAP) scheduling, and list scheduling.
ASAP scheduling, which has already been introduced, is a top-down approach
where each operation is scheduled as soon as it possibly can be, given the
dependency constraints from the DFG. Figure 1.5(a) shows the original DFG
scheduled using ASAP with an allocation of 1 multiplier and 2 adders. ALAP is
the opposite of ASAP scheduling. Instead of being a top-down approach, ALAP
is a bottom-up approach. Given a latency constraint, ALAP scheduling starts
at the bottom of the DFG and schedules each operation into the latest control
step possible. Figure 1.5(b) shows the DFG scheduled using ALAP.
14
List scheduling is a more advanced scheduling technique. In list scheduling,
each control step is scheduled sequentially starting from the first control step.
At each control step, a list of the operations that have all their dependency
constraints satisfied is created. A value is then attached to each operation,
and the operations are scheduled, starting with the operation having the
highest or lowest value, until all operations are scheduled or until the resource
constraint for that type of unit is met. The value that is attached to each
unit can vary depending on the goal that the designer wishes to achieve. For
latency minimization, the value that is normally assigned to each operation
is the total execution time of the downstream operations in the longest path
from the current operation to a primary output. The operation with the longest
downstream execution time is then scheduled first. If two operations have
the same value, then either operation can be chosen. Figure 1.5(c) shows the
schedule for the DFG using list scheduling and the downstream execution time
as the deciding value. A sample list for the first control step is also shown.
We can see that the multiplier for operation 1 has a downstream execution
time of 3 operations. Since it has the longest downstream execution time, it is
scheduled first. Since there is only 1 multiplier allocated, operation 3 cannot
be scheduled in the first clock step. Operation 2 has the longest downstream
execution time for an addition, so it is scheduled. Operations 4 and 5 both have
equal downstream execution times, so operation 4 is chosen randomly to be
scheduled in the first clock step.
1.3.4 Binding
There are two parts to binding: register binding, and functional unit binding.
Register binding looks at the lifetimes of the variables that are produced
15
by operations and binds the variables into different registers. Functional
unit binding takes the operations themselves and binds them to an allocated
functional unit. Both register binding and functional unit binding use the
schedule to determine whether a binding operation is legal.
1.3.4.1 Register allocation and binding
We will examine one of the simplest register binding and allocation algorithms.
This is the left edge algorithm [10]. The idea behind the left edge algorithm
is to find the interval, or lifetime, during which a register will need to store a
variable. Once these intervals are identified for all the variables in the DFG,
an interval graph can be created. An interval graph is formally defined as a
graph, G = (V,E), where each vertex, v ∈ V , represents an interval, and each
edge, e ∈ E, represents an overlap between the vertices that it connects. Each
interval can be packed into a register based on its start time (left edge). The
maximum number of registers needed is the control step that has the largest
number of intervals across it. The left edge algorithm is shown in Algorithm
1.1. For interval graphs, it can be proven that the left edge algorithm allocates
a minimal number of registers.
An example of the left edge binding algorithm is shown next. The scheduled
DFG is shown in Figure 1.6(a). The interval graph is shown in Figure 1.6(b),
and the register binding which corresponds to the packing solution is show in
Figure 1.6(c). From the figure we can see that the interval graph is found by
examining the start and end times for each variable. For variable 1, the result is
generated in control step 1 and then consumed in control step 2, so its interval,
or lifetime, consists of the first control step. More interestingly, variable 3 starts
in control step 1 and is not consumed until control step 4, so the interval is
much longer. In the packed intervals shown in Figure 1.6(c), which corresponds
16
Algorithm 1.1 Left Edge Algorithm for Register Binding
1: Create Interval Graph();
2: Sort According to Left Edge();
3: for each interval, i do
4: for each packed register, r do
5: if starti > endr then
6: Assign i to r;
7: endr = endi
8: end if
9: end for
10: if i is not assigned then
11: Create new register and assign i;
12: end if
13: end for
(a) Scheduled CDFG (b) Interval graph
(c) Packed interval graph
Figure 1.6: Register binding example
17
to the final register allocation and binding solution, we can see that 4 registers
are required for the given schedule. Connecting back to the scheduling task,
we can also see that, if we were to make a change in the schedule, the register
allocation and binding solution would change.
1.3.4.2 Functional unit binding
The FU binding solution determines a number of factors in the final design.
A bad FU binding can destroy any optimization that previously occured, so
it is very important that this step not be overlooked. There have been many
FU binding algorithms proposed that perform FU binding to meet different
objectives. Some of these objectives include minimizing the size of multiplexers,
minimizing the amount or length of interconnect, minimizing the clock period,
and minimizing either the static or switching power of the circuit. To get an
idea of the FU binding problem, we will examine a binding algorithm based on
bipartite matching.
The algorithm that we will examine was proposed by Huang et al. [11].
The goal of the algorithm is to reduce the multiplexer size. By reducing the
multiplexer size, the authors hope to reduce the delay of the critical path,
thereby decreasing the clock period of the bound solution and thus reducing the
overall latency for the CDFG. The functional unit binding part of the algorithm
assumes that both the FU allocation and register binding and allocation have
occurred. To reduce the multiplexer size, the algorithm introduces the concept
of gain. The gain of binding two operations to the same functional unit is found
by examining the registers where operands for the operation are stored. If the
operands from two operations are stored in a mutually exclusive set of registers,
then binding the two operations to the same FU will create a multiplexer. This
condition is referred to as negative gain. However, if the two operations are
18
stored in the same registers, then no multiplexer is needed, creating a positive
gain situation. The algorithm begins by creating a gain table for each operation
pair. The gain is defined by Equation (1.5):
GainTable = α× (β − |IR(opi) ∩ IR(opj)|) + γ × |OR(opi) ∩OR(opj)| (1.5)
where α, β, and γ are weighting parameters, IR(opk) refers to the set of input
registers for operation k, and OR(opk) refers to the set of output registers for
operation k.
After the gain table is created, the operations are bound one control step
at a time by solving a bipartite graph for the highest-weight solution. The
bipartite graph is set up with the operations on the left-hand side and the
allocated FUs on the right hand side. For each operation, an edge is created
between the operation and the FU if the operation can be bound to the functional
unit. A weight, wij, is then assigned to the edge according to Equation (1.6)
where OPFU(fuj) represents all the operations that have been assigned to fuj.
Figure 1.7 shows the bipartite graph setup for binding the first control step of
the schedule from Figure 1.6(a) with an allocation of 2 adders and 1 multipler.
wij =
∑
opk∈OPFU(fuj)
GainTable[k][i] (1.6)
The bipartite graph can then be solved for the highest-weight solution. This
solution is the bound solution for the control step.
This section has presented an introduction to the subtasks of behavioral
synthesis. It has presented the basics of allocation, scheduling, and binding and
has offered an example algorithm for each subtask. As has been demonstrated,
the three subtasks are very much intertwined. While these sections have
19
Figure 1.7: Bipartite graph setup
presented the behavioral synthesis tasks in their normal order, many algorithms
have been proposed that perform the subtasks in different orders, as well as
simultaneously. These algorithms can often offer better results, since they are
able to consider the effects of one subtask on the other. As will be demonstrated
in Chapter 3, the field of behavioral synthesis is still a very active research field,
with new algorithms being proposed constantly.
20
CHAPTER 2
STATISTICAL STATIC TIMING ANALYSIS
FOR DESIGNS WITH MULTI-CLOCK
DOMAINS
2.1 Introduction
In recent years, a number of SSTA algorithms have been proposed in the
literature [12], [13], [14], [8], [6]. SSTA has proven to be an especially difficult
problem as evidenced by the fact that the majority of work on SSTA has
focused on defining a main SSTA algorithm. Issues such as the need to consider
structural correlations and process parameter correlation have been the focus
of research, while issues such as multi-clock paths and false paths have been
ignored. In order to bring SSTA to maturity and into the mainstream, solutions
to these problems are required.
In this paper we tackle the problem of considering multiple clock domains,
multi-cycle paths, and false paths in SSTA. We propose a method for considering
multi-clock domains, multi-cycle paths, and false paths in block-based timing
analysis frameworks. Our contributions include (1) a multiple clock domain
max equation derivation that considers spatial and structural correlations; (2) a
simple timing graph extension to support multi-clock, multi-cycle path and false
path constraints; (3) a modified PERT-like traversal to enable accurate timing
analysis across multiple clock domains. To the best of our knowledge, this is
the first work to deal with the constraints imposed by multiple clock domains,
multi-cycle paths, and false paths in SSTA. The results show that our method
extends block-based SSTA to consider these complex timing constaints without
21
additional loss of accuracy over the approximations in the max operation.
The remainder of the paper is organized as follows. In Section 2.2 we
formulate the multi-clock problem and transform it into a multi-cycle path
problem as well as provide a motivational example. In Section 2.3, we describe
our algorithm for considering multi-cycle paths. In Section 2.4, we present our
experimental results.
2.2 Problem Formulation and Motivation
Complex timing constraints, such as multi-clock, multi-cycle paths and false
paths, are a reality in modern industrial designs [15]. The ability to accurately
identify and analyze them in industrial designs is key in attaining timing
closure. In order to further motivate the need for considering multi-clock
during block-based timing analysis, we present the circuit shown in Figure 2.1, a
circuit structure that is often found in high-level and RTL synthesis. The circuit
consists of two functional units (one multiplier and one adder), a multiplexer,
and four flip-flops. If it is assumed that the multiplier and adder both require
one clock cycle to complete their operation, then the timing analysis can be
performed using exisiting techniques. Using a block-based approach, the pdf of
the circuit delay can be found by Equation (2.1).
pdfFF5 = max(A+ C,B + C) (2.1)
where + is the statistcal sum operation and is defined by Equation (2.2),
assuming a normal distribution
µsum = µa + µb (2.2a)
22
Figure 2.1: Motivational circuit from high-level synthesis
σ2sum = σ
2
a + σ
2
b (2.2b)
and max is the statistical max operation from [16] and is shown in Equation
(2.3) where z = max(x, y).
E[z] = µxΦ(β) + µyΦ(−β) + αϕ(β) (2.3a)
σ2z =(µ
2
x + σ
2
x)Φ(β) + (µ
2
y + σ
2
y)Φ(−β)
+ (µx + µy)αϕ(β)− E2[z]
(2.3b)
α =
√
σ2x + σ
2
y − 2ρσxσy
β =
µx − µy
α
ϕ =
1√
2pi
exp
(−x2
2
)
Φ(x) =
1√
2pi
∫ x
−∞
exp
(−y2
2
)
dy
where ρ represents the correlation (both spatial and structural) between the
two distributions that are being maxed. For the motivational example the
23
correlation is defined as:
ρ =
Cov(A,B) + Cov(A,C) + Cov(C,B) + Cov(C,C)
σA+CσB+C
(2.4)
and consists of one structual correlation component, Cov(C,C), and three
spatial correlation components, Cov(A,B), Cov(A,C), and Cov(C,B). Using
these equations, the pdf for the gate delay at the input to FF5 can be found for
the single-cycle case.
However, if the multiplier requires two cycles to complete its operation so
that the path from the inputs of multiplier B, FF3 and FF4, to the output
FF5 requires 2 cycles, then current block-based methods cannot be used
to find the pdf of the performance yield. While this example contains only
multi-cycle paths, we show that the multi-clock problem can be transformed
into a multi-cycle path problem.
Given a set of clock domains, C, select a clock domain to act as the master
clock domain, cm, from C. For all the other clock domains, ci, either assign or
find (if the operating frequencies of the clock domains are known) the ratio for
clock domain i, ni = ci/cm, with which the clock domains operate. For example,
if the master clock domain operates at 10 MHz and another clock domain
operates at 12 MHz, then the ratio between the domains is 1.2. Using the clock
domain ratio, the multi-clock problem is transformed into a multi-cycle path
problem by applying a multi-cycle timing constraint to domain i with cycle ni.
False paths can be included by applying a cycle number of infinity.
With this transformation, we formally define the problem as follows:
Definition: Given a statistical timing graph, G(V,E), where each node,
vi ∈ V , and edge, ei ∈ E, contains a random variable for the delay of the
node (edge), find the maximum delay, max(P1, P2, . . . , Pn), over all paths in
24
the circuit where an arbitrary set of the paths, MC(P ), have a multi-cycle path
constraint.
As was mentioned earlier, spatial correlations between process parameters
have a major impact on the timing analysis of the circuit. In this work we
model intra-die spatial correlations between process parameters using a grid
structure as is shown in [4]. In this correlation model, all the process parameters
of all gates within a grid square are assumed to be perfectly correlated and
the process parameters of gates in different grid squares are assumed to be
correlated with respect to the distance between the grid squares. Specifically,
we model transistor gate length, Lg; transistor width, Wg; gate oxide thickness,
tox; and doping concentration, Na. The details of which can be found in Section
2.4. A correlation equation that meets the requirements set forth in [3] is used
so that the correlation matrix is guaranteed to be positive semi-definite, a
requirement of PCA. We follow the assumptions made in [4], [12], [14] by
assuming that all process parameters are normally distributed. While we do
not explicitly model wire delay, the algorithm can be easily extended to include
wire delay using the approach outlined in [4].
2.3 Algorithm Description
In this section we first show how multi-cycle paths can be considered in SSTA
using the block-based method from [5]. Next, we extend these methods so
that they can be used for a PCA-based timing analysis. We then describe the
algorithm that is used to traverse the timing graph. To begin, we return to our
motivational example shown in Figure 2.1 and decompose the circuit into two
paths, as shown in Figure 2.2. The path A–C–FF5 is a single-cycle path, while
the path B–C–FF5 is a two-cycle path. In order to find the pdf for the delay of
25
Figure 2.2: Path decomposed motivational circuit
this circuit, the max between two delay distributions that are of different cycle
lengths must be found. In the following section we describe how to find the max
between two paths that differ in the number of cycles that they are given to
propagate.
2.3.1 Max between different cycle paths
In order to calculate the max operation between two paths with different cycle
propagation constraints, a normalizing operation is applied to the paths so that
the delay distribution is expressed as a function of a single cycle. Equation (2.5)
shows the normalizing operation.
µnorm =
µo
n
(2.5a)
σnorm =
σo
n
(2.5b)
where µo and σo are the original mean and standard deviation of the delay
distribution and n is the number of cycles the path has to propagate. We offer
26
the following proof to justify the normalizing operation:
Φ
(
x− µ
σ
)
= Φ
 1nx− 1nµ1
n
σ
 (2.6)
Thus, by dividing the mean and standard deviation by the multi-cycle path
cycle constraints, we are able to find the performance yield as a function of
the single-cycle clock period. However, by normalizing the delay distribution,
we change the covariance between the delay distributions of the single and
multi-cycle paths. Equations (2.7) derive the necessary changes to the covariance.
Cov(X, Y ) = E[XY ]− E[X]E[Y ] (2.7)
Cov
(
1
n
X, Y
)
= E
[
1
n
XY
]
− E
[
1
n
X
]
E[Y ] (2.8)
Cov
(
1
n
X, Y
)
=
1
n
(E[XY ]− E[X]E[Y ]) (2.9)
Cov
(
1
n
X, Y
)
=
1
n
Cov(X, Y ) (2.10)
Therefore, the adjusted covariance can be found by dividing all the multi-cycle
path correlations by the number of cycles they are given to propagate. Using
Equations (2.5) and (2.7) it is possible to find the max between the two paths
A–C–FF5 and B–C–FF5 as follows:
pdfFF5 = max
(
1
n
(B + C), A+ C
)
(2.11)
ρ =
Cov
(
A,
1
n
B
)
+ Cov
(
A,
1
n
C
)
+ Cov
(
C,
1
n
B
)
+ Cov
(
1
n
C,C
)
1
n
σA+CσB+C
(2.12)
27
where ρ is the correlation coefficient of the two paths.
2.3.2 Application to PCA
In this section, the normalization operation and the correlation change equations
are extended for use in a PCA-based timing analysis. Principal component
analysis, as discussed in Section 1.2, simplifies the traversal of the timing graph
to a PERT-like traversal by expressing each delay distribution as a function of
its principal components:
d = do + k1p
′
1 + · · ·+ kmp
′
m (2.13)
The principal components are all independent, which significantly simplifies the
tracking of correlation throughout the circuit. Three properties for expressing
the delay distribution as a function of the principal components are given in [4]:
Property 1: σ2d =
∑m
i=1 k
2
i
Property 2: cov(d, p
′
i) = ki
Property 3: Let di and dj be two random variables:
di = d
o
i + ki1p
′
1 + · · ·+ kimp
′
m
dj = d
o
j + kj1p
′
1 + · · ·+ kjmp
′
m
then
Cov(di, dj) =
m∑
r=1
kirkjr (2.14)
Through the use of property 1, the normalization operation can be defined
as:
dnorm =
d0
n
+
k1
n
p
′
1 + · · ·+
km
n
p
′
m (2.15)
28
The major advantage of PCA comes in the form of the covariance calculation.
Property 3 states that the covariance between two delay distributions can
be easily found by summing the products of the principal components for
two delay distributions. The consequence of this property is that covariance
information is inherently tracked through the propagation of the principal
componenets during the PERT-like traversal. In the multi-cycle context, this
means that by dividing all the principal components by the multi-cycle cycle
constraint during the normalization operation, the covariance information
between the two paths is inherently updated. Therefore, no equations are
necessary for the calculation of the new correlation coefficient. The new covariance
can be found by applying property 3 to the principal components of the delay
distribution.
2.3.3 Multi-cycle graph traversal
Until this section, the motivational example has been used to derive the
equations that are necessary to perform a max between two paths with different
cycle constraints. The motiviational example was broken into two paths to
aid in deriving the equations; however, path-based SSTA suffers from slow run
time compared to block-based SSTA. In this section, we show how the concepts
shown above can be applied to a general timing graph using a block-based
SSTA traversal.
2.3.3.1 Timing graph setup
A general timing graph consists of two or more vertices connected by edges,
where the vertices represent the logic gates and the edges represent the wires
connecting the logic gates. Also added to the graph is a source node, which
29
connects to all inputs, and a sink node, to which all outputs connect. Attached
to each vertex (edge) is a delay associated with the vertex (edge). For a single-cycle
SSTA-based timing graph, each vertex (edge) stores a mean delay, as well
as a standard deviation value. In the case of PCA, the standard deviation is
represented by principal components which are stored at the corresponding
vertex (edge).
In the single-cycle timing graph, it is assumed that every path in the
circuit is a single-cycle path. To be able to distinguish between single-cycle
and multi-cycle paths, it is necessary to store extra information at each vertex
and edge. We follow an approach similiar to [17] and add a list to each vertex
(edge) that specifies all the multi-cycle constraints in which the vertex (edge)
participates. This approach allows us to handle many types of timing constraint
issues such as thru-x and false paths. For an overview of different types of
timing constraints we refer the reader to [17]. We also store multiple delay
distributions at each vertex (edge), one for each timing constraint that passes
through the vertex (edge). An example timing graph for the output multiplexer
in Figure 2.1 is shown in Figure 2.3. Vertex 4 has a single-cycle path as well as
a 2-cycle path that passes through the gate. Therefore, attached to the vertex
is a list, denoted by C = {1, 2}, containing the cycle constraints for 1-cycle
and 2-cycle paths along with their corresponding delay distributions as shown
in Figure 2.3. Two false paths are also defined by the timing graph. For the
single-cycle timing domain, false paths are defined for all the paths passing
through vertex 2 into vertex 4. The false path is applied through the deletion
of the cycle constraint of 1 from the list of cycle constraints on the input edge
that goes from vertex 2 to vertex 4. The same technique is used to define a
false path in the 2-cycle clock domain by applying a false path for all inputs to
vertex 4 from vertex 3.
30
Figure 2.3: Timing graph setup: gate-level multiplexer used in Figure 2.1
2.3.3.2 Traversal
We propose a modified PERT-like traversal in order to consider multi-cycle
constraints during block-based SSTA. In a single-cycle PERT-like traversal,
a breadth-first search is performed where the max operation is performed
across all input edges to a vertex. After all the input edges have been maxed,
the delay at the vertex is added to the max delay distribution. To consider
multi-cycle constraints, we propose Algorithm 2.1, which shows the steps that
should be executed at each vertex in the timing graph. The algorithm maintains
the breadth-first traversal of the orginal PERT-like traversal; however, at each
vertex Algorithm 2.1 is executed.
The algorithm is similar to the single-cycle algorithm with a few key
exceptions. The algorithm begins by selecting one of the cycle constraints
from the list attached to the timing graph. It then iterates through all the input
edges, and if an input edge contains the cycle constraint, then the algorithm
performs a max operation between that edge and any previous edges to obtain
pdfmax,n. If the input edge does not contain the cycle constraint, then it is
31
Algorithm 2.1 Modified PERT-like traversal
1: for each cycle constraint, n, at node v do
2: pdfmax,n = 0
3: for each input edge ein do
4: if n ∈ ein then
5: pdfmax,n = max(pdfmax,n, pdfein)
6: end if
7: end for
8: pdfvertex,n = pdfmax,n + pdfv
9: for each output edge eout do
10: if n ∈ eout then
11: pdfeout final,n = pdfvertex,n + pdfeout
12: end if
13: end for
14: end for
ignored. After all the input edges have been considered, the delay pdf of the
vertex is added to the delay pdf from the max calculation of the input edges
to obatin pdfvertex,n. The output edges are then iterated through, and if they
contain the cycle constraint, then their delay pdf is added to pdfvertex,n to
obtain pdfeout final . The delay pdfeout final is then used as the input edge delay
for the next vertex in the PERT-like traversal. As can be seen from line 1,
this entire procedure is performed for each cycle constraint at the vertex. This
propagates each clock domain independently and a delay distribution for each
cycle constraint is stored at each vertex (edge).
When the modified PERT-like traversal has been completed, the sink node
contains a delay distribution for each timing constraint in the circuit. In the
example of Figure 2.3 the sink node would contain two delay distributions, one
for the 1 cycle constraint and one for the 2 cycle constraint. To complete the
timing analysis the algorithm shown in Figure 2.2 is performed.
The algorithm iterates over all the cycle constraints at the sink node,
performing a normalize operation on each delay distribution as was described in
section 2.3.1 or 2.3.2, depending on the timing analysis algorithm being used.
32
Algorithm 2.2 Sink max algorithm
1: final max = 0;
2: for each cycle constraint, n at the sink do
3: normpdf = normalize(pdfn)
4: final max = max(final max,n pdf)
5: end for
It then performs the max operation between the normalized distribution and
all the other cyle constraints at the sink node. This results in a pdf that can be
used to find the performance yield for a given clock frequency.
2.4 Experimental Results
In this section we test our algorithm on two sets of benchmarks: a set of
high-level synthesis (HLS) benchmarks that have been synthesized to the
gate level, and a number of MCNC benchmarks. We have implemented our
algorithm in a C++ program, MCSSTA, using the PCA-based timing analysis
methodology described in Section 2.3. All the experiments were run on a
computer with an Intel Pentium 4 3.2 GHz processor and 8 GB of RAM
running Red Hat Linux. For the experiments, we synthesize the benchmark
netlists using Synopsys Design Compiler followed by placement and routing
in Cadence Encounter. The placement information is then used by MCSSTA.
MCSSTA is also fed a timing constraints file that consists of the startpoint,
endpoint, and cycle constraint for each multi-cycle and/or multi-clock path in
the benchmark.
To verify our results, we perform Monte Carlo simulations in Matlab. The
correlation information between the different process parameters, the values for
the process parameters, and a netlist containing the timing graph are passed to
Matlab where 10,000 correlated samples are created for each random variable
in the timing graph. We use the 45 nm Nangate standard cell library provided
33
Table 2.1: Process Variation Parameters
Parameter µ 3σ % Deviation from Mean Correlation Distance (µm)
Lg 45 nm 15% 1.0
Wg 9 5nm 12% 1.0
Na 2× 1020 6% 0.0
tox 1.75 nm 6% 0.0005
by Nangate [18] for the experiments. We base our modeling off the predictions
in [19]. Table 2.1 shows the specifics of our process variation modeling.
The high-level synthesis benchmarks have been obtained from [20]. They
consist of two different DCT algorithms (pr and wang) and a DSP program
(mcm). The benchmarks have been scheduled and bound using LOPASS [21].
All the multipliers in the circuit have been scheduled to take 2 clock cycles,
while the adders are scheduled for one clock cycle. This setting can be changed
for other configurations, such as 3-cycle multipliers or 2-cycle adders. From the
LOPASS synthesis solution, which is used to generate RTL, the multi-cycle
constraints are generated for use with MCSSTA and during synthesis and
placement and routing. Along with the HLS benchmarks, we also run MCSSTA
on the benchmarks presented in [15]. These benchmarks consist of two MCNC
benchmarks that have been combined to create a single benchmark with
multi-clock domains. We run our experiments with a cycle constraint of 2
for these benchmarks, which correspoinds to a 2:1 clock cycle ratio between the
clock domains.
For each benchmark, a single-cycle timing analysis (all paths were set to
a cycle constraint of 1) was performed using MCSSTA (MCSSTA Single), a
corresponding Monte Carlo simulation (Monte Carlo Single) was performed,
and the difference between the two analyses (% Diff Single) was calculated
to establish a baseline for the error in the mean (µ) and standard deviation
(σ). Next, the multi-cycle constraints were applied to the timing graph, and
34
a timing analysis was performed using MCSSTA (MCSSTA Multi) followed
by a Monte Carlo simulation (Monte Carlo Multi) that was aware of the
multi-cycle constraints. The difference between these values was then calculated
(% Diff Multi). The benchmarks named pr, wang, mcm, and steam belong to
the set of high-level synthesis benchmarks, while the remaining benchmarks
are the multi-clock domain benchmarks from [15]. The characteristics of the
benchmarks, as well as the run times for both the single-cyle and multi-cycle
experiments, are shown in Table 2.2. The column Cells is the number of
standard cells in the synthesized benchmark. The columns PI and PO corresponds
to the number of psuedo-primary inputs and pseudo-primary outputs. The next
two columns show the run times of the single-cycle and multi-cycle experiments,
and in the last column the increase in run time is shown. On average, the
run time of the SSTA is reduced by 19% in the multi-cycle case. The reason
for this is explained later. The results from the experiments are shown in
Table 2.3. On average, the single-cycle SSTA overestimates the mean delay
by 0.07% and underestimates the standard deviation by 0.66%. The multi-cycle
case overestimates the mean delay by 0.37% and underestimates the standard
deviation by 0.92%.
35
Table 2.2: MCSSTA Benchmark Characteristics and Run Times
Run time
Benchmark Cells PI PO Single Multi %Increase
mcm 17463 4200 1980 8845 6179 −30.14
pr 13323 2236 1248 1868 560 −70.02
wang 9573 2164 1216 8091 6256 −22.67
alu4 clma 4115 463 84 49 51 4.08
alu4 diffeq 2680 1226 372 137 92 −32.84
alu4 tseng 2592 1289 476 142 90 −36.61
apex2 s298 2884 370 86 52 42 −19.23
apex2 tseng 5410 3220 1494 1902 967 −49.15
apex4 elliptic 6546 2961 1067 1780 1207 −32.19
des clma 8366 2172 1045 2235 2343 4.83
ex1010 tseng 4982 1255 476 615 537 −12.68
misex3 diffeq 4927 3077 1311 1259 1426 13.26
misex3 tseng 2625 1302 488 153 153 0
pdc tseng 6487 3138 1546 3328 2756 −17.18
seq diffeq 5526 3229 1380 2003 1794 −10.43
seq tseng 3195 1511 557 227 226 −0.44
Average −19.47
36
T
ab
le
2.
3:
B
en
ch
m
ar
k
R
es
u
lt
s
B
en
ch
m
ar
k
M
on
te
C
ar
lo
S
in
gl
e
M
C
S
S
T
A
S
in
gl
e
%
D
iff
S
in
gl
e
M
on
te
C
ar
lo
M
u
lt
i
M
C
S
S
T
A
M
u
lt
i
%
D
iff
M
u
lt
i
N
am
e
µ
(n
s)
σ
(n
s)
µ
(n
s)
σ
(n
s)
µ
σ
µ
(n
s)
σ
(n
s)
µ
(n
s)
σ
(n
s)
µ
σ
m
cm
17
01
.0
2
11
4.
68
17
10
.9
3
10
9.
76
0.
57
−4
.4
8
14
55
.5
3
86
.5
8
14
58
.2
5
86
.6
9
0.
18
0.
12
pr
17
65
.5
5
10
7.
87
17
68
.8
9
10
4.
18
0.
18
−3
.5
3
10
50
.6
7
52
.4
10
52
.5
51
.4
20
5
0.
17
−1
.9
0
w
an
g
16
89
.7
8
11
6.
14
16
92
.1
7
11
1.
81
0.
14
−3
.8
6
14
22
.4
4
98
.1
4
14
20
.8
7
98
.4
3
−0
.1
1
0.
29
al
u4
cl
m
a
16
10
.1
1
11
1.
29
16
08
.8
8
11
1.
00
−0
.0
7
−0
.2
6
94
6.
77
55
.1
8
94
6.
69
55
.1
3
0.
01
−0
.0
8
al
u4
di
ffe
q
14
80
10
2.
22
14
79
.4
4
10
2.
19
−0
.0
3
−0
.0
3
95
2.
14
51
.9
6
95
3.
27
52
.5
6
0.
11
1.
14
al
u4
ts
en
g
13
34
.1
6
97
.9
8
13
35
.2
7
98
.3
5
0.
08
0.
38
97
7.
62
57
.8
7
97
9.
7
57
.3
3
0.
21
−0
.9
3
ap
ex
2
s2
98
18
98
.8
9
13
8.
46
18
99
.1
3
13
8.
82
0.
01
0.
25
10
61
.6
7
66
.1
3
10
61
.6
3
65
.2
8
−0
.0
1
−1
.3
0
ap
ex
2
ts
en
g
13
67
.8
7
90
.9
5
13
65
.0
3
91
.2
13
6
−0
.2
0
0.
28
10
36
.7
8
55
.9
8
10
41
.6
5
53
.3
4
0.
46
−4
.9
4
ap
ex
4
el
lip
ti
c
16
33
.3
3
11
2.
16
16
34
.0
3
11
2.
09
3
0.
04
−0
.0
5
10
92
.0
4
63
.3
5
10
94
.5
8
59
.8
1
0.
23
−5
.9
1
de
s
cl
m
a
17
11
.4
1
11
7.
03
17
12
.3
6
11
8.
34
8
0.
05
1.
11
10
82
.4
3
67
.9
1
10
93
.7
3
67
.6
2
1.
03
−0
.4
2
ex
10
10
ts
en
g
13
00
.2
77
.5
4
12
99
.5
6
77
.1
8
−0
.0
4
−0
.4
5
12
82
.7
6
78
.7
0
12
83
.8
1
78
.2
0
0.
08
−0
.6
4
m
is
ex
3
di
ffe
q
14
97
.1
5
10
2.
85
9
14
98
.1
7
10
0.
92
0.
06
−1
.9
1
10
09
.2
7
58
.9
6
10
15
.5
7
56
.7
4
0.
62
−3
.9
0
m
is
ex
3
ts
en
g
12
48
.5
77
.6
8
12
47
.5
79
.2
10
3
−0
.0
8
1.
93
93
2.
03
53
.0
7
93
3.
86
53
.6
7
0.
19
1.
11
pd
c
ts
en
g
14
17
.6
87
.4
9
14
15
.6
7
87
.9
7
−0
.1
3
0.
54
13
40
.9
5
80
.2
2
13
42
.0
3
82
.1
0
0.
08
2.
29
se
q
di
ffe
q
15
83
.6
11
1.
11
15
83
.6
11
1.
46
0.
00
0.
31
99
8.
49
54
.1
10
12
.9
4
55
.1
34
8
1.
42
1.
87
se
q
ts
en
g
12
66
.7
1
81
.2
89
12
66
.7
5
80
.7
3
0.
00
3
−0
.6
8
10
70
.1
1
72
.8
2
10
83
.6
9
71
.7
5
1.
25
−1
.4
9
A
ve
ra
ge
0.
07
−0
.6
6
0.
37
−0
.9
2
37
By examining the benchmarks in more detail, it can be seen that MCSSTA
closely matches the Monte Carlo results for the majority of the cases. The
difference in error between the single-cycle case and the multi-cycle case does
increase, but not to levels that are unacceptable. This increase in error is due
to the increased significance of the final max operation. In the single-cycle case,
the max operation is performed at each node between all the input edges. In
the multi-cycle case, on the other hand, only the input edges that contain the
same cycle constraint are maxed at the node. Then, at the sink node, the max
is then performed between the different cycle constraints, which contains all the
correlation and delay information for the edges that were not maxed earlier.
Since the calculation that PCA uses to obtain the principal components after
the max operation is an approximation, some of the information is lost after
it is applied successively throughout the traversal. This leads to the increased
error that is seen in the benchmarks. Overall though, the increase in error is
about 0.30% on average.
The overall run time of the benchmarks is also interesting. Going from the
single-cycle to multi-cycle timing analysis, one would expect that the run time
would significantly increase. This does not seem to be the case, as some of
the run times actually decreased. The SSTA algorithm run time is dominated
by the number of max operations. With the multi-cycle path and multi-clock
domain case, it is possible that the number of max operations is decreased if the
circuit structure is such that the input edges to logic gates belong to different
clock domains. In this case a max operation does not need to be performed.
Overall, we find that the run time for MCSSTA is not significantly increased
over the single-cycle case.
38
CHAPTER 3
FASTYIELD: VARIATION-AWARE,
LAYOUT-DRIVEN SIMULTANEOUS
BINDING AND MODULE SELECTION FOR
PERFORMANCE YIELD OPTIMIZATION1
3.1 Introduction
The shift to probabilistic design methodologies has produced a number of
gate-level variation-aware optimization techniques [22] [23]. While progress at
the gate-level is encouraging, the large productivity gains available in high-level
synthesis (HLS) make it attractive and necessary to address the issue of process
variations at a higher level of abstraction.
In this work, we propose a novel variation-aware simultaneous binding and
module selection algorithm, named FastYield, based on bipartite matching,
that maximizes the performance yield of the resulting circuit. We connect our
synthesis engine to the layout closely, so layout information can be accurately
back-annotated to the synthesis and introduce useful synthesis transformations.
Synthesis and layout are iterated until the performance gain is maximized. The
major contributions of our paper are summarized as follows: (1) a simultaneous
binding and module selection algorithm that considers registers, multiplexers,
functional units, interconnects, and spatially correlated process variations; (2) a
1 c©2009 IEEE. Reprinted, with permission, from Proceedings of the 2009 Asia and South
Pacific Design Automation Conference. This material is posted here with permission of the
IEEE. Such permission of the IEEE does not in any way imply IEEE endorsement of any
of the University of Illinois’s products or services. Internal or personal use of this material
is permitted. However, permission to reprint/republish this material for advertising or
promotional purposes or for creating new collective works for resale or redistribution must
be obtained from the IEEE by writing to pubs-permissions@ieee.org. By choosing to view this
material, you agree to all provisions of the copyright laws protecting it.
Scott Cromar contributed significantly to this work, especially Sections 3.4 and 3.6.
39
timing-driven, simulated annealing–based, statistical floorplanner that considers
interconnect delay and spatial correlation between all units in the design; (3)
an iterative functional unit rebinding based on timing analysis information and
register criticality. The rest of the paper is organized as follows: Section 3.2
introduces related work on recent high-level synthesis algorithms; Section 3.3
presents statistical functional unit modeling; Section 3.4 presents the details
of the FastYield algorithm; Section 3.5 presents details on the timing-driven
floorplanner and unit correlation model; Section 3.7 presents experimental
results.
3.2 Related Work
HLS is a well-studied topic [24], [25], [26], [27]. Much work has been done in
the areas of scheduling, resource allocation, and binding. Several works, such
as [28], have addressed the topic of simultaneous binding and floorplanning,
but with no consideration of spatial correlation or variability. Huang et al. [11]
presented a binding algorithm based on bipartite weighted matching, a method
we employ in this work. However, their algorithm does not address the critical
issues of resource selection and delay variability. Likewise, most of the work in
HLS has ignored the issue of process variation, as it has not been an important
issue, but that has begun to change in the past few years with the move to deep
submicron processes. We will mainly introduce variation-aware HLS work here,
which is an emerging area of research.
Hung et al. [29] offer a simultaneous scheduling, binding, and allocation
algorithm based on simulated annealing. The simulated annealing algorithm
seeks to reduce the overall latency while meeting a performance yield requirement.
However, the algorithm does not consider multiplexer use or interconnect delay,
40
both of which can significantly contribute to the clock period of the unit.
Jung et al. [30] propose a timing variation-aware HLS algorithm that
improves resource sharing. While the algorithm is effective, it ignores multiplexers
and interconnects, and it also relies on the assumption that functional units
(FUs) are independent of each other in its yield calculation, given by:
yield =
n∏
i=1
P (FUi ≤ Tclk) (3.1)
where n is the number of functional units, and Tclk is the chosen clock period.
As has been shown in [31] and [3], and as our results show, correlation among
process parameters has an effect on the performance yield.
Lastly, Wang et al. [32] propose a simulated annealing–based method to
consider both power yield and timing yield during HLS. They use a number
of different simulated annealing moves combined with a cost function that
penalizes the design if it exceeds a power or timing yield constraint. Spatial
correlation and interconnect delay are not considered.
3.3 Resource and Correlation Modeling
Modeling resources at a higher level of abstraction is critical to attaining an
accurate HLS solution. We employ a Monte Carlo–based method to precharacterize
the functional units. Two types of variation are considered: random variation
and correlated variation (or systematic variation). The characterization flow for
each unit begins with logic synthesis followed by placement and routing using
Synopsys Design Compiler and Cadence SOC Encounter. The characterization
was performed on a recently released 45 nm standard cell library provided in
the design kit from [33]. From the place and route information, the delay of the
unit and placement of the individual gates are extracted.
41
Using Monte Carlo analysis, we then characterize the units by specifying
a correlated, θcor, and independent, θind, percentage of delay variation for
each gate in the resource with respect to its nominal delay value. For each
Monte Carlo run, the critical path of the circuit is then found by running a
deterministic timing analysis (we used Synopsys PrimeTime). By plotting the
critical path for each Monte Carlo run, the mean, µFU , and standard deviation,
σFU of the delay distribution are built.
To consider spatial correlation during the binding algorithm, we define
two types of delay variation: inter-unit delay variation and intra-unit delay
variation. Inter-unit delay variation is defined to be correlated across units,
while intra-unit delay variation is defined to be independent across units. The
components of inter- and intra-unit delay variation are calculated as percentages
of the standard deviation that was found from the Monte Carlo analysis of the
resource. Equation 3.2 show the calculation of the intra- and inter-unit delay
standard deviations.
σ2intra = σ
2
FU ×
θind
θind + θcor
(3.2a)
σ2inter = σ
2
FU ×
θcor
θind + θcor
(3.2b)
This characterization process produces the components of intra- and inter-unit
delay variation as well as the mean delay for the resource, which are used for
the remainder of the algorithm. We support different structural implementations
of the same arithmetic operation. These implementations provide different delay
and area trade-off characteristics and offer opportunities for better design space
exploration, targeting higher performance yield given a specific resource or area
constraint.
42
3.4 FastYield Algorithm Description
In this section we will present the FastYield binding/module selection algorithm.
FastYield seeks to improve performance yield through a multiplexer and
interconnect-aware delay reduction strategy. Performance yield evaluated at
a clock period t, PY(t), is defined as:
PY = P (r1 ≤ t, r2 ≤ t, . . . , rn ≤ t) (3.3)
where PY(t) is the probability that r1, r2, . . . , rn meet the clock period requirement,
and rn represents the probability distribution of register n. We assume all
delays are jointly Gaussian with an associated covariance matrix; i.e., they are
correlated.
The algorithm has three major components: (1) an initial resource allocation
and binding; (2) a timing-driven floorplanner, that performs both a timing
driven placement as well as a statistical static timing analysis (SSTA); and (3)
a FU rebinding which incorporates timing analysis information from component
2. FastYield seeks to improve the synthesis solution through iteratively feeding
back accurate, floorplan- and interconnect-aware, statistical timing information
to the rebinding step.
One of the strengths of FastYield lies in its use of a process correlation
model during timing analysis, enabled by floorplan information. As stated
previously, interconnect delay and multiplexer delays are considered during each
SSTA step. Performance yield is calculated at the end of each timing analysis to
evaluate the success of the algorithm, and the algorithm exits when no further
improvement is seen in the binding/module selection solution. Each of the main
components of FastYield is described next.
43
3.4.1 Initial binding
The inputs to the algorithm include (1) a scheduled control data flow graph
(CDFG), (2) a resource library, and (3) an area constraint. The resource library
contains all the resources, including FUs, multiplexers, and registers, as well as
the precharacterization data for each.
FastYield performs an initial allocation and binding in three steps: First, a
minimal set of registers is allocated and bound to a set of variables (variables
are outputs of operations). Second, a combined FU allocation and binding
takes place. Third, a minimized set of multiplexers is allocated. This initial
allocation and binding step follows a strategy similar to that presented in [11]
but differs significantly in its consideration of delay variation, module selection,
and connection to a spatial correlation–aware timing analysis that leads to
rebinding. We name this section Initial Binding to differentiate from the
Rebinding procedure to be covered in Section 3.6.
3.4.2 Register allocation and binding
Register binding is accomplished in a manner similar to that described in [11],
where a minimal set of registers is allocated, and variables are bound by solving
a weighted bipartite graph.
3.4.3 Initial functional unit allocation and binding
Once the registers are allocated and variables are bound to them, FUs are
allocated and operations assigned to them one control step at a time. First,
control steps are ranked according to the following equation:
Rankcstep = diversity × numOPs (3.4)
44
where diversity is the number of different types of operations in the control
step, and numOPs is the number of operations assigned to the control step. The
control steps are then processed from the highest-ranked to the lowest-ranked.
This strategy is similar to the “first fit decreasing” heuristic used in bin packing
problems. The items are put in descending order according to their volumes (in
this case rank) and then packed one at a time in an effort to make the packing
as close to optimal as possible.
The cluster of control step operations to be bound is placed into a set,
Ocstep, and the available FUs are put into a set FUav. On the first control step
to be bound, the set of available FUs consists of, for each operation in the
control step, one instance of each FU in the resource library that is compatible
with that operation (see Figure 3.1). This initial allocation ensures that each
operation can bind to any of the compatible FUs in the resource library. In
subsequent control steps, FUav is trimmed of any FUs that, if allocated, would
exceed the area constraint, with the qualification that a sufficient number of
FUs of each type has been allocated to accommodate a successful binding
solution. In this way, FastYield produces a binding solution that meets the area
constraint, while also enabling module selection.
A weighted bipartite graph is constructed in which each vertex represents
either an operation (oi ∈ Ocstep) or an FU (fuj ∈ FUav), and there is an edge,
eij, between each operation, oi, and FU, fuj, which can perform the operation,
with a corresponding weight. Edge weights are based on multiplexer creation
due to the previously bound registers. If two operations share the same input
register, it is advantageous to bind the two operations to the same FU, because
no multiplexer is needed (which will in effect potentially reduce the path delay).
Likewise, if two operations that share the same output register are bound to the
same FU, no multiplexer is needed at the register’s input port (again having
45
Figure 3.1: Illustration of the bipartite graph created for the functional unit
binding of the first control step
a positive effect on the delay reduction). The initial binding weight, wij initial,
corresponding to each edge, eij, is calculated according to:
wij initial =
1
estDelay(i, j)
(3.5a)
estDelay(i, j) = µfuj + µmuxin + µmuxout + 3×
√
σ2fuj + σ
2
muxin
+ σ2muxout (3.5b)
where µ is the mean, σ is the standard deviation, muxin is the multiplexer
that would be created at the input of the FU if the operation were bound to
it, and muxout is the multiplexer that would be created at the input of the
output register if the operation were bound to the FU. This weight calculation
effectively incorporates the statistical behaviors of all the involved components
in the circuit paths, putting a higher weight on the shorter-delay paths. The
maximum-weight solution is then found, and the operations are bound to FUs
for the control step. An example of the bipartite graph setup is shown in Figure
3.1.
46
3.4.4 Multiplexer/connection allocation
Connections between FUs and registers are made with multiplexers that are
allocated at the inputs of FUs and at the inputs of registers, according to a
random port assignment routine. Once the connections to ports are assigned, a
minimized set of multiplexers is allocated at each port.
3.5 Timing-Driven Floorplanner
The timing-driven floorplanner is run after each binding iteration is completed
to evaluate the performance yield of the solution. As has been shown in previous
work [3], [4], and as we show in the experimental results section, spatial correlation
of parameters such as gate length variation can have an impact on the variance
of the timing of a circuit. To achieve accurate timing results it is important
that spatial correlation among units is considered during timing analysis.
3.5.1 Unit correlation model
To complement the high-level synthesis resource modeling, we propose a
novel unit-based correlation model. In the unit-based correlation model,
each functional unit, register, or multiplexer is assigned a unit number and
the correlation between each unit is found based on the distance between
the center points of the units using a correlation function that meets the
requirements of [3] so that the correlation matrix for the circuit is positive
semi-definite, a requirement for the statistical static timing analysis approach
that we use. This model is beneficial to high-level synthesis, as it complements
the proposed resource modeling and also takes into account the different sizes
of functional units. For high-level synthesis, a grid-based model, as used
47
in [4], does not complement the unit characterization, since it is possible for
functional units to be split across different grid regions, which complicates
both the unit characterization process and the correlation calculation. Figure
3.2 shows an example of the unit correlation model. Two multipliers (1 and
2), an adder (3), and a register (4) are labeled in the picture. It can be seen
that when an adder and a register (small area) are placed next to each other,
the correlation is higher than when two multipliers (large area) are placed
next to each other. The unit-based correlation model can also be viewed as
an extension of the grid-based model where each logic gate/functional unit is
its own grid. Based on this view, we expect that the unit-based correlation
model is at least as accurate as a grid-based model. The proposed correlation
model, in conjunction with the inter-unit and intra-unit variation found during
the resource characterization, effectively allows correlated variation to be
represented at a higher level of abstraction.
3.5.2 SSTA algorithm
To obtain placement information during the timing analysis we use a modified
version of the Parquet floorplanner [34]. The modified flooplanner employs
a simulated annealing approach in which, after a number of unit moves, a
statistical timing analysis is performed to evaluate the solution. Algorithm
3.1 shows the pseudocode for the timing-driven floorplanner.
The method for statistical timing analysis considering spatial correlation
used is based on the work of Chang et al. [4]. This work relies on principal
component analysis (PCA) to transform a set of correlated random variables
into a new set of independent random variables.
To perform PCA, a correlation matrix for the binding solution is found
48
Figure 3.2: Sample floorplan showing data connections
Algorithm 3.1 Timing-driven floorplanner pseudocode
1: while time > timecool do
2: PerformMoves(num moves);
3: CalcWireDelay();
4: CalcCorrelation();
5: PerformPCA();
6: PerformTimingAnalysis();
7: CalculateCost();
8: end while
49
using the unit correlation model described above. The interconnect delay
between the units is modeled based on distance. Since no detailed routing
information is available, we model the connection between two functional units
as a two-pin net with the length being the Manhattan distance between the two
connecting terminals of the FUs. The mean Elmore delay with optimal buffer
placement is then found using (3.6), which follows from the results of [35]:
µwire = 2.5×
√
RbuffCbuffRlengthClengthl2 (3.6a)
σwire = α× µwire (3.6b)
where µwire is the mean wire delay, Rbuff is the output resistance of the buffer,
Cbuff is the input capacitance of the buffer, Rlength is the resistance per unit
length, Clength is the capacitance per unit length, and l is the net length. The
standard deviation of the wire length is calculated using (3.6b) where α is a
percentage of wire variation. α is found in accordance with the results from [36]
as follows :
α = 0.3836× exp(−0.1537h) (3.7)
where h is the optimal buffer size as calculated by [35]. We consider the wire
variation to be independent across wires.
3.5.3 Floorplanner cost function
The cost function for simulated annealing moves in the floorplanner is given by:
50
Z ∼ N(µz, σz) = max(reg1(µ1, σ1), reg2(µ2, σ2), . . . , regn(µn, σn)) (3.8a)
TR =
µz + σz
µbest + σbest
(3.8b)
Cost = α× area + β × TR (3.8c)
where max(reg1(µ1, σ1), reg2(µ2, σ2), . . . , regn(µn, σn)) represents the statistical
max operation [4] on the timing distributions at the inputs to all output
registers (pseudo-primary outputs), µbest and σbest represent the mean and
standard deviation of the best solution found so far, and α and β are weighting
parameters. The TR cost is then found by adding the mean and standard
deviation of the max distribution, normalized by the mean and standard
deviation of the best solution. In the calculation of TR, we chose to use the
sum of the mean and standard deviation since the result corresponds to the
required clock frequency for approximately an 84% yield.
Upon completion of the timing analysis, the delay probability density
function (pdf) for each register is known. The distributions, as well as the
required clock frequency for an 85% performance yield, are then passed back to
FastYield for the criticality analysis of the rebinding step.
Figure 3.2 shows an example floorplan obtained from the timing-driven
floorplanner, with the arrows representing the flow of data through the critical
path. After a specified number of moves, a timing analysis is performed on
all paths in the design as described above. In Figure 3.2 it can be seen that
it might be possible to improve the timing by moving either mux2 83 or
51
reg 10; however, it should be kept in mind that moving the resources closer
together could have a negative impact on the timing analysis because of spatial
correlation.
3.6 Rebinding
Functional unit rebinding is performed after the initial solution has been
analyzed by the timing-driven floorplanner, and it continues in an iterative
fashion thereafter until the floorplanner reports that no improvement has been
made. The rebinding algorithm works by determining which functional units
along the critical paths are causing the majority of the delay. It then attempts
to reduce the delay in two ways: (1) by swapping slower FUs on critical paths
for faster FUs and(2) by rebinding individual operations on the critical paths.
Algorithm 3.2 shows the pseudocode for the rebinding algorithm, which will be
explained next.
Algorithm 3.2 Rebinding algorithm pseudo code
1: CalcRegAndFURanks();
2: if SwapCriticalFUs() then
3: Break;
4: end if
5: OrderRebindOperations(Orebind);
6: for all op ∈ Orebind do
7: CalcOpToFUWeights();
8: BindLargestWeightPair();
9: EstimateTiming();
10: RecalcRegAndFURanks();
11: end for
3.6.1 Register and functional unit ranking
The algorithm begins by ranking the output registers in order of their criticality.
The slowest register is identified by finding the worst-case deterministic delay
52
based on the mean and standard deviation from the floorplanning information.
The rank of register r is then calculated by:
RegRankr =
µr + 3σr
µslowest + 3σslowest
(3.9)
where µr and σr are the mean and standard deviation for register r, and µslowest
and σslowest are the mean and standard deviation of the slowest register. The
registers are then ordered according to their criticality, or rank, starting from
the most critical.
With the registers ranked, the algorithm then proceeds to rank each FU that
is connected to each register. The rank of FU k connected to register r is found
by:
FURankk = RegRankr ×
(
0.5× µk
µr
+ 0.5× σk
σr
)
(3.10)
where µr and σr are the mean and standard deviation for output register r, µk
and σk are the mean and standard deviation of the FU, and RegRankr is the
rank for register r. The RegRankr weight provides an estimate of the global
impact the register has on the overall clock period, while the ratio of the means
and standard deviations considers how much the overall mean or variance of the
functional unit impacts the final timing at the register. The end of Section 3.6.3
will present an example of how this ranking is accomplished.
3.6.2 Swapping critical functional units
The rebinding algorithm examines the set of allocated FUs; based on their rank,
finds any higher-ranked FUs that are slower than lower-ranked, faster FUs of
the same type; and swaps them. The net effect is to place the fastest FUs on
53
Figure 3.3: Example FU ranking and operation selection for rebinding
the most critical paths. If no FUs meet the criteria for swapping, the rebinding
proceeds to the next step. If FUs are swapped, then the timing analysis is rerun
before rebinding proceeds.
3.6.3 Selection of operations to be rebound
The rankings of the registers and FUs are used in the selection of particular
operations that will be rebound. Operations are chosen that both contribute
to a critical path delay and have the potential to reduce that delay. Briefly,
this is done as follows: First, a set of output registers is selected for their delay
criticality based on their rank. For each chosen register, the FU connected to it
with the highest rank (denoting its greater contribution to the criticality of the
register) is selected, and an operation, or multiple operations, that are bound to
that FU are selected to be rebound. The criterion for selection of the particular
operations associated with each FU to be rebound is the operation’s potential,
if rebound, to reduce multiplexer size on that critical path. The example in the
next paragraph serves to clarify the process.
An example of the register and FU ranking, and operation selection, is
54
illustrated in Figure 3.3. The method is presented step-by-step: The slowest
register has a mean µ = 3.1, and a standard deviation σ = 0.3. )1) Based on the
slowest-register information, by equation (3.9) register 5 is found to have a rank
of 0.9. (Register 5 is determined to be critical based on its rank.) (2) The FUs
connected to register 5 are ranked according to equation (3.10). fu2 is found to
have a higher rank than fu1, so it is from fu2 that an operation, or operations,
will be selected for rebinding. (3) The inputs to fu2 are examined, and port 1 is
found to have a larger multiplexer than port 0. (4) The registers connected to
the inputs of the 3-input multiplexer are evaluated. One of the three registers
has two variables bound to it, and the other two have one variable bound to
them. Since a smaller number of variables bound to a register is preferred (more
likely to reduce the multiplexer size if moved), register 3 is randomly chosen
from the two registers that have only one variable bound to them. (Usually,
only one register’s bound variables are selected for rebinding for a given FU.)
The operation corresponding to that register/variable, operation 1 in this case,
is assigned the rank of the associated FU and chosen for rebinding. This same
process is carried out for each critical register, and the selected operations
(along with their ranking) are placed in the set Orebind to be rebound.
3.6.4 Operation rebinding
The rebinding is performed for each operation oi ∈ Orebind, one operation at a
time, starting with the operation with the highest rank. Previous bindings that
have not been selected for rebinding are left untouched. For a given operation,
oi, a rebind weight is calculated for each FU, fuj, in the previously allocated FU
set. The weight, wij rebind, for each operation FU pair is calculated as follows:
55
wij rebind =
wij rebindPrevious
max(wi rebind)
× (1− FURankj) (3.11)
where wij rebindPrevious is the weight of the operation-to-FU pair in the previous
iteration of rebinding (or the initial binding if this is the first iteration), max(wi rebind)
is the maximum weight from all of the operation-to-FU pairs, and FURankj is
the FU rank as described earlier. The first part of the weighting considers the
likelihood operation oi had of being assigned to fuj in the previous binding. If oi
was close to being assigned to fuj during the previous binding, then rebinding oi
to fuj will be a good choice, if the rank of the FU is low (meaning it currently is
not a part of the critical path). The second part of the equation adds this rank
consideration to the weight.
The operation-to-FU pair with the largest weight is then chosen, and that
operation is bound to the FU. The process repeats for each operation that
belongs to the set Orebind. However, after each operation is rebound, it is
possible that the multiplexer size has changed, which in turn reduces the critical
path of the circuit and changes the ranks of the registers. Therefore, after each
operation is rebound, an estimated timing analysis is performed on the paths
that are affected by the rebinding of the operation and the register and FU
ranks are recalculated. After every operation in Orebind has been processed, one
iteration of rebinding is complete and the solution is sent to the floorplanner for
analysis.
3.7 Experimental Results
In this section we present several results that demonstrate the importance
of considering process variation and correlation during high-level synthesis,
and we show the effectiveness of FastYield at accomplishing these tasks.
56
Table 3.1: HLS Benchmark Characteristics
Benchmark No. PIs No. POs No. Adds No. Mults Total
No.
Edges
chem 20 10 171 176 731
dir 8 8 84 64 314
honda 9 2 45 52 214
mcm 8 8 64 30 252
pr 8 8 26 16 134
steam 5 5 105 115 472
wang 8 8 26 22 134
FastYield reads in a benchmark, which has been prescheduled with a simple
list scheduling, and a resource library, and runs it through the initial binding,
floorplanning, and timing analysis and rebinding. The resource library contains
the pre-characterized resources, which include FUs, multiplexers, and registers.
The resources were precharacterized with 10% random variation and 10%
spatially correlated variation with a correlation distance of 1 mm (such assumptions
are compatible with the variation predictions laid out in [19]). The characterization
was performed on a 45 nm library provided in the design kit from [33].
A number of data-intensive benchmarks are used in our experiments with
FastYield. The benchmark control data flow graphs include chem and steam,
several different DCT algorithms including pr, wang, and dir, and a couple of
DSP programs such as mcm and honda [20]. The benchmarks are profiled in
Table 3.1. Each node in the benchmarks is either an addition/subtraction or a
multiplication.
3.7.1 Spatial correlation in timing analysis
In order to show the importance of considering spatially correlated process
parameters during the timing analysis, we performed a floorplanning and timing
analysis on the same binding solution with spatial correlation and θind from
57
Table 3.2: Correlation vs. No-Correlation Experimental Results
85% Yield Clk (ns)
Benchmark Corr No Corr Corr reduction
in Clk over No
Corr (%)
Corr 85% PY
Gain over No
Corr (%)
chem 5.91 6.20 4.70 14.97
dir 4.91 5.14 4.49 14.98
honda 5.14 5.30 3.03 14.37
mcm 4.09 4.28 4.35 10.56
pr 4.45 4.66 4.51 14.99
steam 5.54 5.80 4.51 14.89
wang 4.91 5.11 3.98 14.99
Average 4.22 14.25
equation (3.2) set to 0 (Corr), and without correlation (No Corr) with θind = 1.
Setting θind = 0 makes σ
2
intra = σ
2
FU = 1. This makes it possible for all the
FU variation to be correlated between units; however, the actual correlation
between the FUs is still found based on the distance between them. The results
are shown in Table 3.2. Columns 2 and 3 show the clock period obtained for an
85% yield with Corr and No Corr, respectively. Column 4 shows the reduction
in clock period of the Corr result over the No Corr result, which averages
4.22%. Column 5 reports the performance yield (PY) gain of Corr over No
Corr, which averages about 14.25%. That is, for the No Corr clock period given,
one would expect to achieve an 85% PY based on the No Corr timing analysis
but would achieve a 85% + 14.25% = 99.25% PY based on the Corr timing
analysis. In other words, timing analysis without consideration of correlated
process parameters is conservative compared to correlated timing analysis.
This shows the importance of using spatial correlation information to guide the
floorplanner, as well as performing accurate SSTA.
58
Figure 3.4: Chem delay distributions
3.7.2 FastYield compared to BindBWM
We compare the results of FastYield after rebinding (FY Rebind) to an enhanced
version of the weighted bipartite graph–based binding (here referred to as
BindBWM) of Huang et al. [11]. The enhancements to [11] include module
selection and the ability to specify an area constraint, making the comparison
demonstrative of the performance yield gains that can be achieved when
considering process variation during binding. The same schedules, area constraints,
and library were used in both algorithms. We also compare FY Rebind performance
to the performance attained by FastYield before rebinding (FY Initial) to
show the effect of timing information on the rebinding solution. In all of the
benchmarks, the same number of adders and multipliers were allocated in the
binding solution for BindBWM, FY Initial, and FY Rebind.
Table 3.3 summarizes the experimental results. Columns 2, 4, and 6 give
the clock periods for each BindBWM, FY Initial, and FY Rebind respectively.
Columns 3 and 5 give the PY attainable by the respective binding solutions
if clocked at the 85% PY clock of FY Rebind. Figure 3.4 demonstrates this
graphically by plotting the cumulative density functions (cdfs) for the different
59
binding results of chem (pdfs are inset). If clocked at the 85% PY clock period
of FY Rebind, FY Initial and BindBWM have PYs of 67.7% and 12.5%,
respectively.
60
T
ab
le
3.
3:
F
as
tY
ie
ld
E
x
p
er
im
en
ta
l
R
es
u
lt
s
B
in
d
B
W
M
F
as
tY
ie
ld
In
it
ia
l
F
as
tY
ie
ld
R
eb
in
d
C
om
p
ar
is
on
B
en
ch
m
ar
k
85
%
Y
ie
ld
C
lk
(n
s)
P
Y
at
F
Y
R
eb
in
d
85
%
C
lk
(%
)
85
%
Y
ie
ld
C
lk
(n
s)
P
Y
at
F
Y
R
eb
in
d
85
%
C
lk
(%
)
85
%
Y
ie
ld
C
lk
(n
s)
T
ot
al
F
Y
R
u
n
T
im
e
(m
in
)
F
Y
R
eb
in
d
re
d
u
ct
io
n
in
C
lk
ov
er
B
in
d
B
W
M
(%
)
F
Y
R
eb
in
d
85
%
P
Y
G
ai
n
ov
er
B
in
d
B
W
M
(%
)
F
Y
R
eb
in
d
re
d
u
ct
io
n
in
C
lk
ov
er
F
Y
In
it
ia
l
(%
)
F
Y
R
eb
in
d
85
%
P
Y
G
ai
n
ov
er
F
Y
In
it
ia
l
(%
)
ch
em
6.
9
12
.5
6.
1
67
.7
6.
0
75
14
.1
7
72
.5
2.
35
17
.3
di
r
5.
8
1.
5
4.
9
70
.9
4.
8
43
16
.7
1
83
.5
1.
76
14
.1
ho
nd
a
5.
7
8.
1
4.
9
82
.6
4.
9
28
14
.3
9
76
.9
0.
32
2.
4
m
cm
4.
9
11
.4
4.
3
78
.0
4.
2
40
14
.5
7
73
.6
3.
34
7.
0
pr
5.
2
0.
1
4.
5
70
.1
4.
3
24
16
.4
7
84
.9
3.
04
14
.9
st
ea
m
6.
2
7.
6
5.
5
76
.3
5.
5
64
11
.8
8
77
.4
1.
14
8.
7
w
an
g
5.
3
1.
6
4.
7
80
.8
4.
6
16
13
.2
9
83
.4
0.
95
4.
2
A
ve
ra
ge
14
.5
0
78
.9
1.
84
9.
8
61
In Table 3.3, Column 7 gives the total FY run time in minutes. Columns
8 and 10 give the FY Rebind percentage reduction in clock period when
compared to BindBWM and FY Initial, respectively. Columns 9 and 11 give the
PY gain (in percent) of FY Rebind over BindBWM and FY Initial, respectively.
This means that if BindBWM or FY Initial were clocked at the 85% PY clock
period of FY Rebind, they would have a PY smaller than 85% by the given
amount.
By considering process variation and layout, FY Rebind is able to reduce
the clock period of the benchmarks by an average of 14.5% and increase the
performance yield an average of 78.9%, compared to BindBWM. It is also able
to improve clock period and PY an average of 1.84% and 9.8%, respectively,
over FY Initial. In some cases the amount of clock period improvement that
rebinding can achieve is limited by the number of the type of unit that is on the
critical path. For example, if there are 4 allocated multipliers, all of which are
found to be critical, then rebinding cannot offer much improvement. However,
if only 3 of 4 allocated multipliers are found to be critical, then rebinding can
offer more improvement. This is because reassignment of operations from one
critical path to another will have a negligible impact on the overall clock period.
Often, though, even if the reassignment of operations has a small effect
on mean clock period, it can have a large impact on the variance of the clock
period, thus improving the PY significantly. This can be seen in Figure 3.4,
where there is a large improvement in the delay cdf between the BindBWM
and FY Rebind. This explains the results in column 3 of Table 3.3, where we
see that when BindBWM is clocked at the 85% PY clock value of FY Rebind,
the PY is very small. The difference between FY Rebind and FY Initial is
not as drastic, but there are two key improvements. First, the mean of the
pdf has been shifted to a lower clock value. Second, the variance has been
62
reduced. Combining these two improvements results in a significant PY jump
for a relatively minor change in mean clock period.
63
CHAPTER 4
CONCLUSIONS AND FUTURE WORK
4.1 Multi-Clock SSTA
MCSSTA is a new method for considering multiple clock domains during SSTA.
The method is general enough to handle many complicated timing constraints
such as false paths, thru-x, and multi-cycle paths and can be incorporated
into different SSTA algorithms. We implement our algorithm using a modified
PERT-like traversal and show that it does not introduce significant additional
error into the timing analysis over a single-clock-domain SSTA algorithm. To
the best of our knowledge, this is the first SSTA algorithm to be able to handle
complex timing constraints such as multiple clock domains and false paths.
4.2 FastYield
The chapter on FastYield presents a new variation-aware algorithm for simultaneous
binding and module selection. FastYield incorporates many competing factors
into its algorithms that are not found in previous variation-aware algorithms.
It considers register, multiplexer, and functional unit usage as well as spatial
correlation among the resources during SSTA embedded in a floorplanner. The
importance of spatial correlation during SSTA was demonstrated. On average,
FastYield achieves an 85% performance yield clock period that is 14.5% smaller,
and a performance yield gain of 78.9%, compared to a variation-unaware and
64
layout-unaware algorithm based on [11]. Also, by making use of accurate timing
information, FastYield’s rebinding improves performance yield by an average
of 9.8% over the initial binding, for the same clock period. This result shows
that by performing statistical layout-driven synthesis, substantial gains in
performance yield can be made.
4.3 Future Work
There are several ways in which we wish to expand upon this research. The
first is to combine FastYield with MCSSTA so that FastYield can support
multi-clock domains. In its current form, FastYield assumes that all FUs are
single-cycle units. We feel that this limitation reduces the opportunities that
are available for rebinding because it is often the case that the slowest type
of unit is also the largest. This results in only a few of the slowest FUs being
allocated, so there are not many places to which operations can be rebound.
By supporting multi-cycle units we can make the delay of the slow units, such
as multipliers, about the same as the delay of the adders, allowing us more
opportunities to take advantage of rebinding.
The second avenue that we wish to explore is to make the scheduling
for FastYield variation-aware. In its current form, FastYield uses a simple
list-scheduling algorithm to come up with a schedule. By making the schedule
variation-aware, we should be able to improve the overall performance yield
of the circuit or reduce the clock period of the circuit for a given performance
yield requirement. By making the scheduling variation-aware and supporting
multi-cycle FUs, we should be able to obtain better results.
Overall, the amount of research that has been performed on variation-aware
HLS is minimal. There has been some research performed into the individual
65
subtasks of HLS and some work that has simultaneously performed the tasks
through simulated annealing, but there still remains ample opportunity for
further research into variation-aware HLS.
66
REFERENCES
[1] S. Borkar, T. Karnik, S. Narendra, J. Tschanz, A. Keshavarzi, and V. De,
“Parameter variations and impact on circuits and microarchitecture,” in
Design Automation Conference, 2003, pp. 338–342.
[2] S. Nassif, “Statistical design on the verge of maturity: Revisiting the
foundation,” in Tutorials of the Asia South Pacific Design Automation
Conference, 2009, pp. 264–311.
[3] J. Xiong, V. Zolotov, and L. He, “Robust extraction of spatial correlation,”
in Proceedings of the International Symposium on Physical Design, 2006,
pp. 2–9.
[4] H. Chang and S. S. Sapatnekar, “Statistical timing analysis considering
spatial correlations using a single PERT-like traversal,” in Proceedings
of the International Conference on Computer-Aided Design, 2003, pp.
621–625.
[5] S. Tsukiyama, M. Tanaka, and M. Fukui, “A statistical static timing
analysis considering correlations between delays,” in Proceedings of the
Asia South Pacific Design Automation Conference, 2001, pp. 353–358.
[6] Z. Feng, P. Li, and Y. Zhan, “Fast second-order statistical static timing
analysis using parameter dimension reduction,” in Proceedings of the
Design Automation Conference, 2007, pp. 244–249.
[7] L. Zhang, W. Chen, Y. Hu, J. Gubner, and C.-P. Chen,
“Correlation-preserved non-Gaussian statistical timing analysis with
quadratic timing model,” in Proceedings of the Design Automation
Conference, 2005, pp. 83–88.
[8] H. Chang, V. Zolotov, S. Narayan, and C. Visweswariah, “Parameterized
block-based statistical timing analysis with non-Gaussian parameters,
nonlinear delay functions,” in Proceedings of the Design Automation
Conference, 2005, pp. 71–76.
[9] K. Wakabayashi and T. Okamoto, “C-based SoC design flow and EDA
tools: An ASIC and system vendor perspective,” IEEE Transactions on
67
Computer-Aided Design of Integrated Circuits and Systems, vol. 19, no. 12,
pp. 1507–1522, Dec. 2000.
[10] A. Hashimoto and J. Stevens, “Wire routing by optimizing channel
assignment within large apertures,” in Design Automation Conference,
1971, pp. 155–169.
[11] C.-Y. Huang, Y.-S. Chen, Y.-L. Lin, and Y.-C. Hsu, “Data path allocation
based on bipartite weighted matching,” in Proceedings of the Design
Automation Conference, 1990, pp. 499–504.
[12] A. Devgan and C. Kashyap, “Block-based static timing analysis
with uncertainty,” in Proceedings of the International Conference on
Computer-Aided Design, 2003, pp. 607–614.
[13] A. Agarwal, V. Zolotov, and D. Blaauw, “Statistical timing analysis using
bounds and selective enumeration,” IEEE Transactions on Computer-Aided
Design of Integrated Circuits and Systems, vol. 22, no. 9, pp. 1243–1260,
Sept. 2003.
[14] J. Le, X. Li, and L. Pileggi, “Stac: Statistical timing analysis with
correlation,” in Proceedings of the Design Automation Conference, 2004,
pp. 343–348.
[15] L. Cheng, D. Chen, M. D. F. Wong, M. Hutton, and J. Govig, “Timing
constraint-driven technology mapping for FPGAs considering false paths
and multi-clock domains,” in Proceedings of the International Conference
on Computer-Aided Design, 2007, pp. 370–375.
[16] C. E. Clark, “The greatest of a finite set of random variables,” Operations
Research, vol. 9, no. 2, pp. 145–162, 1961.
[17] M. Hutton, D. Karchmer, B. Archell, and J. Govig, “Efficient static timing
analysis and applications using edge masks,” in Proceedings of the 13th
International Symposium on Field-Programmable Gate Arrays, 2005, pp.
174–183.
[18] Nangate Incorporated, “Nangate 45 nm open cell library,” April 2009,
[Online]. Available: http://edageek.com/2008/03/03/freepdk45-src-nsf/.
[19] S. Nassif, “Design for variability in dsm technologies [deep submicron
technologies],” in Proceedings of the International Symposium on Quality of
Electronic Design, 2000, pp. 451–454.
[20] M. B. Srivastava and M. Potkonjak, “Optimum and heuristic
transformation techniques for simultaneous optimization of latency and
throughput,” IEEE Transactions on Very Large Scale Integrated Systems,
vol. 3, no. 1, pp. 2–19, 1995.
68
[21] D. Chen, J. Cong, and Y. Fan, “Low-power high-level synthesis for FPGA
architectures,” in Proceedings of the International Symposium on Low
Power Electronics and Design, 2003, pp. 134–139.
[22] M. R. Guthaus, N. Venkateswarant, C. Visweswariaht, and V. Zolotov,
“Gate sizing using incremental parameterized statistical timing analysis,”
in Proceedings of the International Conference on Computer-Aided Design,
2005, pp. 1029–1036.
[23] I.-J. Lin, T.-Y. Ling, and Y.-W. Chang, “Statistical circuit optimization
considering device and interconnect process variations,” in Proceedings of
the International Workshop on System Level Interconnect Prediction, 2007,
pp. 47–54.
[24] G. D. Micheli, Synthesis and Optimization of Digital Circuits. New York,
NY: McGraw-Hill, 1994.
[25] D. D. Gajski, N. D. Dutt, A. C.-H. Wu, and S. Y.-L. Lin, High-Level
Synthesis: Introduction to Chip and System Design. Norwell, MA: Kluwer
Academic Publishers, 1992.
[26] S. D. A. Raghunathan, N. K. Jha, High-Level Power Analysis and
Optimization. Norwell, MA: Kluwer Academic Publishers, 1998.
[27] W. W. R. Camposano, High-Level VLSI Synthesis. Norwell, MA: Kluwer
Academic Publishers, 1991.
[28] Y.-M. Fang and D. F. Wong, “Simultaneous functional-unit binding
and floorplanning,” in Proceedings of the International Conference on
Computer-Aided Design, 1994, pp. 317–321.
[29] W.-L. Hung, X. Wu, and Y. Xie, “Guaranteeing performance yield in
high-level synthesis,” in Proceedings of the International Conference on
Computer-Aided Design, 2006, pp. 303–309.
[30] J. Jung and T. Kim, “Timing variation-aware high-level synthesis,” in
Proceedings of the International Conference on Computer-Aided Design,
2007, pp. 424–428.
[31] P. Friedberg, Y. Cao, J. Cain, R. Wang, J. Rabaey, and C. Spanos,
“Modeling within-die spatial correlation effects for process-design
co-optimization,” in Proceedings of the International Symposium on Quality
of Electronic Design, 2005, pp. 516–521.
[32] F. Wang, G. Sun, and Y. Xie, “A variation aware high level synthesis
framework,” in Proceedings of the Conference on Design, Automation and
Test in Europe, 2008, pp. 1063–1068.
69
[33] Oklahoma State University VLSI Computer Architecture Research,
“OSU design flows for MOSIS SCMOS-subm design flow FreePDK
45 nm variation-aware design flow,” June 2009, [Online]. Available:
http://vcag.ecen.okstate.edu/projects/scells/.
[34] S. N. Adya and I. L. Markov, “Fixed-outline floorplanning: Enabling
hierarchical design,” IEEE Transactions on Very Large Scale Integrated
Systems, vol. 11, no. 6, pp. 1120–1135, Dec. 2003.
[35] H. Bakoglu and J. Meindl, “Optimal interconnection circuits for VLSI,”
IEEE Transactions on Electron Devices, vol. 32, no. 5, pp. 903–909, May
1985.
[36] N. Nagaraj, T. Bonifield, A. Singh, C. Bittlestone, U. Narasimha, V. Le,
and A. Hill, “BEOL variability and impact on RC extraction,” in Design
Automation Conference, 2005, pp. 758–759.
70
