Minflotransit: Min-Cost Flow Based Transistor Sizing Tool by Vijay Sundararajan et al.
MINFLOTRANSIT: MIN-COST FLOW BASED TRANSISTOR SIZING TOOL
￿
Vijay Sundararajan, Sachin S. Sapatnekar, Keshab K. Parhi
Dept. of ECE, University of Minnesota
Minneapolis, MN 55455
E-mail: vijay@ece.umn.edu, sachin@ece.umn.edu, parhi@ece.umn.edu
ABSTRACT
This paper presents MINFLOTRANSIT, a new transistor sizing tool for fast
sizing of combinational circuits with minimal cost. MINFLOTRANSIT is
an iterative relaxation based tool that has two alternating phases. For a cir-
cuit with
j
V
j transistors and
j
E
j wires, the ﬁrst phase (D-phase) is based
on minimum cost network ﬂow, which in our application, has a worst-case
complexity of
O
(
j
V
j
j
E
j
l
o
g
(
l
o
g
(
j
V
j
)
)
). The second phase (W-phase) has
a worst case complexity of
O
(
j
V
j
j
E
j
). In practice, during our simulations
both the D-phase and W-phase show a near linear run-time dependence on
the size of the circuit, comparable to TILOS. Simulation results show excel-
lent run-time behavior for MINFLOTRANSIT on all the ISCAS85 bench-
mark circuits. For reasonable delay targets MINFLOTRANSIT shows up to
1
6
:
5
% area savings over a circuit sized using a TILOS-like algorithm.
1. INTRODUCTION
As evidenced by the successive announcement of ever faster com-
puter systems, increasing the speed of VLSI systems is one of the
major requirements for VLSI system designers today. Faster in-
tegrated circuits are making possible newer applications that were
traditionally considered difﬁcult to implement in hardware. In this
scenario of increasing circuit complexity, reduction of circuit delay
in integrated circuits is an important design objective. Transistor
sizing is one such task that has been employed for speeding up cir-
cuits for quite some time now [1]. Given the circuit topology, the
delay of a combinational circuit can be controlled by varying the
sizes of transistors in the circuit. Here, the size of a transistor is
measured in terms of its channel width, since the channel lengths
of MOS transistors in a digital circuit are generally uniform. In any
case, what really matters is the ratio of channel width to channel
length, and if channel lengths are not uniform, this ratio can be con-
sidered as the size. In coarse terms, the circuit delay can usually be
reduced by increasing the sizes of certain transistors in the circuit
from the minimum size. Hence, making the circuit faster usually
entails the penalty of increased circuit area relative to a minimum
sized circuit and the area-delay trade-off involved here is the prob-
lem of transistor size optimization. A related problem to transistor
sizing is calledgate sizing, where a logic gate ina circuit is modeled
as an equivalent inverter and the sizing optimization is carried out
on this modiﬁed circuit with equivalent inverters in place of more
complex gates. There is, therefore, a reduction in the number of
size parameters corresponding to every gate in the circuit. Needless
to say, this is an easier problem to solve than the general transistor
sizing problem.
There has been a large amount of work done on transistor sizing
[1–8] that underlines the importance of this optimization technique.
Starting from a minimum sized circuit, TILOS [1] uses a greedy
strategy for transistor sizing by iteratively sizing transistors in the
critical path. A sensitivity factor is calculated for every transistor in
the critical path to quantify the gain in circuit speed achieved by a
unit upsizing of the transistor. The most sensitive transistor is then
bumped up in size by a small constant factor to speed up the circuit.
This process is repeated iteratively until the timing requirements
are met. The technique is extremely simple to implement and has
￿This research has been supported in part by the ARO under grant num-
ber DA/DAAG55-98-1-0315 and by SRC under grant number 99-TJ-692.
run-time behavior proportional to the size of the circuit. Its chief
drawback is that it does not have guaranteed convergence proper-
ties and hence is not an exact optimization technique. Among the
past approaches, only [2] and [8] are exact optimization techniques
andthe othertechniques donot have provenconvergence properties.
While [2] was the ﬁrst ever polynomial time technique reported for
addressing the problem exactly, it does not have very good run-time
behavior. The technique presented in [8] has shown impressive run-
time behavior for sizing large adders. The run-time behavior of this
technique on more complex circuits such as multipliers and con-
trollers was not demonstrated. Moreover, the approach appears to
be amenable only to tackling sizing problems where the gate delay
is expressed using Elmore delays. This paper presents a novel way
of solving the transistor sizing problem exactly and in an extremely
fast manner. The proposed approach has some similarity in form
to [6, 7] which will be subsequently explained, but the similarity
in content is minimal and the details of implementation are vastly
different. In essence, the proposed technique and the techniques
in [6,7] are iterative relaxation approaches that involve a two-step
optimization strategy. The ﬁrst-step involves a delay budgeting step
where optimal delays are computed for transistors/gates. The sec-
ond step involves sizing transistors optimally to achieve these delay
budgets. The two steps are iteratively alternated until the solution
converges, i.e., until the delay budgets calculated in the ﬁrst step
are exactly satisﬁed by the transistor sizes determined by the sec-
ond step. The primary features of the proposed approach are:
1. It is computationally fast and is comparable to TILOS in its
run-time behavior.
2. It can be used for true transistor sizing as well as the relaxed
problem of gate-sizing. Additionally, the approach can easily
incorporate wire sizing, as outlined in section 2.1.
3. It can be adapted for more general delay models than the El-
more delay model.
To elaborate further on the last point, the proposed model only re-
quires the transistor delay to be expressible as a sum of simple
monotonic functionals, deﬁned in section 2.1, of transistor sizes and
hence admits more general delay models than just the Elmore delay
model.
The starting point for the proposed approach is a fast guess so-
lution. This could be obtained, for example, from a circuit that
has been optimized using TILOS to meet the given delay require-
ments. The proposed approach, as outlined earlier, is an iterative
relaxation procedure that involves an alternating two-phase relaxed
optimization sequence that is repeated iteratively until convergence
is achieved. The two-phases in the proposed approach are:
￿ TheD-phase where transistorsizes areassumed ﬁxedandtran-
sistor delays are regarded as variable parameters. Irrespective
of the delay model employed, this phase can be formulated as
the dual of a min-cost network ﬂow problem. Using
j
V
j to
denote the number of transistors and
j
E
j the number of wires
in the circuit, this step in our application has worst-case com-
plexity of
O
(
j
V
j
j
E
j
l
o
g
(
l
o
g
j
V
j
)
) [9].
￿ The W-phase where transistor/gate delays are assumed ﬁxed
and their sizes are regarded as variable parameters. As long asN
N
N
1
2
3
VDD
PP P
45 6 P
4
P
5
P
6
N3 N2 N1
Delay(N
3
)= A(B   + B   + B    + C  + E)
x3
xx 4 5
CL
x6
p p p
L
Figure 1. The DAG corresponding to a 3-input static CMOS
nand gate.
the gate delay can be expressed as a separable function of the
transistor sizes, this step can be solved as a Simple Monotonic
Program (SMP) [10]. The complexity of SMP is similar to an
all-pairs shortest path algorithm in a directed graph, [10,11],
i.e.,
O
(
j
V
j
j
E
j
).
The objective function for the problem is minimization of circuit
area. In the W-phase, this objective is addressed directly, and in the
D-phase the objective is chosen to facilitate a move in the solution
space in a direction that is known to lead to a reduction in the circuit
area.
2. PROPOSED APPROACH
The transistor size optimization problem can be stated as,
m
i
n
i
m
i
z
e
X
all transistors, i
x
i
;
s
u
b
j
e
c
t
t
o
:
D
e
l
a
y
(
C
i
r
c
u
i
t
)
<
T
;
m
i
n
s
i
z
e
￿
x
i
￿
m
a
x
s
i
z
e
; (1)
where
x
i refers to the size of transistor
i,
T is timing requirement
speciﬁed as an input to the optimization and
m
i
n
s
i
z
e,
m
a
x
s
i
z
e
are, respectively, minimum and maximum bounds on the sizes of
transistors in the circuit.
2.1. Equivalent DAG for Transistors/Gates
The proposed approach requires transistor delay to be expressible
as a sum of simple monotonic functionals of transistor sizes, which
are deﬁned as follows:
Deﬁnition 1 A function
D
i
(
x
1
;
:
:
:
;
x
n
)is a simple monotonic func-
tional if it can be rewritten as
D
i
=
g
(
x
i
)
q
(
x
1
;
:
:
:
;
x
i
￿
1
;
x
i
+
1
;
:
:
:
;
x
n
),w h e r e
g
(
x
i
) is a mono-
tonic decreasing function of
x
i and
q
(
x
1
;
:
:
:
;
x
i
￿
1
;
x
i
+
1
;
:
:
:
;
x
n
) is
a monotonic increasing function of each
x
j
j
2
f
1
;
:
:
:
;
n
g
j
6
=
i.
Deﬁnition 2 A function
D
(
x
1
;
:
:
:
;
x
n
) is termed decomposable
into simple monotonic functionals if
D
=
P
i
2
f
1
;
:
:
:
;
n
g
D
i where
each
D
i, as in deﬁnition (1), is a simple monotonic functional.
D
i
is termed the simple monotonic projection of
D on
x
i.
In order to mathematically model the transistor size optimization
problem, every static CMOS gate in the circuit is ﬁrst converted in
to an equivalent Directed Acyclic Graph (DAG) model, shown in
ﬁgure 1, as follows. There is a vertex in the DAG corresponding
to every transistor. An edge is drawn between an NMOS (PMOS)
transistor and another NMOS (PMOS) transistor provided there is
a discharging (charging) path consisting of the two transistors. The
edge isalways directedfrom thetransistor higher upinthe discharg-
ing (charging) path to the transistor lower down in the discharging
(charging) path. Every vertex of this DAG has a delay attribute
associated with it. This delay attribute is given by the simple mono-
tonic projection of the worst case discharging (charging) path delay
through the transistor corresponding to the vertex on to the size of
this transistor. Note that, as will be evident soon, the rise and fall
delays are implicitly distinguished due to the fact that the DAG cor-
responding to every gate has separate components for pullup and
pulldown networks.
For easeof exposition wewill henceforth consider thecommonly
used Elmore delay model that can be decomposed in to simple
monotonic functionals. Assuming
x
i to be the size of transistor
N
i(
P
i) in the pulldown(pullup) network of the 3-input NAND gate
shown in ﬁgure 1, it can be shown [1] that the pulldown Elmore
delay can be expressed as,
d
e
l
a
y
p
u
l
l
d
o
w
n
=
(
A
x
1
)
(
B
x
1
+
C
x
2
)
+
(
A
x
1
+
A
x
2
)
(
B
x
2
+
C
x
3
+
D
)
+
(
A
x
1
+
A
x
2
+
A
x
3
)
(
B
x
3
+
B
p
x
4
+
B
p
x
5
+
B
p
x
6
+
C
L
+
E
)
; (2)
where
A,
B and
C are constant coefﬁcients, the resistance, drain
and source capacitance, respectively, of a unit NMOS transistor.
D and
E are related to the wire capacitances. Similarly,
B
p is the
draincapacitance of a unit sizedPMOS transistorand
C
L istheload
capacitance. Under this model if wire sizes were considered to be
variables also then the form of (2) would remain similar. Rewriting
the expression in (2) we get,
d
e
l
a
y
p
u
l
l
d
o
w
n
=
A
x
1
(
B
x
2
+
B
x
3
+
C
x
2
+
C
x
3
+
D
+
E
+
B
p
x
4
+
B
p
x
5
+
B
p
x
6
+
C
L
)
+
A
x
2
(
B
x
3
+
C
x
3
+
D
+
E
+
B
p
x
4
+
B
p
x
5
+
B
p
x
6
+
C
L
)
+
A
x
3
(
B
p
x
4
+
B
p
x
5
+
B
p
x
6
+
C
L
+
E
)
+
3
A
B
: (3)
Since
A,
B,
C,
D,
E,
B
p,
C
L are non-negative quantities, the
Elmoredelaymodel admits asimplemonotonic decomposition.T h e
above fact is illustrated in ﬁgure 1 where the delay corresponding to
NMOS transistor
N
3 is explicitly shown. The constant terms used
inthisexpression have the same connotation as in(3). Weclaim that
such a DAG model can always be developed for any complex static
CMOS gate consisting of a series/parallel network of transistors as
long as the underlying delay model admits a simple monotonic de-
composition. This is a reasonable requirement since the reduction
in the gate delay with an increase in its size can be modeled by the
function
g in Deﬁnition 1, and the increase in the gate delay with
increasing fanout gate sizes can be modeled by the function
q.
Inaddition, ifwiresizingwere alsotobe performed together with
transistor sizing, then we could model the problem by augmenting
the DAG corresponding to a gate by adding vertices corresponding
to each wire. The edges emanating from and incident on these wires
will be similarly constructed as for transistors. The delay attribute
of a vertex corresponding to any wire can also be similarly deﬁned
as for that of a vertex corresponding to a transistor. We conclude
that modeling the problem of wire sizing along with transistor siz-
ing may use the same framework as transistor sizing alone, and the
approach developed in this paper can simultaneously handle both.
For ease of exposition, from here onwards, wire sizing will not be
considered for the remainder of the paper.
Note that the DAG corresponding to a static CMOS gate has
at least two disjoint connected components, as shown in ﬁgure 1,
corresponding to the pulldown network of NMOS transistors (i.e.,
vertices
N
1
;
N
2
;
N
3) and the pullup network of PMOS transistors
(i.e., vertices
P
1
;
P
2
;
P
3) corresponding to the gate. The portion of
the DAG representing the NMOS pulldown networks corresponds
to falling transitions and the portion of the DAG representing the
PMOS pullup network is related to rising transitions at the output
of the gate. Note that there are several vertices in the DAG of a gate
that only have edges emanating from them and have no edges termi-
nating on them; we refer to these vertices as the root vertices of the
gate DAG. Also, note that there are several vertices in the DAG of
a given gate that have only edges terminating on them and no edges
emanating from them; these vertices constitute the leaf vertices of
t h eg a t eD A G .
2.2. A DAG for the Circuit
The entire circuit consisting of static CMOS gates can be repre-
sented with an equivalent DAG,
G
=
(
V
;
E
), by connecting theP
4
P
5
P
6
N3 N2 N1
P
4
P
5
P
6
N3 N2 N1
Figure 2. The DAG corresponding to a circuit consisting of
two 3-input static CMOS nand gates in series.
component DAG’s of individual gates. The construction of the cir-
cuit DAGisasfollows, thevertex set
V ofthecircuit DAGissimply
the union of the vertex sets of DAG’s corresponding to the gates in
the circuit. The edge set
E of the circuit DAG is constructed as
follows. For every wire connecting the output of one gate to the
input of another there will be a set of edges in the circuit DAG that
go from the NMOS (PMOS) DAG components of the ﬁrst gate to
the PMOS (NMOS) DAG components of the second gate. So cor-
responding to every wire connecting the output of the ﬁrst gate to
a given NMOS (PMOS) transistor in the second gate, there will be
edges emanating from all the leaf vertices of the PMOS (NMOS)
DAG of the ﬁrst gate. These edges terminate on all the root vertices
of the NMOS (PMOS) DAG component of the second gate that are
connected to the given transistor in the second gate. Figure 2 illus-
trates the construction of the circuit DAG for a circuit consisting of
two 3-input nand gates in series.
2.3. Two-Phase Optimization
Note that the delay corresponding to a vertex,
i, in the circuit DAG,
whose corresponding transistor has a size
x
i, can always be ex-
pressed as,
d
e
l
a
y
(
i
)
=
P
j
2
S
(
V
(
G
)
)
a
i
j
x
j
+
b
i
x
i
; (4)
where
S
(
V
(
G
)
) denotes some subset of
V
(
G
) that is located in
the neighborhood of the vertex
i. In particular, this subset consists
of the vertices corresponding to all those transistors whose sizes
directly affect the delay of the transistor corresponding to vertex
i.
Also, note that all the coefﬁcients
a
i
j,
b
i in (4) are non-negative.
Rearranging (4), we have,
d
e
l
a
y
(
i
)
￿
x
i
￿
X
j
2
S
(
V
(
G
)
)
a
i
j
x
j
=
b
i
: (5)
Denoting a diagonal matrix whose
(
i
;
i
)
t
h entry is
d
e
l
a
y
(
i
) by
D, amatrixwhose
(
i
;
j
)
t
h entry is
a
i
j by
A, a column vector whose
i
t
h component is
b
i by
Band acolumnvector whose
i
t
h component
is
x
i by
X we can rewrite (5) as
(
D
￿
A
)
X
=
B
: (6)
Thisformulationcan bewrittenaslong asthedelaymodel admits
a simple monotonic decomposition. It can be shown that for strict
gate sizing the matrix
(
D
￿
A
) can be written as an upper triangular
matrix. Thisis due tothe fact that the adjacency matrixof the circuit
DAG can be always written as an upper triangular matrix [12]. On
the other hand it can be shown that for transistor sizing the matrix
(
D
￿
A
) can be written as a block upper triangular matrix (proof
not included due to space restrictions). We will henceforth assume
that
(
D
￿
A
) can always be represented in an upper triangular form
or block upper triangular form.
With this assumption we can state if
D is a constant matrix, then
as long as
(
D
￿
A
) is invertible, a system of equations of the form
(
D
￿
A
)
X
=
B can be solved for the variables
X by a backward
substitution process beginning from the bottom row and proceeding
upwards and progressively solving for all
x
i. From a circuit point
of view, this process proceeds in a backward breadth-ﬁrst manner
beginning with the primary outputs and proceeding backwards in
order of decreasing levels of logic of the circuit. This elimination
process has
O
(
j
F
j
N
) computational complexity, where
N is the
number of components in the vector
X and
j
F
j is bounded (in gate
sizing) by the maximum fanout of any gate in the circuit. Note that
as long as all vertices of the circuit DAG have a non-zero delay,
which is always the case, the (block) upper triangular matrix
(
D
￿
A
) for (transistor) gate sizing will always be invertible.
Now assume that we start with some initial sizing solution
X
0
and some delay matrix
D
0 satisfying (6). We now resize the tran-
sistors slightly so that the new delay matrix is
D
0
+
￿
D and the new
size vector is
X
0
+
￿
X, so that we then have,
(
D
0
+
￿
D
￿
A
)
(
X
0
+
￿
X
)
=
B
;
(
D
0
￿
A
)
(
X
0
+
￿
X
)
+
￿
D
(
X
0
+
￿
X
)
=
B
;
B
+
(
D
0
￿
A
)
￿
X
+
￿
D
X
0
+
￿
D
￿
X
=
B
;
(
D
0
￿
A
)
￿
X
￿
￿
￿
D
X
0
;
￿
X
￿
￿
(
D
0
￿
A
)
￿
1
￿
D
X
0
; (7)
where the term
￿
D
￿
X has been ignored, assuming small pertur-
bations in
￿
D and
￿
X. In other words, we make the following
observations:
￿ For an inﬁnitesimal resizing of the transistors corresponding
to the vertices in the circuit DAG, the inﬁnitesimal changes in
transistor sizes can be represented as a linear function of the
inﬁnitesimal changes in transistor delays.
￿ As a result of (1), we see that the sum of all the components
of
￿
X, which represents the sum of the change in sizes of the
transistors corresponding to all the vertices in the circuit, can
be expressed as a linear function of diagonal entries of the ma-
trix
￿
D.
It can be shown that since all components of
X
0 are positive, all
components of
￿
(
D
0
￿
A
)
￿
1
X
0 will be negative. Hence, the sum
of all the components of
￿
X, can be expressed as a linear function
of diagonal entries of the matrix
￿
D, where the coefﬁcient corre-
sponding to each diagonal element of
￿
D is negative.
Thismotivates atwo-phase strategy for solving the transistor size
optimization problem. In the D-phase, as above, we assume ﬁxed
transistor sizes and redistribute the delay budgets in such a manner
as to minimize the resultant change in transistor sizes. In the W-
phase, on the other hand, we try to ﬁnd the minimal-sized circuit
that satisﬁes the modiﬁed delay budget obtained after the D-phase.
The two-phases are alternated till convergence is achieved and the
delay budgets output by the D-phase are exactly satisﬁed by the
transistor sizes calculated by the W-phase.
2.3.1. D-phase
First assume that the circuit has been sized to meet delay require-
ments using an algorithm such as TILOS. We now deﬁne three at-
tributes for every vertex in the circuit DAG G.F o rav e r t e x
i,t h e s e
are the arrival time
A
T
(
i
), the required time
R
T
(
i
) and the slack,
s
l
(
i
). Additionally, every edge
e
i
j
2 E has the attribute edge-slack,
e
s
l
(
e
i
j
). The entire circuit graph
G has an additional attribute
C
P
(
G
) that refers to the delay of the critical path of the corre-
sponding circuit. We willnow deﬁne all of these attributes formally.
n
A
T
(
i
)
=
e
x
t
e
r
n
a
l
t
i
m
e
o
f
a
r
r
i
v
a
l
;
u
2
P
I
;
=
m
a
x
v
2
f
a
n
i
n
(
i
)
(
A
T
(
j
)
+
d
e
l
a
y
(
j
)
)
;
e
l
s
e
f
C
P
(
G
)
=
m
a
x
u
2
V
(
A
T
(
i
)
+
d
e
l
a
y
(
i
)
)
;
n
R
T
(
i
)
=
C
P
(
G
)
￿
d
e
l
a
y
(
i
)
;
u
2
P
O
;
R
T
(
i
)
=
m
i
n
v
2
f
a
n
o
u
t
(
i
)
R
T
(
j
)
￿
d
e
l
a
y
(
i
)
;
e
l
s
e
n
s
l
(
i
)
=
R
T
(
i
)
￿
A
T
(
i
)
;
e
s
l
(
e
i
j
)
=
R
T
(
j
)
￿
A
T
(
i
)
￿
d
e
l
a
y
(
i
)
:
(8)
where
P
Iand
P
Odenote respectively the primary inputs and pri-
mary outputs of the circuit.
We call a circuit safe when all vertices
i
2 Vh a v e
s
l
(
i
)
￿
0 and
all edges have
e
s
l
(
e
i
j
)
￿
0.
The D-phase involves minimally altering the delay budgets of
transistors in the circuit to move towards a feasible minimum areasolution. Forthis to be possible, we need to capture the slack (avail-
able delay budget) for every transistor and also present a strategy to
alter/redistribute these delay budgets. In the next section, we will
ﬁrst prsent an approach to capture the slack in a circuit in terms of
ﬁctitious buffer-like entities called Fictitious Speciﬁc Delay Units
(FSDU’s). Next, an approach called FSDU-displacement will be
presented, which redistributes the delay budgets for every transis-
tor in such a manner that a lower area solution (from the present
solution) is achieved that is also timing feasible.
Delay Balancing
A given circuit DAG G can be transformed to a functionally equiv-
alent circuit DAG G’ by introducing dummy units of appropriate
delay on to each edge in the circuit DAG in such a manner that for
every
e
i
j
2 E,
e
s
l
(
e
i
j
)
=
0and
C
P
(
G
0
)
=
C
P
(
G
) [13]. This
process is known as delay balancing. For our purposes, we do not
explicitly insert physical delays. We instead, use the concept of de-
lay balancing as a tool to capture all the slack in the circuit DAG.
This captured slack is then used for the D-phase optimization. The
delay units used for delay balancing are, therefore, ﬁctitious entities
whose only purpose istomodel the slackpresent inthe circuitDAG.
Werefer tothese ﬁctitiousdelay unitsas FSDUs(FictitiousSpeciﬁc
Delay Units). Figure 3 shows a circuit DAG and ﬁgure 4 shows its
delay balanced counterpart; the “square boxes” on the edges of the
circuit in ﬁgure 4 represent the FSDUs on that edge.
  Primary Output PO
0
Primary Inputs PI
1
3
7/3/4
6/2/4
4/0/4
4/3/1
2/0/2 0/0/0
1/1/0
4/4/0
1/1/0
0/0/0
0/0/0
PI
PI
PI
PI
 PI
1
2
3
4
5
RT(j)/SL(j)/AT(j)
7/0 7/
2
2
1
4
Critical Path Delay = 8
8/0/8
2
Figure 3. An example of a circuit DAG the integer numbers
within each vertex represent its delay and each vertex
i has
the triplet (RT/SL/AT) above it.
0
Primary Output PO
2
4
2
2
1
7
4
1
7/0/7
6/0/6
4/0/4
4/0/4
2/0/2
3
1/0/1
0/0/0
0/0/0
0/0/0
0/0/0
PI
0/0/0
PI
PI
 PI
1
2
3
4
5
0/
2
2
1
4
2
1
PI
0/0
Primary Inputs PI
RT(j)/SL(j)/AT(j) Critical Path Delay = 8
8/0/8
Figure 4. The circuit DAG in ﬁgure 3 after delay balancing.
The square boxed integers on edges represent the FSDUs
added to the edges for delay balancing.
Starting with a given circuit DAG there are several possible ways
to produce a delay balanced graph. Any such delay balanced graph
will from now on be referred to as a delay balanced conﬁguration.
FSDU-Displacement
We deﬁne FSDU-Displacement, a circuit DAG transformation tech-
nique, as a mapping r:V
￿
!Z,
fZ: the set of integers
g such that
the delay of the FSDU on the edge
e
i
j after FSDU-Displacement,
F
S
D
U
r
(
e
i
j
), is related to the delay of the FSDU before FSDU-
Displacement,
F
S
D
U
(
e
i
j
),b y ,
F
S
D
U
r
(
e
i
j
)
=
F
S
D
U
(
e
i
j
)
+
r
(
j
)
￿
r
(
i
)
: (9)
We state the following without proof, due to space limitations.
Theorem 1 All legal delay balanced conﬁgurations for a given
circuit-graph G are FSDU-Displaced versions of each other.
Theorem 2 The net change in the delay of any structural path from
a vertex
i to another vertex
j after FSDU-Displacement is always
r
(
j
)
￿
r
(
i
).
The above theorem gives rise to the following corollary.
Corollary 1 If we connect all the leaf vertices corresponding to
primary output nodes of a given circuit DAG to a common dummy
vertex O through dummy edges and if we restrict
r
(
O
) to be exactly
0 and also restrict
r
(
I
) for every input vertex I
2 PI to be exactly
0,
then the critical path of the transformed circuit DAG after FSDU-
displacement remains unaltered.
Before we develop a formal mathematical model, we ﬁrst mod-
ify the circuit DAG by adding a
d
u
m
m
y vertex
D
m
y
(
i
) of delay
0 units at the output of every vertex
i in the circuit DAG. A dummy
edge connects vertex
i to its corresponding dummy vertex
D
m
y
(
i
).
All fanout edges which initially originated from vertex
i now orig-
inate from
D
m
y
(
i
). Figure 5 illustrates this circuit DAG transfor-
mation with an example. Now we can summarize the D-phase as
follows:
D-phase
(1) Produce any valid delay balanced conﬁguration of the
given circuit DAG. We use a depth ﬁrst FSDU insertion heuris-
tic for this purpose, [13].
(2) Now starting from the delay balanced conﬁguration in (1)
above, let
￿
X
=
￿
(
D
0
￿
A
)
￿
1
￿
D
X
=
￿
C
T
d
i
a
g
(
￿
D
)
where
C
T
=
￿
(
D
0
￿
A
)
￿
1
X, all other symbols are as de-
ﬁned earlier and diag(
￿
D) is a column vector consisting of
the diagonal elements of
￿
D. Note that minimizing
P
￿
X
i
￿ minimizing
P
X
i. Now, for every vertex
i let
￿
D
i
=
r
(
D
m
y
(
i
)
)
￿
r
(
i
), which means that the delay of the FSDU
at the output of a vertex is the change in its delay after the D-
phase.
(3) To maintain the requirement that
￿
D will be small, intro-
duce the following constraints for every vertex.
F
S
D
U
(
i
!
D
m
y
(
i
)
)
+
r
(
D
m
y
(
i
)
)
￿
r
(
i
)
￿
M
I
N
￿
D
(
i
)
;
F
S
D
U
(
i
!
D
m
y
(
i
)
)
+
r
(
D
m
y
(
i
)
)
￿
r
(
i
)
￿
M
A
X
￿
D
(
i
)
;
where
M
I
N
￿
D
(
i
) and
M
A
X
￿
D
(
i
) bound the change in
delay of vertex
i from both sides, i.e., decrease or increase of
vertex delay.
(4) For every edge
e
(
D
m
y
(
i
)
!
j
) introduce the causality
constraint that states that the slack for all edges in the original
DAG will be non-negative after the D-phase.
F
S
D
U
r
(
D
m
y
(
i
)
!
j
)
=
F
S
D
U
(
D
m
y
(
i
)
!
j
)
+
r
(
j
)
￿
r
(
D
m
y
(
i
)
)
￿
0
:
(5) Now solve the following optimization problem, whose dual
is a min-cost network ﬂow problem [14],
m
i
n
i
m
i
z
e
X
o
v
e
r
v
e
r
t
i
c
e
s
i
X
i
￿
m
a
x
i
m
i
z
e
X
o
v
e
r
v
e
r
t
i
c
e
s
i
C
i
￿
(
r
(
D
m
y
(
i
)
)
￿
r
(
i
)
)
subject to
:
F
S
D
U
(
i
!
D
m
y
(
i
)
)
+
r
(
D
m
y
(
i
)
)
￿
r
(
i
)
￿
M
I
N
￿
D
(
i
)
;
F
S
D
U
(
i
!
D
m
y
(
i
)
)
+
r
(
D
m
y
(
i
)
)
￿
r
(
i
)
￿
M
A
X
￿
D
(
i
)
;
For all edges Dmy(i)
! j
;
F
S
D
U
r
(
D
m
y
(
i
)
!
j
)
=
F
S
D
U
(
D
m
y
(
i
)
!
j
)
+
r
(
j
)
￿
r
(
D
m
y
(
i
)
)
￿
0
: (10)
Note that the D-phase optimization is in the form of the dual of a
minimum cost network ﬂow problem, [9]. Also the constant terms
intheRHSofthecontraintsintheD-phasecanbeintegerizedby ap-propriate scaling, i.e., by multiplying every constant term by some
power of
1
0 and then rounding off the product. By choosing ap-
propriate powers of
1
0 arbitrary accuracy canbe maintained with
almost no penalty in computational requirements. In this way, fast
methods devised for integerized minimum cost network ﬂow ap-
proaches [9] can be fruitfully employed in solving the D-phase op-
timization problem.
0
1D
2
3
4
5
6
FSDU(1->2)
FSDU(1->3)
FSDU(1->4)
FSDU(1->5)
FSDU(1->6)
 1
2
3
4
5
6
FSDU(1->2)
FSDU(1->3)
FSDU(1->4)
FSDU(1->5)
FSDU(1->6)
Figure5. Illustrationofcircuit DAG transformationrelatedwith
the addition of a dummy vertex at the output of every vertex.
2.3.2. W-phase
Once the D-phase has computed new delays (delay budgets) for
all the vertices in the circuit DAG, we need to ﬁnd feasible sizes
for the transistors corresponding to every vertex in the circuit DAG
to satisfy the delay requirements while using up minimal area. In
effect we have to solve the following problem,
m
i
n
i
m
i
z
e
X
o
v
e
r
v
e
r
t
i
c
e
s
i
x
i
;
subject to
:
P
j
2
S
(
V
(
G
)
)
a
i
j
x
j
+
b
i
x
i
￿
d
e
l
a
y
(
i
)
;
￿
X
j
2
S
(
V
(
G
)
)
a
i
j
x
j
+
b
i
￿
d
e
l
a
y
(
i
)
￿
x
i
;
m
i
n
s
i
z
e
￿
x
i
￿
m
a
x
s
i
z
e
: (11)
It turns out that due to the non-negativity of
a
i
j,
d
e
l
a
y
(
i
) and the
coefﬁcients of
x
i in the objective function, this optimization prob-
lem can be modeled as a Simple Monotonic Program (SMP) [10].
This kind of problem can be solved by a constraint relaxation pro-
cedure with worst case complexity of
O
(
j
V
j
j
E
j
) where
j
E
j is the
number of constraints and
j
V
j is the number of variables. The de-
tail of this relaxation procedure are being omitted for lack of space,
but can be found in [10]. In the W-phase, due to the restrictions
on the magnitude of the change in delay budgets computed in the
D-phase, the magnitude of the change in
x
i, i.e.,
￿
x
i will be small.
To sum up, the W-phase ﬁnds a set of sizes for the transistors in
the circuit that is a minimum area solution for satisfying the delay
requirements calculated by the D-phase.
2.4. Putting it All Together
Having, deﬁned the D-phase and W-phase of the optimiza-
tion strategy, we are now in a position to ﬁnally describe
M
I
N
F
L
O
T
R
A
N
S
I
T, our Min-cost Flow based Transistor siz-
ing Tool.
MINFLOTRANSIT
1.Size the circuit to meet delay requirements using TILOS.
2.Iteratively perform alternately the D-phase and W-phase
optimizations, solving the problems formulated in (10) and
(11) respectively.
3.Stop the iterations when the area improvement after the W-
phase is negligible.
We now present an example that qualitatively illustrates the im-
provements provided by MINFLOTRANSIT over TILOS.
Example 1 Figure 6 shows a simple three gate circuit to be sized.
TILOS is a sensitivity based greedy heuristic that proceeds by
bumping up in each pass the size of that transistor/gate that leads
to maximal beneﬁt in speed for a unit increase in area. Such a tran-
sistor/gate is called the transistor/gate with the highest sensitivity.
Assume that in ﬁgure 6, both
B and
C are gates with identical sen-
sitivity and
A has a lower sensitivity. Therefore the paths
A
!
B
A
B
C
Figure 6. An example illustrating the global perspective taken
by MINFLOTRANSIT which TILOS tends to overlook.
Table 1. The area savings in
% of MINFLOTRANSIT over
TILOS is listed. The CPU time required by TILOS and the
extra time required by MINFLOTRANSIT over and above that
required by TILOS are listed. The critical path of a minimum
sized circuit is denoted by
D
m
i
n.
Circuit # Gates Area Delay CPU CPU
savings Specs. (TIME) (TIME)
over (TILOS) (OURS)
# Gates TILOS
adder32 480
￿ 1% 0.5
D
m
i
n 2.2s 5s
adder256 3840
￿ 1% 0.5
D
m
i
n 262s 608s
c432 160 9.4% 0.4
D
m
i
n 0.5s 4.8s
c499 202 7.2% 0.57
D
m
i
n 1.47s 11.26s
c880 383 4%% 0.4
D
m
i
n 2.7s 8.2s
c1355 546 9.5% 0.4
D
m
i
n 29s 76s
c1908 880 4.6% 0.4
D
m
i
n 36s 84s
c2670 1193 9.1% 0.4
D
m
i
n 27s 69s
c3540 1669 7.7% 0.4
D
m
i
n
s
i
z
e 226s 335s
c5315 2307 2% 0.4
D
m
i
n
s
i
z
e 90s 111s
c6288 2416 16.5% 0.4
D
m
i
n
s
i
z
e 1677s 2461s
c7552 3512 3.3% 0.4
D
m
i
n
s
i
z
e 320s 363s
and
A
!
C are both critical. TILOS, due to its greedy nature, will
bump up the sizes of
B and
C in alternate passes, whereas it should
be intuitively clear that sizing up
A, even though it has lower sen-
sitivity may be the better option as it speeds up both paths
A
!
B
and
A
!
C simultaneously. In the D-phase, MINFLOTRANSITex-
plicitly includes constraints to evaluate the beneﬁts of altering the
sizes of gates
A,
B and
C. It is therefore able to identify whether
sizing gate
A, in spite of its lower sensitivity, will be advantageous.
Theorem 3
M
I
N
F
L
O
T
R
A
N
S
I
T produces minimum transis-
tor sizing for any delay constraints.
Proof: Let us assume that we are in some intermediate iteration at
a non-optimal point. We iteratively apply the D-phase, followed
by the W-phase, and it is sufﬁcient to show that the application of
each of these steps causes the objective function to reduce, while
maintaining feasibility. The D-phase uses a Taylor series approxi-
mation to the constraint in Equation (6) to represent the objective
function entirely in terms of the delay variables. This approxima-
tion is valid within a radius of convergence of
￿ around the current
point,
X, corresponding tosome radius of
￿ around the vector of de-
lays. Therefore, a solution to the D-phase is a valid solution to the
original problem as long as the allowable delay change is bounded
by a quantity that lies within a
￿-ball of the current delays. This
is achieved forcing
M
A
X
￿
D
(
i
) and
M
I
N
￿
D
(
i
) to be small for
all
i.
If the current solution is not optimal, due to the convexity of the
problem [1, 2], there must be another feasible point in the neigh-
borhood of the current point that has a smaller objective function
value, and this point will be found by the D-phase. In the W-phase
that follows, the solution found in the D-phase is a feasible solu-
tion. Since the W-phase does not limit the change in the delay or
the transistor sizes as greatly as the D-phase, its solution must have
an objective function value that is no larger than the solution of the
W-phase.
Therefore, since the objective function value decreases in each
phase, the procedure is guaranteed to ﬁnd an optimal solution to the
problem.
3. SIMULATION RESULTS
Simulation results were obtained on all the combinational cir-
cuits in the ISCAS85 benchmark suite and also on ripple carry0.2 0.4 0.6 0.8 1.0
(Delay of Ckt)/(Delay of minimum size Ckt)
1.0
2.0
3.0
4.0
(
A
r
e
a
 
o
f
 
C
k
t
)
/
(
A
r
e
a
 
o
f
 
m
i
n
i
m
u
m
 
s
i
z
e
 
C
k
t
) c432 (TILOS)
c432 (MINFLOTRANSIT)
0.2 0.4 0.6 0.8 1.0
(Delay of Ckt)/(Dleay of minimum size Ckt)
1.0
2.0
3.0
4.0
(
A
r
e
a
 
o
f
 
C
k
t
)
/
(
A
r
e
a
 
o
f
 
m
i
n
i
m
u
m
 
s
i
z
e
 
C
k
t
) c6288 (TILOS)
c6288 (MINFLOTRANSIT)
Figure 7. Comparative area-delay curves for gate sizing of
two ISCAS85 benchmark circuits. The total device area of
the circuits after transistor sizing with TILOS and MINFLO-
TRANSIT is plotted against delay, normalized with respect to
the delay of a minimum sized circuit. Even though the curves
look close the area beneﬁts are actually signiﬁcant. For ex-
ample in the case of c6288, for a circuit with
0
:
5 times the
delay of the minimum sized circuit, the area savings of MIN-
FLOTRANSIT over TILOS is
1
4
:
2
%.
adders of
3
2-
2
5
6 bits. The results shown in this section are for
gate sizing which as mentioned before is a special case of true tran-
sistor sizing. We implemented the TILOS algorithm as described
in [15]. Starting with all transistors at minimum size, a bumpsize =
1.1 was used for the initial sizing. Iterative application of D-phase
and W-phase optimization was then carried out for optimal transis-
tor sizing. Figure 7 shows the area delay curve for representative
benchmark circuits for both TILOS and MINFLOTRANSIT. The
technology parameters used for simulation were obtained from [16]
for
0
:
1
3
￿ technology. As can be seen a clear gain in performance is
seen when using MINFLOTRANSIT as opposed to using TILOS.
Table 1 lists the area savings of MINFLOTRANSIT over TILOS
and the CPU times required in an Ultrasparc 10 Sun workstation
for sizing the ISCAS85 benchmark circuits. The tabulated results
are for sizing solutions where the area penalty is within
1
:
5
￿
1
:
7
5
times that of a minimum sized circuit and all these correspond to
points where the area penalty of sizing is reasonable. For the adders
the improvement in area over TILOS is marginal thereby suggest-
ing that adders can be easily sized by using greedy heuristics. This
is not surprising since ripple carry adders have a single dominant
critical path which can, possibly, be sized optimally using a heuris-
tic like TILOS. On the other hand, however, as can be clearly seen
for the ISCAS85 benchmark circuits, the area savings vary from
2
%
￿
1
6
:
5
%. The overall time required by MINFLOTRANSIT is
almost always (except in the small circuits c432, c499) within
2
￿
4
times of TILOS.The circuit c6288 shows an unusually high time re-
quirement for sizing, possibly due to the fact that this circuit (a type
of multiplier) has a large number of paths, many of them reconver-
gent. Therefore, a number of competing paths can become critical
at any instance and sizing this circuit is consequently harder. It is
also notable that TILOS performs poorly as compared to MINFLO-
TRANSIT for this particular circuit. In all our simulations only a
few tens of iterations were required by MINFLOTRANSIT (except
the steepest portions of the area delay curve where no more than
1
0
0 iterations were required).
4. CONCLUSIONS
We presented a new transistor sizing tool MINFLOTRANSIT that
has shown impressive run-time behavior over various benchmarks
in the ISCAS85 benchmark suite. MINFLOTRANSIT is a two-
phase iterative-relaxation based technique. The ﬁrst, D-phase has
a worst case complexity of
O
(
j
V
j
j
E
j
l
o
g
(
l
o
g
j
V
j
)
), the second, W-
phase has a worst case complexity of
O
(
j
V
j
j
E
j
). The run-time
behavior of this tool is comparable to TILOS but it is guaranteed to
produce optimal transistor sizes for meeting the delay constraints.
Although Elmore delay models were used in this paper for illustra-
tion, MINFLOTRANSIT is valid for a larger class of delay models
characterized by the monotonic decomposition property in Deﬁni-
tion 2.
REFERENCES
[1] J. P. Fishburn and A. E. Dunlop, “TILOS: A Posynomial Pro-
gramming Approach to Transistor Sizing,” in Proceedings of
the 1985 International Conference on Computer-Aided De-
sign, pp. 326–328, November 1985.
[2] S. Sapatnekar, V. Rao, P. Vaidya, and S. Kang, “An Exact So-
lutiontotheTransistorSizingProblemforCMOSCircuitsUs-
ing Convex Optimization.,” IEEE Transactions on Computer-
Aided Design, vol. 12, pp. 1621–1634, November 1993.
[3] J.-M. Shyu, A. L. Sangiovanni-Vincentelli, J. Fishburn, and
A. Dunlop, “Optimization-based Transistor Sizing,” IEEE
Journal on Solid State Circuits, vol. 23, no. 2, pp. 400–409,
1988.
[4] D. Marple, “Performance Optimization of Digital VLSI Cir-
cuits,” Technical Report CSL-TR-86-308, Stanford University,
October 1986.
[5] D. Marple, “Transistor Size Optimization in the Tailor Lay-
out System,” in Proceedings of the 26th ACM/IEEE Design
Automation Conference, pp. 43–48, June 1989.
[6] H. Y. Chen and S. M. Kang, “iCoach: A Circuit Optimization
Aid for CMOS High-Performance Circuits,” Intergration, the
VLSI Journal, vol. 10, pp. 185–212, January 1991.
[7] Z. Dai and K. Asada, “MOSIZ: A Two-Step Transistor Sizing
Algorithm based on Optimal Timing Assignment Method for
Multi-Stage Complex Gates,” in Proceedings of the 1989 Cus-
tom Integrated Circuits Conference, pp. 17.3.1–17.3.4, May
1989.
[8] C. Chen, C. N. Chu, and D. F. Wong, “Fast and Exact Simul-
taneous Gate and Wire Sizing by Lagrangian Relaxation,” in
Proceedings of the 1998 IEEE/ACMInternational Conference
on Computer-Aided Design, pp. 617–624, November. 1998.
[9] A. V. Goldberg, M. D. Grigoriadis, and R. E. Tarjan, “Use of
Dynamic Trees in a Network Simplex Algorithm for the Max-
imum Flow Problem,” Mathematical Programming, vol. 50,
pp. 277–290, June 1991.
[10] M. C. Papaefthymiou, “Asymptotically Efﬁcient Retiming
under Setup and Hold Constraints,” Proceedings of the
IEEE/ACMInternational Conference on Computer-Aided De-
sign, pp. 288–295, Nov. 1998.
[11] T. H. Cormen, C. E. Leiserson, and R. L. Rivest, Introduction
to algorithms. McGraw-Hill New York, NY, 1990.
[12] G. Strang, Linear Algebra and its Applications. Harcourt
Brace Jovanovich, Publishers, San Diego, CA, 1988.
[13] V. Sundararajan and K. K. Parhi, “Low Power Gate Resizing
Using Buffer-Redistribution,” in Proceedings of the Twentieth
Anniversary Conference on Advanced Research in VLSI,( A t -
lanta, GA), pp. 170–184, March 1999.
[14] V. Chvatal, Linear Programming. W. H. Freeman and Com-
pany, New York, NY, 1983.
[15] A. E. Dunlop, J. P. Fishburn, D. D. Hill, and D. D. Shugard,
“Experiments Using Automatic Physical Design Techniques
for Optimizing Circuit Performance,” in Proceedings of the
32nd Midwest Symposium on Circuits and Systems, (Urbana,
IL), pp. 216–220, August 1989.
[16] J. Cong, “Challenges and Opportunities for Design Innova-
tions in Nanometer Technologies,” Technical Report, Semi-
conductor Research Corporation, 1997.