High-Level Optimization by Combining Retiming and Shannon Decomposition by Soviani, Cristian et al.
High-Level Optimization by Combining
Retiming and Shannon Decomposition
Cristian Soviani Olivier Tardieu Stephen A. Edwards∗
Department of Computer Science
Columbia University, New York
Abstract
Applying Shannon decomposition can reshape sequential cir-
cuits and improve opportunities for retiming. Both Shannon
decomposition and retiming only rely on limited information
about combinational blocks (timing estimates), so both tech-
niques are suitable for high-level synthesis.
We describe an efficient algorithm to preprocess a circuit
using Shannon decomposition to increase retiming efficacy. It
assembles complex chains of Shannon decompositions while
carefully avoiding parallel ones in order to limit the area over-
head due to logic duplication.
We compare a traditional retiming flow with the same flow
augmented with our algorithm. Although our algorithm pro-
vides no improvement on half of our benchmarks, for the other
half we obtain a 25% speed-up on average (7% to 61%), while
only increasing area by 5% (3% to 12%).
1 Introduction
IC technology has made it possible to build enormous systems
on a chip. High-level synthesis and optimization techniques
are needed to handle not only low-level entities such as gates
but also more complex structures such as arithmetic units.
There are many techniques for low-level optimizations, in-
cluding two-level optimization, algebraic methods, and redun-
dancy elimination. But if we increase the level of abstraction,
we can no longer use most of them. However, Shannon de-
composition and retiming scale very well, as they ignore the
complexity of the combinational blocks. In this work, we con-
centrate on these two techniques.
There are too many ways to combine these transforma-
tions, so heuristics are usually applied to choose good ones.
Even worse, the circuit is usually optimized after each trans-
formation (critical Boolean opportunities might otherwise be
missed), meaning a systematic way of considering combina-
tions of transformations is difficult. Any high-level approach
must therefore ignore Boolean properties or account for them
simply because of the lack of low-level information.
∗soviani, tardieu, sedwards@cs.columbia.edu. Edwards and his group
are supported by an NSF CAREER award, a grant from Intel corporation,


















Figure 1: Motivating example: four slow combinational blocks





























Figure 2: Shannon decomposition reduces the feedback loop





























Figure 3: Retiming has reduced cycle time to one block’s delay
plus a multiplexer.
In this paper, we propose an efficient algorithm to select and
apply Shannon decompositions that enable effective retiming
while avoiding excessive area increase. In other words, by pre-
processing a circuit using our algorithm just before retiming,
the speed-up obtained by retiming is significantly improved.
The algorithm handles both acyclic and (sequential) cyclic cir-
cuits. It is both efficient and exact, provided our timing esti-
mation function and that of the retiming algorithm agree.
1.1 An example
Consider the circuit in Figure 1, which was extracted from a
Gigabit Ethernet 8b10b encoder. The four identical combina-
tional nodes are already optimized—they were designed man-
ually for speed. We may even have several variants, exploiting
a performance/area trade-off. Therefore, it would be desirable
to speed up the circuit without modifying the nodes.
Retiming is a widely-used transformation. In our example,
the designer put three registers on each input in the hopes
that the increased latency would allow retiming to improve
throughput, otherwise known as pipelining. Unfortunately, re-
timing fails on this circuit because of the tight (single-register)
feedback loop. If dnode is the delay of the combinational node,
the minimum period remains 4dnode after retiming.
Figure 2 shows how the loop can be broken using Shannon
decomposition. The combinational nodes are duplicated and
some muxes added, resulting in a significant area penalty. This
is not so severe in practice, as usually only a small fraction of a
circuit will require Shannon decomposition. The longest path
is now 4dnode +dmux, where dmux is the delay of a mux. But the
loop is much shorter—its delay is dmux—enabling retiming to
achieve a period of dnode +dmux (Figure 3).
In fact, we can generate multiple points on the period/area
curve (Figure 4). Adding more than three registers at the in-
puts makes it possible to further reduce the minimum period.
Although impractical beyond a certain limit, we can theoret-
ically decrease the minimum period until it reaches the delay
of the loop: dmux.
1.2 Related work
Speeding up combinational logic by resynthesizing the mini-
mum slack paths has a long history. Relevant techniques in-
clude tree-height reduction (THR, Singh et al. [12]), the gen-
eralized select transform (GST, Berman et al. [1]), the gener-

















Figure 4: Period/area trade-off.
sensitization of critical paths (Saldanha et al. [9]). Like ours,
GST is based on Shannon decomposition.
Speeding up sequential logic has also been the focus of ex-
tensive research, such as the work of Singh [11]. Retiming
(Leiserson and Saxe [5]), can decrease the minimum period
without restructuring the logic. Malik et al. [6] combines re-
timing and resynthesis. The algorithm is efficient for circuits
with little feedback, such as pipelines. In contrast, our combi-
nation of retiming and resynthesis focuses on feedback.
Addressing high-level synthesis, Hassoun et al. [4] proposes
architectural retiming, which attempts to optimize high-level
pipelines by mixing retiming with pre-computation and pre-
diction. Marinescu et al. [7] proposes an algorithm to auto-
matically pipeline a circuit by increasing the pipeline length
with the help of stalling and forwarding. But several transfor-
mations applied successively may interfere with each other. As
a result, most approaches apply various transformations itera-
tively, in a heuristic manner. By contrast, in this work we take
all retiming opportunities into account when choosing Shan-
non decompositions, so only a single pass of our algorithm
followed by a single pass of retiming is needed.
1.3 Paper organization
Section 2 introduces the notation we use and reviews Shannon
decomposition, retiming, and retiming efficiency estimation.
Section 3 describes serial compositions of Shannon decom-
positions. Section 4 presents the exhaustive exploration algo-
rithm for combinational circuits. Section 5 sketches the exten-
sion of this algorithm to sequential circuits. In Section 6, we
describe our implementation and discuss experimental results.
We conclude in Section 7.
2 Basics
2.1 Sequential circuits
A sequential circuit is a directed graph S = (V,E) with vertices
V = PI∪PO∪N∪R∪{spi,spo}. PI & PO are the primary in-
puts & outputs; N are the internal combinational nodes; R are
registers; spi and spo are two supernodes connected to/from all
PI/PO respectively. The edges E ⊂V ×V model the intercon-
nect: fanin(n) = {n′|(n′,n) ∈ E}, fanout(n) = {n′|(n,n′) ∈ E}.
Each combinational node n ∈ N computes a Boolean func-
tion of its p input wires f :   p →   which defines the common
value of all its output wires. ∀r ∈ R : |fanin(r)|= 1.
Combinational cycles are not allowed: the subgraph of S ex-
cluding registers D = (V \R,E|V\R×V\R) must be acyclic.
2.2 Arrival times




arrival time (from clock) n ∈ PI
delay of logic n ∈ N
setup time (to next clock) n ∈ PO
0 n ∈ R∪{spi,spo}
(1)
procedure ArrivalTimes(S)
for each n ∈ R∪{spi} do {at(n)← 0}
for each n ∈V \ (R∪{spi}) in topological order in D do
at(n)← d(n)+maxn′∈fanin(n) at(n′)

















Figure 6: Shannon decomposition of f on input xk.
For each node n ∈V \ (R∪{spi}), we have an arrival time:
at(n) = d(n)+ max
n′∈fanin(n)
at(n′) (2)
Since D is acyclic, computing arrival times is straightforward










Let f :   p →   be the Boolean function of the combinational
node n and 1≤ k ≤ p. Then
f (x1,x2, . . . ,xp) = xk fxk + xk fxk
where fxk = f (x1, . . . ,xk−1,1,xk+1, . . . ,xp)
fxk = f (x1, . . . ,xk−1,0,xk+1, . . . ,xp)
Such a Shannon decomposition suggests an alternate imple-
mentation of the node (Figure 6) with arrival time
at(n) = max
{
at( fxk )+dmux0,at( fxk )+dmux1,at(xk)+dmuxs
}
For simplicity, we assume dmux0 = dmux1 = dmuxs = dmux, so
at(n) = max
{
at( fxk ),at( fxk ),at(xk)
}
+dmux
Such a transformation, therefore, improves at(n) provided that
xk arrives later than the other inputs xi (i 6= k). The cost comes
in an area increase from node duplication.
2.4 Retiming
Retiming follows from observing that moving registers back
or forth in a sequential circuit preserves its functionality (Fig-
ure 7). Its goal is to move registers to decrease long (critical)
combinational paths at the expense of short (non-critical) ones.
Let ret(S) be the minimum period achievable by retiming a









Figure 7: Basic retiming.
3 7 2
3 7 2




Figure 8: Decreasing cycle period through retiming. (a) initial
cycle with period 10, (b) retiming preserving nodes: period 7,
(c) retiming splitting nodes: period 4.
and rc are the combinational delay and the number of registers
of the cycle c in S, ret(S)≥ dc/rc. Similarly, if p is a path from
spi to spo having rp registers and of combinational delay dp,













For the example in Figure 8(a), lb(S) = (3+7+2)/3 = 4.
In addition to moving registers past nodes (as in classical re-
timing, Figure 8b), achieving the period lb(S) requires moving
registers inside nodes (Figure 8c). Large combinational blocks
built from small gates, such as those in an FPGA [13], can usu-
ally be modified this way. We therefore assume ret(S) = lb(S).
In the sequel, we focus on transforming S to minimize lb(S).
2.5 Retiming efficiency estimation
The number of cycles can be exponential in the size of the
circuit, so computing lb(S) directly with (3) is not practical.
Assign weight−c to the registers: ∀r ∈R : d(r) =−c, where
c is the desired period. Every other node keeps its weight d
as defined by (1). Then there exists a retiming for period c
iff at(spo) ≤ c and the graph S has no positive cycles. In our
example, c = 3 gives the cycle weight 3; c = 5 gives −3. Not
surprisingly, for c = lb(S) = 4, the cycle has weight 0.
The Bellman-Ford algorithm [3] (Figure 9) detects positive
cycles. The algorithm terminates after at most |V | − 1 itera-
tions iff there exists no positive cycle. Therefore, lb(S) can be
approximated by binary search on the period c.
3 Variants
Consider building many variants of a combinational node in
parallel (i.e., with identically-connected inputs). The fanouts
of a node may then choose to connect to any of these vari-
ant’s outputs without affecting the circuit’s function. However,
procedure BellmanFord(S)
at(spi)← 0
for each n ∈V \{spi} do {at(n)←−∞}
repeat
changes← false
for each n ∈V \{spi} do









Figure 9: The Bellman-Ford algorithm for calculating positive
cycle weights.
if our only goal is the fastest circuit, only the variant with min-
imum arrival time is interesting; we may ignore the others.
However, we may also want to consider more complex vari-
ants that instead of a single output wire w, (redundantly) en-
code it as a series of wires (vi)i∈I , for instance as w = vsv0 +
vsv1. We say that (vs,v0,v1) is a virtual wire of type sh where
sh : vs,v0,v1 7→ vsv0 + vsv1.
3.1 Variant types
In general, a type is a function t :
  p →
 
that decodes a virtual
wire to give its true value. The type of a real wire is id : w 7→w.
Providing a variant with an output of type t for a node of
function f means designing f ′ such that t ◦ f ′ = f . Computing
f ′ instead of f may be seen as a speculative computation that
we will later complete using function t.
Such variants are interesting because the arrival times for
the several components of a virtual wire may be different, thus
giving further opportunities to optimize speed.
To this aim, we need to chain variants. In addition to virtual
output wires, we support virtual input wires. We say that f ′ is
a variant of f of type t1×·· ·× tn→ t iff
t ◦ f ′(w1, . . . ,wn) = f (t1(w1), . . . , tn(wn)) (∀i : wi has type ti)
In other words, f ′ is such a variant of f iff f ′, provided with
virtual wires of types t1, . . . , tn, computes a virtual wire of type
t so as to match the computation of f for the corresponding
real wires (Figure 10).
3.2 Shannon variants
In principle, we can generate arbitrarily complex variants; we
can even consider the collapsed input cone of a node as a vari-
ant, deferring all computation to the type function. In this pa-
per, we only consider the variants generated by nested Shan-


































Figure 11: Virtual wire types s0, s1, and s2.
First, we restrict ourselves to the set of wire types {sk}k≥0
(Figure 11), where sk :
  2k+1 →
 
are recursively defined as
s0 : v0 7→ v0 ∀k ≥ 0,sk+1 = sk ◦ rk,
where {rk}k≥0 are the functions rk :
  2k+3 →
  2k+1 such that
r0 : v0,v1,v2 7→ v2v0 + v2v1
rk+1 : v0, . . . ,v2k+4 7→ (v2v0 + v2v1,v3v0 + v3v1,v4, . . . ,v2k+4).
Intuitively, the types {sk}k≥0 describe a series of Shannon
decompositions, each rk function specifying one such decom-
position: sk = s0 ◦ r0 ◦ r1 ◦ · · · ◦ rk−1 (Figure 12).
To limit the number of node copies, we only allow one non-
real input wire on a node. That is, we only consider variant
types in the set {sin× s0× ·· ·× s0 → sout}in≥0,out≥0 (modulo
permutation of the input wire types).
We only use a few variants of such types. Basically, we re-
strict variants to make at most one Shannon decomposition.
As a result, out ≤ in+ 1. These variants can be seen as series
combinations. Combining Shannon decompositions in parallel
would require a node to be copied four times or more times,
which we consider impractically costly.
Formally, consider the function f :   p →   of node n. In Ta-
ble 1, we first define the sets of primitive variants {start fk }k≥0
and {extend fk}k≥1 , which start and extend Shannon decompo-
sitions (Figure 13).
Starting from these primitive variants, we define the set
Sh(n) of all Shannon variants for the function f of node n

f
start fk ∀k ≥ 0
extend fk+1 ∀k ≥ 0
rk−` ◦ · · · ◦ rk−1 ◦ rk ◦ start
f
k ∀k ≥ 0,∀`≥ 0 s.t. `≤ k
rk−` ◦ · · · ◦ rk−1 ◦ rk ◦ extend fk+1 ∀k ≥ 0,∀`≥ 0 s.t. `≤ k


∀k ≥ 0, start fk : sk× s
p−1
0 → sk+1
w1 = (v0, . . . ,v2k),w2, . . . ,wp 7→ f (0,w2, . . . ,wp), f (1,w2, . . . ,wp),v0, . . . ,v2k
extend fk+1 : sk+1× s
p−1
0 → ss+1
w1 = (v0, . . . ,v2k+2),w2, . . . ,wp 7→ f (v0,w2, . . . ,wp), f (v1,w2, . . . ,wp),v2, . . . ,v2k+2






































































Figure 13: Examples of primitive Shannon variants.
Non-primitive variants are obtained by appending, to prim-
itive variants, chains of multiplexers rk−` ◦ · · · ◦ rk that par-
tially recombine the virtual output wires of primitive variants
to complete Shannon decompositions started previously.
3.3 Circuit variants and arrival times
A variant S′ of the circuit S is obtained by consistently replac-
ing the combinational nodes of S by one or several variants of
these nodes and replicating registers accordingly (to latch vir-
tual wires). The obvious constraint is that two nodes connected
by a wire in S′ must agree on its type.
Circuit variants are still circuits. Node variants contain sev-
eral atomic combinational nodes in the sense of Section 2.1
(copies of the initial combinational node, multiplexers) and
compute 2k + 1 Boolean functions at once, forming a single
virtual output wire of type sk (k≥ 0).
Therefore, arrival times in circuit variants can be computed
as described in Section 2.2, but since we are now interested
in the arrival times of macro nodes rather than atomic nodes,
we choose to denote the arrival times of node variants as tu-
ples. For instance, consider a node variant n with output type
s2 and virtual output wire (v0,v1,v2,v3,v4). For simplicity, we
assume that paired virtual output wires, such as (v0,v1) or
(v2,v3), have the same arrival time1. Therefore, if at(n0) =
at(n1) = 3, at(n2) = at(n3) = 6, and at(n4) = 10, where ni is
the atomic node computing vi, we write the arrival time of this
variant as the 3-tuple at(n) = (3,6,10).
4 Combinational circuits
In this section and the next, we describe how to efficiently and
systematically build and analyze circuit variants in order to
design variants tailored for retiming—variants that maximize
retiming efficiency. Ideally, we would like to find a variant S′
of the circuit S such that lb(S′) is minimum (cf. Section 2.4).
As a secondary goal, we want to select variants that minimize
node duplications (area).
To start with, we focus on combinational circuits (R = /0),
meaning that we are simply looking for a circuit with mini-
mum cycle period (at(spo) minimum).
4.1 Overview
Since S is acyclic, we can build circuit variants by process-
ing combinational nodes in topological order. Assume we have
built a circuit variant S′ to node (n− 1). There are many
choices for implementing n. First, we have to decide which
variants of n′ ∈ fanin(n) shall drive n. Second, we have to
choose the variant of n itself. By design, the type constraints
of Section 3.3 limit the number of choices. We denote the set
of possible extensions by S′n.
How to choose among node variants? It is not clear which
variant is better. Each fanout may have some special advantage
in using a specific type of virtual output wire that compensates
for a late arrival time (a well-understood issue in technology
mapping). As a result, we must consider several variants for
each node during this construction, and only select optimal
variant(s) for each node at the end.
This suggests three phases. Through a topological traversal
of the circuit, we first compute the “feasible arrival times” for
the node n in any variant of the initial circuit: fat(n). In particu-
lar, fat(spo) will contain the minimum feasible cycle period for
whole circuit. Then, with a reverse topological traversal, we
1We do not apply constant propagation to Shannon variants.
extract from these sets one or several “required arrival times”
for each node n: thin(n)⊆ fat(n) that express feasible local re-
quirements (i.e., per-node requirements), which, if locally met
by an appropriate choice of node variant(s), guarantee a min-
imum cycle period for the whole circuit. Finally, we choose
and wire node variants accordingly to produce a circuit with
minimum cycle period.
4.2 Feasible arrival times
For any circuit S, fat(n) can be directly computed from the
delay d(n) of the node n in S and the feasible arrival times of
the nodes in its fanin:
fat(n) = combine(d(n),{fat(n′)}n′∈fanin(n)) (4)
Intuitively, since the arrival time of a node reveals its type
(tuple size), the feasible arrival times of the nodes in fanin(n)
carry enough information to decide whether a given Shan-
non variant of n can be used and how to wire it. Then, for
each possible choice of a variant of n (including the choice of
its wiring), we can compute its arrival time by applying (2)
to each of its inner atomic combinational nodes. For lack of
space, we do not formally define the combine function here.
As a result, the algorithm for the computation of feasible
arrival times for a combinational circuit (Figure 14) is similar
to the algorithm for computing arrival times (Figure 5). The
key differences are that feasible arrival times uses (4) instead
of (2) and includes a pruning operation, described below.
Although finite, the sets fat(n) can be very large. Therefore,
we prune them (remove irrelevant values) on the fly.
Intuitively, the extension q ∈ S′n can be safely discarded iff,
regardless of the following circuitry, there exists an extension
p ∈ S′n that guarantees a better overall cycle period (our main
goal). As a result, we can remove certain elements from fat(n)
without fear of producing an inferior circuit. For instance, if
p has arrival time (6) and q has arrival time (8) then q can be
safely removed from S′n, thus (8) from fat(n), without putting
our construction at risk.
Let be a partial order on arrival times such that if p,q∈ S′n,
p 6= q and at(p)  at(q), then q can be safely removed from
S′n. We can exhaustively discard non-minimal feasible arrival
times using a pruning algorithm (Figure 14).
A good relation lets us prune fat(n) aggressively. Further-
more, a suboptimal relation will not affect the optimality of the
final circuit, only the running time of the algorithm.
We choose a simple but effective relation:
(p0, . . . , pi) (q0, . . . ,q j) iff (i≤ j)∧ (∀k ≤ i, pk ≤ qk)
For instance, (2) (4) (4,5) (4,6) (4,6,7). Because
we know exactly how virtual wires can be recombined using
chains of multiplexers, we are able to compare variants with
different output types as in (4,6) (4,6,7).
A pruned set always contains exactly one singleton, which
corresponds to a node variant of n having a real output wire.
procedure FeasibleArrivalTimes(S)
fat(spi)← {(0)}
for each n ∈V in topological order do
fat(n)← Prune(combine(d(n),{fat(n′)}n′∈fanin(n)))
procedure Prune(X)
while there exist p,q ∈ X such that p 6= q and p q do
X ← X \{q}
Figure 14: Feasible arrival times and pruning.
We write this arrival time opt(n). By construction,
opt(spo) = min




4.3 Required arrival times
By construction, all arrival times in fat(n) are feasible through
a appropriate choice of variants for the nodes k≤ n. Intuitively
however, in order to obtain the minimum cycle period, a circuit
variant only needs to achieve some of the arrival times in fat(n)
for each node n. First, we rely on partial pruning. Second, cir-
cuits fragments we initially considered in the computation of
feasible arrival times may end up not being fast enough to be
part of an optimal circuit. Third, the minimum cycle period
may admit several implementations.
We traverse the circuit again to select required arrival times
thin(n)⊆ fat(n) for each node n. Since several variants of the
same node may be required to produce alternate encodings of a
given node output (needed by subsequent nodes in the circuit),
we may end up with more than one required arrival time per
node, hence the need for sets. But in our experiments we found
a single variant was almost always sufficient.
Required arrival times must comply with two constraints.
First, thin(spo) = {opt(spo)}. Second, if a circuit variant
has been built to node (n− 1) to achieve the arrival times
{thin(k)}k<n, then it should be possible to extend it by a proper
choice of variants for the node n so as to provide the arrival
times thin(n) for the node n.
We perform the construction—essentially a pruning
operation—through a reverse traversal of the circuit, starting
from node spo. In general, there are several choices for the
thin(n) sets corresponding to several implementations of the
minimum cycle period. We have a crude heuristic (a partial
order on the pruning) that attempts to minimize node duplica-
tions. We omit its description for lack of space.
4.4 Circuit construction
We can now build a circuit variant with minimum cycle pe-
riod. Starting from node spi, we select and wire one or several
variants for each node n so as to achieve the arrival times in
thin(n). By definition of required arrival times, the resulting
circuit achieves the minimum cycle period.
procedure FATBellmanFord(S)
fat(spi)←{(0)}
for each n ∈V \{spi} do {fat(n)←{(−∞)}}
repeat
changes← false
for each n ∈V \{spi} do









Figure 15: Bellman-Ford for feasible arrival times.
5 Sequential circuits
Let S be a sequential circuit. While we could restrict the in-
put and output types of register nodes in variants of S to be s0
(real wires), we see no reason to impose such a limitation. As
a result, a register node in a variant consists in general of sev-
eral atomic registers, each of them latching a single element of
a virtual wire. Experimental results show that the number of
atomic registers grows typically as fast as the area.
Since S may contain cycles, we can no longer obtain fea-
sible arrival times using a one-pass transversal of the circuit
(Figure 14); we have to iterate. As in Section 2.5, we assign
delay −c to registers and use a modified Bellman-Ford algo-
rithm (Figure 15) to compute the fat sets. We say the compu-
tation succeeds iff it terminates and opt(spo)≤ c.
We rely on the following result: the computation succeeds
for period c iff there exists a variant S′ of S such that lb(S′)≤ c.
Indeed, by definition of feasible arrival times, if the compu-
tation terminates then there exists a variant S′ of S such that
lb(S′) ≤ max{c,opt(spo)}. The converse is less obvious. A
circuit always admits slow variants (lb(S′) > c). As a result,
pruning becomes mandatory to guarantee termination if a fast
variant exists (lb(S′) ≤ c). In other words, the choice of the
pruning relation is no longer a matter of optimization, but the
correctness of the whole procedure depends on the pruning.
We believe we have designed an appropriate pruning relation,
but we do not have a formal proof of this.
Although we have not yet obtained a theoretical bound on
the size of the fat sets or on the number of iterations required to
converge, the numbers remain small even on large examples.
As in the combinational case, we extract required arrival
times from feasible arrival times to choose wire and node
variants that give a circuit variant S′ such that lb(S′) ≤ c. To
achieve this period, we apply retiming to S′, which may re-

















delay (levels of logic)
Figure 16: Delay/area trade-off for a 128-bit adder.
In summary, given a candidate period for retiming c, we can
decide whether c can be achieved by a retimed Shannon vari-
ant of the initial circuit and, if so, produce such a circuit. In
addition, we can approximate the minimum period achievable
by a retimed variant of S by a binary search.
6 Experiments
We implemented our algorithm for acyclic and cyclic circuits
in C++, using SIS libraries [10] to handle BLIF files. Our test-
ing platform is a 2.5 GHz P4 with 512MB running Linux.
We estimate delays by simply counting levels of logic.
Though imprecise, this is a widely-used estimate.
6.1 Combinational case: adders
As a quick correctness and performance check, we ran the al-
gorithm on a variety of ripple-carry adders ranging from four
to 1024 bits. The algorithm successfully transforms each sam-
ple into a O(log(n))-delay carry-select adder.
We have further extended the algorithm for the combina-
tional case to enable a trade-off between delay and area. For
the 128-bit adder, we vary the required delay so as to mea-
sure the efficiency of our area minimization heuristics (Sec-
tion 4.3). The values in Figure 16 were measured after a fi-
nal two-input decomposition in SIS. All 120 points were com-
puted in 88 s, of which our algorithm takes only 24 s.
6.2 ISCAS89 sequential benchmarks
For the sequential case, we first select an approximately opti-
mal period by binary search. We start from a candidate period
equal to half the delay of the critical path and stop when we
have obtained an unfeasible period cu and a feasible period c f
such that c f < cu + 1/2. We then build a circuit variant for
period c f .
Because they are widely available, we considered mid-sized
ISCAS89 sequential benchmarks. We target an FPGA-like,
three-input lookup-table architecture. Hence, we report delay
and area as levels and numbers of lookup tables.
Reference Retimed Shannon-R Time Speed- Area
period area period area period area (s) up penalty
s510 8 184 8 184 8 184 0.5
s641 11 115 11 115 9 122 1.1 22% 6%
s713 11 118 11 118 10 121 0.9 10% 3%
s820 7 206 7 206 7 206 0.5
s832 7 217 7 217 7 217 0.4
s838 10 154 10 154 8 162 2.6 25% 5%
s1196 9 365 9 365 9 365 0.6
s1423 24 408 21 408 13 460 3.8 61% 12%
s1488 6 453 6 453 6 453 0.7
s1494 6 456 6 456 6 456 0.8
s9234 11 662 8 656 8 684 6.7
s13207 14 1382 11 1356 9 1416 18.0 22% 4%
s38417 14 7706 14 7652 13 7871 113.0 7% 3%
Table 2: Results on ISCAS89 sequential benchmarks.
Following Saldanha et al. [9], for each sample, we first
run script.rugged and perform a speed-oriented decomposition
decomp -g; eliminate -1; sweep; speed up -i . We then reduce
the depth of the circuit while keeping the nodes three-feasible
using reduce depth -f 3 [14]. We consider the above flow a
classical FPGA delay-oriented one. The results are reported in
Table 2 under “Reference.”
Starting from these optimized circuits, we either directly
execute retiming (retime -n -i, modified to use the unit de-
lay model) as reported in column “Retimed,” or run our al-
gorithm followed by retiming (“Shannon-R”). We verified our
algorithm produced functionally-correct circuits by comparing
them with the originals using VIS [2].
Although we produced no improvement on half of the sam-
ples, we realize a significant speed-up for the other half with
only a 5% area increase on average. The algorithm is very fast.
In particular, if no improvement can be made, its running time
is negligible. Otherwise, it appears linear in the circuit size.
The memory requirement is low (e.g., 70 MB for our largest
circuit, s38417). Hence, our technique seems to scale well.
7 Conclusions
In this paper, we propose an algorithm that applies Shannon
decomposition to enhance retiming opportunities on circuits
with tight sequential feedback loops. Provided with an initial
circuit and a desired period for this circuit, it tries to identify a
series of Shannon decompositions that would make the period
achievable through retiming. We approximate the best feasible
period with a binary search.
A carefully-designed set of Shannon variants bounds the
area penalty. We further reduce area with additional heuristics.
Our technique is sound. If the algorithm produces a circuit,
then the target period can be achieved by retiming, provided
the combinational nodes can be arbitrarily pipelined.
While we have not yet proved completeness (i.e., if the pe-
riod is feasible then the algorithm achieves it), experimental
results show significant improvements.
References
[1] C. L. Berman, D. J. Hathaway, A. S. LaPaugh, and
L. Trevillyan. Efficient techniques for timing correction.
In Proc. ISCAS, pages 415–419, 1990.
[2] R. K. Brayton, G. D. Hachtel, A. L. Sangiovanni-
Vincentelli, F. Somenzi, A. Aziz, S.-T. Cheng, S. Ed-
wards, S. Khatri, Y. Kukimoto, A. Pardo, S. Qadeer, R. K.
Ranjan, S. Sarwary, T. R. Shiple, G. Swamy, and T. Villa.
VIS: a system for verification and synthesis. In Proc.
CAV, pages 428–432, 1996.
[3] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein.
The Bellman-Ford algorithm. In Introduction to Algo-
rithms, pages 588–591. Prentice Hall, 2002.
[4] Soha Hassoun and Carl Ebeling. Architectural retiming:
pipelining latency-constrained circuits. In Proc. DAC,
pages 708–713, 1996.
[5] C. E. Leiserson and J. B. Saxe. Retiming synchronous
circuitry. Algorithmica, 6(1):5–35, 1991.
[6] S. Malik, E. M. Sentovich, R. K. Brayton, and A. L.
Sangiovanni-Vincentelli. Retiming and resynthesis: Op-
timizing sequential networks with combinational tech-
niques. IEEE Transactions on CAD, 10(1):74–84, 1991.
[7] Maria-Cristina V. Marinescu and Martin Rinard. High-
level automatic pipelining for sequential circuits. In
Proc. ISSS, pages 215–220, 2001.
[8] Patrick C. McGeer, Robert K. Brayton, Alberto L.
Sangiovanni-Vincentelli, and Sartaj K. Sahni. Per-
formance enhancement through the generalized bypass
transform. In Proc. ICCAD, pages 184–187, 1991.
[9] Alexander Saldanha, Heather Harkness, Patrick C.
McGeer, Robert K. Brayton, and Alberto L.
Sangiovanni-Vincentelli. Performance optimization
using exact sensitization. In Proc. DAC, pages 425–429,
1994.
[10] E. M. Sentovich, K. J. Singh, L. Lavagno, C. Moon,
R. Murgai, A. Saldanha, H. Savoj, P. R. Stephan, R. K.
Brayton, and A. L. Sangiovanni-Vincentelli. SIS: A sys-
tem for sequential circuit synthesis. Technical report,
UCB/ERL M92/41, 1992.
[11] K. J. Singh. Performance optimization of digital circuits.
PhD thesis, UCB, 1992.
[12] Kanwar J. Singh, Albert R. Wang, Robert K. Brayton,
and Alberto L. Sangiovanni-Vincentelli. Timing opti-
mization of combinational logic. In Proc. ICCAD, pages
282–285, 1988.
[13] H. Touati, N. Shenoy, and A. L. Sangiovanni-Vincentelli.
Retiming for table-lookup field-programmable gate ar-
rays. In Proc. ACM/SIGDA international Workshop on
Field Programmable Gate Arrays, pages 89–93, 1992.
[14] Herve´ Touati, Hamid Savoj, and Robert K. Brayton. De-
lay optimization of combinational logic circuits by clus-
tering and partial collapsing. In Proc. ICCAD, pages
188–191, 1991.
