Retiming DAGs by Calland, Pierre-Yves et al.
HAL Id: hal-02101755
https://hal-lara.archives-ouvertes.fr/hal-02101755
Submitted on 17 Apr 2019
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entific research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Retiming DAGs
Pierre-Yves Calland, Anne Mignotte, Olivier Peyran, Yves Robert, Frédéric
Vivien
To cite this version:
Pierre-Yves Calland, Anne Mignotte, Olivier Peyran, Yves Robert, Frédéric Vivien. Retiming DAGs.
[Research Report] LIP RR-1995-18, Laboratoire de l’informatique du parallélisme. 1995, 2+20p. ￿hal-
02101755￿
Laboratoire de l’Informatique du Parallélisme
Ecole Normale Supérieure de Lyon
Unité de recherche associée au CNRS n°1398 
Retiming DAGs
Pierre Yves Calland
Anne Mignotte
Olivier Peyran
Yves Robert
Frederic Vivien
September  
Research Report No  
Ecole Normale Supérieure de Lyon
Adresse électronique : lip@lip.ens−lyon.fr 
Téléphone : (+33) 72.72.80.00    Télécopieur : (+33) 72.72.80.80
46 Allée d’Italie, 69364 Lyon Cedex 07, France
Retiming DAGs
PierreYves Calland
Anne Mignotte
Olivier Peyran
Yves Robert
Frederic Vivien
September  
Abstract
The increasing complexity of digital circuitry makes global design optimization no longer
possible  a designer will only consider the critical parts of his circuit This paper dis
cusses timing optimization problems when these critical parts can be represented by
Direct Acyclic Graphs DAGs We deal with dierent though related clock period
problems under various external constraints The main algorithm concerns the deter
mination of the optimal clock period of a given circuit We propose an ecient solution
based on the retiming technique which improves current results in the literature We
give three dierent formulations depending on which combinational gates delay model
is used  same delay for every gate dierent delays or nonuniform delays
Keywords  Direct Acyclic Graph clock period optimization retiming delay model registers
Resume
La complexite grandissante des circuits numeriques rend desormais impossible toute
optimisation globale d	un circuit  les concepteurs ne consid
erent plus que les parties cri
tiques de leur circuit Cet article traite des probl
emes d	optimisation temporelle lorsque
ces parties critiques peuvent etre representees par des graphes acycliques orientes Nous
considerons dierents probl
emes concernant les contraintes sur la periode d	horloge qui
interviennent lors de la conception d	un circuit et qui decoulent de la determination
de la periode optimale du circuit Nous proposons un algorithme de faible complex
ite base sur la technique de resynchronisation pour resoudre ce probl
eme essentiel et
nous donnons trois formulations dierentes dependant du mod
ele de delai utilise pour
les portes combinatoires  meme delai pour toutes les portes delais dierents ou delais
nonuniformes
Motscles  Graphe acyclique oriente periode d	horloge optimisation mod
ele de delai registres
Retiming DAGs
PY Calland A Mignotte O Peyran Y Robert and F Vivien
LIP CNRS URA   Ecole Normale Superieure de Lyon
		
 Lyon Cedex  France
e mail Pierre YvesCalland AnneMignotte OlivierPeyran
YvesRobert FredericVivienlipens lyonfr
September  
Contents
 Introduction 
 Motivation 
 Review of retiming techniques 
 A graphtheoretic framework                                                               
 Review of related problems                                                                 
 New results                                                                                   
 Clock period minimization for a DAG with unit delay operators 
 Circuit period                                                                                 
 Two pass Algorithm                                                                         
 Proof and Complexity                                                                       
 Clock period minimization for a DAG with operators of any delay 
 Some bounds on opt                                                                         
 From a register distribution w and a clock period T                                     
 Algorithm and complexity                                                                   
	 Clock period minimization for a DAG with operators of nonuniform delay 
 Some Denitions                                                                             
 Equivalence with the previous case                                                         
 Algorithm and complexity                                                                   
 Conclusion 


  Introduction
Retiming is a technique used to optimize synchronous VLSI circuits The basic idea is to relocate
registers along the paths in the circuit so as to reduce combinational rippling The rule of the game
is that the functional behavior of the circuit as a whole is preserved There are dierent cost criteria
to evaluate the eciency of the retiming but minimizing the clock period andor the state total
number of registers are the most frequently used The survey paper of Leiserson and Saxe  gives
several polynomialtime algorithms to compute the optimal retiming of a given circuit using the
previous two cost criteria
Pipelining VLSI circuits can be viewed as another instance of the retiming problem Given a
circuit with combinational elements the problem is to determine the minimum number of registers
that should be added and to determine where to insert these registers so as to achieve a xed clock
period when operating in pipelined mode hence the name pipelining The rule of the game is
still that the functional behavior of the circuit is preserved but the price to pay here is an increase
of the latency
Section  is devoted to a review of known results concerning various instances of the retiming
problem We formulate all these instances via a unied graphtheoretic framework as introduced
in 
The main contribution of the paper is a new algorithm for retiming circuits without cycles ie
whose network graph is a DAG Direct Acyclic Graph In a word our algorithm for retiming a
DAG only requires two passes over the network graph as opposed to jV j passes in  where jV j is
the number of nodes in the network graph see below in Section  for formal denitions
The rest of the paper is organized as follows  as already stated Section  is devoted to a
review of existing literature together with a formal statement of our main results whose proofs are
given in Section  for operators with the same delay in Section  for operators with arbitrary
delays and in Section  for operators with nonuniform delays Section  emphasizes upon
the practical importance and usefulness of scheduling DAG circuits a target application is the
design optimization of online arithmetic operators We give some nal remarks and conclusions in
Section 
 Motivation
In online arithmetic the operands circulate through the operators in a digitserial fashion most
signicant digit rst Online arithmetic was introduced by Ercegovac and Triveli  who proposed
algorithms for online multiplication and division The operands enter the design at time t and
the rst digit of the result output at time t   where  is the latency of the design The main
advantages of online arithmetics are the low resulting clock period even for complex applications
and the simplied accuracy control Online arithmetic performs computation most signicant digit
rst Each new digit of the result is then a complementary precision The computation can stop
as soon as the required accuracy is reached
This serialization is possible only if all the operands circulate most signicant digit rst Unfor
tunately the usual serial algorithms for addition and multiplication work least signicant digit rst
A specic number system is then used namely the signed digit redundant number system whose
redundancy makes it possible to perform addition and multiplication with no carry propagation
For instance Figure  shows a borrow save adder for redundant arithmetic
Several algorithms and corresponding architectures have been proposed to perform basic and
complex operators such as addition multiplication division square root and trigonometric func

x  y   z  t  u
t  maj x y z
u  x y  z
  
    
x y z
t uut
zyx
  
   


  
    
  
    
  
    
  
    
  
   
  
   
  
   
  
   
a  a
 
 a
 
 a
 
b
 
 b
 
 b
 
 b
 
a
 
 a
 
 a
 
 a
 
b
 
 b
 
 b
 
 b
 

s  s
 
 s
 
 s
 
 s
 
s
 
 s
 
 s
 
 s
 
 s
 

Figure   A Borrow Save adder made of PPM Plus Plus Minus cells The digits ai bi and
siwhose values are   or  are represented by the bits a
 
i  a
 
i  b
 
i  etc such that ai  a
 
i  a
 
i  bi 
b i   b
 
i       
tions  most signicant digit rst These operators are already temporized so as to satisfy a
specic behavior They dene a pipelined operator library for online arithmetic For instance
Figure  shows a serial borrow save adder It corresponds to one stage of the adder of Figure 
Registers are inserted to make sure that c n  b
 
n  of the previous stage be used to compute s
 
n 
and s n 
Thus a given application using on line arithmetic is pipelined by construction This determines
a given latency  and a given clock period corresponding to the critical path of the design ie the
longest path in terms of cells with no register Let us illustrate this on the application of Figure a
The critical path of the application is  cells
s n 
s n 
c n 
b n 
c n
b n
a n
a n
b n
Figure   A latency borrowsave serial adder working most signicant digit rst The square
cells are registers

Nevertheless the resulting clock period may not be satisfying for a designer Therefore one may
have to transform the design by grouping digits to increase the clock period or at the opposite
by inserting timing barriers between operators to reduce the clock period We will focus on the
second transformation as the rst one only deals with the pipelined operator library Suppose that
the designer inserts a timing barrier as shown in Figure b the critical path is reduced to  cells
Unfortunately in that case the critical path is not optimized   it can be reduced to  cells by
moving some registers as shown in Figure c
The problem of moving registers to optimize the clock period while preserving the whole be
havior is the well known retiming problem Moreover on line arithmetic has two main properties 
the design is only made up of two basic cells PPM and half adders and there is no loop in the
applications thus we could sensibly consider that all the cells of the design have the same delay
and the design can be described with a DAG These two properties will be exploited to reduce the
algorithm complexity of retiming as described in the next sections
...
...
...
...
...
...
a) a design made up of 4 serial adders
...
...
...
...
...
...
Borrow Save Adder
b) design a) with one timing barrier
...
...
...
...
...
...
Borrow 
Save Adder
c) design b) after retiming
PPM cell
memory point
Borrow
Save Adder
Figure   Adding and moving registers to decrease the period
 Review of retiming techniques
  A graphtheoretic framework
Computational devices VLSI circuits or VHDL programs are represented by a nite connected
vertexweighted edgeweighted directed multigraph G  VE dw Vertices of the graph or
nodes model the computational elements Each vertex v  V is weighted with its nonnegative delay
dv Vertex delays d can be rational numbers but since the graph is nite we can always change
the time unit to have integer delays The directed edges E of the graph model interconnections
between functional elements Each edge e  E is weighted with a register count we the register
count corresponds to the number of wait until clock statements in VHDL programs Edge delays
are nonnegative integers
We need a few denitions and notations
For an edge e   u   v we write u  te the tail of e and v  he the head of e We dene as
input nodes respectively output nodes the nodes whose indegree respectively outdegree is zero

We add two special vertices vfrom host and vto host with zero propagation delay to model the
interface of the graph with the external world  dvfrom host  dvto host   We let V  
V  fvfrom host vto hostg
Finally we dene a nullweighted edge from vfrom host to each input node and a nullweighted
edge from each output nodes to vto host These edges are called respectively input edges and output
edges They form the set of interface edges that we denote by I  We let E  E  I 
We use the same conventions as in  For any simple path P  v
e   v
e       
ek 
  vk we
dene the path tail tP   v as the tail of its rst edge the path head tP   vk as the head of its
last edge the path delay dP  
Pk
i dvi as the sum of the delay of the vertices of P  and the
path weight wP  
Pk 
i wei as the sum of the weights of the edges of P  We denote by lP  the
length of P  ie the number of edges in P  If k   we let lP    dP   dv and wP   
We assume that G does not contain any zeroweight cycle  in any directed cycle C of G there is
some edge with strictly positive weight ie wC   This condition ensures that the operation
of G is welldened in the case of VLSI circuits we say G is synchronous The clock period of
G is then well dened by the equation G  maxfdP wP   g Intuitively the clock period
is the maximum amount of propagation delay through which any signal much ripple between clock
ticks The state of G is dened as the sum of the registers over all edges  SG 
P
eE we
Retiming is an assignment of an integer lag rv to each vertex v  V   it amounts to suppress
rv registers to the weight of each edge leaving v whose tail is v and to add rv registers to
each edge entering v whose head is v Formally a retiming function r is a mapping from V  to Z
such that rvfrom host  rvto host   It leads to a new edgeweighting function wr dened for
an edge u
e
 v by wre  we rv  ru A retiming is legal if the new edge weights wr are all
nonnegative Obviously a legal retiming does not change the global behavior of the computational
graph G but both the clock period rG and the total number of registers SrG 
P
eE wre
are altered
Several problems can be formulated 
Problem  Given a graph G  VE dw and a maximum allowable clock period c nd a legal
retiming r such that rG  c
Problem  Given a graph G  VE dw nd a legal retiming r such that the clock period
rG of the retimed circuit Gr  VE dwr is as small as possible
Problem  Given a graph G  VE dw and a maximum allowable clock period c nd the
smallest nonnegative integer k such that when adding k registers to each input edge there
exists a legal retiming r such that rG  c
Problem   is the basic retiming problem and can be solved in OjV jjEj in the most general
instance see  We point out that Problem   is sometimes formulated as a register minimization
problem  given c we ask to determine a legal retiming r such that rG  c and SrG is as small
as possible The complexity of this variant is dominated by the solution of a minimum cost!ow
problem see   and the references therein for several bounds
Problem  is a clock minimization problem and can be solved in OjV jjEj log jV j in the most
general formulation see  In fact it turns out that a way to solve Problem  is to repeatedly
solve several instances of Problem   using a dichotomic search for c A particular case for Problem 
supposes that the maximal delay of a node D  maxfdv v  V g grows subpolynomially with
respect to the number of functional elements in the circuit ie that there exists a polynomial func
tion P such that DG  P jV j In this case the algorithm complexity becomes OjV jjEj logD
see 

Problem  is a circuit scheduling problem that can be solved by solving several instances
of Problem   using a dichotomic search for k upperly bounded by jV j hence a complexity
OjV jjEj log jV j When the graph G is a DAG given with no registers at all SG   Prob
lem  is a formulation of the pipelining problem In this particular case Problem  has a complexity
reduced to OjEj logD see 
  Review of related problems
Retiming techniques have been applied to several related problems Retiming was rst introduced
by Leiserson and Saxe in  in order to solve Problem   for systolic circuits unitdelay circuits
with at least one register between two functional elements Their solution using the BellmanFord
algorithm requires OjEjjV j time
Wehn et al linked high level synthesis and retiming in  using the correspondence between
wait until statements and registers Using the conditions dened by Leiserson and Saxe they
stated dierent problems like AsSoonAsPossible scheduling or minimization of registers as linear
programming problems They expressed these problems with various objective functions However
they did not provide any solution that eciently solves them
Some researchers proposed interesting results for simpler problems like C Papaeftymiou in 
who proposed an algorithm which gives the minimum clock period of a circuit with at most l levels
of registers where l is a given value this problem is the dual of Problem  The algorithm runs
in OjEjlogjV j but the initial circuit has to be empty of registers Therefore it is only interesting
when you need to pipeline a combinational circuit
In the same idea of optimizing a combinational circuit Munzer and Hemme proposed in 
an algorithm in OjEj jV j to solve the register minimization problem From a registerempty
circuit they apply an AsSoonAsPossible and AsLastAsPossible register locations with the
constraint of satisfying a given clock period The two locations determine the parts of the circuit
where registers are likely to be moved Within these parts they use a maximal !ow algorithm to
nd the minimal number of registers
Considering sequential circuits WeiJeng Cheng et al showed the relationship between retiming
and loop folding  They propose a solution to the Problem  using an AsSoonAsPossible
pipeline algorithm However they did not really explain the way they moved the registers within
the circuit which is the most important step in terms of complexity
Finally Leiserson and Saxe improved their technique  and proposed several algorithms
to solve Problem  by capturing it into a linearprogramming problem the best one running in
OjEjjV jlogjV j They also have a solution to the register minimization problem in the case of
sequential circuits They express this problem as a minimum cost !ow problem after having
augmented the graph representing the circuit with virtual vertices and edges in order to have a
good function of cost The complexity is OjV jlogjV j
   New results
All these problems can be particularized to the case where G is a DAG with a nonzero state for
Problem  Our main results show that the complexity of the three problems can be signicantly
decreased for DAGs as stated in the following theorems
Theorem  Given a DAG G  VE dw and a maximum allowable clock period c a legal
retiming r such that rG  c can be found in OjEj if it exists

Theorem  Given a DAG G  VE dw let D  maxvV dv A legal retiming r such that
the clock period rG of the retimed circuit Gr  VE dwr is as small as possible can be found
in OjEjlog jV j logD
Theorem  Given a DAG G  VE dw and a maximum allowable clock period c let D 
maxvV dv Let k be the smallest nonnegative integer k such that when adding k registers to each
input edge there exists a legal retiming r such that rG  c Then k and the corresponding
retimed circuit can be found in OjEj
We prove these theorems in the following sections
 Clock period minimization for a DAG with unit delay operators
In this section we present an algorithm which minimizes the period for a DAG whose operators have
all the same delay Without loss of generality we let dv   for all v  V nfvfrom host vto hostg In
subsection  we propose a lower bound on the period In Subsection  we describe an algorithm
which achieves this lower bound In subsection  we prove the correctness of our algorithm and
we give its complexity
 Circuit period
Period value
Lemma  Let G  VE dw be a DAG The minimum clock period optG of G satises to	
optG  max fdv   v  V g 
optG  max
 
dP 
wP   
  P path of G tP   vfrom host and hP   vto host


Proof of lemma   The rst bound is true by denition  a path of P of length  satises
wP    Let us prove the second bound
Suppose that G has been retimed with respect to the period optG Let P be a path of G
from vfrom host to vto host dP  is equal to the number of vertices in P except the two extra
vertices vfrom host and vto host Therefore P contains lP   dP    edges and wP  registers
Furthermore between two consecutive registers of P there are at most optG nodes ie optG 
 arcs without registers For the same reason between vfrom host resp vto host and the rst resp
last register there are at most optG arcs without registers counting the input resp output
edge Now let us count the number of edges in P ie the length lP  of P  counting separately
edges with registers and edges between registers This leads to 
lP   optG  wP   wP     optG    optG
ie lP   optG  wP     
ie optG 
lP   
wP   
This proves the lemma since lP  and wP  are unchanged by retiming
In Subsection  we present an algorithm which achieves this lower bound thereby proving the
following theorem 

Theorem  Let G be a DAG The minimal period optG of G is	
optG  max
 
max
vV
dvmax
 
dP 
wP   
  P path of G tP   vfrom host hP   vto host


Period computation
Let us build a graph G identical to G except that an edge eround   vto host  vfrom host is added
This edge carries a single register  weround  
Lemma  The value of the minimum cycle mean of G is the inverse of the clock period of the
circuit G if optG  maxvV dv
Proof of lemma   The minimum cycle mean of a directed graph is always reached on an
elementary cycle
Let C be an elementary cycle of G C can be decomposed as erounde       ek We consider
the path P of G equal to e        ek By construction wC  wP    and dC  dP  The
mean value on C is then equal to wP  
dP  
Thus the cycle mean of an elementary cycle of G is minimal if it is the inverse of the period
of a path of G from an input node to an output one
The period of a graphG can thus be computed in OjV GjjEGj log jV Gj ie OjV Gj
jEGj log jV Gj by a dichotomic use of the BellmanFord algorithm as the computation of
maxvV dv can be done in OjV Gj Note that we can also use Karp	s mean weight cycle
algorithm see  on a graph G built from G  in G we collapse the input nodes we collapse the
output nodes we link these new nodes with an edge with a single register and we cut out the two
special nodes The complexity of this second solution is OjV Gj  jEGj
We will see in Subsection  that this precomputation of the period can be more eciently
replaced by a dichotomic search of the optimal period with the help of our two pass algorithm 
 Two pass Algorithm
The following algorithm operates through two passes over G to determine a legal retiming r such
that rG  T  In the rst pass all registers are moved as close as possible to the node vfrom host
with respect to retiming rules  when processing a node we can add suppress the same number
of registers on each incoming outcoming edge provided that the weight on each edge remains
nonnegative In the second pass all registers are moved as far as possible from the node vfrom host
with respect to the same rules and also with respect to the target period T 
Algorithm
Remember that V is the set of vertices of G with non zero in and outdegrees and that the set of
all vertices of G is V   V  fvfrom host vto hostg
Consistency test
If T  maxvV dv then ERROR endif
First pass
for each vertex v  V in reverse topological order

do
n  mineEjtev we
e  E j te  v we	 we  n
e  E j he  v we	 we  n
enddo
Second pass
"vfrom host	 
for each vertex v  V in topological order
do
n	 mineEjhev we
"v	 dv  max f"te   e  E he  v and we  ng
n " computation statement n
if "v  T 
n Leave a register case n
then
if n   then ERROR endif
n	 n  
"v	 dv
endif
e  E j he  v we	 we  n
e  E j te  v we	 we  n
enddo
Lemma  Let v  V be a node of G After the second pass of Algorithm 
 if the ERROR case
does not occur "v is the largest delay of a path without register that ends at v	
"v  maxfdP    P path of GwP    hP   vg 
Proof of lemma   The proof is by induction on the length l of the longest path without
register that ends at v
If l   there is at least one register on each of the incoming edges of v ie the leave a register
case occurred during the process of v Thus "v  dv
Now suppose that the property has been proved for any value of l between  and l
Let v be a node of G such that l  l  l   then the leave a register case did not occurred
and the value "v was set by the " computation statement Thus 
"v  dv  max f"te   e  E he  v and we  g
Then because of the induction hypothesis
"v  dvmaxfmaxfdP    P path of GwP    hP   teg   e  E he  v we  g
which is equivalent to equation 
We illustrate the execution of the twopass algorithm on an example The original graph to
be retimed is given in Figure  and as a period of  The graph obtained after the rst pass is
shown in Figure  Finally the graph obtained after the second pass is shown in Figure 

1
1
1
1 2 3
1
2 1
1
1
2
2
v
 from_host
v 
to_host
Figure   Initial register distribution with a period equal to 
2
1
1 2
3
1
1
3
1
2
1
v vfrom_host to_host
Figure   Resulting distribution after the rst pass of the algorithm when operators have same
unit delay
  Proof and Complexity
Proof of the correctness of the algorithm
Theorem  LetM  max
n
maxvV dvmax
n
dP 
wP  
  P path of G tP   vfrom host hP   vto host
oo

If the tested period T is greater than or equal to M then the twopass algorithm on G succeeds to
nd a new register distribution with period less than or equal to T 
Before establishing the proof of this theorem we demonstrate two lemmas
Lemma  Let v be a node of G v 
 vfrom host We suppose T  maxvV dv Then after the
rst pass of algorithm 
 there is a path without any register from v to vto host
Proof of lemma   Each node v  V is processed by the rst pass of algorithm  Registers
are moved so that at least one of its outcoming edges has no register To build recursively the path
announced in the lemma choose one of these edges as the rst edge of the path and do the same
operation on the edge head
1
2
2
1
1
31
1
1
2
1
1
2
1
1
2
v
from_host
v to_host
Figure   Resulting distribution after the second pass of the algorithm when operators have same
unit delay T  optG  

Lemma  Let v  V  We suppose that no ERROR occurred before or during the process of v by
the second pass of algorithm 
 Then after this second pass there is a path from an input node
to v which begins by repeating several times the pattern T    edges without any register followed
by an edge with a single register plus "v   edges with no register
Proof of lemma   We prove the property by induction considering all v  V in the same
order as processed by the second pass
Obviously this is true for any input node v  a void pattern plus "v         edge
Now let v  V be a vertex of G which is not an input node and suppose that lemma  holds
for all vertices processed before v Consider the situation during the process of v and just after the
" computation statement 
n	 mineEjhev we
"v	 dv  max f"te   e  E he  v and we  ng
Let e be an edge such that at this step of the algorithm we  n he  v and "v 
dv  "te Let v  te Since v has already been processed by induction hypothesis
there is a path P from an input node to v consisting of T    edges with no register followed by
an edge with a single register this pattern being repeated say r times  actually r  wP plus
"v   edges with no register
We have now to consider two cases depending whether the leave a register case occurs or not
ie whether "v  T or not
Case "v  T   In this case n and "v are not modied and n registers are moved to the
outcoming edges of v Therefore the weight of e is modied to  Now by adding e to P
we get a path from an input node to v with the desired pattern repeated r times followed by
"v      "v   edges with no register
Case "v  T Leave a register case  Since "v  T  "v  T  n is changed to n  and
"v to dv   One register is left on edge e Therefore by adding e to P we get a path
from an input node to v with the desired pattern repeated r times plus "v    T   
edges without registers plus one edge with a single register ie r   times the desired
pattern followed by "v     edge
Proof of theorem   To prove theorem  we must prove rst that the ERROR case cannot
occur Then we show that the register distribution satises the period T 
As T is greater than or equal to M  T is greater than maxvV dv
The rst pass always ends correctly Let us suppose that the second pass does not end correctly
ie that the ERROR case occurs during the treatment of a node v This means that there exists a
node v such that "v  T and an edge e from v to v which carries no register v was treated
by the second pass we conclude from lemma  that there exists a path P from an input node to
v which is composed by wP times the motif T    edges without any register followed by an
edge with a single register plus T    edges without any register
The vertex v was treated by the rst pass but not by the second pass thus the result of lemma 
is still valid none of the successors of v have been touched since the rst pass Thus there exists
a path P from v to vto host which carries no register lP   since v 
 vto host

Now we build a path P from vfrom host to vto host as the concatenation of an input edge P e
and P wP   wP and lP    lP lP  wP T T    lP Now
replacing wP by wP  and lP  by dP    gives
T 
dP   
wP   
which contradicts the fact that 
T M 
dP 
wP   
see lemma 
We have thus proved that the second pass ends without the ERROR case We just have to
check that the period of G is now smaller than T 
For all v  V  either the leave a register case does not occur and "v  T  or "v  dv  T 
Furthermore by lemma  "v is the largest delay of a path with no register that ends at v There
fore the period of G is one of the "v and is smaller or equal to T 
Corollary  Theorem 

Proof  Theorem  gives an upper bound on opt which is exactly the lower bound of lemma 
Complexity
The topological sort on G can be done in OjEj jV j  OjEj recall that G is connected The
rst and second passes of the algorithm process all nodes of G once For each node all its input and
output edges are processed The complexity of these passes is therefore OjEj The complexity of
the twopass algorithm alone is then OjEj
If we try to determine the minimal possible clock we can use a dichotomic search Indeed as
each node v has unit delay dv   optG is upperly bounded by jV j Thus we can use the two
pass algorithm for a dichotomic search of optG in the range  jV j  for each tested value we try
to process G by the twopass algorithm  if the process succeeds then optG is smaller than or
equal to the tested value else optG is strictly greater The complexity of this dichotomic search
is therefore OjEj log jV j as optG is upperly bounded by jV j Thus precomputing optG is
less ecient than this strategy
 Clock period minimization for a DAG with operators of any
delay
Let G  VE dw be a DAG In this section the function d   V  N is no longer constant
However we still consider that dvto host  dvfrom host   in all cases
 Some bounds on opt
Let D  maxvV dv be the largest operator delay and minvV dv the smallest one Let G be
a graph obtained by replacing d with d   v  V  minvV dv and compute the minimum clock

period G as in  for this graph Similarly let G be a graph obtained by replacing d with
d   v  V  maxvV dv and let G be its period So we have 
G  optG  G

Furthermore we have the obvious following bounds 
max
vV
dv  optG  G
Then 
maxGmax
vV
dv  optG  minG
G
 From a register distribution w and a clock period T
Theorem 	 Let G  VE d w be a DAG and let Ttest be an integer If there exists a register
distribution wr on G whose clock period is less than or equal to Ttest then the twopass algorithm
called with T set to Ttest succeeds to nd such a register distribution
Remark  If there exists a register distribution wr on G whose clock period is less than or equal
to Ttest then Ttest is certainly greater than maxvV dv This property will implicitly be used in
the rest of this section
To establish the proof of this theorem we need to prove rst the two lemmas below
Lemma 	 Let G  VE dw be a line graph G   v
e        
ei 
  vi
ei       
ek 
  vk If G 
VE dw admits at least one valid register distribution whose clock period is less than or equal to
Ttest then the twopass algorithm succeeds to nd one
Proof  Let w be any register distribution on G whose clock period is less than or equal to
Ttest When applying the twopass algorithm on VE dw
 with the period T set to Ttest we
obtain a new register distribution w As G is a line graph a retiming does not change the total
amount n of registers of G Let i         in resp j         jn be the indices of the edges of
G  VE dw resp G  VE dw which carry the n registers
We prove by induction on l that jl  il for all l  f        ng
This is true for l    indeed i resp j is the position of the register which is the closest to
v in G  VE dw
 resp G  VE dw By denition of the twopass algorithm  this
register is as far as possible from v in G  VE dw Therefore j  i Actually j is dened
by 
jX
m
dvm  Ttest 
jX
m
dvm  dvjl 
Now we suppose that j  i        jl  il and we prove that jl   il  By denition of
the twopass algorithm  jl  indicates at which vertex the leave a register case occurred for the
l th register We thus have 
jlX
m jl
dvm  Ttest  dv jl 
jlX
m jl
dvm
To show that jl   il  we just have to prove that
Pil
m jl dvm  Ttest But 
Pil
m jl dvm Pil
m il dvm since jl  il by induction hypothesis and
Pil
m il dvm  Ttest asG  VE dw

has a register of clock period smaller than or equal to Ttest by hypothesis

Now it remains to show that after ejn  the ERROR case can not occur Indeed if the process
of G  VE d w by the twopass algorithm fails we have
Pk
mjn
dvm  Ttest but since
jn  in we also have
Pk
min
dvm  Ttest and the register distribution w does not have a clock
period smaller than or equal to Ttest which contradicts the hypothesis Thus the processing of
G  VE dw by the twopass algorithm succeeds
Lemma  Let G  VE dw be a DAG and let Ttest be an integer Process G  VE d w
by the twopass algorithm with the period T set to Ttest Then for any node v of V which has
been successfully processed by the second pass of the algorithm there exists a path P starting from
vfrom host ending at v and satisfying the following properties	
 The weight of every edge of P is equal to zero or one
 if r is a register of P  in P alone considered as a line graph this register cannot be pushed
further from the input nodes without violating the period value Ttest
Proof of lemma   P is built from the head to the tail by induction  if v is the current tail
of P and if v is not an input vertex we choose the predecessor u of v and the edge e from u to v
such that 
we  min
fE
fwf    x
f
  vg and "u  max
xV
f"x    x
f
  v wf  weg 
Then we is equal to  or  as the number of registers removed from the incoming edges of
a vertex is equal to the minimal number of registers on these edges or is equal to this minimal
number of register minus 
We prove now that on P considered as a line graph a register cannot be moved further from
the input node
Let r be a register of P  and let e be the edge that carries r Let respectively u and v be the
tail and head of e  u
e
  v we   thus by denition of e see equation  every incoming
edge of v carries a register Thus as by hypothesis the node v has been successfully processed by
the second pass of the algorithm there was during the process of v at least one predecessor u of v
linked to v by an edge e such that we was minimal among the input edges of v and such that
"u  Ttest   dv   conditions of occurrence of the leave a register case in algorithm 
Then by denition of u see equation  "u  "u  Ttest dv and the register r cannot
be pushed further from the input node of P without violating the period Ttest
Remark  The result of lemma  can be applied to any node of G which had successfully been
processed by the second pass even if the whole processing of G failed
Proof of theorem   If the twopass algorithm ends correctly ie if the ERROR case does
not occur then the register distribution found satises the period Ttest  for each node v of G
"v  T  Ttest and then lemma  allows us to conclude that the register distribution satises
Ttest
We suppose now that the ERROR case occurred during the process of G As there exists a
register distribution valid for Ttest Ttest is greater than or equal to maxvV dv Thus the ERROR
case occurred during the process of a node v  there is in G a node u and an edge e such that 
u
e
  v and we   and "u  dv  Ttest

We build for u the path Pu dened in lemma 
There is a path P  with no register from v to vto host as stated in lemma  Let P be the
concatenation of Pu e and P
 By construction wP  is equal to wPu By denition of Pu
see lemma  no register of the line graph Pu can be moved further from vfrom host and there is
no register on e  P  Thus the twopass algorithm is not able to nd a register distribution
valid for Ttest on the line graph P  This means that such a register distribution does not exist see
lemma 
However this contradicts the fact that there exists a retiming r on the whole graph G with clock
period less than or equal to Ttest theorem hypothesis Indeed the restriction of r to the vertices
of P path from vfrom host to vto host gives a distribution of the wP  registers of P considered as
a line graph with clock period less than or equal to the clock period of G
10 1
1
1 1
1
3
5 1
5
3 5
3
2
1
3
1
3
1
10
5
8
8
3
10
1
2 0
4
10
7
0
4
8
1
1
2
2
2
1
1
1
2
2
2
2
2
2
2 v
to_hostv from_host
Figure   Initial register distribution
10
10
10
1
1
3
5
5
8
1
2
3
0
5
3
3
1
5
3
7
4
0
1
8
10
1
1
4
8
2
1
1
1
3
1
1
1
2
1
1
1
1
2
2
2
1
1
1
3
v
from_host v to_host
Figure   Resulting distribution after the rst pass of the algorithm Operators have dierent
delays
  Algorithm and complexity
Algorithm
The algorithm below nds the minimum achievable period for a DAG and a valid retiming for this
period
This algorithm uses the twopass algorithm as a kernel in a dichotomic search of the minimum
clock period  if there exists a valid retiming for the tested period value c then the twopass
algorithm succeeds to nd one else it fails and in both cases the set of possible period values is
rened Gr is the resulting graph Gr  VE dwr

10
10
10
1
1
1
3
3
1
3
3
1
0
1 4
2
1
0
4
8
3
1
2
1
5
3
7
1
10
5
5
5
8
8
1
1
1
1
2
3
1 1
1
1
1
2
2
2
2
1
1
v
to_hostv
from_host
Figure   Resulting distribution after the second pass of the algorithm Operators have dierent
delays
General algorithm
tmin 	 maxvV dv
tmax 	 G
Repeat
t 	

tmax   tmin


if the twopass algorithm succeeds with T  t
then tmax 	 Gr
else tmin 	 t  
tmax 	 maxtmaxGr
endif
Until tmax  tmin
We illustrate the execution of the algorithm on an example The original graph to be retimed is
given in Figure  We assume that the optimal period optG   has been found and we apply
the two passes The graph obtained after the rst pass is shown in Figure  Finally Figure 
represents the graph obtained after the second pass of our algorithm
Complexity
The complexity of the twopass algorithm is OjEj thus the complexity of this algorithm is
OjEjlog jV j logD where D  maxvV dv  indeed optG is upperly bounded by jV jD
The stated complexities of the twopass algorithm and of the general algorithm directly
prove Theorem  and Theorem  As for Theorem  we use a dichotomic search for k using the
twopass algorithm once for each instantiation The value of k is clearly upperly bounded by
the diameter of G hence by jV j note that we could slightly rene the given complexity using the
diameter This would lead to a complexity of OjEj log jV j for Problem  However a much better
solution is to add jV j registers to each input edge and run the twopass algorithm once At the
end of the second pass simply suppress the useless registers left on the output edges Let k be the
minimum number of useless registers on the output edges Then k  jV j   k is the desired value
and the process we described gives the retimed circuit Then the complexity of the algorithm is
reduced to OjEj

 Clock period minimization for a DAG with operators of non
uniform delay
During the conception of a circuit the designer knows the dierent types of operators that he will
use independently of the technology on which the design will be implemented Symbolic operator
delays are available which could be seen as worst case delays
Once the design technology has been chosen more accurate delays are available An operator
delay is no longer symbolic  in fact there is a delay for each internal path or internal edge between
an input and an output Usually the delay of an internal edge ei is equal to
dei  ei  ei  fei
where ei is the intrinsic delay ei is the extrinsic delay depending on the output capacitance
and fei is the output fanout
This section aims at giving a new form of the twopass algorithm taking into account this more
accurate delay model
 Some Denitions
Let G  VE dw be a DAG In this section we consider vertices in which the delays through
individual elements are nonuniform Let Iv and Ov be the inputs and outputs of v For each
pair i o  Iv Ov of vertex v we call dvi o the delay from input i to output o of vertex
v We shall now consider the tail te of edge e as the pair v o e coming out of node v from
output o In the same way the head becomes the pair i v where i is an input of node v As in
the previous sections we add two new vertices vfrom host and vto host with a single output resp
input In the twopass algorithm we will need to compute a new value called "out The twopass
algorithm becomes 
Algorithm
Consistency test
If T  maxvV  i oIvOv dvi o then ERROR endif
First pass
for each vertex v  V in reverse topological sort order
do
n  mineEjtev we
e  E j te  v we	 we  n
e  E j he  v we	 we  n
enddo
Second pass
"outvfrom host	 
for each vertex v  V in topological sort order
do
n	 mineEjhev we
"outv o	 maxiIvdvi o  "outte   e  E he  i v we  n
"v  maxsOv"outv s
if "v  T 

then
if n   then ERROR endif
n	 n  
o  Ov"outv o  maxiIv dvi o
"v  maxoOv"outv o
endif
e  E j te  v we	 we  n
e  E j he  v we	 we  n
enddo
We denote by ovm and i
v
m one of the output and input for which "v is reached "v 
dvi
v
m o
v
m  "outte where e  E he  i
v
m v we  n
 Equivalence with the previous case
We will prove that theorem  Lemma  and Lemma  still apply to our new graph model and to
our new two pass algorithm
Denition of nwpv  "outv o can be considered as the arrival time of a signal s on the
output o of vertex v That is the longest time it will take to the signal coming out from a register
to arrive on the output o of vertex v It means that there is a null weight path that we will call
nwpv ending at v and of delay "outv o
Denition of prv  If every incoming edge of v has at least one register it means that the
rst computation of "v during the second pass was greater than T in such a case nwpv has
only one vertex  v  since a register has been left 
ovm  Ov   "v  "outv o
v
m  T
we consider here the rst computation of "out Besides
"outv o
v
m  dvi
v
m o
v
m  "outte with e  E he  i
v
m v we  n
We denote by prv o the tail of edge e and we call prv the predecessor of v  We then
can simplify the previous inequality 
"v  dvi
v
m o
v
m  "outprv o
  T
Remark that after the second pass there is exactly one register on the edge between v and
its predecessor prv If registers had been pushed from the inputs to the outputs it would have
revealed a null weight path P such that dP   "v  T  The last vertex of P would have been
v and the penultimate one would have been prv
As Lemma  only deals with linegraphs  internal edges have no meaning as there is only one
input and one output Thus Lemma  still can be applied to our new twopass algorithm
New proof of lemma   We shall now construct a path P as a list of null weight paths Pi
the head of a subpath being the predecessor of the tail of the next subpath  
P  Pn  nwpprtPn  Pn   nwpprtPn         P  nwpprtP P  nwpv
Every Pi is a null weight path by denition of nwp P  nwpv is the longest null weight path
ending at v Let us call its tail vt  Let us now call v

h the predecessor of v

t  P is then the longest

null weight path ending at vh We construct the rest of P by induction until we nd vfrom host as
a path tail
There is exactly one register between the path tails and their predecessor as by denition of
"out the edge between the two vertices was of minimal weight
Assume now that r is a register of P carried by the edge v
e
  v v is a tail of one of the
null weight path Pi v is its predecessor prv By denition of the predecessor we have
"v  dvi
v
m  o
v
m  "outv o
  T
As "v  "outv o
 we can conclude that
"v  T   dvi
v
m  o
v
m 
which shows that if r was pushed further from the tail of P  it would reveal a null weight path
ending at v of delay "v  dvi
v
m  o
v
m  T 
Finally the proof of theorem  can be directly applied to our new graph model
  Algorithm and complexity
In conclusion the twopass algorithm can be generalized to our new delay model conditionally to
the described modication in the computation of " The complexity remains the same
 Conclusion
We have dealt with various instances of the retiming problem Our initial motivation was dictated
by our target application  online computer arithmetic circuits are naturally pipelined circuits
without cycles hence the search for more ecient retiming algorithms applicable to DAGs
We have succeeded in improving the known complexity of clock minimization problems Further
work could be aimed at improving the complexity of the register minimization problem It is not
clear that a more ecient solution can be found for DAGs than for arbitrary graphs with cycles
However computer arithmetic circuits usually involve regular computational elements and this
characteristic may prove helpful For instance the Borrow Save adder of Figure  only involves
identical devices of input degree  and output degree  Thus in this case our twopass algorithm
also minimizes the total register number as registers are pushed in the second pass from inputs
to outputs
Finally there remain many interesting open problems in the area For instance computational
devices are usually selected from a cell library and we can have the freedom to select say among
several adders with dierent delays and inputoutput degrees For example if a Borrow Save 
adder has a very interesting delay it is due to the redundant number representation used which
means twice more registers to memorize a number This could be yet another parameter of the
fundamental design optimization problem to be solved  match a clock period constraint while
minimizing the total register number
Acknowledgment
We would like to gratefully thank Alain Darte for his careful reading of this paper and for his
numerous comments and suggestions

References
 TsingFa Lee Allen CH Wu WeiJeng Chen WeiKai Cheng and YounLong Lin On the re
lationship between sequential logic retiming and loop folding In Proceedings of the SASIMI
pages # Nara Japan October 
 JC Bajard S Kla and JM Muller Bkm  A new hardware algorithm for complex elementary
functions IEEE Transactions on Computers pages # 
 J Biesenack T Langmaier M M$unch and N Wehn Scheduling of behavioural VHDL by
retiming techniques In Proceedings of the EuroDAC
 pages # September 
 MD Ercegovac and KSTrivedi On line algorithms for division and multiplication IEEE
Transactions on Computers  # 
 A M$unzer G Hemme Converting combinational circuits into pipelined data paths In Pro
ceedings of the ICCAD  pages # November 
 Richard M Karp A characterization of the minimum cycle mean in a digraph Discrete
Mathematics  # 
 CE Leiserson and JB Saxe Optimizing synchronous systems Journal of VLSI and Computer
Systems  # 
 CE Leiserson and JB Saxe Retiming synchronous circuitry Algorithmica  # 
 MCC Papaefthymiou A Timing Analysis and Optimization System for LevelClocked Cir
cuitry PhD thesis Massachusetts Institute of Technology September 
 JB Saxe Decomposable Searching problems and Circuit Optimization by Retiming	 Two
Studies in General Transformations of Computational Structures PhD thesis Carnegie Mellon
University August 

