Rewired retiming for flip-flop reduction and low power without delay penalty. by Jiang, Mingqi. & Chinese University of Hong Kong Graduate School. Division of Computer Science and Engineering.
Rewired Retiming for Flip-flop 
Reduction and Low Power withou 
Delay Pena 
JIANQ Mingqi 
A Thesis Submitted in Partial Fulfillmen 
of the Requirements for the Degree o 
Master of Philosophy 
in 
Computer Science and Engineering 




Professor X U Qiang (Chair) 
Professor Prof. W U Yu Liang Dav id (Thesis Supervisor) 
Professor Y O U N G Fung Yu (Committee Member) 
Professor Huang Shi Yu (External Examiner) 
Abstract of the thesis entitled: 
Rewired Retiming for Flip-flop Reduction and Low Power without Delay 
Penalty 
Submitted by JIANG Mingqi 
for the degree of Master of Philosophy 
at The Chinese University of Hong Kong in ( July, 2009 ) 
Abstract 
Retiming is an opt imizat ion technique for sequential circuits by reposit ioning 
f l ip-f lops across the combinational elements of the circuit. I t has been applied 
to different areas such as logic synthesis, circuit part i t ioning, power reduction 
etc. However, due to the intrinsic difference between fan-in and fan-out 
counts of a retimed component, the number of f l ip-f lops tends to be 
undesirably increased in a conventional ret iming procedure, wh ich can cause 
a significant area/power penalty on the retimed circuit. Moreover, because 
of the higher dominance on interconnect delays, w i thout a mechanism to 
reflect real physical design accurately, the clock period produced by a 
ret iming scheme w i l l be unrealistic. 
To overcome these two major drawbacks of the conventional ret iming 
technique, we propose a novel ret iming f low combined w i t h rewir ing, being 
able to largely cut down fl ip-f lops (FFs) whi le w i t h the original retimed clock 
period uncompromised. For a more accurate delay estimation, all 
interconnect delays are formulated and calculated based on real placements. 
Experimental results show that this novel rewired ret iming scheme can 
br ing a reduction of 18.7% averagely on the number of f l ip-f lops compared to 
the original ret iming wi thout rewir ing. This large FF reduction can be 
considered a free gain as the retimed clock period can sti l l be kept w i thout 
compromise. A n d meanwhile, due to such FF reduction, about 8.26% of the 


















I wou ld like to express my highest gratitude to my supervisor Prof. Dav id 
Yu-Liang Wu, for his consistent support and guidance throughout my 2 years' 
M.Phil. study. His earnest attitude, endless endeavor towards research and 
never-give-up spir i t has been leading my research as wel l as my daily life. 
Besides, I must thank Prof. Evangeline F.Y. Young for her support and 
comments on the work. Also, my fel low colleagues in the VLSI C A D group, 
they are so very much helpful and provide me w i t h every care and support 
along the way, w i t h so much inspir ing discussion and suggestions. 
I l l 
Table of Contents 
Abstract i 
Acknowledgement ni 
1 Introduction 1 
2 Rewiring Background 4 
2.1 REWIRE 6 
2.2 GBAW 7 
3 Retiming 9 
3.1 Min-Clock Period Retiming 9 
3.2 Min-Area Retiming 17 
3.3 Retiming for Low Power 18 
3.4 Retiming with Interconnect Delay 22 
4 Rewired Retiming for Flip-flop Reduction 26 
4.1 Motivation and Problem Formulation 26 
4.2 Retiming Indication 29 
4.3 Target Wire Selection 31 
4.4 Incremental Placement Update 33 
4.5 Optimization Flow 36 
4.6 Experimental Results 38 
5 Power Analysis for Rewired Retiming 41 
5.1 Power Model 41 
5.2 Experimental Results 44 
6 Conclusion 47 
Bibliography 50 
IV 
list of Figures 
2.1 (a)Original Circuit (b) Rewired Circuit 4 
2.2 Example of ATPG based Rewiring 6 
2.3 Example of GBAW Patterns 8 
3.1 (a)Comparator (b)Adder (c)Original Circuit 10 
3.2 Graph Representation of Original Circuit 11 
3.3 Basic Retiming Operations 12 
3.4 Retime Value r(v) 13 
3.5 Graph Representation of Retimed Circuit 14 
3.6 (a)Switching Act iv i ty and Power Dissipation 19 
(b)Switching Act iv i ty and Power Dissipation of Retimed Circuit 19 
3.7 Example of arrival t ime [11] 24 
4.1 Flip f lop reduction using rewir ing and ret iming 28 
4.2 Retiming Flip-flops backward and forward 30 
4.3 (a) Condi t ion 1: e(u, v) is selected (b) Condi t ion 2: e(u, v) is selected 30 
4.4 Example of placement estimation for adding new wire 34 
4.5 Overal l opt imizat ion f low 37 
V 
list of Tables 
4.1 Experimental Result of FF Reduction 40 
4.2 Experimental Result of Average Clock Period 40 
4.3 Experimental Result of Best Clock Period 41 
5.4 Power Estimation of Pure Retiming Circuits 44 
5.5 Power Estimation of Rewired Retiming Circuits 45 




Fol lowing the Moore's Law, the Very Large Scale Integration (VLSI) 
technology has soared over the last decades, wh ich brings the Electronic 
Design Automat ion (EDA) tools to an even more significant and 
indispensable role in the industry. As the process technology advances, the 
requirements for EDA techniques become more and more sophisticated. EDA 
opt imizat ion techniques mainly include reducing the circuit area, delay and 
power f rom different stages of the VLSI design. As technology advances to 
deep submicron, people's concerns have gradually shifted f rom area to 
t iming and power dissipation. 
Retiming is an EDA opt imizat ion technique which is original ly designed 
for t im ing opt imizat ion [1]. I t minimizes the circuit clock period for 
sequential circuits by repositioning fl ip-flops across the combinational 
elements of the circuit. The later works in ret iming involve opt imiz ing the 
cycle time [1] [2], reducing the area by min imiz ing the number of f l ip-f lops [3] 
4] etc. It has been also applied to different practical applications, such as 
logic synthesis [4] [5], circuit partit ion[6][7], power reduction [8] [9]and 
testability[10；. 
In most of the existing approaches, the problem of f l ip- f lop placement is 
v iewed f rom a purely graph-based perspective, w i t h all logic information 
about the circuit being discarded dur ing retiming. Approaches which 
incorporate ret iming w i t h logic re-synthesis are thus proposed in [4] to try to 
exploit the possibil i ty of improvement by ut i l iz ing the extra freedom. 
CHAPTER 1 I N T R O D U C T I O N 
In [4], the authors proposed a circuit optimization approach in wh ich all 
the f l ip-f lops are temporari ly moved to the boundaries of the combinational 
network using retiming. Re-synthesis is then performed on the combinational 
logic between the fl ip-flops. This is one of the first attempts to couple the 
movement of f l ip-f lops by ret iming and combinational re-synthesis 
techniques to achieve circuit optimization goals. However, their technique is 
pr imar i ly targeted at min imiz ing the number of literals of the circuit. No real 
placement information is ut i l ized to handle the interconnect delay factor that 
dominates in circuit design nowadays. Clearly, any logic synthesis f low 
wou ld be more accurate and effective if physical information obtained f rom 
real place and route can be integrated together. 
As the VLSI process technology scales down to a deep submicron era, 
tradit ional ret iming algori thm which ignores the interconnect delay is no 
longer accurate enough, because the interconnect delay can be much 
dominat ing and larger than the logic/gate delay. In [11], the ret iming 
problem is re-formulated to include both gate and interconnect delays, in 
wh ich the interconnect delay is assumed to be proport ional to the wire 
length. 
However, as demonstrated in [12], the optimal clock period gained f rom 
ret iming may not be feasible after the circuit is really placed. As a large 
number of f l ip-f lops are relocated and the f l ip- f lop number w i l l usually 
increase after retiming, a tradit ionally retimed opt imal clock period might not 
be close to reality in a legalized placement. Moreover, a larger amount of 
power consumption can be introduced due to the increased fl ip-flops. 
Therefore, besides delay improvement, it's also important to cut down the 
ret iming-induced f l ip-f lops for both area and power reductions, wh ich is a 
CHAPTER 1 INTRODUCTION 
problem not addressed in [11]. 
In this work, we w i l l integrate an interconnect delay based retiming w i th 
rewir ing to achieve better and more accurate circuit optimizations. Rewiring 
13-16] is an optimization technique in logic re-synthesis, a powerful tool for 
combinational logic transformation and circuit optimization. We demonstrate 
that w i t h the application of logic transformation using rewiring, we can 
further reduce the number of flip-flops on an interconnect delay retiming 
(18.7%). In addition, as a good by-result, due to the reduction on fl ip-flops, 
the power consumption estimated by Power Compiler gives a considerable 
reduction (8.26%) on the total dynamic power of the circuit. 
This thesis is organized as follows. Chapter 2 introduces the rewir ing 
backgrounds and algorithms. Chapter 3 reviews previous work in retiming, 
including Min-clock period retiming, Min-area retiming, ret iming for low 
power, and interconnect retiming. Chapter 4 presents the rewired ret iming 
optimization technique for f l ip-f lop reduction. Chapter 5 analyzes the power 
reduction due to the use of the optimization scheme. Chapter 6 gives the f inal 
conclusion. 
• End of chapter. 
Chapter 2 
Rewiring Background 
Rewiring, originally proposed in [13] and [14], is a powerful technique for 
combinational optimization. Rewiring can be viewed as a procedure of logic 
transformation on the combinational part of the circuit. I t transforms the 
circuit through replacing certain wires by adding some extra wires to the 
circuit, whi le maintaining the logic function of the circuit unchanged. The 
wires being removed are called target wires (TWs), whi le the extra wires 
added are called alternative wires (AWs). Guided by a suitable cost function, 
the appropriate target wires and alternative wires can be selected to achieve 
different optimization objectives, including logic minimization [13] [15], post 
layout t iming optimization [14], technology mapping [16] [17], FPGA rout ing 
[18] [19] and circuit part i t ioning [20]. From all the previous works, we can 
learn that rewir ing algorithms provide flexible and powerfu l logic 
transformation which can be used to optimized circuits' performance for 
different goals. This strongly motivates us to apply rewir ing to improve 
ret iming 
A n example of rewir ing is shown in Fig.2.1. 
T 
Fig.2.1 (a) Original Circuit 
CHAPTER 2 REWIRING B A C K G R O U N D 
Fig. 2.1 (b) Rewired Circuit 
The original circuit is shown in Fig.2.1(a). Suppose we want to remove the 
wire g l ->g5( red line, TW), the rewir ing is done as the follows: 
1. add a gate g6 to connect the output of g l and c to the input of g4. (by doing 
this, actually a wire is added f rom g l to g4). 
2. remove the wire g l—g5 
3. the gate g5 has only one input left and becomes removable 
Finally, the rewired circuit is shown in Fig.2.1(b). The funct ion 
y = (a + b)c + ab = ac + be + ab remains the same as Fig.2.1.(a). 
Over the years, a lot of effort has been made in developing rewir ing 
algorithms. There are now existing three main rewir ing algorithms, namely 
the Automated Test Pattern Generation (ATPG)-based, graph-based, and Set 
of Pairs of Functions to be Distinguished (SPFD)-based algorithms. In our 
work the ATPG-based rewir ing is adopted, and thus i t w i l l be described in 
detail in the fo l lowing sections w i t h brief introduct ion of graph-based 
rewi r ing 
CHAPTER 2 REWIRING B A C K G R O U N D 
2.1 REWIRE 
The most commonly used rewir ing technique is Automatic Test Pattern 
Generation ( ATPG ) based. It converts the problem of f inding 
target-alternative wire pairs into a problem of seeking undetectable stuck-at 
faults where ATPG technique is applied. The basic idea of the ATPG-based 
rewir ing technique is to add a redundant wire/gate to make other 
wires/gates redundant and removable. Redundancy inside a circuit means 
that the logic value of a connection or a component has no effects on the 
circuit outputs. 
REWIRE is an ATPG-based rewir ing algorithm which utilizes the 
undetectable stuck-at-fault inside the circuit to f ind alternative wires and to 
make the target wire redundant. For a given target wire, the algorithm 
computes the Mandatory Assignments (MA) [13] for the test of the target 
wire. Mandatory Assignment is a set of values that assigned to the inside of 
the circuit such that the fault at the target wire can propagate to the pr imary 
output. If a set of consistent M A of the target wire does not exist, i t means 
that the stuck-at-fault of the target wire is not detectable at the pr imary 
output, which means that the target wire is redundant and removable. If a set 
of M A for the target wire originally exists, we can try to add a redundant 
wire to the circuit, so that the M A becomes inconsistent and the target wire 
becomes undetectable and removable. Such a redundant wire is called 
alternative wire. After adding an alternative wire to the circuit, we have to 
check whether it is redundant as well, we can do this by a similar process, i.e. 
assigning the newly added alternative wire as a target wire and check its M A , 
if a consistent set of M A does not exist, i t is a redundant wire. 
The example in Fig. 2.2 shows how the rewir ing works. g3 g7 is a 
CHAPTER 2 REWIRING B A C K G R O U N D 
candidate wire to make g l — g5 redundant and removable. We test the 
stuck-at-1 fault at g l -> g5. First, we set {a = 0, b = 0} to make g l = 0. To 
propagate the fault to the primary output o l , the side inputs to g5, g6, g7, and 
g9 should have non-controll ing values, i.e. {e = 1, g4 = 0, g3 = 1, f = 1, g = 0}. 
g4 = 0 requires {g2 = 0, b = 0}. So g l has to be 1 to make g3 = 1, but we have 
set g l = 0. The conflict means that there is no test vector to detect this fault. 
Hence g l g5 is redundant and removable. 
Fig.2.2 example of ATPG-based rewir ing 
2.2 GBAW 
Graph-based rewiring uses graph pattern matching to find target wires and 
alternative wires. GBAW is a graph based rewir ing algorithm [22]. I t uses a 
set of graph configurations, which are called Patterns. Patterns are 
pre-defined graph representations of sub-circuits which contains alternative 
wires. Figure 2.3 is an example of a pattern. The target wire to the NOR gate 
can be replaced by the alternative wire to the A N D or N A N D gate. Figure 2.4 















CHAPTER 2 REWIRING B A C K G R O U N D 
NOR AND /NAND NOR AND/NAND 
Target Wire Alternative Wire/^~\ 
t O CX：；^^^；；；；：^^ 
Alternative Wire Target Wire 
Figure 2.3 Example of GBAW Patterns 
GBAW transforms the problem of f inding target wires and alternative wires 
into matching patterns. It searches for alternative wires by performing 
pattern matching on the circuit f rom the library of patterns. GBAW is time 
efficient in f ind ing alternative wires, on average it is around 150 times faster 
than REWIRE. However, the number of alternative wires found by GBAW is 
much less than that of REWIRE. 
• End of chapter. 
Chapter 3 
Retiming 
The previous chapter has introduced the background of the rewir ing 
technique, w i t h specific details in ATPG-based rewir ing. Next, we are going 
to see how this technique can be incorporated into retiming, and before that, 
this chapter w i l l review the ret iming technique, including its original 
formulat ion, advancement, w i t h particularly its application to power 
opt imizat ion and its drawbacks in today's technology. Retiming involves 
opt imiz ing the cycle time (Min-Clock Period Retiming) [1] [2], reducing the 
area by min imiz ing the number of f l ip-f lops (Min-area Retiming) [3] [4] etc. I t 
has been also applied to different practical applications, such as logic 
synthesis [4] [5], circuit partit ion[6][7], power reduction [8][9]and 
testability[10]. We are going to see why we need rewir ing for the 
improvements of retiming, in terms of delay and placement estimation, 
tradit ional area reduction, as wel l as power reduction, which has captured 
much attention in today's technology. 
3.1 Min-Clock Period Retiming 
The earliest ret iming formulat ion is given by Leiserson and Saxe [1]. I t is a 
graph-based opt imizat ion technique to get the feasible min imal clock cycle by 
reposit ioning the f l ip-f lops in a sequential circuit w i thout violat ing the 
circuit's funct ion and t iming constraints. 
In a classical ret iming formulat ion, A sequential circuit C is represented 
by a directed graph G(V, £, d, zo). Each node v corresponds to a combinational 
gate and each directed edge e(u, v) represents a connection f rom the output of 
10 
CHAPTER 3 RETIMING 
gate u to the input of gate v. For each combinational element v in the circuit, 
there is a propagation delay d(v). The number of flip-flops are modeled as 
weight w(u, v) on the edge e(u, v). If there are n flip-flops on the edge e(u, v), 
e{u,v) has a weight w(u, v) = n. 
A n example is shown in Fig.3.1 (c). Consider a circuit composed of two 
comparators, one adder and two flip-flops. A comparator has a function 
S{x,a) = 1 if X = a, 
else S{x,a) = 0 
an adder has a function: 
adder {x,y) = x + y 
A comparator has a delay of 3ns; an adder has a delay of 7ns. The 
pr imary input and output of the circuit is represented as a "Host" element 
w i th 0 delay. Originally there are two flip-flops at the wire f rom the "Host" to 





Fig. 3.1 (a) Comparator 
11 
C H A P T E R 3 R E T I M I N G 
X X + y 
V 
Fig.3.2 (b) Adder 
Host 
Fig. 3.1 (c) original circuit 
0 
Fig. 3.2 Graph representation of the original circuit 
12 
CHAPTER 3 R E T I M I N G 
r(v) = -1 r(v) = +1 
Fig.3.3 Basic retiming operations 
As shown in Fig.3.2, the circuit of Fig.3.1(c) is represented by a retime 
graph. Each node on the graph represents a combinational element (adder or 
comparator). The numbers on the nodes are the delays of the combinational 
elements d(v). The weight on the edges (zu(u, v) ) are the number of fl ip-flops 
on the edges. 
For a path p, f rom vertex to , w i t h edges e^, ... 




The clock period is defined as: 
c = max{6/(p)}(厂 zv(p)=0) 
As the clock period is the longest delay f rom one f l ip-f lop to another. The 
original circuit has a delay of 13, which is calculated as 3+3+7 = 13, the sum of 
the longest path delay. 
A retime value of integer type r(v) is defined for each node v to represent 
the f l ip-f lop movements across the node, as shown in Fig.3.3. r(v) of a positive 
13 
CHAPTER 3 R E T I M I N G 
value m stands that there w i l l be m fl ip-flops moved f rom every output edges 
of V to every input edges of v. Similarly, a negative r(v) value of - m stands for 
the opposite mov ing direction. The weight w'(u,v) after ret iming is w'(u,v)= 
10(u, v) + r(v) - r(u) 
To represent the ret iming operation of Fig.3.2, each of the nodes in Fig.3.2 
has its retime value r(v), which is shown in Fig, 3.4. 
From Fig.3.4 we can see the two comparators have a -1 value, which 
means that 1 f l ip- f lop is retimed f rom their input to their outputs. As a result, 
the graph after ret iming is shown in Fig. 3.5. In Fig.3.5, the new clock period 
is reduced to 7, as the original critical path of 13 is broken by fl ip-flops. 
0 
Fig.3.4 Retime value r(v) 
CHAPTER 3 RETIMING 
14 
Fig.3.5 Graph representation of the retimed circuit 
Therefore, classical retiming can be viewed as an integer value vertex 
labeling of the retime value on the graph such that by the specified adding 
and removing of flip-flops, the new graph has the minimal achievable clock 
period whi le the structure of the graph is unchanged. 
In [2], the algorithm for minimiz ing the clock period of a circuit is based 
on two quantities defined as: 
W(u, v)= min{w(p)： u— v} 
D(u, v)= max[d(p): u—v and zy(/?)=W(u, v)) 
W(u,v) is the m in imum number of fl ip-flops on any path from vertex u to v, 
D(u, V) is the maximum total propagation delay on any critical path from u to 
V. 
Based on these two quantities, [2] developed the fol lowing lemmas, 
which are the basic tool needed to solve to min-clock-period retiming 
problem. 
For a given clock period c, the retimed circuit has a clock period c' = c if 
and only if: 
(1) r(u) - r(v) ^ zu(u, v) for every edge e(u,v) of G 
(2) r(u) - r(v) ^ W(u, v) -1 for all vertices such that D(u, v) > c 
15 
CHAPTER 3 R E T I M I N G 
The constraints on the unknowns r(v) are linear inequalities involv ing 
only differences of the unknowns, and thus they can be regarded as an 
instance of a linear programming problem. Using the Bellman-Ford 
algori thm to test whether a given clock period c is feasible takes only 0(|V|^) 
for the 0(|V|2) inequalities. The algorithm to solve the m in imum clock 
period is summarized below: 
1. Compute all W(u, v) and D(u, v) for all u, v G V such that u is connected 
to V 
2. Sort the elements in the range of D 
3. Binary search among D(u, v) for the m in imum achievable clock period. 
Use the Bellman-Ford algorithm to test whether the clock period is 
achievable. 
4. For the m i n i m u m clock period found in step 3, use the values for the r(v) 
found as the opt imal ret iming solution. 
Yet, [2] proposed a more efficient algorithm to determine whether a given 
clock period is feasible, by iteratively relaxing the constraints for each 
tentative retiming. This efficient algori thm takes only 0(|V||E|) time, which 
is a significant improvement of the original 0(|V|^). Combined w i t h the 
Binary search, the author gave a 0(|V||E| lg|V|) algori thm which can solve 
the min-clock-period problem. 
The ret iming problem can also be formulated as a Mixed-Integer Linear 
Programming (MILP) problem. 
For a graph G(V, £, d, w), there exist a legal ret iming such that c' = c if the 
fo l lowing constraints are satisfied: 
16 
CHAPTER 3 R E T I M I N G 
For every vertex v, if there exist a real value s(v) and an integer value r(v) 
such that 
-s(v) = - d(v) for every vertex v E V 
s(v) = c for every vertex v 6 V 
r(u) - r(v) ^ w(u, v) for every edge e G E 
s(u) - s(v) ^ - d(v) for every edge wherever r(u) - r(v) = w(u, v) 
Based on the above constraints, we can solve the ret iming problem using 
mathematical programming approach. A n algorithm is derived f rom this 
MILP basis. The basic steps of the algorithm are shown as follows: 
1. Compute all W(u, v) and D(u, v) for all u, v e V such that u is connected 
to V 
2. Sort the elements in the range of D 
3. Binary search among D(u, v) for the m in imum achievable clock period. 
Use the MILP to test whether the clock period is achievable. 
4. For the m i n i m u m clock period found in step 3, use the values for the r(v) 
found as the opt imal ret iming solution. 
Step 1 runs in 0(|V||E| + |V|Mg|V|) if the Fibonacci heap data structure 
by Fredman and Tarjan [23] is used for the all-pairs shortest paths algori thm 
[24]. Step 2 runs at 0(|V|Mg|V|) for the 0(|V|2) elements. Every iteration in 
the binary search of step 3 requires solving a MILP w i t h | V | integer 
variables, | V | real variables, and 2 | V | +21E | inequalities. The total t ime of 
step 3 is thus 0(|V||E| lg|V| + iVl^lg^lVl). Therefore, the total runt ime of the 
algori thm is 0(|V||E| lg|V| + iVl^lg^jVl). 
Though the theoretical formulat ion by Leiserson and Saxe to solve the 
ret iming problems have polynomial complexity, the implementations of the 
algor i thm is not considered and lead to h igh complexity for large circuit w i t h 
17 
CHAPTER 3 RETIMING 
more than 500 combinational cells. Shenoy et al. [25] addresses the 
implementation issues required to exploit the sparsity of circuit graphs to 
al low min-period ret iming as wel l as constrained min-area retiming to be 
applied to circuits w i t h as many as 10,000 combinational cells. Recently, Zhou 
[26] proposed an efficient incremental algorithm for min-period retiming 
which iteratively moves FFs to decrease the clock period whi le guarantees to 
f ind the opt imal solution in a short time. 
3.2 Min-Area Retiming 
There can be more than one solution to reposition the f l ip flops, while 
achieving the same optimal clock period. Min-area retiming is therefore 
targeted at min imiz ing the number of flip-flops, such that the total area of the 
circuit can be minimized. Min-area retiming minimizes the FF area under a 
given clock period, thus could be used to minimize the FF area even under 
the m in imum clock period. 
Based on the classical Min-clock period retiming, Min-area retiming is a 
technique which further solve for a solution w i th min imum total weights. 
The basic formulat ion of the problem is the same, a sequential circuit C is 
represented by a directed graph G(V, E), each combinational element is 
represented by a node v, and each connection is represented by and edge e(u, 
V). The f l ip flops on the edges are denoted by the weight on the edges. r(v) is 
also used to represent the number of flip-flops that are retimed backward 
across the node v. The weight w'(u,v) after retiming is w'(u,v) = w(u, v) + r(v) 
- r ( u ) . The min-area ret iming objective is to minimize the total number of 
18 
CHAPTER 3 R E T I M I N G 
f l ip-flops, i.e., E w ' is min. Using (3), this leads to the fo l lowing opt imizat ion 
problem: 
r(v) . ( I FI(v) I - I F〇(v) I) — min imum 
where FI(v) and F〇(v) represent the set of fanin and fanout gates of gate v. 
The early Min-area ret iming basically fol lows the Min-clock period 
ret iming idea of Leiserson and Saxe. Later, a lot of work has been done to 
study the Min-area ret iming algorithm. Shenoy et al. [25] were among the 
first to consider a practical implementation of the min-area ret iming 
algorithm. They proposed techniques to prune away redundant constraints, 
which lead to higher efficiency in time and space usage to solve the problem. 
Singh et al. [27] also proposed to incrementally move FFs in the circuit to 
overcome the expenses of previous approaches to min-area retiming, 
however, their approach is a heuristic which only looks for better moves and 
may end up w i t h sub-optimal solution. Recently, Jia Wang et.al. [28] 
proposed an efficient algori thm iMinArea wh ich can solve the Min-area 
ret iming problem incrementally and optimally, and the runt ime is proved to 
be much faster than all existing approaches. 
3.3 Retiming for Low Power 
Power dissipation has been receiving more and more concerns today, a lot of 
opt imizat ion techniques have been applied to this area, and ret iming is one of 
19 
CHAPTER 3 R E T I M I N G 
them. As ret iming is a technique which can lead to strong effect on the critical 
path delay, number and posit ion of the flip-flops, etc., which are all 
significant factors of a circuit's power dissipation. 
Previous works in this f ield mainly consider reducing the switching 
activities of the circuit. As switching activities are largely due to the posit ion 
of fl ip-flops. Switching activity is a significant cause of power dissipation in 
combinational and sequential circuits. Logic synthesis has been used to 
improve the power dissipation of a circuit. In [29], a new cost function for 
combinational logic synthesis targeting low power was presented. A method 
to speed up a sequential circuit using ret iming and lowering power 
dissipation (and increasing delay) by scaling down the power supply voltage 
was presented in [30]. There are also methods that lowered power dissipation 
by restructuring the combinational logic. In [31], the authors investigated the 
application of ret iming to modi fy the switching activities on internal wires of 
a circuit and demonstrate the impact of these techniques on average power 
dissipation. 
Fig.3.6 (a) Switching activity and power dissipation 
Fig.3.6 (b) Switching activity and power dissipation after retiming 
20 
CHAPTER 3 RETIMING 
The power dissipation of a gate g is proportional to the output switching 
£ E 
activity 、‘ times the load capacitance of its fanout C, i.e. ^ C. The effect of 
ret iming on switching activity is demonstrated in Fig.3.6. LO, L I and L2 and 
combinational elements of the circuit, the power dissipation of (a) is 
^oC,. the power dissipation of (b) is 五oC, + 五,C" + 五厂C?, ^^^ 
power dissipation of (a) and (b) is different, generally Ef is less than 五i, 
because the output of FF changes at most once every clock period, by the 
same token, is less than E\ Therefore, changing the position of FF can 
have an influence on the total power dissipation. A general idea is to retime 
the FFs to the wires w i th high switching activities. 
In [31], the nodes are selected by a cost function, based on the switching 
activity at their output and the number of fanouts. 
Weight (J) = P{i) * (m(/) + no{i)) , where P(i) is the power estimated by 
switching activity and load capacitance, ni and no are the numbers of fanins 
and fanouts. 
The switching activity is estimated by the probability that a transition 
propagates through its transitive fanout. Retiming is then executed based on 
the cost function, t ry ing to place FFs at the fanouts of high switching gates. 
In [32], the authors conduct an empirical study of power dissipation 
when different levels of pipel ining are added to a circuit. They discovered 
that by adding more levels of pipelining to a sequential circuit, more gates 
are l ikely to have balanced paths and so there is a power reduction. 
21 
CHAPTER 3 R E T I M I N G 
In [33], the authors propose an algorithm for pipeline insertion in a 
sequential circuit. Since this is the only known algorithm for low power 
retiming, i t warrants close inspection. The first step of their algori thm is to 
perform a detailed power estimation of all the gates in the network. Each gate 
is then weighted by three factors: (1) The amount of gl i tching activity at the 
gate (the dierence between the switching activity under a general delay 
model and the zero delay model) (2) The probabil i ty that a transition at a gate 
results in transitions in the gate's transitive fanout (at most two levels ahead) 
(3) The number of fanouts of the gate. Latches are then placed in a greedy 
fashion in f ront of the gates based upon their weights. In an attempt to reduce 
the total latch count, the nal step of their algorithm is to forward retime gates 
which are not on the same paths as the gates that were greedily retimed. The 
approach in [33] has three limitations: (1) Their algorithm requires a costly 
simulat ion for each level of pipel ining that is inserted. (2) Placing a latch in 
front of a gate may prevent the propagation of a glitch at the cost of the 
generation of a new glitch. (3) Their algori thm may introduce large latch 
counts. 
As reported in [34], in reality the power consumed by a single latch is 
considerably greater than the power consumed by a single transition. I t is 
very l ikely that the power dissipation can be cut down further if the number 
of FFs can be reduced. 
So far, the effect of using ret iming to reduce power is not obvious 
applicable, s imulat ing the switching activity and combined into ret iming is a 
tedious task, yet, i t is obvious that reducing the number of FFs is critical for 
reducing the total power dissipation and beneficial for clock tree generation. 
22 
CHAPTER 3 RETIMING 
3.4 Retiming with Interconnect Delay 
As the VLSI process technology scales down to a deep submicron era, a major 
drawback of the traditional retiming approaches (both Min-period retiming 
and Min-area retiming) is that in all the retiming algorithms above, the 
interconnect delay (or wire delay) is not considered into a path delay. 
However, in today's technology, interconnect delay has become a 
predominant factor which can be much more than the delay of a gate. Hence, 
not only gate delay should be considered but more importantly, the 
interconnect delay. The traditional formulation which ignores the 
interconnect delay is therefore not accurate enough regarding delay 
estimation. 
The work of [11] gives us a new ret iming formulation including both gate 
delay and interconnect delay. Based on experiments conducted in [11], the 
interconnect delay is nearly proportional to the interconnect wire length, 
therefore, the formulation assumes that the interconnect delay is proportional 
to the wire length. The wire length can be estimated by the Manhattan 
distance between the connected cells in the given placement. 
Similar to the Min-clock period retiming, a circuit C is represented by a 
corresponding graph G(V, E, w, d). Each node v corresponds to a 
combinational gate and each directed edge e(u, v) represents a connection 
f rom the output of gate u to the input of gate tj. For each combinational 
element v in the circuit, there is a propagation delay d(v). The number of 
f l ip-flops are modeled as weight w(u, v) on the edge e(u, v). Besides, the 
interconnect delay on an edge e(u, v) is represented by d(u, v), wh ich is 
estimated f rom the Manhattan distance between the corresponding cells in 
the given placement. 
23 
CHAPTER 3 R E T I M I N G 
I t is assumed that G is strongly connected. If not, [11] stated that a source 
node s can be added to the circuit and connect it to all pr imary inputs, and a 
target node t can be added as wel l and connect all pr imary outputs to it, and 
connect t to s. Then the resulting graph is strongly connected. The delay of s, t 
and all the added edges are set to zero; the number of registers on the edge 
f rom t to s is set to one and that on the other added edges set to zero. Then a 
ret iming solut ion of the modif ied graph w i l l also be a val id ret iming solution 
of the original graph. 
One method [11] for the problem is extended from the Mixed Integer 
Linear Programming method introduced in 3.1. This method can guarantee 
the opt imal solution. In the original Min-clock period ret iming formulation, 
only gate delay is considered, however, the MILP mathematical 
programming approach can be extended to solve the problem w i t h both gate 
and interconnect delay optimally by modi fy ing some of the constraint 
formulation. 
A variable a(v) is defined for each node v to represent the max imum 
arrival t ime of v. As shown in Fig.3.7, a(v) is the time given for a signal to 
travel f rom a f l ip- f lop fanin to a node v to the output of the node v. If there 
are mult ip le f l ip-f lops, a(v) is given for the one w i th longest path delay to v. 
24 
CHAPTER 3 RETIMING 
Regislei- with ^ 
longest ilekiy \ 
lo wncx V V")、\deljiy=a(v) 
4 
Z Resist CI S L 
> 丫 z J 
Fig.3.7 example of the arrival time [11] 
The MILP is formulated as follows: 
d{v) < a(v) VveV (1) 
a(v) < T VveV (2) 
r(v) + wCu, v) - r(u) > 0 VeeE (3) 
a(u) + d(u, v) + d(v) - T(r(v) + w(u, v) - r ( u ) ) < a(v)�eeE (4) 
Constraints (1) and (2) are obvious. (3) is the number of flip-flops after 
ret iming on the edge e(u, v). For a legal retiming, the weight should never be 
negative. (4) is a new constraint considering the interconnect delay. (4) can be 
interpreted as the arrival time of a signal f rom a f l ip-f lop to a node v must be 
larger than or equal to the delay of the path f rom the fl ip-f lop to the node v. 
If we define a real variable R(v) as R(v) = a(v) /T + r(v), the constraints (1) 
to (4) can be transformed to : 
/?(v)-/-(v) > "^(v) 
T 
R{v)-r{y)<\ 
r{u)-r{v) < w{u,v) 
V v e F (5) 
V v e K (6) 
\/e{ii,v) e E (7) 
• 水 E (8) 
T T 
There are | V | real variables R(v), | V | integer variables r(v), and 
2| V | + 2 | E | constraints. This problem can be solved in 0(|V||E| lg|V| + 
25 
CHAPTER 3 R E T I M I N G 
|V|2lg2|V|)if the Fibonacci heap [23] is used as the data structure. If these set 
of constraints can be solved, which means that the clock period T and the 
variable r(v) is decided, we can determine the exact position of the f l ip-f lops 
on the wire. Besides, [11] also proposed a fast near optimal approach (0.13% 
more than the opt imal clock) for the interconnect delay retiming, by first 
reducing the problem to the single source longest path problem. 
• End of chapter. 
Chapter 4 
Rewired Retiming for Flip-flop 
Reduction 
The last chapter has reviewed the previous works of retiming, f rom the most 
classical Min-period retiming, to new approaches in Min-area retiming, 
interconnect delay retiming and the application to power reduction. 
Throughout the whole picture, we can see retiming is a topic which closely 
follows the step of technology advancement and has attracted much attention 
due to its potential for area, t iming and power optimization. However, there 
is stil l plenty of space to explore in this area as previous optimization related 
to ret iming seldom consider interconnect delay, and there is a need to 
develop optimization tools for area, t iming and power based on this more 
accurate delay model. Considering w i th logic synthesis is a good opportunity 
for us to achieve further improvement on this subject. This chapter is going to 
talk about the rewired retiming technique in detail as proposed by the author. 
4.1 Motivation and Problem Formulation 
As demonstrated in the previous chapter, in today's deep submicron 
technology, tradit ional ret iming algorithm which ignores the interconnect 
delay is no longer accurate enough, because the interconnect delay can be 
much dominating and larger than the logic/gate delay. However, most 
existing ret iming related applications and their optimization don't consider a 
more accurate delay model w i t h interconnect delay, unt i l this problem is 
addressed in [5], where the ret iming problem is re-formulated to include both 
gate and interconnect delays, in which the interconnect delay is assumed to 
27 
CHAPTER 4 REWIRED RETIMNG FOR FLIP-FLOP REDUCTION 
be proportional to the wire length. 
Nevertheless, as demonstrated in [12], the optimal clock period gained 
from ret iming may not be feasible after the circuit is really placed. As a large 
number of f l ip-flops are relocated and the f l ip-f lop number w i l l usually 
increase after retiming, a traditionally retimed optimal clock period might not 
be close to reality in a legalized placement. Moreover, a larger amount of 
power consumption can be introduced due to the increased flip-flops. 
Therefore, besides delay improvement, it's also important to cut down the 
retiming-induced fl ip-flops for both area and power reductions, which is a 
problem not addressed in [11]. 
Therefore, we try to integrate the interconnect based retiming with 
rewir ing to achieve better and more accurate circuit optimizations. Our 
experiment demonstrate that w i th the application of logic transformation 
using rewir ing, we can further reduce the number of flip-flops based on the 
models introduced in [11:. 
The example in Fig. 4.1 shows how the rewir ing helps in reducing the 
number of f l ip-f lops through logic transformation. Assuming that each logic 
gate has a delay of 1 unit, whi le the interconnect delay is proportional to the 
(shortest) Manhattan distance between the placed logic elements. In Fig. 1 (a), 
the init ial circuit has a clock period of 33 w i th two FFs. A conventional 
ret iming wou ld produce a solution (Fig. 1(b)) w i th a reduced clock period of 
22 however w i t h an increased FF count of three. As a new FF is added to the 
wire d — g4, we wou ld try to see if we can remove this wire. Using rewir ing 
technique, we know that d ^ g3 is an alternative wire for the target wire 
d->g4. After a rewir ing transformation, the retiming solution uses only two 
28 
CHAPTER 4 REWIRED RETIMNG FOR FLIP-FLOP REDUCTION 
FFs, w i t h clock period of 21.5. The rewired circuit after ret iming is shown in 
(c). 
(a) The initial circuit: clock period is 33. A FF is going to move across g4 
guided by retiming. By rewiring implication, d^g3 is an alternative wire 
to replace d ^ g 4 
in 
(b) Retiming without rewiring: the clock period is reduced to 22, while FFs are 
increased. 
(c) Retiming after rewiring: after replacing d ^ g 4 by d->g3, the number of FF 
is reduced compared to (b), the retimed clock period is 21.5. 
Fig. 4.1: Fl ip-f lop reduction using ret iming and rewir ing 
29 
CHAPTER 4 REWIRED RETIMNG FOR FLIP-FLOP REDUCTION 
The rewired ret iming problem can be formulated as follows: 
Given a sequential circuit C and its placement P, based on an interconnect 
delay model, we can compute its initial retiming solution: the min imum clock 
period T and the number of FFs n after retiming. 
App ly ing several iterations on rewir ing transformations, we want to f ind 
a functionally equivalent circuit C and its corresponding placement P', such 
that after ret iming for C based on P', the number of FF n' is considerably cut 
down f rom n, whi le the clock period T is not worse than T. 
4.2 Retiming Indication 
In order to achieve better delay estimation and make our technique more 
practical, here we adopt the delay model proposed in [11]' in which the 
interconnect delays are assumed to be proportional to wire lengths, i.e. 
interconnect delays are estimated from the shortest Manhattan distance 
between the connected cells in the given placement. Gate delay is assumed to 
be 1 uni t of the wire delay. Though there can be many other different delay 
models, we assume that a general f low responsive to a reasonable cost 
function can also be effective to others. We implement the Mixed Integer 
Linear Programming approach to solve the interconnect delay retiming 
problem. 
As discussed in chapter 3, a graph G(V, E, w, d), is used to represent the 
sequential circuit. Each node v corresponds to a combinational gate and each 
directed edge e(u, v) represents a connection f rom the output of gate u to the 
30 
CHAPTER 4 REWIRED RETIMNG FOR FLIP-FLOP REDUCTION 
input of gate v. For each v in the circuit, there is a propagation delay d(v). The 
number of f l ip-f lops are modeled as the weight w(u, v) on the edge e(u, v). w(u, 
v) is a non-negative integer. The interconnect delay of edge e(u, v) w i thout 
any FF is represented by d(u, v). 
A retime value of integer type r(v) is defined for each node v to represent 
the f l ip-f lop movements across the node, as shown in Fig.4.2. A n r(v) of a 
positive value m stands that there w i l l be m flip-flops moved f rom every 
output edges of v to every input edges of v. Similarly, a negative r(v) value of 
- m stands for the opposite moving direction. Beware that in some ret iming 
works involv ing placement, mult iple fanouts can share the same FF when the 
circuit is placed, whose results can reduce FFs needed a bit but are highly 
dependent on the actual placement tool applied. In order to see a fairer 
comparison on the FF reductions produced by our f low, in this paper we 
assume that in this case each fanout should take one FF, i.e. we disallow the 
possibility of FF sharing on fanouts. When the circuit is f inally placed, we 
place the same number of FFs as produced by the retiming procedure. 
r(v) = -1 r(v) 二 +1 
Fig.4.2 ret iming fl ip-flops forward and backward 
31 
CHAPTER 4 REWIRED RETIMNG FOR FLIP-FLOP REDUCTION 
Finally, based on the mixed integer linear programming (MILP) approach 
(constraints (5) 一 (8) in chapter 3.4, by introducing a variable R(v) = a(v) /T + 
r(v) for each node v, a set of constraints are formulated as follows: 
R{v)- r(v) > d{v) 
T 
R{v)-r{v)<\ 
r{u)-r{v) < w{u,v) 
T T 
V v g K 
VveV 
Ve(u,v) e E 





This problem can be solved in 0(|V||E| lg|V| + |V|2lg2|V|). By solving r(v) 
for each node, the weight of each edge after retiming is decided. Therefore, in 
our iterative optimization process, the variable r(v) is used to predict the 
movement of FFs. Once we f ind a retiming solution, the optimal clock period 
T, r(v) of each node is produced as a side effect, and we make use of this 
value to indicate the movement of FFs. Once this information is obtained, 
some heuristics can be developed to guide the rewir ing optimization process. 
4.3 Target Wire Selection 
Given a target wire, rewir ing is able to f ind a list of alternative wires 
which when added to the circuit, can make the target wire redundant and 
thus removable. For a sequential circuit, the logic implication [13] is done 
only w i th in the combinational sub-network bounded by FFs. Therefore, the 
alternative wires are w i th in the same combinational sub-network of the target 
32 
CHAPTER 4 REWIRED RETIMNG FOR FLIP-FLOP REDUCTION 
wire. A wire w i t h FFs is treated as a primary input or output of the 
sub-network and cannot be a target wire. 
By selecting TWs which have an effect on the movement of the FFs, the 
FFs produced by ret iming could be different. Here, a few heuristics are 
developed based on some typical conditions observed to guide the selection 
of TWs. 
Consider an edge e(u, v) {u is a fanin to v) as shown in Fig.4.3, where r(v) 
denotes the number of FFs moved from v's fanouts to its fan-ins, and r(u) 
denotes the number of FFs moved from u's fanouts to its fanins. There are 
two conditions which are effective in FF reduction: 
Fig.4.3 (a) Condition 1: e(u, v) is selected 
Fig.4.3 (b) Condition 2: e(u, v) is selected 
Condition 1: r(v) - r(u) > 0 {r(v) > 0), there is FF moved from the fanout(s) of v 
to this edge and the other input edge(s) of v. Therefore, by replacing e(u, v) by 
an alternative wire is l ikely to reduce FF after retiming, e(u, v) is selected as a 
target wire. 
33 
CHAPTER 4 REWIRED RETIMNG FOR FLIP-FLOP REDUCTION 
Condition 2: r(v) - r(u) > 0 (r(u) < 0), there is FF moved from the fanin edges 
of u to the fanout(s) of u. If u has more than 1 fanout edges, replacing e(u, v) 
by an alternative wire is likely to reduce FF after retiming, e(u, v) is selected 
as a target wire. 
On the above two conditions, FFs in this edge w i l l have a net increase 
after ret iming as they w i l l be moved in either from the fanout(s) of v and/or 
fanin(s) of u. Therefore, selecting e(u, v) as a target wire and replacing it by its 
alternative wire (before applying retiming) is likely to reduce the number of 
FFs of the ret iming result. 
Consider other conditions, if r(v)-r(u)<0, e(u, v) must already have FF 
before ret iming (because the weight of an edge can never be negative, which 
is guaranteed by the retiming constraints). A wire w i th FFs is treated as a 
primary input or output dur ing logic implication [13] and cannot be a target 
wire. If r(v) - r(u) = 0, e(u, v) is not selected as target wire, since it has no 
change to the number of FF on this edge. 
In this way, retiming gives us a hint to direct the rewir ing 
transformations. Indicated by the retime value r(v), the wires that have a 
direct effect on the number of FFs are identified as target wires. By replacing 
them w i th their alternative wires, the FFs after retiming are very likely to be 
reduced. The TW-AW list should be updated whenever a rewir ing 




Fig.4.4 Example of placement estimation for adding new wire 
In Fig. 4.4, g l and g2 are the source and sink of the alternative wire AW. 
fanout l , fanout2 and fanoutS are the fanouts of g2. In order to add AW, a 
34 
CHAPTER 4 REWIRED RETIMNG FOR FLIP-FLOP REDUCTION 
4.4 Incremental Placement Update 
After rewir ing, we must give an updated placement to support the next 
iteration of ret iming evaluation. To avoid the undesirable influence of the 
randomness rooted f rom re-placement for the whole circuit, we estimate the 
position of the new cells and update the placement incrementally after 
rewiring. Our approach is applicable to both 2-input gates and mult iple input 
gates, for simpl ic i ty we use 2-input gates in our experiment. The update is 
done as fol lows: 
When a target wi re is removed f rom the network, a corresponding sink 
cell having only one input left is removable. In this case, we drop the cell 
f rom the placement, and the positions of all other cells remain unchanged. 
When an alternative wire is added to the circuit, a new gate is added between 
the alternative wire's sink node and its fanout(s), as shown in Fig.4.4 
Utl. X 

















CHAPTER 4 REWIRED RETIMNG FOR FLIP-FLOP REDUCTION 
new cell New is added to the circuit, connecting g l , g2 and the fanouts of g2 
(bold lines). After adding the new cell and the new net (bold lines), the net 
g2^ fanout l , g2^ fanout2 and g2^fanout3 are removed. 
Assume that the positions of all other cells remain the same, to f ind an 
optimal position for the new gate, such that the total Manhattan distance 
among the g l , g2. New, fanoutl , fanout2 and fanout3 is minimum, is a 
diff icult problem (similar to f inding a Steiner Min imum Tree, which is 
NP-hard). To avoid the complexity, we adopt an arithmetic average position 
to estimate the New gate's position. 
new.x = (g l .x + g2.x + fanout 1 .x + fanout 2.x + fanout 3.x) / 5 
new.y = (gLy + g2.y + fanout 1 .y + fanout 2.y + fanout 3>.y)l 5 
In general, the position (x, y) for the new cell is estimated as: 
new.x = {AWsrc.x + A Wdst.x + ^ fanouts.x) / N 
new.y = (乂 Wsrc.y + A Wdst.y + ^ fanouts.y) / N 
where AWsrc and AWdst are the source and sink of the alternative wire, 
N is the total number of fanouts plus 2 (AWsrc and AWdst). 
As a rewir ing step usually only injects a small perturbation on a local 
area, based on our experiments this calculated position provides a reasonable 
estimation close to the real placement change, thus can be used for the next 
ret iming and rewir ing iterations. After all rewir ing iterations, the rewired 
circuit is retimed and placed again, and the clock period is calculated f rom 
the f inal placement. 
36 
CHAPTER 4 REWIRED RETIMNG FOR FLIP-FLOP REDUCTION 
4.5 Optimization Flow 
A n opt imizat ion scheme combining ATPG based rewir ing and ret iming 
is developed. As shown in Fig. 4.5, the optimization scheme consists of the 
fo l lowing basic steps: 
(1) Ini t ial ret iming: select the init ial TW-AW list based on the init ial 
ret iming indication. 
(2) Perform rewi r ing using a TW-AW pair f rom the TW-AW list. 
(3) Incrementally update the placement according to the rewir ing 
transformation. 
(4) Retiming evaluation: evaluate whether the rewir ing is beneficial, if yes, 
go to (5); i f not, go to (6) 
(5) If the number of accepted transformations is w i th in N (a predefined 
number of total iterations), accept the rewir ing transformation, and update 
the T W - A W list, go to (2) to perform the next iteration; if N is reached, go to 
(8) 
(6) Discard the change, if there is stil l TW-AW pair in the TW-AW list, go 
back to (2) w i t h next T W - A W pair; if TW-AW list is empty, go to (7) 
(7) If the number of perturbations performed is w i th in a predefined M, 
perturb the circuit by randomly performing a rewir ing transformation, 
update the T W - A W list and go back to (2); if M is reached, go to (8) 
(8) Final ret iming and placement, get the clock period. 
37 
CHAPTER 4 REWIRED RETIMNG FOR FLIP-FLOP REDUCTION 
The init ial ret iming and ret iming evaluation both solve the MILP for T 
and r(v) to predict the FF movements and resulting number of FFs, but the 
FFs are not actually moved. 
X a W list 
V 
Final Retiming & Placement 
(get real clock period) Yes 
Go to next Perturb and 
TW-AW Update 
pair TW-AW list 
Fig.4.5 Overall optimization f low 
To make the scheme more efficient, N iterations are div ided into three 
stages (each stage has N / 3 iterations) w i th a different def ini t ion of whether a 
rewi r ing is beneficial or not. 
38 
CHAPTER 4 REWIRED RETIMNG FOR FLIP-FLOP REDUCTION 
Stage 1: the transformation is beneficial if the clock period is less than or 
equal to the init ial clock period, and the number of FFs is less than the 
ret iming solution w i t h the least FFs so far. This is a greedy stage 
Stage 2: the transformation is beneficial if the clock period is w i th in 1.25 
times of ini t ial clock period, and the number of FFs is less than the ret iming 
solution w i t h the least FFs so far. 
Stage 3: the transformation is beneficial if the clock period is less than or 
equal to the ini t ial clock period, and the number of FFs is less than the best 
result of stage 1. 
Finally, the best result w i th least FFs and no larger clock period comes 
f rom stage 3 or stage 1 w i l l be taken. The final circuit is retimed and placed, 
and the clock period is obtained f rom placement under the same delay 
model. 
4.6 Experimental Results 
The experiments were performed for the ISCAS89 benchmark suite. The 
benchmark circuits were first mapped w i t h a l ibrary consisting of common 
logic gates (Inverters, A N D gates, OR gates, N A N D gates, NOR gates). The 
max imum number of fanins for a gate is 2. The init ial placements of the 
circuits were generated by the Capo 10.5 placer. The program was 
implemented in C language, the ret iming MILP constraints are solved by 
ILog Solver 6.0. Experiments were run on a Sun Blade 2500 (2 x 1.6GHz 
US-IIIi) 2GB R A M Solaris 8 machine. 
39 
CHAPTER 4 REWIRED RETIMNG FOR FLIP-FLOP REDUCTION 
The number of iterations for each stage is set to be 10 (totally N=30); for 
stage 1 and 2, the predefined number of perturbation is 5, for stage 3, the 
perturbation is performed at most 15 times (totally M=25). 
Table 4.1 shows the reduction of FFs of our approach compared to init ial 
(original) pure ret iming [5]. On average, our optimization scheme can achieve 
a 18.7% cut d o w n on the total number of FFs. Basically, rewir ing can be done 
in polynomial time, and experiments show that averagely 98.2% of the time is 
taken by solving the ret iming constraints, where the simplex algorithm is 
used. This can be improved by more effective implementation of retiming, 
however, it does not affect our optimization scheme. 
The f inal clock period is calculated on a real placement based on the same 
delay estimation model. To examine the effect on the clock period, we use the 
average and the best clock period obtained f rom 10 times of f inal placements. 
Table 4.2 shows that f rom 10 placement results, a pure ret iming has an 
average of 7.97% cut down on the init ial clock period, whi le combining w i th 
rewir ing, we also come up w i t h an average of 7.96%. Note that the clock 
period reduction could be negative in this new f low because the new FF 
topology after the rewired ret iming could be much different f rom that of the 
original pure retiming. 
Table 4.3 shows the best clock period f rom 10 times of f inal placements. 
Compared to the best reduction of 18.01%, our optimization scheme also has 
a reduct ion of 17.74%. 
CHAPTER 4 REWIRED RETIMNG FOR FLIP-FLOP REDUCTION 
4 0 
#of FF 











S298 133 119 10.5 2 99.1 
S344 59 53 10.2 5 99.3 
S349 57 54 5.26 6 99.5 
S382 81 77 4.94 22 99.2 
S444 93 75 19.4 5 99.4 
S510 117 103 12 4 98.5 
S526 181 166 8.29 107 99.6 
S820 207 82 60.4 53 98.1 
S832 67 17 74.6 76 96.2 
S953 96 83 13.5 9 98.4 
S1238 32 31 3.13 27 97.3 
S1488 484 387 20 19 97.1 
SI 494 417 415 0.48 25 95.8 
average 18.7 98.2 
Table 4.1 Experiment Result of FF Reduction 
Clk of initial 
circuit 




Clk of rewired 
retiming 
Reduction of 
from initial (%) 
S298 41 30.5 25.6 29.1 29.0 
S344 55 50.8 7.64 50 9.09 
S349 54 49.3 8.7 51.7 4.25 
S382 48 39.3 18.1 40 16.6 
S444 52 48.4 6.92 47.3 9.03 
S510 61 65.1 -6.7 64 -4.9 
S526 49 42.4 13.4 45.2 7.75 
S820 77 75.1 2.46 71.7 6.88 
S832 81 71.7 11.5 76 6.17 
S953 88 77.3 12.16 79.2 10 
S1238 172 148.3 13.8 150 12.7 
SI 488 98 97.5 0.51 100.3 -2.34 
SI 494 95 104.9 -10.4 95.8 -0.84 
average 7.97 7.96 
Table 4.2 Experimental Result of Averap;e Clock Period 
41 
CHAPTER 4 REWIRED RETIMNG FOR FLIP-FLOP REDUCTION 
Clk of 
initial circuit 








from initial (%) 
S298 41 27 34.1 26 36.6 
S344 55 42 23.6 44 20 
S349 54 42 22.2 43 20.4 
S382 48 34 29.2 33 31.2 
S444 52 43 17.3 41 21.2 
S510 61 58 4.9 60 1.63 
S526 49 38 22.4 41 16.3 
S820 77 67 12.9 65 15.5 
S832 81 68 16.04 71 12.3 
S953 88 70 20.4 71 19.3 
SI 238 172 138 19.7 138 19.7 
S1488 98 92 6.12 89 9.18 
SI 494 95 90 5.26 88 7.36 
average 18.01 17.74 
Table 4.3 Experimental Result of Best Clock Period 
The experimental data demonstrate that though being a quite simple 
scheme, this rewired ret iming f low can stably further cut a significant 18.7% 
of FFs used in the retimed circuit wi thout paying any compromise on the 
delay improvements produced by the original pure retiming, a result quite 
opening a new dimension for the ret iming applications. 
• End of chapter. 
Chapter 5 
Power Analysis for Rewired Retiming 
Low power is a common crucial issue nowadays, and it is also a high concern 
of us to f ind out the impact on low power yielded by our proposing f low. 
Though previously there are some contributions in this field, as discussed in 
chapter 3, the major concern of them is the reduction of switching power 
itself but not cutt ing the power by reducing flip-flops. The cost function and 
power estimation of them is relatively too complicated and thus may not be 
accurate enough for practical use. Hence, i n this chapter, we are going to 
analyze how the rewired ret iming technique, as a relatively simple power 
cutt ing approach, impacts on the dynamic power of the circuit. 
5.1 Power Model 
After the f inal placement, we have a physical design w i th less clock period 
and min imized FFs. The circuit is then analyzed by Power Compiler to 
estimate its power dissipation, and compare w i t h the circuit produced by 
pure retiming. Here, we adopt the Power Compiler as it is a 
wel l-acknowledged industrial tool for design and power analysis. 
Power Compiler is a commercial tool widely used for power analysis and 
design optimization. Its power analyst engine provides detailed gate level 
power report by capturing switching activity, mapping the design to gates, 
and annotating the design. 
The power dissipated in a circuit falls into two broad categories: static 
power and dynamic power. Static power is the power dissipated by a gate 
43 
CHAPTER 5 POWER ANALYSIS FOR REWIRED RETIMNC 
when it is not switching, that is, when it is inactive or static. Static power is 
often called leakage power. Dynamic power is the power dissipated when the 
circuit is active, wh ich is composed of switching power and internal power. 
The switching power of a dr iv ing cell is the power dissipated by the 
charging and discharging of the load capacitance at the cell outputs. The total 
load capacitance at the output of a dr iv ing cell is the sum of the net and gate 
capacitances on the dr iv ing output. Internal power is the power dissipated 
w i th in the boundary of a cell. Dur ing a signal switching, a circuit dissipates 
internal power by the charging or discharging of any existing capacitances 
internal to the cell. Internal power includes power dissipated by a 
momentary short circuit between the P and N transistors of a gate, called 
short-circuit power. 
Power Compiler uses a zero-delay model for internal simulation and for 
propagation of switching activity dur ing power analysis. This zero-delay 
model assumes that the signal propagates instantly through a gate w i th no 
elapsed time, and the switching activity propagating wou ld make certain 
statistical assumptions. 
In our experiment, we adapt the power model by Power Compiler, as 
wel l as the defaults of switching activity estimation at pr imary inputs as 
such: 
P I = 0.5 (the signal is in the logic “ 1 state” 50 percent of the time), where 
P I is the probabil i ty that input P is at logic state 1. 
TR = 0.5 fclk (the signal switches once every 2 clock cycles), where fclk is 
the frequency of the input's related clock in the design. 
44 
CHAPTER 5 POWER ANALYSIS FOR REWIRED R E T I M N C 
5.2 Experimental Results 
In our experiment, the VTVT Standard Cell library (developed by the 
VTVT Group, Virginia Tech.) which targets the TSMC 0.18um, 1.8V CMOS 
process is adopted. 
The leakage power, switching power, internal power and total dynamic 
power (sum of switching power and internal power) are estimated by Power 








s298 240.8 56.5 297.3 7.57 
s344 189.2 75.7 264.9 5.7 
s349 257.3 100.7 357.9 5.7 
s444 474 197.1 671.2 10.6 
s526 476.3 163.8 640.2 12.9 
s820 406 370.1 776.2 10.1 
s832 436.5 439.8 876.3 11.3 
S1238 1400 1561 2961 17.89 
s382 538.5 171.8 710.4 12.1 
s386 153.2 134.9 288.1 
s953 761.6 497 1258.7 14.4 
S 1 4 8 8 506.1 621.6 1128 16.2 
S 1 4 9 4 523.6 653.6 1177.3 16.4 
Table 5.4 Power Estimation of Pure Retiming Circuits 
45 
C H A P T E R 5 P O W E R A N A L Y S I S F O R R E W I R E D R E T I M N C 
Cel l In ternal 
( u W ) 
Net Swi tch ing 
(uW) 
Total D y n a m i c 
( u W ) 
L e a k a g e ( n W ) 
s 2 9 8 201.2 47.5 248.7 6.4 
s344 223.7 72.9 296.6 5.4 
s 3 4 9 231.6 76.7 308.3 5.6 
s 4 4 4 509.2 157.9 667.1 11.2 
s 5 2 6 384.9 124.4 509.4 12.4 
s 8 2 0 407.9 395 802.9 10.6 
s 8 3 2 366.2 303.4 669.6 11.12 
S1238 1220 1337 2560 16.3 
s 3 8 2 423.2 158.6 581.8 9.7 
s 3 8 6 153.1 134.9 288.1 5 
s 9 5 3 735.8 485.8 1221 14.2 
S1488 500 591 1091 16.9 
S1494 487.3 571.4 1059 16.4 
T a b l e 5 . 5 P o w e r E s t i m a t i o n o f R e w i r e d R e t i m i n g C i r c u i t s 
Pu re Re t im ing Rew i red Re t im ing Reduc t i on % 
s 2 9 8 297.3 248.7 16.35 
s 3 4 4 264.9 296.6 -12 
s 3 4 9 357.9 308.3 13.86 
s 4 4 4 671.2 667.1 0.611 
s 5 2 6 640.2 509.4 20.43 
s 8 2 0 776.2 802.9 -3.44 
s 8 3 2 876.3 669.6 23.59 
S1238 2961 2560 13.54 
s 3 8 2 710.4 581.8 18.1 
s 3 8 6 288.1 288.1 0 
s 9 5 3 1258.7 1221 2.995 
S1488 1128 1091 3.28 
S1494 1177.3 1059 10.05 
A v e r a g e 8.26 
T a b l e 5 . 6 T o t a l D y n a m i c P o w e r R e d u c t i o n o f R e w i r e d R e t i m i n g C i r c u i t s 
46 
CHAPTER 5 POWER ANALYSIS FOR REWIRED R E T I M N C 
Table 5.4 and Table 5.5 show the power estimation results of the circuits 
for pure ret iming and for rewired ret iming respectively. The results 
generated by Power Compiler include cell internal power, net switching 
power, total dynamic power (sum of cell internal and net switching power) 
and the leakage power. 
Table 5.6 shows the reduction on total dynamic power by using our 
opt imizat ion scheme. For the benchmarks used in the experiment, our 
approach can achieve an average power cut down of 8.26%, compared to the 
results produced by pure retiming. 
In general, in our experimented rewired ret iming scheme a reduction of 
18.7% in the number of FFs can be achieved w i th a simultaneous power cut of 
8.26%, whi le w i t h the improved clock period remains totally uivsacrificed. 
• End of chapter. 
Chapter 6 
Conclusion 
The thesis has studied the retiming and rewir ing techniques and 
proposed an optimization scheme combining the two techniques to improve 
the interconnect delay retiming in terms of flip-flops reduction and power 
reduction. Placement w i th close relation to delay estimation, 
post-placement delay estimation and power analysis are also studied for the 
research. 
A simple whi le very effective scheme integrating rewir ing and retiming 
is developed to minimize the number of FFs while wi thout scarifying the 
original delay reduction. The whole f low is placement-aware and works 
tightly w i th a real placement thus makes the retiming result more practically 
realizable. The real clock period is calculated from the final placement. 
Experimental results show a remarkable reduction on the number of FFs 
(18.7%), which can be considered a free gain because the retimed clock period 
is kept unchanged. 
Besides a significant area cut, this extra FF reduction produced by our 
approach can also lead to a desirable savings on power (8.26%) and ease the 
clock tree generation because of less clock skews. 
The work is a new attempt to use logic re-synthesis technique to further 
improve the ret iming technique w i th real placement-based interconnect 
delays. Our approach not only reduces the area but also keeps retiming close 
to a real placement, and cuts down the adverse effect of the increased FFs 
48 
CHAPTER 6 C O N C L U S I O N 
commonly introduced by a conventional ret iming process. Particularly, this 
remarkable f l ip- f lop and power cut-down is achieved wi thout sacrificing any 
original ret imed clock period. 
• End of chapter. 
bibliography 
m] C. Leiserson, F. Rose, and J. B. Saxe, "Opt imiz ing Synchronous Circuitry 
by Retiming, ” in Proc. Caltech Confi 1983, pp. 87-116. 
[幻 C. Leiserson and J. B. Saxe, “ Retiming Synchronous Circuitry ” , 
Algorithmatica, vol. 6, pp. 5-31,1991. 
[3] N. Maheshwari and S. Sapatnekar, “ Efficient Retiming of Large 
Circuits，，, IEEE Trans. VLSI Systems, vol. 6, pp. 74-83, March 1998. 
S. Mal ik, E. M. Sentovich, R. K. Brayton and A. Sangiovanni-Vincentelli, 
“ Retiming and Resynthesis: Optimizing Sequential Networks wi th 
Combinational Techniques，，，IEEE Trans. Computer-Aided Design, vol. 10, 
pp. 74-84, January 1991. 
[3] Rajeev K. Ranjan, Vigyan Singhal, Fabio Somenzi, and Robert K. Brayton. 
On the Optimizat ion Power of Retiming and Resynthesis Transformation. 
In Proc. ICCAD, pages 402-407,1998 
[6] Jason Cong, Honching Li, and Chang Wu. Simultaneous Circuit, 
Part i t ioning/ Clustering w i th Retiming for Performance Optimization.In 
Proc. DAC, pages 460-465,1999 
[7] Jason Cong, Sung Kyu Lim, and Chang Wu. Performance Driven 
Mult i - level and Mul t iway Partitioning w i th Retiming. In Proc. DAC, 
pages 274-279, 2000. 
[8] Monteiro, J. ,Devadas, S., Ghosh, A., Retiming sequential circuits for low 
power, Computer-Aided Design, 1993. ICCAD-93. Digest of Technical 
Papers., 1993. 
9] C. V. Schimpfle, Sven Simon, and Josef A. Nossek. Optimal Placement of 
Registers in Data Paths for Low Power Design. In Proc. ISC AS, pages 
2160-2163,1997 
[10] A. El-Maleh, T. E. Marchok, J. Rajski, and W. Maly. Behavior and 
Testability Preservation under the Retiming Transformation. IEEE TCAD, 
16:528-542,1997 
[11] Dennis K.Y.Tong, Evangeline F.Y.Young, Chris Chu, and Sampath Dechu, 
“ Wire Retiming Problem Wi th Net Topology Optimization, IEEE 
Transactions of Computer-Aided Design of Integrated Circuits and Systems, 
Volie, No.9,September 2007 
[12]Neumann, I. Kunz, W., "T ight coupling of t iming driven placement 
and retiming，，，Circuits and Systems, 2001. ISCAS 2001. The 2001 IEEE 
International Symposium , Volume: 5, On page(s): 351-354 vol. 
[13]S. C Chang, L. P. P. P. van Ginneken, and M. Marek-Sadowska, "Fast 
Boolean Optimizat ion by Rewir ing,“ in Proc. Int'l Conf. Computer-Aided 
Design, Nov. 1996, pp. 262-269. 
14] Y. M. Jiang, A. Krstic, K. T. Cheng, and M. Marek-Sadowska, Post-layout 
Logic Restructuring for Performance Optimization, “ in Proc. of Design 
Automation Conf., 1997, pp. 662-665. 
50 
bibliography 
15]L. A. Entrena and K. T. Cheng, "Combinational and Sequential Logic 
Optimizat ion by Redundancy Addi t ion and Remova l ”， IEEE Trans. 
Computer-Aided Design, vol. 14, pp. 909-916, July 1995. 
16]S.C. Fu, T.K. Lam, and Y.L. Wu, "On Improved Scheme for Digital Circuit 
Rewir ing and Application on Further Improving FPGA Technology 
Mapping," Proc. ASP-DAC 2009. pp.197-202 
:17]W.C Tang, W.H. Lo, and Y丄 .W u , "Further Improve Excellent 
Graph-Based FPGA Technology Mapping by Rewiring," Proc. ISCAS 2007. 
18] L. Zhou, W . C Tang, W.H. Lo, and Y.L. Wu, "How Much Can Logic 
Perturbation Help from Netlist to Final Routing for FPGAs," Proc. 
IEEE/ACM Design Automation Conference (DAC07), 2007. 
[19] L. Zhou, W . C Tang, and Y. L. Wu, "Fast Placement-Intact Logic 
Perturbation Targeting for FPGA Performance Improvement," Proc. IEEE 
Southern Conference on Programmable Logic (SPV07) pp. 63-68, 2007. 
20]D. I. Cheng, C C. Lin, and M. Marek-Sadowska, "Circui t Partitioning 
w i th Logic Perturbation,，， in Proc. Int. Conference on Computer Aided 
Design, Nov. 1995, pp. 650-655. 
[21] Chris C.N. Chu, Evangeline F.Y. Young, Dennis K.Y. Tong and Sampath 
Dechu, “ Retiming w i th Interconnect and Gate Delay “ ,Proceedings IEEE 
International Conference on Computer-Aided Design, pp.221-226, 2003 
22] Y. L. Wu, W. Long, and H. Fan. A fast graph-based alternative wi r ing 
scheme for Boolean networks. In International VLSI Design Conference, 
pages 268-273, 2000 
23JM.L. Fredaman and R.E. Tarjan, Fibonacci heaps and their uses in 
improved network optimization algorithms. Proceedings of the 25出 
Annual Symposium on Foundations of Computer Science, IEEE 
Computer Society, October 1984, pp.338-346. 
[24]D.B. Johnson, Efficient algorithms for shortest paths in sparse networks. 
Journal of the Association for Computing Machinery, VoL24,No.l, 
January 1997, ppl-13. 
[25] N. Shenoy and R. Rudell. Efficient implementation of retiming. In ICCAD, 
pages 226-233,1994. 
[26] H. Zhou. Der iv ing a new efficient algorithm for min-period retiming. In 
ASPDAC, pages 990-993, 2005. 
[27] D. P. Singh, V. Manohararajah, and S. D. Brown. Incremental ret iming for 
FPGA physical synthesis. In DAC, pages 433-438, 2005 
28]Jia Wang, Hai Zhou, A n efficient incremental algorithm for min-area 
retiming. Annual A C M IEEE Design Automation Conference, Proceedings 
of the 45th annual conference on Design automation. Pages 528-533 
29] A. Shen, S. Devadas, A. Ghosh, and K. Keutzer. On Average Power 
Dissipation and Random Pattern Testability of Combinational Logic 
Circuits. In Proceedings of the Int '1 Conference on Computer-Aided 
Design, pages 402-407, November 1992. 
51 
_ L I 〇 G R A P H Y 
[30] P. Duncan, S. Swamy, and R. Jain. Low-Power DSP Circuit Design Using 
Retimed Maximal ly Parallel Architectures. In Proceedings of the I't 
Symposium on Integrated Systems, pages 266-275, March 1993. 
[31] Monteiro, J. ,Devadas, S., Ghosh, A., Retiming sequential circuits for low 
power, Computer-Aided Design, 1993. ICCAD-93. Digest of Technical 
Papers., 1993. 
[32]Jeroen Leitjen, Jef van Meerbergen, and Jochen Jess. Analysis and 
reduct ion of glitches in synchronous network. Received:1996. 
33] Jose Monteiro and Srinivas Devadas. Computer-Aided Design 
Techniques for Low Power Sequential Logic Circuits. Kluwer Academic 
Publishers, 1997. 
34] Narayanan, U. , Peichen Pan, Liu, C L . "Low Po wer Logic Synthesis 
under a General Delay Model" , 1998 International Symposium on Low 
Power Electronics and Design, 1998. Proceedings. 10-12 Aug 1998 On 
page(s): 209- 214 

C U H K L i b r a i 
0 0 4 6 6 0 0 4 5 
