Partitioning a given circuit targeting multiple Fpgas by Cherussery, Girish Narayanan
UNLV Retrospective Theses & Dissertations 
1-1-2002 
Partitioning a given circuit targeting multiple Fpgas 
Girish Narayanan Cherussery 
University of Nevada, Las Vegas 
Follow this and additional works at: https://digitalscholarship.unlv.edu/rtds 
Repository Citation 
Cherussery, Girish Narayanan, "Partitioning a given circuit targeting multiple Fpgas" (2002). UNLV 
Retrospective Theses & Dissertations. 1450. 
https://digitalscholarship.unlv.edu/rtds/1450 
This Thesis is protected by copyright and/or related rights. It has been brought to you by Digital Scholarship@UNLV 
with permission from the rights-holder(s). You are free to use this Thesis in any way that is permitted by the 
copyright and related rights legislation that applies to your use. For other uses you need to obtain permission from 
the rights-holder(s) directly, unless additional rights are indicated by a Creative Commons license in the record and/
or on the work itself. 
 
This Thesis has been accepted for inclusion in UNLV Retrospective Theses & Dissertations by an authorized 
administrator of Digital Scholarship@UNLV. For more information, please contact digitalscholarship@unlv.edu. 
INFORMATION TO USERS
This manuscript has been reproduced from the microfilm master. UMI films 
the text directly from the original or copy submitted. Thus, some thesis and 
dissertation copies are in typewriter face, while others may be from any type of 
computer printer.
The quality of this reproduction is dependent upon the quality of the 
copy submitted. Broken or indistinct print, colored or poor quality illustrations 
and photographs, print bleedthrough, substandard margins, and improper 
alignment can adversely affect reproduction.
In the unlikely event that the author did not send UMI a complete manuscript 
and there are missing pages, these will be noted. Also, if unauthorized 
copyright material had to be removed, a note will indicate the deletion.
Oversize materials (e.g., maps, drawings, charts) are reproduced by 
sectioning the original, beginning at the upper left-hand comer and continuing 
from left to right in equal sections with small overlaps.
ProQuest Information and Learning 
300 North Zeeb Road, Ann Arbor, Ml 48106-1346 USA 
800-521-0600
UMI*
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
PARTITONING A  G IVEN  C IR C U IT  TARG ETING  M U LTIP LE  FPGAS
by
Girish Cherussery
Bachelor o f Engineering 
Madras University 
2000
A  thesis submitted in  partial fu lfillm ent o f the 
requirements fo r the
Master of Science Degree 
Department of Electrical and Computer Engineering 
Howard R. Hughes College of Engineering
Graduate College 
University of Nevada Las Vegas 
December 2002
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
UMI Number 1413603
UMI’
UMI Microform 1413603 
Copyright 2003 by ProQuest Information and Learning Company. 
All rights reserved. This microform edition is protected against 
unauthorized copying under Title 17. United States Code.
ProQuest Information and Learning Company 
300 North Zeeb Road 
P.O. Box 1346 
Ann Arbor. Ml 48106-1346
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
IINTV Thesis ApprovalThe Graduate College 
üniversit}- ot \evad a , Las Vegas
November 13 ^20ü 2_
The Thesis prepared by
G ir is h  C herussery
Entitled
P a r t i t io n in g  a G iven  C i r c u i t  T a rg e t in g  M u l t ip le  FPGAs
is approved in partial fulfillment of the requirements for the degree of 
M aste rs  o f  S c ience
Examinahon Committee Member
Examination Committee Member
^ f)  hCraciiiate College Faciÿty 1
 û m J L
Examination Committee Chair
Jean of the Graduate College
Representative
1017-53 u
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
ABSTR AC T
Partitioning a Given Circuit Targeting Multiple FPGAs
by
G irish Cherussery
Dr. Henry Selvaraj, Examination Committee Chair 
Professor o f E lectrical and Computer Engineering 
University o f Nevada, Las Vegas
FPGAs have moved from being a method o f implementing random logic in c ircu it 
boards to being a flexib le implementation medium fo r many types o f systems. Logic 
simulation tasks in which ASIC designs are simulated on FPGA-based structures have 
greatly increased simulation speeds. In order to completely take advantage o f the fact that 
designs implemented in hardware produce simulation results quicker than designs 
implemented using software, FPGA w ith  large area and a large number o f input-output 
ports are required. But w ith increasing size o f the design it is becoming very d ifficu lt to 
design an FPGA that has enough input-outputs and enough number o f CLBs to handle the 
logic. Hence, partitioning the design to f it  into m ultiple FPGA is considered as an 
efficient solution. An efficient logic-partitioning tool should m inim ize the total number o f 
FPGAs and the interconnection between them and consequently maximize the utilization 
o f each ET*GA.
Our approach to the problem o f partitioning the design (represented as a hypergraph) 
into Multi-FPGAs uses a bi-level approach by in itia lly  clustering the design and then
iii
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
applying the bipartitioning technique iteratively. Each partition generated by the iterative 
bipartitioning technique should meet the constraints given by the FPGAs input- output 
and number o f CLBs. The traditional FM partitioning can be applied to partition the 
circuit into m ultiple FPGAs. FM partitioning aims to m inim ize the number o f 
interconnections but fails to group the nodes w ith  maximum interconnections into one 
partition. Thus FM algorithm looks at the partitioning problem w ith a global viewpoint, 
abandoning the details. The proposed algorithm  adds another level o f optim ization to the 
partitioning heuristic. By clustering the nodes that are connected very closely in a netlist 
before partitioning, local optim ization property is added to the FM algorithm. This 
clustered circu it is then partitioned to implement the design in m ultiple FPGAs.
B ipartitioning using the Fiduccia Mattheyses algorithm is applied. B ipartitioning 
algorithm  is applied such that at least one partition satisfies the constraints o f one FPGA. 
This way at least one partition thus created can be f it  into one FPGA. The algorithm has 
been tested on standard benchmarks and the results show an improvement to the already 
existing algorithms.
nr
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
TABLE OF CONTENTS
ABSTRACT........................................................................................................................ iü
TABLE OF CONTENTS.................................................................................................... v
LIST OF FIGURES........................................................................................................... v i
LIST OF TABLES............................................................................................................ vü
ACKNOW LEDGEMENTS.............................................................................................v iü
CHAPTER 1 IN TR O D U CTIO N........................................................................................ I
CHAPTER 2 PROBLEM D EFIN ITIO N  AND PREVIOUS W ORK.............................7
2.1 Representation o f VLSI c ircu its ...........................................................................7
CHAPTER 3 B I-LEVEL P A R TIT IO N IN G ....................................................................13
3.1 C lustering............................................................................................................. 13
3.2 B ipartition ing ...................................................................................................... 25
3.2.1. Pseudo code fo r the F-M  algorithm ...................................................... 3 1
3.3 M ulti-partition ing................................................................................................ 33
3.3.1 Pseudo code fo r the proposed a lgorithm ...............................................36
CHAPTER 4 EXPERIM ENTAL RESULTS AN D  TA B U LA TIO N S ........................ 39
4.1 Clustering Results............................................................................................... 39
4.2 Comparison w ith other clustering methodologies............................................41
4.3 B ipartitioning Results..........................................................................................44
4.4 B ipartitioning at different ratios........................................................................ 50
4.5 Comparison w ith previous results..................................................................... 58
4.6 Mapping onto Spartan Device............................................................................6 6
CHAPTER 5 CONCLUSION AND RECOMMENDATIONS.................................... 69
BIBLIO G RAPHY..............................................................................................................72
APPENDDC........................................................................................................................ 74
V IT A  .......................................................................................................................... 79
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
LIST OF FIGURES
Figure I . I Design flow  fo r an FPGA based design................................................................. 3
Figure 1.2 Hardware Embedded Simulation environments................................................... 4
Figure.1.3 Hierarchy in the design............................................................................................4
Figure 1.4 Simulation o f design w ith  3 modules in hardware and 2 modules in software. 5
Figure 3.1 C ircuit w ith clusters after random partitioning....................................................15
Figure 3.2 Bi-Partitioned C ircu it after applying F-M  algorithm .......................................... 17
Figure 3.3 Bi-Partitioned circu it after clustering and applying F-M a lgorithm ..................18
Figure 3.4 Nodes A . B. C such that A  and B are connected by a net as is B and C...........20
Figure 3.5 The circuit after A and B are clustered................................................................ 20
Figure 3.6 Equation describing the connectivity m e tric .......................................................21
Figure 3.7 New connectivity m etric................................................................................. .....22
Figure 3.8 A  simple ne tlis t...................................................................................................... 23
Figure 3.9 N etlist after nodes C l and C5 are clustered.........................................................24
Figure 3.10 Bucket lis t structure.............................................................................................28
Figure 3.11 C ritical nets...........................................................................................................29
Figure 3.12 Before moving the base ce ll................................................................................ 30
Figure 3.13 A fter moving the base c e ll.................................................................................30
Figure 3.14 In itia l C ircu it.........................................................................................................37
Figure 3.15 C ircuit after firs t b ipartitioning...........................................................................37
Figure 3.16 C ircuit after bipartitioning tw ice.........................................................................38
Figure 3.17 C ircuit after final b ipartition ing..........................................................................38
Figure 4 .1 Comparison o f different clustering criteria..........................................................42
Figure 4.2 Variation o f number o f cuts w ith the size o f the circu it....................................47
Figure 4.3 Variation o f number o f cuts w ith  size o f the circu it.......................................... 49
Figure 4.4 Variation o f cutsize w ith  different ratios on F-M  algorithm .............................51
Figure 4.5 Cutsize fo r different ratios as applied to m odified F-M algorithm ....................53
Figure 4.6 Variation o f Cutsize on MCNC 99 Benchmarks w ith respect to d ifferent ratios.
........................................................................................................................................... 55
Figure 4.7 Comparison o f cutsize between F-M  and m odified F-M  algorithms at different
cut ratio..............................................................................................................................56
Figure 4.8 Cutsize o f different algorithms..............................................................................60
Figure 4.9 Comparison o f Mod F-M  w ith  existing algorithms............................................ 62
Figure 4.10 Comparison o f best results w ith  existing algorithms........................................64
Figure 4.11 Comparison o f time taken fo r pa rtition ing ........................................................6 6
Figure 4 .12 CLB U tilization fo r different partitions.............................................................6 8
V I
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
LIST OF TABLES
Table 4.1 Characteristics o f the netlist after clustering.........................................................40
Table 4.2 Comparison w ith  other clustering criterion.......................................................... 41
Table 4.3 Comparison o f time taken fo r clustering fo r different algorithm s..................... 43
Table 4.4 Clustering on MCNC 99 benchmark circu its....................................................... 44
Table 4.5 Results o f F-M partitioning....................................................................................46
Table 4.6 Comparison o f F-M  algorithm w ith our a lgorithm ..............................................48
Table 4.7 F-M algorithm  performed at different ratios.........................................................50
Table 4.8 Variation on cutsize w ith  ratio when applied to modified F-M  a lgorithm  52
Table 4.9 Cutsize o f various MCNC 99 Benchmark circuits w ith respect to different
ratios................................................................................................................................. 54
Table 4.10 Cutsize for different ratios when clustering is performed using bandwidth
criterion............................................................................................................. '................57
Table 4 .11 Previous results.....................................................................................................59
Table 4.12 Comparison with, existing algorithm s.................................................................61
Table 4.13 Comparison o f best results from  Mod F-M  algorithm w ith existing algorithms.
...................................................................................................  63
Table 4.14 Comparison o f tim e taken fo r partitioning......................................................... 65
Table 4 .15 U tilization o f C LBs.............................................................................................. 6 8
vu
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
ACKNOW LEDGEMENTS 
Comments, reviews and support o f many people have helped the development o f this 
thesis. I would like to thank all o f them who have been instrumental in this work. The 
firs t and foremost I would like to thank Dr. Henry Selvaraj fo r his guidance on my 
research and study at University o f Nevada, Las Vegas. W ithout his valuable advice and 
encouragement, I could not have reached th is stage in my academic pursuits.
I would like to thank Dr. Venkatesan Muthukumar who has constantly been 
m otivating me to implement new ideas. He gave many valuable suggestions on the 
implementation o f our algorithm and the research direction o f the thesis. It would be 
unfair on my part i f  I do not thank my colleagues Shyam Subramanian. 
Balasubramaninan Murugan, Sridhar Veeravalli and Bharath Radhakrishnan fo r their 
help in w riting  the software.
I would like to thank my parents fo r a ll that they have done to help me reach where I 
am. W ithout their hard work and support. I would not have had an opportunity to study in 
the University o f Nevada, Las Vegas.
VIU
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER I 
INTRODUCTION
Very Large Integrated C ircuits have led to the design o f large designs w ith low  cost. 
The advancement in semiconductor technology has helped in increasing the performance, 
re liab ility  and reducing cost, power consumption and size o f the chip. Application 
Specific Integrated C ircuits (ASICs) however consumed more time fo r designing and 
manufacturing. A lso, the design could not be m odified once the chip is fabricated. In the 
mid I980’s a new technology fo r implementing d ig ita l logic was introduced, the fie ld- 
programmable gate array (FPGA). This meant that designs could now be downloaded on 
to FPGA and used as an application specific hardware device. The re-programmable 
feature o f FPGAs facilitates in adding new modules or m odifying the current design and 
thus easing the process o f upgrading the current hardware device. FPGAs can be 
manufactured at a far cheaper cost that Application Specific Hardware devices. The 
circuit mapped on to FPGAs need not be standard hardware equations. They can vary 
from ordinary arithmetic operations to complex discrete cosine transforms. However 
FPGAs cannot match the speed o f ASIC. Also since these FPGAs are prefabricated, they 
have lim ited numbers o f input and output (lO ). This restricts the usage o f FPGA to 
lim ited number o f  designs. As the design grows bigger they cannot f it  into one FPGA. 
They need to be distributed onto multi-ETGAs. This means that the design should be
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
partitioned to m ultiple FPGA devices such that they now meet the constraints imposed by 
each FPGA device.
One o f the main applications o f FPGA system lies in logic emulation and embedded 
simulation. This is one o f the most important applications o f multi-FPGAs. Increased 
design complexity and size has caused the tim e required fo r simulating to increase 
drastically. Increased design com plexity requires improved Design Automation tools. 
The early methods o f design were tedious and tim e consuming and these have given way 
to Computer-Aided Design (C AD ) fo r design entry, synthesis and implementation 
processes. The use o f CAD tools has improved the productivity o f the designer and the 
designer now heavily relies on software fo r every aspect o f his design. CAD tools are 
used in design entry, simulation o f the hardware description language, synthesis o f the 
hardware description language to register transfer level (RTL) code, placement, routing 
o f the design and also in the tim ing simulation.
The extensive use o f CAD tools has necessitated the need fo r e fficient CAD tools that 
can synthesize, place and route the design very e ffic iently. This requires optim ization at 
every step o f the design process. Objectives o f optim ization vary from size o f circu it, 
delays, and power dissipation. Figure 1.1 shows the design flow  fo r an FPGA.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
SIMULATION
RTL code
Gate Wôelist, Ketlist
Mappiticand
TimmeSumu&a&n
ÎBcal/ Functâtnal 
«ümubdDn
Gate Inptl SmulatiDn
FPGA
hnCW^CVHDL, 
Vcrfloc; Sckmmdc#)
Figure l . l  Design flo w  for an FPGA based design
As the size and com plexity o f design increased, the tim e required fo r simulation 
increased considerably. Weeks and days o f simulation tim e can be reduced to hours and 
minutes using FPGA systems fo r simulation. One among the many applications o f 
FPGAs is the Hardware Embedded Simulation (FIES).
The FIES environment consists o f a hardware board w ith  m ultip le FPGAs or a single 
FPGA on it. It also consists o f a software simulator, which is linked to the hardware 
board through the Peripheral Component Interconnect (PCI). This environment ensures
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
that parts o f the design in the software communicate e ffic iently  w ith  the hardware board. 
Figure 1.2 shows the HES environment. Figure. 1.3 shows the hierarchy in the design.
SofHvare Sunulator
Hardware B card with 
FPGA on it.
Figure 1.2 Hardware Embedded Simulation environments
Top Level
Figure. 1.3 Hierarchy in the design
The design, which is usually in modules, can be downloaded into an FPGA as and 
when a module is completed. When a module is to be verified, it  is placed in the software 
domain and simulated. Those parts o f the design that have already been verified goes into 
the hardware. The hardware software co-simulation helps in accelerating the design 
process by 100- 1000 times. Figure 1.4 shows the design mapped onto the HES board.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Tqi Level Softivare Simulator
Figure 1.4 Simulation o f design w ith 3 modules in hardware and 2 modules in software
As the design becomes larger they cannot f it  into the constraints o f one FPGA. They 
need to be partitioned into m ultiple FPGAs. Partitioning o f the design can be done at 
various levels in  the design flow . Partitioning at a more abstract level is called high level 
partitioning and the one done after synthesis is called logic partitioning. Partitioning can 
be done after implementation, but then various aspects o f tim ing and delays w ill need to 
be considered. In this thesis we focus on logic partitioning.
Even though m ulti-FPGA systems have a great potential fo r high-performance 
solutions there are several problems that hold back current systems for achieving their 
fu ll performance. FPGAs tend to have very few IQ connections fo r their logic capacity. 
As a result it  leads to lim ited usage o f their logic cells. We partition the given netlist into 
m ultip le FPGAs based on various constraints o f the FPGAs. These constraints could be 
the number o f logic cells, also called configurable logic blocks (C LB), number o f lOs, 
number o f clocks in the design, etc. The advantage o f perform ing logic partitioning is that 
it  automates the process o f partitioning to the user. Partitioning after implementation can 
cause problems in delays that we may need to consider.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
This work has been motivated by the need to provide designers w ith tools that allow  
them to simulate their large designs which cannot fit into one single FPGA. We consider 
partitioning o f a c ircu it into smaller subcircuits such that each subcircuit satisfies the 
constraints o f an FPGA. In addition to obtaining a feasible partition, our algorithm aims 
to reduce the tim e required fo r partitioning and also focus on incremental prototyping. 
Incremental prototyping is one by which new modules can be added into the design by 
synthesizing and implementing the module that is being added.
The thesis begins w ith an introduction to the underlying problem in partitioning and 
problem definition in Chapter 2. This also includes a survey on current methodologies 
used in partitioning. Chapter 3 delves in more detail o f our algorithm . Chapter 4 provides 
the results o f this algorithm  when tested on various benchmarks. We conclude this thesis 
w ith some overall results and proposed future work in Chapter 5.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 2
PROBLEM DEFIN ITIO N AND PREVIOUS W ORK 
Partitioning is a traditional method used fo r solving problems, which are generally 
large. The large problem is broken down into smaller sub problems. These smaller 
problems are then solved one at a time. The solutions to these smaller problems are then 
combined to obtain the solution fo r the larger problem. Partitioning methodology is used 
for solving design automation problems that occur at various stages o f the Integrated 
C ircuit (IC ) design process. As the size and com plexity o f the Very Large Scale 
Integrated C ircu it (V LS I) increases dramatically, it becomes absolutely necessary to 
solve these problems by using partitioning techniques. In this case the VLSI c ircu it is 
broken down into smaller circuits. C ircuits can be represented in graph or hypergraph 
format. Therefore, hypergraph or graph partitioning solves the problem o f d ivid ing large 
circuits into smaller subcircuits.
2.1 Representation o f V LS I circuits 
VLSI circuits are described as netlists. In the past, netlist was described in graph 
format. Graph based representations o f a c ircu it can be w ritten as G = (K . E), where K is 
a set o f nodes (vertices) representing the fundamental components such as gates, flip - 
flops, input pads, output pads and £  is a set o f edges representing the nets w ith in  the 
circuit o r network. These nets connect different nodes in  the c ircu it. Each edge in a graph
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
representation connects exactly two nodes. I f  e, represents an edge that connects exactly
two nodes Vj and v* such that e, e £  andv^ K , then each edge can be written
ase, = (v^ ,v^ ). Partitioning the graph fo r VLS I design problems separates the nodes into
disjo int sets based on certain constraints. An optim izing function is formed fo r the nodes 
and edges based on the constraints.
One classic partitioning problem is to reduce the number o f cuts between partitions. 
This is otherwise called minimum cut (m incut) problem. The entire set o f nodes V is 
divided into two d isjo int parts P^ and £ ,, such that the number o f edges that connects the 
two parts is m inim ized. This can be represented as follows:
Minimize {e (£ ,,£ ,)}= {(« , 6 )e  £  j ae £, and be. P^  }
The set is referred to as the cut set and the number o f edges in the cut set as
cut value.
Ford and Fulkerson [ I ]  converted the minimum cut problem into a maximum flow  
problem. In their paper they have proved that calculating the path w ith the maximum 
flow  would give the minimum cut. In this algorithm  the constraints are placed only on the 
number o f edges crossing the two partitions and there is no lim ita tion on the number o f 
nodes in each partition. This means that a c ircu it w ith  110 nodes could be partitioned 
w ith 100 nodes in one partition and 10 nodes in  another partition. The problem o f finding 
the minimum cut w ith  a restriction on the number o f nodes on each partition has been 
proved to be NP-complete [20]. Thus it is unlikely that there is an algorithm  to accurately 
find a solution to this problem. The above mentioned algorithm  has a tim e complexity o f
) where \V\ is the total number o f nodes in graph G .
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
The partitioning heuristics can be categorized into four m ajor groups’ namely move- 
based approaches, geometric representations, combinatorial formulations and clustering 
approaches. Move-based algorithms iteratively explore the space o f feasible solutions 
according to a neighborhood operator. These methods include iterative exchange, 
simulated annealing and greed. Geometric based algorithms generally embed the circuit 
into some kind o f geometry, e.g., I-dimensional or multi-dimensional vector space. 
Combinatorial algorithms transform the problem into an optim ization problem like 
network flow , which can be solved using mathematical programming. Clustering based 
algorithms group the nodes in the netlist to form smaller sub circuits called clusters.
The approach to the problem o f partitioning in this thesis is based on move-based 
algorithms. The heuristic algorithm  proposed by Kemighan and L in  (K -L ) [2] in 1970 for 
a two-way partitioning was one among the foremost move-based approaches. In this 
algorithm  nodes are swapped pair wise between the two partitions. The algorithm 
proceeds in a series o f passes such that in every pass a node moves exactly once either 
from  partition P, to partition P, or vice versa. A fte r every move the nodes are locked 
into their new partition. This is done to ensure that the algorithm  does not get caught in 
an in fin ite  loop. The nodes to be swapped are chosen on the basis o f gains o f each 
unlocked node. I f  v, and are two nodes in tw o partitions P,, P, such that v, e P, and
Vj e  P, the gain can be defined as:
g a in iy ,, Z( " v *  “  )
V, e  p . €  P^
where and are the number o f connections between v,- and other nodes in that 
partition. The swapping o f the nodes is carried un til a ll nodes are swapped. The algorithm
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
10
stops when it reaches local m inima. The com plexity o f  the algorithm  w ith  tim e is given 
by 0{n~  log») where n is the number o f nodes in the circuit. In  graph representation, it 
is assumed that each edge connects only two nodes. But in V LS I circuits each connection 
from a node may connect to more than one node. Therefore the graph representation o f a 
circuit may not be the best method o f describing a circuit. One possible solution for this 
problem is to assign weights to the edges.
Representing a c ircu it as hypergraph gives a closer description to the actual VLSI 
circuit. Hypergraph is usually represented as H {V ,E )  where E  is the set o f hyperedges, 
which is a subset o f V . As in the case o f graph partitioning, hypergraph-partitioning 
problem is also NP-complete.
An extension to the Kemighan and Lin algorithm was published by Fiduccia and 
Mattheyses [3]. This algorithm  was faster than the K -L algorithm . Their algorithm is 
popularly known as F-M  algorithm . F-M algorithm uses a better data structure, which 
maintains the gain calculations. The complexity o f the algorithm  is linear w ith  time 0 {n )  
where n is the number o f nodes in the circuit. Because o f its efficiency, linear time 
complexity and ease o f implementation it is the most w idely adopted algorithm .
Many improvements have been made to the F-M  algorithm . Krishnamurthy [4] 
reported that the solutions yielded by F-M  algorithm were erratic. He proposed a 
modification to the F-M  algorithm  by implementing a look-ahead technique w ith  the F-M  
algorithm. This considerably improved the performance o f F-M  especially when more 
than one node had the same gain value. Sanchis [5 ] in his paper extended 
Krishnamurthy* s w ork to partitioning a given circuit into k partitions. Improvements to 
the basic F-M  algorithm  were made using cell replication [6 , 7]. D utt and Deng [8 ] in
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
11
their paper improved the F-M  algorithm  by proposing a new method to select cells to 
move w ith a view to m oving clusters that straddle the two subsets o f a partition into one 
o f the subsets. Cong et al [9 ] presented a bottom up clustering algorithm  based on 
recursive collapsing o f small cliques in a graph.
F-M algorithm is one among the many iterative methods o f partitioning. Among the 
other iterative algorithms are simulated annealing, tabu search, etc. Simulated annealing 
[10] approaches local m inim a faster than F-M and K -L algorithms because they can only 
make downhill moves. Simulated annealing may lead to unbalanced partition. Ratio cut 
partitioning [ I I ]  is another method o f partitioning the circu it used to divide the circu it 
based on some ratio and not into equal sized partitions. In the ratio cut partitioning, Wei
and Cheng [ I I ]  divide the nodes into two disjoint sets P, and P, such that ~ is
r i l r :
m inimized wheree(P,,P, )= {(a ,6 )e  E\aeP^ a n d P,}.
Among the various methods fo r partitioning are the methods based on graph spectra 
[12]. These use eigenvalues and eigenvectors o f matrices obtained from  the netlist. 
Various clustering methods have been formulated fo r partitioning [13], [14], [15]. 
Usually clustering its e lf is not used fo r d ivid ing the c ircu it into different partitions. It is 
usually used before some partitioning algorithm like F-M , simulated annealing, etc. 
Clustering ensures that closely connected nodes are not separated into different partitions.
Among the many m ulti-partition algorithms, the most commonly used fo r VLS I 
circuits is the one proposed by Karypis et al [16]. Hauck et al [17] compared the various 
partitioning algorithms and proposed an algorithm  that e ffic ien tly  partitions the circu it 
that targets m ultiple FPGAs. Kuznar et al [6 ]  proposed a m ulti-partitioning algorithm  that
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
12
targets m ultiple FPGAs by iteratively applying the F-M  partitioning heuristic w ith node 
replication.
In this thesis a c ircu it is represented as a hypergraph / / (F ,£ )  w ith V 
and £ = {e ,,e ,....e „}b e in g  the set o f nets. Each net is a subset o f F containing the 
modules that that net connects and we assume that fo r each ee  £ ,|c |> 2 . Clustering 
C = {C ,,C ,,...C ^ | consists o f j  clusters (subsets o f V ), C ,,C ,,...C ^, such that 
C, Y C , Y....C^ = F . A  ^ way partition ing{£,, A r e p r e s e n t s  the k partitions into
which C is distributed, such that there is minimum interconnections between the 
partitions. The set o f hyperedges cut by a cluster is given by 
£ (P )= {e e  £  sJ.O < \ e l  P\ <|e|}, i.e. ee £ (£ ) i f  at least one, but not a ll, o f the pins o f e 
are in P . The objective function is to m inim ize £ ( £ ) .
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 3
B I-LE V E L PARTITIONING
In this chapter the algorithm  used for m ulti-partitioning is explained. The algorithm 
can be divided into two stages and hence it is called bi-level partitioning. As described in 
the previous chapter the objective is to partition the given netlist into m ultiple partitions 
such that the number o f cuts between the partitions is m inimal. F-M  algorithm  is one o f 
the most simple and e ffic ient algorithms fo r bipartitioning. Hence in this thesis recursive 
F-M partitioning is used to obtain m ultiple partitions. Some changes have been made to 
this F-M partitioning methodology in order to optim ize the results.
One o f the common optim ization to F-M algorithm is clustering. Clustering boosts 
the performance o f the overall algorithm . A fter clustering, the clusters are partitioned 
using iterative F-M  algorithm . The circuit is partitioned such that one o f the partitions 
satisfies the constraints o f one o f the FPGAs. I f  the remainder o f the circu it is larger than 
what can be accommodated in one FPGA. we further partition this portion o f the circuit 
using F-M algorithm.
3.1 Clustering
Clustering is the process o f grouping nodes to form small clusters. Clustering is one 
o f the most common optim ization techniques applied to boost the performance o f F-M 
algorithm. When small nodes combine to form  clusters then the interconnection between
13
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
14
the nodes become the interconnection between the clusters. Each cluster C. is comprised 
o f nodes grouped together on certain heuristic conditions. Clustering before partitioning 
improves the quality o f partitioning. Since many nodes combine to form  a cluster, the 
effective number o f nodes (clusters) fo r partitioning decreases sizably. This decreases the 
tim e required for partitioning. Since the number o f nodes to be partitioned reduces, the 
com plexity o f the problem reduces considerably.
F-M algorithm is a global approach to partition problem. It ignores the microscopic 
details o f the circuit. Ignoring the intrinsic details o f the c ircu it may not lead to the most 
e ffic ient partitioning. Also it has been proved in [18] that F-M  algorithm  performs better 
when each node is connected to an average o f 6  nets. It has been shown [18] that in the 
normal circu it that has not been clustered, the average number o f nets each node connects 
is in the range o f 2.8 to 3.5.
The above-mentioned problem can be clearly explained w ith  an example. Let us 
consider a circuit as shown in Figure 3.1. The dotted line represents the cut line. The dark 
lines between the nodes represent the nets connecting the different nodes. F-M  algorithm 
is explained in detail in the next section. In order to explain the problem in F-M , a b rie f 
description o f the algorithm  is given below.
F-M  algorithm progresses by moving one node from  one partition to other. The given 
c ircu it is firs t randomly partitioned. The ce ll to be moved is decided based on the gain o f 
each node in the circuit. A  node w ith  the highest gain is moved from  one partition to the 
other. Once a node is moved it  is locked to  ensure that the same node is not moved again 
and again. One pass is completed when a ll nodes are swapped from  their current position.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
15
In the graph shown below in Figure 3.1, the nodes (also called cells) 1 ,2 ,3  and 4 
form  a strongly connected subcluster C i. S im ilarly we have clusters C?, C3 and C4 .
C l
C2
CUTLINE
C3
12
Figure 3.1 C ircu it w ith clusters after random partitioning
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
16
The circu it that results after random partitioning is shown in Figure 3.1. As can be 
seen, the number o f nets crossing the outline is four. Hence the cutsize is four. This is the 
circu it before the F-M algorithm  is applied.
Figure 3.2 shows the c ircu it that is obtained after F-M partitioning. As stated above, 
F-M partitioning algorithm  progresses by moving one node from one partition to the 
other. The movement o f the node from  one partition to the other is based on the gain or 
the reduction in the cutset when a node is moved from  one side to the other.
As shown in Figure 3.2, the clusters seem to have distributed such that the cutsize is 
4. This means that there is no net reduction in the cutset F-M algorithm  was able to see 
the close connectivity o f nodes w ith  a cluster, but it  did not read the closeness o f the two 
clusters. Let us assume C| and C? were in  one partition as opposed to them being in 
different partitions. S im ilarly i f  C3 and C4  were together, the effective cutsize would be 
one instead o f four. The global approach o f F-M  algorithm fails to see this microscopic 
detail.
This is achieved by clustering before partitioning using F-M algorithm . Clustering 
lends this microscopic touch to the otherwise global approach o f F-M  algorithm .
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
17
CL
C2
CUTLINE
C3
Figure 3.2 Bi-Partitioned C ircu it after applying F-M  algorithm
C+
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
18
C2C3
C U TLIN E
CL
Figure 3.3 Bi-Partitioned c ircu it after clustering and applying F-M  algorithm
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
19
In  Figure 3.3 clustering allows clusters C i and C? to come into the same partition 
and C3 and C4 to come into the same partition thus allow ing the cutsize to reduce by 
three.
An intelligent clustering algorithm  should focus on local optim ization. There have 
been several clustering algorithms described fo r clustering. The simplest o f all 
clustering methods is randomly combining the connected nodes. Recursively 
combining connected nodes result in  form ing d isjo int pairs [19]. The nodes are 
picked randomly to cluster. Random clustering takes a large amount o f time.
K -L  clustering [14] looks fo r m ultip le short paths between nodes, expecting them 
to be placed in the same partition. I f  they were not in the same partition then, each o f 
these paths w ill have a net in the cutset. This reduces the partition quality. Checking 
the K -L connectedness consumes a large amount o f time.
In [15] Roy proposed an algorithm  called bandwidth clustering. In this method 
each net is given a bandwidth between a ll nodes connected to it. The value o f
bandwidth is given by I f  the bandwidth is greater than I between any two
nodes, then they are clustered. Thus in  this case two nodes can be connected only i f  a 
net d irectly connects two nodes. However transitive clustering is allowed. Transitive 
clustering is one where two nodes can be clustered even i f  they are not d irectly 
connected by a net. For example, let A  and B be connected by a net and B and C be 
connected by another net. I f  the bandwidth between A  and B, and between B and C 
each have a bandwidth greater than 1, then A  and C can be clustered after A  and B 
are clustered even though the in itia l bandwidth between A  and C is zero.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
20
Figure 3.4 Nodes A , B. C such that A  and B are connected by a net as is B and C.
Figure 3.4 shows nodes A  and B connected by a net. S im ilarly a net connects 
nodes B and C. I f  A  and B are clustered then the new clustered circu it w ill look as 
shown in Figure 3.5.
A  and 6  aftet cl osteiing.
Figure 3.5 The c ircu it after A  and B are clustered
A fte r A  and B are clustered, the net between B and C w ill connect the clustered 
nodes and C. Now the bandwidth between the clustered nodes and C is greater than I 
and hence w ill be clustered. However there is no restriction on the size o f a cluster. I f  
we have a node connected to clock or reset in  a circuit, then the possibilities are that
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
21
th is node combines w ith a ll other nodes in the circuit. Thus there is a possibility that a 
large number o f nodes can cluster to form  a cluster greater than the size o f an FPGA 
itself.
In [13] Schuler et al proposed another algorithm  that overcame this problem. This 
algorithm  is popularly known as connectivity clustering. Here clusters are formed 
based on the connectivity between the nodes. Connectivity is described by the 
equation shown below;
bandwidth
connectivitv „ =
" size, *  size ! *  {fanout, —bandwidth,^ ) *  {fanout ^  -  bandwidth,^ ) 
Figure 3.6 Equation describing the connectivity metric
The restriction on the size o f the clusters has been imposed by the size o f each 
cluster in the denominator. Since we do not want the fanout o f any node to be very 
high, a factor (fanout -  bandwidth) is incorporated in the denominator. Large fanout 
results in large cutsize. The reason fo r the restriction is to avoid large nodes to attract 
a ll its neighbors into a single huge cluster.
Our methodology o f clustering is very sim ilar to connectivity clustering. There 
have been small m odifications made to the equation shown above. As the cluster size 
increased the fanout seemed to increase. So the fanout o f each node was restricted to 
10. This number is ju s t a random number chosen keeping in m ind the CLBs. This
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
way the term (fanout -  bandwidth) is removed from the equation. The new metric is 
shown in the equation below:
bandwidth 
connectvitv=-
size, * size j 
Figure 3.7 New connectivity metric
This metric was formed to avoid large cutsize because o f large fanout. In the 
earlier case, i f  bandwidth were high, then fanout -  bandwidth factor would be low. 
Hence there is a possibility that nodes can s till combine. Clustering is performed to 
improve the results o f F-M algorithm . Large fanout w ill increase the cutsize. This is 
because when a node w ith a large fanout is moved between partitions, the net cutsize 
w ill increase or decrease by a factor that is proportional to the fanout. Keeping the 
size o f the partition in mind, we cannot allow  too many nodes to move into one 
partition just to decrease the cutsize. This means that the fanout o f each cluster has to 
be restricted. Hence we make sure that we do not combine nodes w ith fanout greater 
than 1 0 .
The above metric factor has been tested w ith  a lo t o f variations. It has been found 
that the best results are produced fo r large circuits w ith the factor described in Figure 
3.7. The results that were obtained fo r various metrics are tabulated in Chapter 4.
Figure 3.8 shows an example circuit. The clustering process is explained based on 
this figure.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
23
n4-
Figure 3.8 A simple netlist
The circuit has eight nodes represented as C i, Cz. C3. C4 , C5 , Ce. C? and Cg and 
seven nets represented as n ,, m , ns, tu , ns, ne and n?. More than two nodes can be 
connected by one net. Cz, C5 , Ce and C? are connected by net n;. The bandwidth for
every net is calculated. The bandwidth fo r net n, is given by ;—r—: i.e. T-— = I-
K - 1 2 -1
The bandwidth fo r net n j is given by
n J -1  4 - 1  3
 ^ =  - .  S im ilarly the bandwidths
fo r a ll nets are calculated. In itia lly  a ll nodes have a size equal to one.
For two nodes to cluster the sum o f bandwidths o f a ll nets connecting the two 
nodes should be greater than or equal to I. As can be seen node C i and Cs have a net 
bandwidth greater than 1 because the bandwidth o f net n l is equal to 1. For two nodes 
to cluster the metric used to evaluate is the equation given in  Figure 3.7. In  this case
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
24
the two nodes C i and C5 would cluster because the value o f connectivity is equal to I. 
Once clustered, the ce ll size o f the cluster w ill be equal to the size o f node I and node 
5. In this case the size o f the cluster is 2. Let us represent this cluster as U ,. Once 
clustered the net connecting the nodes get connected to the cluster. Thus net na and ns 
are connected to U i. Since net n l fa lls w ith in  the cluster, it  vanishes as far as the 
circu it is concerned. This is equivalent to saying that the net is deleted. The circu it 
after clustering C i and C5 is shown in Figure 3.9
ti4
Figure 3.9 N etlis t after nodes C i and C5 are clustered
U i and C7 are now considered fo r clustering. U i and C? are connected by nets ns 
and ns. The bandwidth o f net ns is j  and that o f net n$ is ^ .  The net bandwidth
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
25
between U i and C? is ( — + —) = — . The value o f connectivity is - ^ ^ .  Since this
3 2 6  1*2
value is less than 1. the two nodes do not combine to form  a cluster. This process
continues until a ll nodes are clustered once.
In this thesis, we cluster the nodes only once. Clustering itse lf could take a large 
amount o f time i f  we go in fo r recursive clustering. This would not improve the 
current cluster because we place a restriction on the fanout o f each cluster. Once we 
finish clustering, the next stage is bipartitioning. Usually the general methodology 
used fo r partitioning is to partition the clustered nodes and then uncluster them and 
partition them. However it  has been seen from the results that partitioning the 
clustered nodes directly gives improved results than unclustering and partitioning i f  
the correct seed is chosen fo r partitioning.
3.2 B ipartitioning
A fte r clustering, the stage is now set fo r bipartitioning. One o f the best known and 
most w idely used bipartitioning algorithms is the K -L algorithm [2 ]. In this approach 
a pair o f nodes is swapped between two partitions until we get an optim al result. A  
modified version o f K -L  A lgorithm  is presented by Fiduccia and Mattheyses in [3]. 
This algorithm  has a com plexity that is linear w ith time. Another feature o f the F-M 
algorithm is the effic ient data structure that is used.
In this algorithm  the c ircu it is represented as a hypergraph H (V , E). Given a 
network which consists o f nets and cells, also referred as nodes, the algorithm  divides 
them into two blocks A  and B such that the number o f nets that have cells in  both the
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
26
blocks is m inimal. The main idea o f the algorithm  is to move one cell at a tim e from 
one block to the other block. The ce ll to be moved is called the base cell. Base cell is 
chosen based on a balance criterion and on its effect on the size o f the current cutset. 
The balance criterion specifies the number o f nodes that needs to be in one partition. 
I f  a balance criterion is not specified then the partitioning can become such that one 
block has a large number o f nodes w hile the other has very few nodes in them.
Gain fo r any given node or ce ll is given as the number o f nets by which the cutset 
would decrease i f  the node or cell were moved from one partition to the other. I f  the 
number o f nets incident on a cell c(i) is represented by p(i), the gain g (i) o f any cell 
can vary only between -p ( i)  and +p(i). This is because when a cell c (i) is moved from 
one partition, the maximum number o f nets that can be added to the cutset w ill be 
equal to the number o f nets connected to this cell c(i). S im ilarly, the maximum 
number o f nets that can be decreased from  the cutset w ill be equal to p (i). Hence the 
gain o f any cell c(i) can vary only between -p ( i)  and +p(i).
The base cell is chosen as the cell w ith the highest gain. I f  the balance criterion 
does not allow  this node or cell to be moved, the node or cell w ith the highest gain in 
the complimentary block or partition is chosen. There is a possibility that this cell has 
a negative gain. This cell is s till moved w ith the expectation that the move w ill allow  
the algorithm to clim b out o f local minima. A ll the cells are moved from  one partition 
to another in this maimer. The movement o f a ll cells from  one partition to another is 
called a pass. The best partition encountered during the pass is chosen as the output o f 
the pass.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
27
In one pass, after every ce ll is moved it is locked in its new block. This cell 
remains locked fo r the remainder o f the process o f partitioning. Thus only free cells 
are allowed to move. This is done to prevent the algorithm from running into an 
in fin ite  loop. The algorithm moves to the best intermediate partition and then unlocks 
a ll nodes and proceeds into the next pass. I f  a pass fails to find a better cutset than the 
previous pass, the algorithm is terminated.
The main feature o f this algorithm is the methodology used to find the best cell. It 
uses a sorted lis t called the bucket data structure to find the best cell to move. This is 
made possible because o f the fact that the gain g (i) o f any cell or node c(i) is w ith in  
the range -pm ax to +pmax where pmax =  m ax{p (i)}. The data structure has an array 
o f lists, where each lis t contains cells in the same partition that cause the same change 
to the cutset when moved. This is to say, that nodes w ith the same gain are 
maintained as a linked lis t in an array o f gains. For example, i f  nodes 3 and 4 had the 
same gain when moved from  block A  to block B. they are maintained in the same list. 
Thus we have an array bucket [-pmax ....+pm ax]. Since we have two blocks A  and B, 
we have two such arrays each fo r block A  and B. Every element k in the array 
contains a linked lis t o f free cells w ith  gains currently equal to k.
When a cell is moved, the gain o f other cells needs to be updated. However it is 
also true that when a cell is moved, only the gain o f the cells it  is connected to w ill be 
updated. So when a naive algorithm  would recompute the gain o f a ll cells. F-M  
algorithm  computes the gain o f only those cells to which the moved cell is connected. 
Whenever a cell is moved from  one partition to another, it  is removed from  the bucket 
structure and the ce ll to which it  is connected is moved to the appropriate bucket
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
28
based on its new gain value. Every cell has a gainbucket through which the bucket 
structure accesses the cell. Thus the movement from  one bucket to another is done in 
constant time. This is shown in Figure 3.10.
cel I#
CELL
Figure 3.10 Bucket lis t structure
A  net is said to be critica l i f  moving any node on it causes a change in the cutset. 
Consider a net n. then let A(n) represent the number o f cells on this net in  block A  
and B(n) be the number o f cells on this net that are in block B. A  net n is now said to 
be critica l i f  A(n) o r B(n) is 1 o r 0. The gain o f a ce ll depends only on the critica l nets 
it is connected to. So a net which is not critica l w ill not affect the gain o f any o f its 
cells. This is shown in Figure 3.11.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
29
A(n) =0 A(n) = l
Figure 3.11 C ritica l nets
As seen in Figure 3.11, i f  any o f the nodes are moved to the other block B in the 
firs t case, the net w ill be cut. In the second case, one node is placed in A  block while 
the others are in  the complimentary block B. I f  a node is moved from  A  to B then the 
net n is removed from the cutset, else it remains in the cutset. Hence the net n is 
critica l.
When a cell is moved a ll nets connected to it need to be updated. There are four 
situations now that needs to be considered. The firs t case is when a net was not in the 
cutset but was moved into it. The second case is when a net was in the cutset but was 
moved out o f it. The th ird  case is when a net was firm ly  in the cutset but is now 
removable from  the cutset. The fourth case is when a net was removable from  the 
cutset but is now firm ly  in the cutset. A  net is said to be firm ly  cut i f  it has two nodes 
in one partition or a locked node in one partition. A ll other nets are said to be 
removable because they have only one node in one o f the partitions and that node is 
unlocked. By moving this node, the net itse lf can be removed from  the cutset.
Updating the gain o f the neighbors is the important part o f F-M  algorithm . During 
one pass o f the algorithm  it  has been proved not more than four update operations are
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
30
performed per net. I f  a net has cells locked in  two different partitions, then the net is 
said to be dead and it shall remain in the cutset. Updating the gain is done by 
scanning a ll the nets connected to the base cell. I f  this net has no cell in the block to 
which the base cell is moved, then gains o f a ll free cells on the net is incremented.
[5 6  cb
Base Cell
Figure 3.12 Before moving the base cell
Figure 3.12 shows the status o f net n before the base cell is moved into the 
complimentary block. Once it has been moved the gains o f the other cells on this net 
is updated as shown in Figure 3.13.
Cell is locked
□  
Base Cel I
Figure 3.13 A fte r moving the base cell
Let F be the side from  which the base cell is moved and T  be the side to which the base 
ce ll is moved. I f  the number o f cells on the net in  the T  side is 1. then the gain o f this cell 
provided the cell is free is decremented. Once this is done, decrement the balance o f the
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
31
side from which the base cell is moved and increment the balance o f the side into which 
it is moved. Then the net is checked i f  it  is s till critica l. I f  this net has no nodes in the 
partition from  which it was moved then decrement the gain o f the free cells on that net. 
Else i f  the number o f cells in the partition from which it was moved is equal to 1, then 
increment the gain o f this cell which is in the partition.
3.2.1. Pseudo code fo r the F-M algorithm  
The pseudo code for the F-M algorithm  is shown below.
Get the netlist; Create the in itia l partition randomly;
w hile cutsize has decreased
w hile cells are available fo r moving and balance criteria is satisfied
{
Select the node w ith the highest gain which also does not cause an 
imbalance in the partition size;
Move this node to other partition and lock the node;
Update the gains o f its neighbors ( );
Find the move where m inimum cutsize is obtained;
Unlock a ll nodes in this partition where m inimum cutsize is obtained;
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
32
Update the gains o f its neighbors ( );
Let F be the partition from  which the base cell is moved; 
Let T  be the partition into which the base cell is moved; 
For each net n on the base cell;
i f  (T(n) =  0 /*  i f  the number o f cells on net n in partition T  =  0 */ 
Increment gains o f a ll free nodes on this net n; 
else if(T (n ) = 1 )
Decrement the gain o f this cell in T;
Decrement F(n); /*  decrease the number o f cells in F * /
Increment T (n); /*  increase the number o f cells in T  * !
i f  (F(n) =  0)
Decrement the gain o f a ll free cells connected to net n; 
else i f  (F(n) =  1)
Increment the gain o f this cell which is in F;
}
}
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
33
3.3 M ulti-partition ing 
The aim o f the thesis is to partition the c ircu it into m ultiple blocks such that there 
is m inimum interconnection between the blocks. Since these partitions are going to be 
f it  into an FPGA. we try  to partition such that each partition fits  into an FPGA. Given 
the present trends, it is found that the major restriction in fittin g  a circuit into an 
FPGA is the term inal count and not the CLBs. There have been some m ulti­
partitioning algorithms that have been proposed in  the past. One among them is the 
algorithm  proposed by Kuznar et al [6 ]. He proposes a way o f performing m ulti­
partitioning using functional replication. The proposed algorithm  uses this basic idea 
w ith  some modifications. The problem w ith  Kuznar"s algorithm  is that there is no 
lim it on the maximum partition size. Their approach avoids the inclusion o f prim ary 
inputs connected to both the partition as part o f the cutset. In our proposed algorithm 
the primary inputs connected to both partitions are a part o f the cutset. Another m ajor 
drawback is that there is no restriction on the number o f replications. This 
unfortunately results in one partition being about 60% o f the circuit.
In our proposed approach we ignore cell replication. This means that the sum o f 
the cells in a ll partitions is equal to the total number o f cells in the circuit. This allows 
us to place a restriction on the size o f the partition. In this algorithm the c ircu it is 
clustered using the methodology explained in  section 3.1. This clustered circuit then 
forms our circu it to be partitioned.
The algorithm  progresses by iteratively applying F-M  partitioning to the circu it. 
Clustering reduces the tim e taken to partition and improves the quality o f the cutset. 
When the clustered c ircu it is bipartitioned, it  is done such that one partition always
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
34
satisfies the balance criteria given by the ratio which the user inputs. I f  the user wants 
30% o f the design to be in one partition, then the c ircu it partitions such that 30% o f 
the entire ceils are in one partition and 70% in another. Now the user could decide to 
further partition the circuit. In  this case the user chooses the percentage o f the entire 
c ircu it that needs to be in one partition. Based on this the c ircu it is again partitioned. 
This is continued until a ll partitions could fit into one FPGA each. I f  at any point o f 
time during the partitioning the number o f nets that cross the given partition exceeds 
the number o f input output pins, the m ulti-partition is reinitia lized w ith a lower ratio.
A  m ulti-way partitioning is created by recursively applying bipartitioning 
algorithm  to the c ircu it until the required number o f partitions is applied or until a ll 
partitions can satisfy the constraints given by the FPGAs. This means that each 
partition is mapped into one FPGA. Scott et al [17] uses a recursive bipartitioning. 
However their algorithm  always divides the c ircu it only into even number o f 
partitions all o f which are equal in size. This m ight help i f  a ll the FPGAs are the 
same. However it is always better to have different FPGAs o f different sizes. It has 
been proved by Kuznar that it is always better to map a c ircu it into small FPGA firs t 
before using the large FPGAs. Based on this theory, it  has been found from  our 
results that best results are obtained when small ratios are used in itia lly .
The circu it after clustering is first partitioned such that the balance ratio (specified 
by the user) *  total cells fits into one FPGA. Thus the sizes o f the two partitions that 
are obtained would be sizepartl and sizepart2 respectively. The value o f sizepartl and 
sizepart2  are as follow s:
sizepartl =  totalcells *  ratio
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
35
sizepartl = totalcells — sizepartl
Now the ratio is so specified such that one partition satisfies the constraints o f one 
FPGA. I f  the number o f cuts that cross the partition increases the number o f input 
output o f the FPGA, then the m ulti-partitioning is done w ith  a ratio much lower than 
that given as input. Since the number o f nets connected to a cluster varies between 5 
and 1 0 , the number o f clusters w ithin a partition would decide the number o f nets that 
cross the partition. Every time a cluster is moved from  one partition to the other the 
cutset increases or decreases by this number. So when the number o f nodes in the 
partition decreases the cutsize also decreases. This is the reason that we have smaller 
FPGA used to f it  the circuits in itia lly  and then use the larger FPGAs.
Once one partition satisfies the constraints o f one FPGA, the remainder o f the 
circuit is partitioned in a sim ilar fashion such that the second partition satisfies the set 
o f constraints imposed fo r the second partition. This process is continued un til all 
partitions satisfy the constraints specified.
Recursive bipartitioning strategy, as proposed and illustrated can be also viewed 
as an extreme case o f asymmetrical recursive bipartitioning. In this strategy, each 
application o f the bipartitioning procedure produces one feasible subcircuit and the 
remainder. The same strategy is then applied to the remainder un til, feasible solutions 
are obtained.
The main advantage o f the recursive asymmetrical bipartitioning method over the 
symmetrical b ipartitioning method (which in  each step generates to balanced 
partitions) is the immediate possibility o f evaluating the quality o f at least one o f the 
subsets produced in each bipartitioning stage.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
36
The purpose o f clustering is to reduce the tim e taken to run this iterative recursive 
bipartitioning algorithm  and also to improve the quality. So the in itia l set o f nodes is 
first clustered to form  cells. These cells are then partitioned. The pseudo code fo r the 
above algorithm  is shown below.
3.3.1 Pseudo code fo r the proposed algorithm
for a ll cells
check for clustering criterion; 
i f  criterion is satisfied 
cluster the cells;
calculate the partition size based on the ratio; 
w hile RJ does not satisfy the constraints imposed
partition the c ircu it into P, and Rj such that Pi satisfies the constraints that are 
imposed;
Rj— Rj^i; 
}
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
37
As seen in the pseudo code, we in itia lly  cluster the nodes to form  cells. FM 
partitioning algorithm  is applied to this new graph which has cells and nets such that 
the ratio o f balance is satisfied. Thus the number o f cells in each partition w ill be 
ceiling function (ratio *  number o f cells) and number o f cells -  ceil (ratio *  number o f 
cells) respectively. Thus i f  Ro was the entire circuit, it is partitioned into P, and R, 
such Pi can be implemented on one FPGA. Ri is the reminder o f the circuit which 
w ill be partitioned into Pi, R i, where Pi w ill satisfy the constraints o f one FPGA. This 
process o f recursive partitioning is continued until we have P,. P i, ...Pk. Figure 3.14, 
Figure 3.15, Figure 3.16 and Figure 3.17 show the m ulti-partitioning as performed 
stage by stage.
RO
Figure 3.14 In itia l C ircuit
R I
Figure 3.15 C ircu it after firs t bipartitioning
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
38
R2
Figure 3.16 C ircu it after bipartitioning tw ice
P2
P3P i
P4
Figure 3.17 C ircu it after fina l bipartitioning
Thus the given c ircu it is m ulti partitioned based on reducing the interconnection 
and also satisfying the balance.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 4
EXPERIM ENTAL RESULTS AN D  TABU LATIO N S 
In the previous chapter a detailed description o f the algorithm  was given. In this 
chapter the results o f various experiments conducted w ill be tabulated and explained. The 
proposed algorithm  has been exercised on MCNC 99 benchmark suite and MCNC 98 
benchmark suite. The input files fo r the software implementing the algorithm are in .net 
o r .netD format. The software implementing the algorithm  was also tested by partitioning 
the X ilin x  N etlist format (xn f) file  and mapping onto a X ilin x  V irtex chip.
First the .net file  is clustered using the modified connectivity algorithm  and then it is 
bi-partitioned using the F-M  algorithm. Since most o f the earlier algorithms were 
performed on MCNC 98 series or the MCNC benchmarks, we shall firs t tabulate the 
results o f these files. The netlist o f MCNC 99 Benchmarks have a huge number o f nodes 
and large number o f nets connecting them.
4.1 Clustering Results 
Table 4.1 shows the total number o f nodes in each .net file . The th ird  column shows 
the number o f nets that connect these nodes in each case. Each net has at least tw o nodes 
connected to them. The fourth column shows the number o f nodes after clustering. The
39
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
40
clustering criterion used here is that the modified connectivity should be greater than I.
bandwidth,.
M odified connectivity is defined as
ce I  Is be, *  ce I Is be.
Table 4.1 Characteristics o f the netlist after clustering.
Filename # o f nodes # o f nets
# o f 
clusters
# o f  
resulting nets
I9ks 2844 3282 1766 1950
s9234 5866 5844 3340 3367
biomed 6514 5742 4250 3530
si3207 8772 8651 5143 5131
S15850 10470 10383 5964 5976
industry2 12637 13419 8739 9703
industry3 15406 21923 10584 17600
S35932 18148 17828 10264 10835
S38584 20995 20717 13017 12855
$38417 23849 23843 13989 14043
golem3 103048 144949 65801 103568
From the above chart it  can be seen that the number o f nodes and number o f nets 
formed are smaller than those seen orig inally.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
41
4.2 Comparison w ith  other clustering methodologies 
In Table 4.2 a comparison o f different methods o f clustering is shown. The th ird 
column represents the number o f clusters formed using m odified connectivity criterion. 
The fourth column shows the number o f clusters formed using only the bandwidth 
criterion. In this criterion the nodes are combined i f  the bandwidth between the nodes is 
greater than I. The fifth  column shows the number o f clusters formed when clustered 
using the connectivity clustering. In the case o f connectivity clustering nodes are
combined i f
greater than 1 .
bandwidth,.
cellsbe, *cellsizej *(, fanout, -  bandwidth,^)* {fanout j  —bandwidth,^)
IS
Table 4.2 Comparison w ith other clustering criterion
Filename # o f Nodes
# o f clusters 
using 
modified 
connectivity
# o f clusters 
using 
bandwidth
# o f clusters 
using 
connectivity
industry2 .net 12637 8739 2529 12405
industry3.net 15406 10584 2557 15343
s35932.net 18148 10264 5239 16942
ibm0 2 .net 19601 10584 6265 19216
s38584.net 20995 -13017 3392 19239
ibm03.net 23136 16330 4262 23007
ibm04.net 27507 20103 6266 26660
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
42
It is seen from  the above table that the best method o f clustering in terms o f number 
o f resulting clusters is bandwidth clustering. However there are two drawbacks to this 
method o f clustering. First, the tim e taken fo r clustering increases significantly. Secondly 
the fanout o f each cluster now becomes larger. Hence the cutsize would eventually 
increase after partitioning. The next best solution as seen from  the table is the modified 
connectivity. The fanout o f each cluster w ill not be high. This is because in our algorithm 
we ensure that only those nodes which have a fanout less than or equal to 5.
A  simple comparison chart o f the number o f clusters formed is shown in Figure 4.1. 
Table 4.3 shows the tim e taken by modified connectivity algorithm  and time taken by 
bandwidth clustering to form clusters. The software has been run on a Pentium 4 
processor running at 1. 8  GHz on a Linux Operating System.
S 25000 20000 
J  15000 
% 10000 
. 5000
z  0
Different criterion for clustering
y
y  #  ^
Filename
•# of clusters using 
modified 
connectMty
># of clusters using 
bandwidth
-# of clusters using 
connectiuty
Figure 4.1 Comparison o f different clustering criteria.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
43
Table 4.3 Comparison o f time taken for clustering for different algorithms
Filename # o f Nodes
Tim e taken fo r 
bandwidth
Time taken for 
modified 
connectivity
industry2 .net 12637 68.18 1 . 1 1
industry3.net 15406 105.42 0.25
s35932.net 18148 0.26 0.19
ibm 0 2 .net 19601 384.86 0.97
s38584.net 20995 14.4 0.18
ibm03.net 23136 1644.58 0.72
ibm04.net 27507 2027.42 0.7
From Table 4.3, it can be seen that the tim e taken fo r bandwidth clustering is very 
large. For example, the ibm02 netlist takes 0.97 seconds fo r clustering when the modified 
connectivity criterion is applied. When the same file  is clustered using bandwidth 
criterion, the time taken is 384.86. The average time taken using bandwidth clustering 
was found to be 114.62 seconds and the average tim e taken by m odified connectivity 
clustering is 0.54 seconds. The fanout o f each cluster is large. This results in huge 
cutsize. Hence bandwidth clustering is not used fo r clustering.
Table 4.4 shows the clustering results fo r MCNC 99 Benchmark suite circuits. The 
firs t column shows the filename, the second gives the number o f nodes in each circuit.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
44
The th ird  column shows the number o f clusters after m odified connectivity clustering is 
applied. The fourth column shows the tim e taken fo r clustering.
Table 4.4 Clustering on MCNC 99 benchmark circuits
Filename # o f nodes # o f clusters
Time taken for 
clustering
ibmOl 12752 9081 0.32
ibm02 19601 14031 0.97
ibm03 23136 16330 0.72
ibm04 27507 20103 0.7
ibmlO 69429 48706 2.61
ibm l 1 70558 53191 1.9
ibm I7 185495 138566 10.52
ibm l 8 210613 156418 7.97
It is evident from  Table 4.4 that tim e taken fo r clustering increases w ith  number o f 
nodes being clustered.
4.3 B ipartitioning Results 
The next step towards m ulti-partitioning is to b ipartition this circu it. We perform 
bipartitioning using F-M  algorithm. In our algorithm  we firs t cluster the c ircu it before we 
bipartition.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
45
Table 4.5 shows the results o f F-M algorithm  applied to a few MCNC benchmark 
circuits. The firs t column and second column represent the name o f the file  and the 
number o f nodes in this file . Here a node could be a D -flip flop , gate, small combinational 
c ircu it etc. The third column shows the number o f nets o r hyperedges in the c ircu it and 
the fourth represents the number o f nets that have nodes in both the partitions. The bi­
partitioning is performed by forcing the balance condition to 45-55.
Figure 4.2 shows the variation o f numbers o f cuts w ith  the number o f nodes in the 
circuit. It is very evident from the graph below that F-M algorithm  gives good results for 
small circuits. But as the size o f the c ircu it becomes large, the cutsize seems to be 
increasing. This shows the inefficiency o f F-M  algorithm  to group smaller clusters into 
one partition.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
46
Table 4.5 Results of F-M partitioning
Filename # of nodes
# o f
hyperedges
# of cuts in 
FM
19ks 2844 3282 104
s9234 5866 5844 49
biomed 6514 5742 60
s i3207 8772 8651 62
s i5850 10470 10383 170
ibmOl 12752 14111 220
S35932 18148 17828 107
ibm02 19601 19584 84
S38584 20995 20717 185
ibm03 23136 27401 1315
s384l7 23949 23843 138
ibm04 27507 31970 1450
ibmIO 69429 75196 3523
ibml 1 70558 81454 3577
golem3 103048 144949 2430
ibml 7 185495 189581 2558
ibml 8 210613 201920 5114
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
47
F-M Bipartitioning
6000 
5000 
® 4000 
i  3000 
a  2000 
1000 
n
/
y-TL / ----- # of cuts in FM
-------------------------  ------------
u
Circuit name
Figure 4.2 Variation o f number o f cuts w ith  the size o f the circu it.
This can be overcome by clustering the c ircu it before partitioning it. This way the 
smaller loosely connected clusters can also enter into one partition.
Table 4.6 shows the comparison o f F-M  algorithm  w ith our algorithm. It is seen from 
the tabular column that there is only a steady increase in the number o f cuts as the size 
increases in the case o f our algorithm unlike in the case o f F-M algorithm. Figure 4.3 
shows the variation o f number o f cuts w ith respect to the number o f nodes fo r F-M 
algorithm  and our algorithm.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
48
Table 4.6 Comparison o f F-M algorithm with our algorithm
Filename # o f nodes
# o f
hyperedges
# o f cuts in 
FM
# of cuts after 
clustering and 
partitioning
I9ks 2844 3282 104 78
s9234 5866 5844 49 7
biomed 6514 5742 60 46
s I3207 8772 8651 62 19
s i5850 10470 10383 170 41
ibmOl 12752 14111 220 140
S35932 18148 17828 107 118
ibm02 19601 19584 84 96
S38584 20995 20717 185 73
ibm03 23136 27401 1315 966
S38417 23949 23843 138 115
ibm04 27507 31970 1450 474
ibmlO 69429 75196 3523 1262
ibml I 70558 81454 3577 1413
golem3 103048 144949 2430 2272
ibml 7 185495 189581 2558 2565
ibml 8 210613 201920 5114 2145
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
49
It can be seen from Figure 4.3 that F-M  algorithm  and our algorithm perform almost 
sim ilarly fo r small circuits. But as the size o f the circuit increases then our algorithm 
performs better than the F-M  algorithm .
Com parison w ith FM aigorithm
6000
5000
4000
3000
2000
1000
0
#  of cuts in FM
■# of cuts after
clustering and 
partitioning
Figure 4.3 Variation o f number o f cuts w ith size o f the circuit.
It  is to be noted that these comparisons have been made when the ratio o f the 
partitions is 0.45 or 0.5. In this thesis a ratio is defined as the number o f nodes in one 
partition to the total number o f nodes in the entire circuit. So fo r example 0 .1 partition 
ratio means that we have 10% o f the nodes in one partition and 90% o f the nodes in the 
other partition. This is also sometimes referred to as 10-90 balance condition. However 
since we are interested in m ulti-partitioning, there is no need to maintain this value o f 
partitioning. This means that the two resulting partitions do not have to be o f equal size. 
Taking this assumption into account, the number o f cuts at different ratios is tabulated.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
50
4.4 B ipartitioning at d ifferent ratios 
Table 4.7 shows the number o f cuts when the circu it is forced to a 10-90 balance 
condition, 20-80 balance condition, 25-75 balance condition, 30-70 balance condition and 
40-60 balance condition. In the table, the columns O.I, 0.2,0.25,0.3 and 0.4 represent the 
ratios w ith which the circuit is partitioned. Figure 4.4 shows the variation o f number o f 
cuts w ith respect to different ratios on different circuits.
Table 4.7 F-M algorithm  performed at different ratios.
Filename
# o f
Nodes
O.I 0.2 0.25 0.3 0.4
industry2 12637 ISO 185 603 395 488
industry3 15406 356 134 179 194 245
S35932 18148 67 107 102 132 205
ibm02 19601 I I I 84 245 231 388
S38584 20995 215 185 211 254 222
ibm03 23136 922 1315 1450 1125 1648
ibm04 27507 875 1450 1379 659 922
ibmIO 69429 2354 3523 2764 2753 1948
ibml 7 185495 6948 2558 4090 3073 10660
ibml 8 210613 2956 5114 2967 4280 2561
Figure 4.4 shows the variation o f cuts fo r circuits w ithout clustering. It can be 
concluded from  the graph that fo r circuits w ith  smaller number o f nodes, the partition
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
51
ratio does not really matter. A lm ost a ll ratios seem to have the cutsize around the same 
region. But as the size o f the circu it becomes large, the number o f cuts increases 
drastically as the ratio is higher. For example ib m l7 has a cutsize o f 2558 at 0.2 balance 
ratio w hile it has a cutsize o f 10660 for a balance ratio o f 0.4. This goes on to prove that 
when a large circu it is being m ulti-partitioned it is always good to firs t partition the 
circu it w ith a lower ratio before going to higher ratios. This means it is always beneficial 
to have FPGAs o f different sizes so that the CLB utilization is high.
12000
10000
8000
jS 6000 
o
4000
2000
F-M at different ratios
A
nx V  V  <o= 
# ofnodes
0.1
• 0.2
■0.25
•0.3
-0 .4
Figure 4.4 Variation o f cutsize w ith  different ratios on F-M  algorithm .
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
52
The graph shows the results o f simple F-M  algorithm as applied to a circuit. Table 4.8 
shows the variations o f cutsize w ith  different ratios when applied on the modified F-M 
algorithm.
Table 4.8 Variation on cutsize w ith  ratio when applied to m odified F-M  algorithm
File­
name
ft o f Nodes 0.1 0.2 0.25 0.3 0.4 0.45 0.5
19ks 2844 41 78 88 93 130 121 139
s9234 5866 4 7 22 20 57 61 59
biomed 6514 34 46 69 90 86 84 161
s i3207 8772 14 19 33 76 71 114 88
s i5850 10470 21 41 46 52 77 114 152
ind2 12637 137 134 143 129 297 335 524
ind3 15406 149 147 147 147 168 301 384
S35932 18148 106 118 92 100 235 271 271
S38584 20995 65 73 53 54 54 51 68
S38417 23949 84 115 147 172 222 338 465
golem3 103048 1960 2272 2249 2382 3859 3690 4836
In the table, the columns 0 .1 ,0 .2 ,0 .25 ,0 .3  and 0.4 represent the ratios w ith which the 
c ircu it is partitioned. The results tabulated here are based on the m inimum o f twenty runs.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
53
Figure 4.5 shows the variations o f the cutset w ith respect to different ratios when applied 
to modified F-M algorithm . The X-axis represents the name o f the files. The files are 
arranged in the increasing order o f the number o f nodes in them.
As can be seen from the graph, the cutset is almost the same fo r lower values o f cut 
ratio. As it nears 0.4 the cutset starts increasing. This is particularly evident fo r large 
circuits like golem3.
Variation of cutsize for different ratios
4500  
4000  
3500 
ÿ  3000 
3  2500 
2 2000 
z  1500 
1000 
500 
0
0.1
0.2
0.25
0.3
0.4
Filename
A
Figure 4.5 Cutsize fo r different ratios as applied to modified F-M algorithm.
M odified F-M  algorithm  was tested on MCNC 98 benchmark circuits. In Table 4.9 
the modified F-M  algorithm  is tested on MCNC 99 benchmark circuits. Figure 4.6 shows
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
54
the variations o f cutsize w ith  respect to ratio o f partitioning on MCNC 99 benchmark 
circuits.
Table 4.9 Cutsize o f various MCNC 99 Benchmark circuits w ith respect to different
ratios
Filename # o f nodes 0.1 0.2 0.25 0.3 0.4 0.45 0.5
ibmOI 12752 134 140 125 160 215 356 308
ibm02 19601 96 96 173 171 310 294 403
ibm03 23136 395 966 979 1075 1139 1051 1414
ibm04 27507 771 474 535 420 419 747 . 879
ibmlO 69429 664 1262 1354 1364 1853 2115 2286
ibm l 1 70558 571 1413 1601 1608 1346 1843 3172
ibm l 7 185495 1930 2565 3930 4265 3598 3805 5075
ibm l 8 210613 1016 2145 1734 1921 1883 1733 2741
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
55
Cut size at Different ratios
6000
5000
4000
o 3000
2000
1000
,9»
 0.1
 0.25
—  0.3 
—  0.45 
^ — 0.5
#of nodes
Figure 4.6 Variation o f Cutsize on MCNC 99 Benchmarks w ith  respect to different ratios.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
56
Comparison between FM and modified FM
aigorithm
8 0 0 0  
7 0 0 0  
6 0 0 0  
g  5 0 0 0  
a  4 0 0 0  
O 30 0 0
- 0.1 
- 0.2 
- 0 .2 5  
F M 0 .1  
F M  0 .2  
• F M  0 .2 5
CP
Filename
Figure 4.7 Comparison o f cutsize between F-M  and modified F-M  algorithms at different
cut ratio.
Figure 4.7 shows the variation o f cutsize w ith  different ratios for both F-M and 
modified F-M algorithm . The solid lines represent the cutset o f modified F-M  algorithm  
at balance ratios o f 0.1. 0.2 and 0.25. Thé dotted lines represent the cutset o f F-M  
algorithm  fo r the above mentioned balance ratios. The cutsize obtained from the m odified 
F-M  algorithm is lower than that obtained from  the simple F-M  algorithm fo r almost a ll 
ratios. Clustering followed by partitioning groups loosely connected clusters into one 
partition.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
57
We have tried different methods fo r clustering. Among the many methods tried, 
bandwidth clustering seemed to produce better results. Unfortunately clustering using 
bandwidth criterion took a large amount o f time. This was already shown in Figure 4.3. 
Table 4.10 shows the number o f cuts obtained during partitioning when the clustering 
was performed using the bandwidth alone.
Table 4.10 Cutsize for different ratios when clustering is performed using bandwidth
criterion.
Filename
# o f
Nodes
0.1 0.2 0.25 0.3 0.4 0.45 0.5
I9ks 2844 52 91 113 115 135 148 156
s9234 5866 9 22 21 36 43 47 94
biomed 6514 47 46 116 83 106 109 117
s i3207 8772 9 18 21 46 74 74 111
S15850 10470 12 42 51 59 82 70 155
industry2 12637 165 206 202 219 300 686 736
S35932 18148 35 . 35 35 35 43 45 55
S38584 20995 46 92 65 63 81 92 172
S38417 23949 93 91 104 93 188 204 426
Although bandwidth clustering followed by F-M  partitioning gives good results fo r 
some circuits, the time taken fo r clustering is fa r greater than the jnod ified  F-M
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
58
algorithm. Apart from partitioning we also try  to reduce the time taken for partitioning. 
Since the time taken is very large, this method o f clustering is not considered.
4.5 Comparison w ith  previous results 
Table 4.11 shows the results obtained by various existing algorithms. The results 
obtained from  our algorithm shall be compared w ith  these results. A ll these results have 
been obtained by forcing the balance criterion to be 45- 55% o f the circuit.
In the table, the firs t column shows the c ircu it name. The column named PROP 
represents the cutsize obtained from  the algorithm  proposed by Kuznar et al [6 ]. The 
column named O pt.KLFM  represents the cutsize obtained from the algorithm proposed 
by Scott Hauck et al [17]. The column named CLIPPROP represents the cutsize obtained 
from the algorithm  proposed by Shantanu D utt et al [8 ]. The column named PARABOLI 
represents the cutsize when the PARABOLI algorithm  is applied to the c ircu it fo r 
partitioning. The column named hMETIS represents the cutsize obtained when the 
partitioning algorithm proposed by Karypis et al [16] is applied to the circuit.
In the tabular column, a •‘- ’means that inform ation regarding the number o f cuts fo r 
that particular circu it is unavailable. Figure 4.8 shows the variation o f the cutsize w ith  
different algorithms.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Table 4.11 Previous results
59
Benchmark PROP O pt.KLFM CLIPPROP PARABOLI GMetis hMETIS
19ks 105 - 104 - 106 107
p2 143 - 152 146 142 148
s9234 41 45 42 74 43 40
biomed 83 - 84 135 102 83
s i3207 75 62 71 91 74 55
s i5850 65 46 56 91 53 42
industry2 220 - 192 193 177 174
industry3 - - 243 267 243 . 255
S35932 - 46 42 62 57 42
S38584 - 52 51 55 53 47
S38417 - - 65 49 69 52
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
60
Previous results
300
Z  150
P R O P
O p tK L F M
C L IP P R O P f
P A R A B O LI
G M etis
hM E TIS E E 20
Nj' ^  ^  ^
Filename
Figure 4.8 Cutsize o f different algorithms.
From Figure 4.8 it is observed that hMETIS give the best cutsize. Table 4.12 shows 
the results o f the modified F-M  algorithm  in comparison w ith the above results. In Table 
4.12 the column Mod.FM represents the cuts obtained using modified F-M  algorithm . 
The results tabulated are obtained by forcing the balance ratio to be 45-55.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
61
Table 4.12 Comparison with existing algorithms
Benchmark PROP
O pt.KLF
M
CLIPPR
OP
PARAB
O LI
GMetis hMETIS
Mod.
FM
19ks 105 - 104 - 106 107 121
p2 143 - 152 146 142 148 -
s9234 41 45 42 74 43 40 61
biomed 83 - 84 135 102 83 84
s i3207 75 62 71 91 74 55 114
s i5850 65 46 56 91 53 42 114
industry2 220 - 192 193 177 174 335
S35932 - 46 42 62 57 42 271
S38584 - 52 51 55 53 47 51
S38417 - - 65 49 69 52 338
Figure 4.9 shows the comparison on a chart. There is no need to have the partition 
ratio to be 0.45 or 0.5. As shown earlier, better results can be obtained at lower ratios 
w ith increasing size o f circuits. Table 4.13 shows the best results obtained among the 
different ratios. These results are comparable w ith  those obtained from  other existing 
algorithms Figure 4.10 shows the chart that compares these results.
It can be seen from  the chart that when the best results are taken, our algorithm  
performs better than most exiting algorithms. For example, w hile hMETIS produces 174 
cuts when partitioning industry2, m odified F-M  algorithm  produces only 129 cuts.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
62
Comparison with existing aigorithms
400
350
300
250
200
150
100
50
0
...H
&
19ks
P2
S9234 
biomed 
s i3207 
s i5850 
Industry2 
S35932 
S38584 
S38417
Algorithms
Figure 4.9 Comparison o f Mod F-M w ith existing algorithms.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
63
Table 4.13 Comparison of best results from Mod F-M algorithm with existing algorithms.
Benchmark PROP O pt.KLFM CLIPPROPf PARABOLI GMetis hMETIS
BEST
Mod.FM
I 9 k s 1 0 5 1 0 4 1 0 6 1 0 7 41
p2 1 43 1 5 2 146 1 4 2 1 4 8
s 9 2 3 4 4 1 4 5 4 2 7 4 4 3 4 0 4
biomed 8 3 8 4 1 3 5 1 0 2 8 3 3 4
s i  3 2 0 7 7 5 6 2 71 91 7 4 5 5 14
S 1 5 8 5 0 6 5 4 6 5 6 91 5 3 4 2 21
industry2 2 2 0 1 9 2 193 1 7 7 1 7 4 12 9
S 3 5 9 3 2 4 6 4 2 6 2 5 7 4 2 9 2
S 3 8 5 8 4 5 2 51 5 5 5 3 4 7 51
S 3 8 4 1 7 6 5 4 9 6 9 5 2 8 4
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
64
Comparison with existing aigorithms
4 0 0
3 5 0
3 0 0
0  2 5 0
200
»  1 5 0
100
5 0
0
o ' /  ^ #
Algorithms.*^
<$>
1 9 k s
P2
S 9 2 3 4  
biom ed 
s i 3 2 0 7  
s i  5 8 5 0  
industry2  
• S 3 5 9 3 2  
S 3 8 5 8 4  
S 3 8 4 1 7
Figure 4.10 Comparison o f best results w ith existing algorithms.
The next important aspect to be considered while partitioning is the time taken fo r 
partitioning. Table 4.14 shows a comparison o f time taken fo r partitioning a c ircu it using 
F-M algorithm and modified F-M  algorithm . Figure 4.11 shows the comparison on chart. 
It can be seen from this chart that tim e taken by any c ircu it fo r partitioning after 
clustering is almost comparable w ith  the time taken for partitioning the c ircu it using F-M 
algorithm alone. This is because the number o f nodes required fo r partitioning reduces 
after clustering. Also since there are fewer nodes, the number o f passes o f F-M  algorithm  
reduces. However clustering consumes certain amount o f time.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
65
Table 4.14 Comparison of time taken for partitioning
Filename # o f nodes
# o f nodes 
after 
clustering
Time for 
clustering and 
partitioning
Time fo r 
partitioning 
w ithout 
clustering
balu 801 618 0.03 0.01
I9ks 2844 1766 0.27 0.07
S9234 5866 3340 0.12 0.11
biomed 6514 4250 2.88 0.2
s13207 8772 5143 0.15 0.22
S15850 10470 5964 0.2 0.29
ibmOl 12752 9081 1.02 1.01
$35932 18148 10264 0.69 0.6
ibm02 19601 14031 2.29 0.95
$38584 20995 13017 0.62 0.75
ibm03 23136 16330 2.01 1.87
$38417 23949 13989 0.75 0.82
ibm04 27507 20103 2.38 2.67
ibmlO 69429 48706 6.92 4.18
ib m ll 70558 53191 6.19 4.72
golemS 103048 65801 6.97 6.86
ibm l7 185495 138566 25.42 19.26
ibm l8 210613 156418 22.82 34.93
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
66
Analysis of time
40
35
a 30
s  25
« 20 JC
S 15
10
5
0
(OO CO 
CO 00 U)
^  <oO  CO 
CO ^
CO CM 05 CO .
e— ^  CM CM CM
-Time for clustering 
and partitioning
Time for
partitioning without 
clustering
# of nodes
Figure 4.11 Comparison o f tim e taken for partitioning
4.6 Mapping onto Spartan Device 
In an effort to calculate the number o f FPGAs needed, the netlist was mapped onto a 
X ilin x  SPARTAN XCS 30 BG256 w ith a speed grade o f -3. The c ircu it is represented in 
the x n f format. The algorithm  was tested on the MCNC benchmarks. The circuits were 
tested w ith  various ratios. We shall tabulate one o f the results thus obtained.
The c ircu it under consideration is s384I7. This circuit was partitioned into m ultiple 
subcircuits such that each subcircuit can be mapped onto one FPGA. The FPGA being 
considered here is a X ilin x  SPARTAN XCS 30 BG256. XCS represents the fam ily o f 
SPARTAN FPGAs. BG256 represents the package. The circuit is being mapped using the 
X ilin x  ISE 4.2i. In this software, the Design Manager allows us to map the x n f files onto
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
. 67
a target FPGA. The design manager translated the x n f file  into .ngd file  and maps this 
ngd file  onto the FPGA.
We partition the c ircu it w ith  the ratios o f balance being 0.1, 0.2, 0.5, 0.5, 0.4, 0.45, 
0.45 and 0.45. As can be seen the in itia l ratios o f balance are low. As the number o f 
nodes reduces, the partition ratio is increased. The CLB utilization was calculated using 
the X ilin x  ISE. The number o f CLBs in the above mentioned device is 576. Since the 
number o f cuts is always w ith  respect to two partitions and the total number o f results to 
be tabulated w ill be equal to n l, where n is the number o f partitions.
The number o f partitions obtained after partitioning the c ircu it w ith  the ratios 
mentioned above is 9. We shall represent each partition w ith  a name part followed by the 
partition number. The partitions are arranged in the same order they are formed. This 
means that parti is obtained on partition ing the c ircu it s384l7. Then the remainder o f the 
circu it is partitioned to get part2. It goes on until both part8 and part9, both o f which 
satisfy the constraints o f the given ET’G A.
Table Table 4.15 shows the percentage utilization o f CLBs by each partition. The firs t 
row represents the partitioned c ircu it arranged in the order in which they are obtained. 
The second row represents the percentage utilization o f the CLBs when mapped onto the 
X ilin x  SPARTAN XCS30. In the above partitioning the input output buffers are also 
considered as one node. Also a ll nets connected to the external pins are discarded. 
According to the defin ition o f a net in  the hypergraph format, any net should have at least 
two nodes connected to it. Figure 4.12 shows the utilization o f CLBs fo r each partition 
obtained by partitioning s38417.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
68
Table 4.15 Utilization of CLBs
parti part? parts part4 parts part6 part7 part8 part9
96 96 32 77 99 96 83 67 41
CLB Utilization for different partitions
100
80
0
1
§
2
40
O 20
parti part2 part3 part4 parts parl6 part? parts parts
- Series1
Figure 4.12 CLB U tilization for different partitions
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
CHAPTER 5
CONCLUSION AND RECOMMENDATIONS 
The main idea o f the proposed algorithm is to partition a given c ircu it into m ultiple 
circuits such that they can be mapped onto m ultiple FPGAs. The focus here is to facilitate 
the simulation o f large designs using the Hardware Embedded Simulation technology.
The algorithm  presented in this thesis partitions a given c ircu it which is in the form o f 
a netlist into m ultiple circuits such that there is m inimal interconnection between the 
different circuits. The proposed algorithm  considers the problem o f m ulti-partitioning as 
an iterative bipartitioning problem.
In this thesis, the given c ircu it is firs t clustered based on the connectivity among 
different nodes in the circu it. The circuit is represented as a hypergraph. The given 
hypergraph is clustered such that each cluster consists o f closely connected nodes. The 
clustered c ircu it is bipartitioned using F-M algorithm. The bipartitioning is done such that 
one partition always satisfies the constraints o f a given FPGA. The major constraint is the 
number o f lO  ports available in  an FPGA. This iterative bipartitioning is continued until 
both the circuits satisfy the constraints o f the given FPGA.
It has been found that the bandwidth clustering, proposed earlier, consumes a lo t o f 
time fo r clustering the nodes. This is because there is no restriction on the number o f 
nodes that can go into one cluster. In  order to avoid this problem, a factor was introduced
69
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
70
in the denominator o f the criterion which is equal to the product o f the cell size o f 
the clusters under consideration.
It was found that the best cutset in the case o f F-M  algorithm occurs for low er balance 
ratio when the c ircu it size is very large. This means that it is better to have different 
FPGAs instead o f the same kind o f FPGAs. The other factor that supports this approach 
is the utilization o f CLBs. It is found that the CLB utilization can be manipulated using 
this approach by changing the balance ratio.
There is no algorithm  or formula that gives us the best ratio cut that can be 
implemented. This is s till an open problem. It was seen that the best cuts were closer to 
20-80 balance condition. However there is no mathematical p roof attributed to this. This 
can be considered as one among the future work.
Another possible future work can be clustering based on delay m inim ization. It is 
possible to get the tim ing  parameters from the x n f files. I f  this parameter is introduced 
into the clustering criterion such that there is m inimal delay w ith in  the cluster, then delay 
between d ifferent partitions is m inim al.
Partitioning based on power consumption is also an area to be looked at fo r future 
work. In this the major constraint w ill be to m inim ize the power consumed by each 
partition. By partition we mean the power consumed when switching circuits. By 
reducing the power consumed by each partition, each FPGA would consume minimum 
amount o f power.
Thus it was found that the proposed algorithm  partitioned the given c ircu it into 
m ultiple circuits such that there is m inim al interconnection between the circuits. It was
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
71
also noted that time taken fo r clustering and partitioning was almost comparable 
w ith  the time taken fo r partitioning the c ircu it directly.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
BIBLIO G R APH Y
[1] L. Ford and D. Fulkerson, “ M axim al flo w  through a network". Canadian Journal o f
Mathematics, 1956.
[2] B. W. Kernighan and S. L in, "A n e ffic ient heuristic procedure for partitioning graphs". 
Bell System Technical Journal, 1970.
[3] C.M. Fiduccia and R.M. Mattheyses, "A  linear-tim e heuristic for improving network 
partitions", 19th IEEE Design Automation Conference, 1982.
[4] B. Krishnamnrthy, "An improved m in-cut algorithm  for partitioning VLSI networks". 
IEEE Transactions on Computers, 1984.
[5] L.A. Sanchis, "M ultip le-w ay network partition ing". IEEE Transactions on Computers, 
1989.
[6] R.Kuznar, F.Brglez and B.ZaJc, "A  U nified Cost Model for M in-Cut Partitioning w ith 
Replication Applied to Optim ization o f Large Heterogeneous FPGA Partitions", 
EURO-DAC '94, 1994.
[7] C. K ring and A. R. Ne\vton, "A  C ell-Replicating Approach to Mincut-Based C ircuit 
Partitioning", IEEE ICC A D -9 1 ,1991.
[8] S. D iitt and W. Deng, “ V LS I C ircu it Partitioning by Cluster Removal Using Iterative 
Improvement Techniques", IEEE ICCAD-96, 1996.
[9 ] J. Cong and M. Smith, "A  Parallel Bottom-up Clustering A lgorithm  w ith Applications 
to C ircu it Partitioning in  V LS I Designs", Proc. ACM /IEEE 30th Design Automation 
Conference, 1993.
72
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
73
[10] S. Kirkpatrick, C. D. Gelatt Jr. and M. P. Vecchi, "O ptim ization by Simulated 
Annealing", Science, 1983.
[11] Y. Wei and C. Cheng, "Ratio cut partitioning for hierarchical design", IEEE 
Transactions on Computer-Aided Design, 1991.
[12] C. J. A ipert and So-Zen Yao, "Spectral Partitioning: The More Eigenvectors, The 
Better", 32nd DAC, ACM /IEEE. 1995.
[13] D. M. Schuler and E. G. U lrich, “ Clustering and Linear Placement", Design 
Automation Conference, 1972.
[14] J. Garbers, H. J. Promel and A. Steger. “ Finding Clusters in VLSI C ircuits", 
International Conference on Computer-Aided Design, 1990.
[15] K. Roy and C. Sechen, "A  T im ing Driven N-W ay Chip and M ulti-C hip Partitioner", 
lEEE/ACM  International Conference on Computer Aided Design, 1993.
[16] G. Karypis, R. Aggarwal, V. Kumar, and S. Shekhar, "M u ltileve l Hypergraph 
Partitioning: Application in VLSI Domain", Design Automation Conference, 1997.
[17] S. Hauck, G. B orrie llo  and C. Ebeling, “ Logic Partition Orderings for M ulti-FPGA 
Systems", International Symposium on Field-Programmable Gate Arrays, 1995.
[18] Goldberg and M. Burstein, “ Heuristic improvement Technique for Bisection o f 
V LS I Networks", ICCAD, 1983.
[19] T. Bui, C. Heigham, C. Jones and T. Leighton, “ Improving the performance o f the 
Kemighan-Lin and Simulated Annealing Graph Bisection A lgorithm s", DAC, 1989.
[20] M  Garey and D. Johnson, "Computers and Intractability: A  Guide to the Theory o f 
NP Completeness", W .H. Freeman &  Company, 1979.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
APPENDIX
BENCHM ARK FILE FORMAT
Net Format
Each netlist header has five  entries which are
ignored
#Pins
#Nets
#Modules
pad offset
aO
p i
a l
P2
a3 P3
Figure I Simple netlist
The lis t o f nets fo llow s. Each net is sim ply a subset o f modules which are either cells 
or pads. Cells are numbered from  0 to pad offset (inclusive). Pads are numbered from 1 to 
(#Modules - pad offset - I). (Please do not blame me fo r the unintuitive numbering 
scheme). Cells are prefaced by an "a", pads by a "p ". The beginning o f each net is 
denoted by an "s". For the above example w ith  4 cells, 3 pads. 5 nets and 13 pins, the net 
file  is given by 
0 
13 
5 
7 
3
p i s 1
74
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
aO 1 
a l I 
aO s I 
a2 I 
a3 I 
a l s 1 
a 2 l 
a3 1 
a2 s 1 
p 2 l 
a3 s 1 
p 3 l
NetD Format
.netD is ju s t like  .net except that each module in a net is labeled as an input (1), output 
(O) or bidirectional (B ) pin fo r that net. In other words. I f  a module is labeled w ith  an I, it 
is a sink fo r the net; i f  it is labeled w ith an O, it is a source fo r the net. This can enable 
one to deduce signal directions over the circu it. The .netD file  fo r the example is given by 
0
.13
4
7
3
p i sO  
a O lI  
a l I I  
aO sO 
a21I 
a3 I I  
a l s O 
a21I 
a3 I I  
a2 s O 
p21 I 
a3 s O 
p3 I I
X n f Format
In this form at each cell is defined very clearly. Each node here has a name and 
type. This document is a description o f a netlist file  format fo r describing Logic Cell 
A rray (LC A ) designs.
75
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
PIN, I I ,  I, 7gat 
PIN, 10,1,110 
END
SYM , INSI22,NO R 
PIN, 0 , 0 , 17 
PIN, I I .  1,118_1 
PIN, 10.1,113 
END
SYM , INSI24. NAND 
PIN, 0 , 0 . 18 
PIN, I I .  I, 3gat 
PIN. 10.1, Igat 
END
SYM , INSI26, NAND 
PIN, O. O .IIO  
PIN, 11,1,6gat 
PIN, 10,1, 3gat 
END
SYM . INSI28, NAND 
PIN, 0 , 0.112 
PIN, I I ,  1 ,110 
PIN. 10,1,2gat 
END
SYM , INSI30. IN V 
PIN, 0 ,0 ,2 3 g a t 
PIN, 10.1 ,17 
END
SYM , INSI32, IN V 
PIN, 0 , 0,113 
PIN, 10,1,112 
END
SYM , PINI33, IBUF 
PIN. O, O. Igat 
PIN, I, I, Igat_EXTERN 
END
SYM , PINI34, OBUF 
PIN. 0 ,0,22gat_E X TE R N  
PIN, I, I, 22gat 
END
SYM , PINI35, OBUF 
PIN, O, O, 23gat_EXTERN 
PIN, 1,1,23gat 
END
SYM , PINI36, IBUF 
PIN, 0 ,0 ,2 g a t 
PIN, 1,1,2gat_EXTERN
77
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
END
SYM, PINI37, IBUF 
PIN, O, O, 3gat 
PIN, 1,1,3gat_EXTERN 
END
SYM , PINI38, IBUF 
PIN, 0 ,0 ,6 g a t  
PIN, 1,1,6gat_EXTERN 
END
SYM, PINI39, IBUF
PIN, O, O, 7gat
PIN, I, I, 7gat_EXTERN
END
EOF
78
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
VITA
Graduate College 
University o f Nevada, Las Vegas
Girish Cherussery
Local address:
1165 Maryland C ircle #1 
Las Vegas, N V  89119
Home Address:
59-1 Bharathidasan Colony,
K.K.Nagar,
Chennai -  600 078 
India
Degrees:
Bachelor o f Engineering, Electrical Engineering, 2000.
University o f Madras, India
Thesis T itle :
Partitioning a Given C ircu it into M ultip le FPGAs
Publications:
Girish Cherussery. Henry Selvaraj and Venkatesan Muthukumar. "A  B i-level 
Partitioning o f a C ircu it into M ulti-FPGAs", EUROMICRO 2002.
Thesis Examination Committee:
Chairperson, Dr. Henry Selvaraj, Ph.D.
Committee Member, Dr. Shahram L a tifi, Ph.D.
Committee Member, Dr. Venkatesan Muthukumar. Ph.D. 
Graduate Faculty Representative, D r Wolfgang Bein, Ph.D.
79
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.
