University of Windsor

Scholarship at UWindsor
Electronic Theses and Dissertations

Theses, Dissertations, and Major Papers

2005

Efficient quadratic placement for FPGAs.
Yonghong Xu
University of Windsor

Follow this and additional works at: https://scholar.uwindsor.ca/etd

Recommended Citation
Xu, Yonghong, "Efficient quadratic placement for FPGAs." (2005). Electronic Theses and Dissertations.
1882.
https://scholar.uwindsor.ca/etd/1882

This online database contains the full-text of PhD dissertations and Masters’ theses of University of Windsor
students from 1954 forward. These documents are made available for personal study and research purposes only,
in accordance with the Canadian Copyright Act and the Creative Commons license—CC BY-NC-ND (Attribution,
Non-Commercial, No Derivative Works). Under this license, works must always be attributed to the copyright holder
(original author), cannot be used for any commercial purposes, and may not be altered. Any other use would
require the permission of the copyright holder. Students may inquire about withdrawing their dissertation and/or
thesis from this database. For additional inquiries, please contact the repository administrator via email
(scholarship@uwindsor.ca) or by telephone at 519-253-3000ext. 3208.

Efficient Quadratic Placement for FPGAs
By

Yonghong Xu

A Thesis
Submitted to the Faculty o f Graduate Studies and Research through the
Department o f Electrical and Computer Engineering in Partial Fulfillment
o f the Requirements for the Degree o f Master o f Applied Science at
The University o f Windsor

Windsor, Ontario, Canada
2005

© 2005 Yonghong Xu

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

1

*

1

Library and
Archives Canada

Bibliotheque et
Archives Canada

Published Heritage
Branch

Direction du
Patrimoine de I'edition

395 W ellington Street
Ottawa ON K1A 0N4
Canada

395, rue W ellington
Ottawa ON K1A 0N4
Canada
Your file Votre reference
ISBN: 0-494-11520-3
Our file Notre reference
ISBN: 0-494-11520-3

NOTICE:
The author has granted a non
exclusive license allowing Library
and Archives Canada to reproduce,
publish, archive, preserve, conserve,
communicate to the public by
telecommunication or on the Internet,
loan, distribute and sell theses
worldwide, for commercial or non
commercial purposes, in microform,
paper, electronic and/or any other
formats.

AVIS:
L'auteur a accorde une licence non exclusive
permettant a la Bibliotheque et Archives
Canada de reproduire, publier, archiver,
sauvegarder, conserver, transmettre au public
par telecommunication ou par I'lnternet, preter,
distribuer et vendre des theses partout dans
le monde, a des fins commerciales ou autres,
sur support microforme, papier, electronique
et/ou autres formats.

The author retains copyright
ownership and moral rights in
this thesis. Neither the thesis
nor substantial extracts from it
may be printed or otherwise
reproduced without the author's
permission.

L'auteur conserve la propriete du droit d'auteur
et des droits moraux qui protege cette these.
Ni la these ni des extraits substantiels de
celle-ci ne doivent etre imprimes ou autrement
reproduits sans son autorisation.

In compliance with the Canadian
Privacy Act some supporting
forms may have been removed
from this thesis.

Conformement a la loi canadienne
sur la protection de la vie privee,
quelques formulaires secondaires
ont ete enleves de cette these.

While these forms may be included
in the document page count,
their removal does not represent
any loss of content from the
thesis.

Bien que ces formulaires
aient inclus dans la pagination,
il n'y aura aucun contenu manquant.

i*i

Canada
Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Abstract

Field Programmable Gate Arrays (FPGAs) are w idely used in industry because they
can implement any digital circuit on site simply by specifying programmable logic and
their interconnections. However, this rapid prototyping advantage may be adversely
affected because o f the long compile time, which is dominated by placement and routing.
This issue is o f great importance, especially as the logic capacities o f FPGAs continue to
grow.
This thesis focuses on the placement phase o f FPGA Computer Aided Design (CAD)
flow and presents a fast, high quality, wirelength-driven placement algorithm for FPGAs
that is based on the quadratic placement approach.
In this thesis, multiple iterations o f equation solving process together with a linear
wirelength reduction technique are introduced. The proposed algorithm efficiently handles
the main problems with the quadratic placement algorithm and produces a fast and high
quality placement. Experimental results, using twenty benchmark circuits, show that this
algorithm can achieve comparable total wirelength and, on average, 5X faster run time
when compared to an existing, state-of-the-art placement tool.
This thesis also shows that the proposed algorithm delivers promising preliminary
results in minimizing the critical path delay while maintaining high placement quality.

iii

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Acknowledgements

I would like to express my sincere appreciation to my supervisor Dr. Mohammed A. S.
Khalid for his guidance and encouragement. He introduced me to this interesting research
area and guided me throughout m y thesis with great patience. I would like to thank the
members o f my thesis committee, Dr. C. Chen and Dr. W. Altenhof for their advice
regarding the research process and their assistance in the preparation o f this thesis. Here, I
would also like to thank Dr. S. Erfani and Dr. K. Tepe for their help during m y Masters
program at the University o f Windsor.
I am grateful to my w ife who accompanied me all these years. Her understanding and
support made things easier for me. I am indebted to m y parents. They gave me the
opportunity to pursue my own life. I would not have reached this milestone in my life
without their continuous encouragement and support.
The financial support provided by the Ontario Graduate Scholarship Program (OGS) is
gratefully acknowledged.

iv

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Table o f Contents

Abstract.............................................................................................................................iii
Acknowledgements........................................................................................................ iv
List o f Figures................................................................................................................ vii
List o f T ables...................................................................................................................ix
Abbreviations....................................................................................................................x
Chapter 1: Introduction...................................................................................................1
1.1 VLSI D esign Styles................................................................................................ 1
1.2 M otivation................................................................................................................ 3
1.3 Research Approach................................................................................................4
1.4 Thesis Organization...............................................................................................4
Chapter 2: Background and Previous W ork..............................................................6
2.1 FPGA Architecture................................................................................................. 6
2.2 FPGA Design F lo w ................................................................................................9
2.3 Definition o f FPGA Placement Problem......................................................... 10
2.3.1 Half-Perimeter Bounding Box Wirelength M odel.................................. 11
2.4 Placement Algorithms for F P G A s................................................................... 13
2.4.1 VPR Placement Algorithm........................................................................... 15
2.4.2 PPFF: Partitioning-based Placement for F P G A s.....................................19
2.4.3 Ultra-Fast Placement..................................................................................... 21
2.5 Quadratic Placement Techniques..................................................................... 22
2.5.1 Essentials o f Quadratic Placement............................................................. 23
2.5.2 Linear Equation S o lv er................................................................................ 24
2.6 Quadratic Placement Algorithm Examples.................................................... 27
2.6.1 GORDLAN...................................................................................................... 27
2.6.2 GORDIAN-L...................................................................................................28
2.6.3 FastPlace......................................................................................................... 29
2.7 Summary............................................................................................................... 30
Chapter 3: A Quadratic Placement Algorithm for F P G A s..................................31
3.1 Two Main Problems in Quadratic Placem ent............................................... 31
3.2 Overview o f the QPF Algorithm...................................................................... 33
3.3 The Node Mapping Process.............................................................................. 35
3.4 The Expansion P rocess...................................................................................... 38
V

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

3.5 Linear Adjustm ent........................................................................................ 41
3.6 Low Temperature Simulated Annealing.................................................. 43
3.7 Computational Com plexity......................................................................... 45
3.8 Summary..........................................................................................................47
Chapter 4: Experimental Results and A nalysis.......................................................48
4.1 Experimental Evaluation Environment.................................................... 48
4.2 Effects o f Key Parameters and Techniques on Placement Quality... 49
4.2.1 Compensation Factor.......................................................................49
4.2.2 Expansion............................................................................................50
4.2.3 Linear Adjustm ent............................................................................50
4.2.4 Starting Temperature....................................................................... 50
4.3 Comparison Between QPF and V P R ........................................................ 56
4.4 Quality and Time T radeoff......................................................................... 60
4.5 Critical Path Delay Comparison with V P R ............................................ 61
4.6 Summary......................................................................................................... 63
Chapter 5: Conclusions and Future W ork................................................................ 64
5.1 Conclusions and Contributions.................................................................. 64
5.2 Future W ork................................................................................................... 65
R eference........................................................................................................................ 67
Appendix A: Basic Data Structures.......................................................................... 72
VITA AUCTORIS......................................................................................................... 74

vi

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

List o f Figures
Number

Page

Figure 1.1 VLSI Design Styles..............................................................................2
Figure 2.1 FPGA Architecture..............................................................................7
Figure 2.2 Programmable Connection Box and Switch B o x ......................... 8
Figure 2.3 Structures o f (a) Basic Logic Element and (b) Cluster
Logic B lock..........................................................................................8
Figure 2.4 FPGA Design Flow ..............................................................................9
Figure 2.5 Half-Perimeter WireLength M od el................................................12
Figure 2.6 Pseudo-codes for Basic Simulated Annealing
Placement-based Algorithm........................................................... 15
Figure 2.7 Illustration o f The Terminal Alignment Technique....................20
Figure 2.8 Abstract View o f Multi Level Clustering..................................... 21
Figure 2.9 Pseudo-codes o f Preconditioned Conjugated Gradient
M ethod...............................................................................................26
Figure 3.1 Illustration o f Overlap Problem for Quadratic Placement.........32
Figure 3.2 Linear vs. Squared Wirelength: a) Squared Objective Model.
b) Linear Objective M od el.......................................................... 33
Figure 3.3 Overview o f QPF Algorithm.......................................................... 34
Figure 3.4 Over o f Mapping............................................................................... 36
Figure 3.5 Balance Sequence.............................................................................37
Figure 3.6 Pseudo-codes o f Node Mapping Process(P)................................37
Figure 3.7 Overview o f Expansion Process.....................................................38

vii

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Figure 3.8 Add a Dummy Node a) Origin N ode Position & Dummy
Node b) Reference Node Position After Mapping
c) Result o f One Iteration o f Expansion.................................... 40
Figure 3.9 Wirelength Contribution o f One Node Connected to
Three N e ts..........................................................................................42
Figure 3.10 Linear Adjustment Used in a) Stage 1, b) Stage 2 ................... 43
Figure 4.1 CPU Time Comparison vs. Number o f Blocks........................... 58
Figure 4.2 The Relationship between Normalized Wirelength Penalty
and Time Consumption in QFP Algorithm

.....................59

Figure 4.3 Quality vs. Time Tradeoff Comparisons between QPF
and V P R .............................................................................................61

viii

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

List o f Tables
Number

Page

Table 2.1 Compensation Factors for Net with Less Than 50 Terminals ... 13
Table 2.2 VPR Temperature Update Schedule................................................18
Table 4.1 The Wirelength Comparison o f Original Quadratic
Placement Before and After the Compensation
Factor is A pplied................................................................................52
Table 4.2 The Effect o f One Iteration o f Expansion Process
on Placement Quality........................................................................53
Table 4.3 Effect o f One Linear Adjustment Step on Placement Quality... 54
Table 4.4 Effect o f Different Starting Temperatures
on Placement Q uality........................................................................55
Table 4.5 Comparison o f Placement Results Obtained
by QPF and V P R ............................................................................... 57
Table 4.6 Comparison between QPF with Timing-driven Refinement
and Timing-driven V PR ................................................................... 62

ix

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Abbreviations

ASIC:

Application-Specific Integrated Circuit

BLE:

Basic Logic Element

CAD:

Computer-Aided Design

CG:

Conjugated Gradient

CLB:

Configurable Logic Block

CPLD: Complex Programmable Logic Device
FPGA: Field-Programmable Gate Arrays
HDL:

Hardware Description Language

HPWL: Half-Perimeter WireLength
LUT:

Look Up Table

MCNC: Microelectronics Centre o f North Carolina
MPGA: Mask-Programmed Gate Arrays
NP:

Non-deterministic Polynomial-time

PLD:

Programmable Logic Device

QPF

Quadratic Placement for FPGAs

VLSI:

Very Large Scale Integration

VPR:

Versatile Placement and Routing tool for FPGAs

X

Reproduced with permission o f the copyright owner. Further reproduction prohibited without permission.

Chapter 1
Introduction

Advances in microelectronics and VLSI (Very Large Scale Integration) circuits have
contributed to the tremendous growth and pervasiveness o f the global electronics industry
in the past few decades. One rapidly growing area o f microelectronics is Field
Programmable Gate Arrays (FPGAs). FPGAs are w idely used for implementing digital
circuits because they offer moderately high levels integration and the ability to program
and reprogram the chip in the field by the end user. A Computer Aided Design (CAD) tool
suite is needed for mapping user designs on to the FPGA chip. This mapping CAD flow
consists o f a series o f steps: logic synthesis, technology mapping, placement and routing.
This subject o f this thesis is efficient quadratic placement for FPGAs.

1.1 VLSI Design Styles
Usually there are several design styles that can be considered for implementing a
digital system. As is illustrated in Figure 1.1, each design style has its own merits and
shortcomings [1].
In Full Custom design, major parts o f chip layout are designed from scratch without
utilizing any cell libraries. This can create compact and power/area efficient chips.
However, the development cost o f such a design style is becoming prohibitively high and
this is usually used only for the design o f high-volume products such as memory chips,
high- performance microprocessors and FPGA masters.

l

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Standard Cell and Masked Programmed Gate Arrays (MPGA) are semi-custom
designs. Users work one some pre-developed cells to implement their circuits. In Standard
Cell design, all o f the commonly used logic cells are developed, characterized, and stored
in a standard cell library. Once the circuit is mapped, these cells are arranged in horizontal
rows within the chip boundary. The spaces between these rows are used to implement the
interconnections between the cells. An MPGA device consist o f an array o f uncommitted
elements (usually rows o f transistors), and most o f the mask layers are pre-defined by the
manufacturer. Users define the final metal layers to connect the transistors in the array, so
as to implement their designs.

VLSI Design

Full Custom

Semi-Custom

Standard Ce

MCA

Programmabl e Logic
Device (PL D )

CPLD

FP GA

Figure 1.1 VLSI Design Styles

Programmable Logic D evice (PLD) design style provides users with pre-fabricated
array o f programmable logic and interconnections [2]. Users can configure the final logic
structure o f the device by themselves so that no fabrication step is need in this design style.
The difference between Complex Programmable Logic Devices (CPLDs) and Field
Programmable Gate Arrays (FPGAs) is their structure and logic resources. CPLDs mainly
consist o f two levels o f programmable logic; an AND plane and an OR plane, and a wide

2

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

number o f inputs. While FPGAs have a more general structure that allows high logic
capacity and flexibility (details o f FPGA architectures are presented in Chapter 2).
Since no physical manufacturing step is necessary for customizing the FPGA chip, a
functional sample can be obtained almost as soon as the design is mapped into a specific
FPGAs chip. This gives FPGAs significant advantages over the custom designs.

1.2 Motivation
To implement a digital circuit with FPGAs, a set o f CAD tools are needed to map a
user design into bitstream file that are required to configure the FPGA chip. These tools
first transform the circuit description (expressed using hardware description language such
as VHDL or Verilog) into a netlist o f technology-mapped logic blocks and their
connections. Then placement and routing steps are done so that these logic blocks are
assigned to physical locations on the FPGA and interconnected correctly. Finally this
location and connection information is transformed in to bitstream file that is downloadable
for FPGAs. This thesis focuses on the placement part o f this process.
Quality and speed are two key metrics for evaluating the goodness o f a CAD tool. N ow
that recently announced FPGAs contain the equivalent o f 40-m illion gates, the compile
time is more important than ever. For some large circuits, this CPU compile time, which is
dominated by placement and routing, can be in the order o f tens o f CPU hours.
For a placement tool, high quality means that we can implement more digital logic in
an FPGA chip and may also achieve a better circuit performance by utilizing the placement
tool. Fast speed means w e can get the desired placement within shorter time so that the turn
around time o f our products will be reduced as well. Unfortunately, the current placement
tools that provide high quality solutions require a large amount o f CPU time. While other
tools with fast speed give poor quality. There is a great need for placement tool that run in a
reasonable amount o f CPU time while still generating high quality solutions.

3

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

1.3 Research Approach
Our research goal was to find a fast algorithm for placement that produces high quality
result. We create a fast placement tool based on quadratic technique and integrate it into
Versatile Placement and Routing tool for FPGAs (VPR) [3][4], which is a well known,
t

high quality placement and routing tool for FPGAs. W e evaluate our placement tool with
respect to VPlace (the placement part o f VPR) based on both runtime and placement
quality by running both algorithms on the same computation platform with the same suite
o f benchmark circuits and the same FPGA architecture.
VPlace is based on the simulated-annealing based algorithm that is widely used in
academia and industry. It starts with a random initial placement and improves it by large
number o f swaps and m oves o f nodes. This algorithm can achieve the best placement
quality with a large amount o f time. It spends a lot o f time on examining the poor initial
placement, which does not contribute significantly to the final result.
Our efficient placement algorithm uses the quadratic technique to create a reasonably
good placement within a very short time, and only uses simulated annealing algorithm to
refine the final placement. It combines the fast speed o f quadratic based placement
algorithm and the high quality o f simulated annealing based algorithm.

1.4 Thesis Organization
The thesis is organized as follows:
Chapter 2 contains an introduction to FPGA architectures, generic CAD procedure for
FPGA-based designs. Then we give the definition o f FPGA placement problem. In this
chapter we also describe the previous works on FPGA placement, as w ell as some related
topics o f quadratic technique.
In chapter 3, w e first describe the common problems with the quadratic placement
method. Then we introduce our proposed placement algorithm and explain how w e deal
with these problems in details. We also give a time complexity analysis o f our proposed
algorithm.
4

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Chapter 4 presents the results obtained from running our tool on a suite o f large
benchmark circuits, using a simple and general FPGA architecture. W e show the effects o f
key techniques in our proposed algorithm on placement quality, by presenting and
analyzing relevant experimental data. W e also provide a comparison between our tool and
VPR, which is a well-known, high-quality placement and routing tool for FPGAs.
Chapter 5 highlights the key results from our research, and proposes possible
directions for future research in this area.

5

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Chapter 2
Background and Previous Work

This chapter presents the background o f placement for FPGAs and the previous work
done in this area. The FPGA architecture is described in section 2.1. The general FPGA
design flow is discussed in Section 2.2 and the definition o f FPGA placement problem is
given in Section 2.3. Some placement algorithms developed for FPGAs are summarized in
section 2.4. Essentials o f quadratic placement and related techniques are discussed in
section 2.5. Finally section 2.6 gives an overview o f well-known quadratic algorithms used
in ASIC placement.

2.1 FPGA Architecture
An FPGA is a completely re-configurable logic chip. Similar to traditional hardwired
gate arrays, the chip consists o f an array o f logic elements. In the traditional gate arrays,
these gates are specified and interconnected at the manufacturing stage. The FPGA differs
in that it can be programmed, and re-programmed by the users. Although there are a wide
variety o f architectures for commercial FPGAs, all FPGAs are composed o f three
fundamental components: logic blocks, I/O blocks and programmable routing resources. A
circuit is implemented in an FPGA by programming each o f the logic blocks to implement
a small part o f the logic o f the circuit, and the I/O blocks serve as the input and output pads
o f the circuit. The programmable routing resources are used to connect these logic blocks
and I/O blocks together as required by the circuit. Figure 2.1 shows the FPGA architecture

6

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

model used in this thesis. This FPGA architecture is also used by VPR. And many
researchers and CAD tools also employ this m odel as their prototype [3][5].

/(> Pad

S

Swi tch Box

C

Connection Box

Routing Cha nn e

Figure 2.1 FPGA Architecture
As illustrated in Figure 2.1, the I/O blocks are assigned around the chip edges. And the
logic blocks are scattered on the chip region like islands. Between the logic blocks, are the
routing resources, which include switch box, connection box and routing segments. A
connection box can be programmed to connect to a CLB to routing channels. A switch box
is a switch matrix that is used to connect wires in one channel to the wire in another
channel. Figure 2.2 shows how the connection box and switch box are used to connect the
logic blocks as required by the circuit.
Logic blocks are constructed by using Look Up Tables (LUTs) and D flip-flops.
Usually, we refer to a LUT and a D flip-flop combination as a Basic Logic Element (BLE)
and a logic block can contain one or more such BLEs. Figure 2.3 shows the structure o f
logic blocks. In this thesis, we will use logic blocks that contain one BLE.
7

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

r
. { Switch
-J

Connection
Bl o c k J

Block

T" ""

Figure 2.2 Programmable Connection Box and Switch Box

I n puls

N

K -inpu t
1. 11

D IT

Out

C lo c k — ► >

(a) Bsic logic element

BLE

BLEs

O utputs

BLE
/
Inputs

Clock
(b) Cluster logic block

FPGA

Figure 2.3 Structures o f (a) Basic Logic Element and (b) Cluster Logic Block

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

2.2 FPGA Design Flow
To implement a circuit in a modem FPGA chip, a sequence o f mapping steps are
needed. Typically users o f FPGAs describe the circuit using a hardware description
language or schematic input. The CAD tools take this description and compile it into a
bitstream file programs the FPGA chip to implement the desired circuit. The FPGA design
flow is shown in Figure 2.4.

Design Entry
Synthesis & Logic optimization(SIS)
Technology map to LUT(Flowmap)
Logic BlockPacking
Placement
Routing
FPGA bit-stream file

Figure 2.4 FPGA Design Flow

The design entry is the input circuit that the users developed by using a hardware
description language (HDL) such as VHDL or Verilog. Other methods are using a state
machine description language or a schematic.
The first stage o f the design flow first converts the circuit description into a netlist o f
basic gates. Technology-independent logic optimization removes the redundant logic
wherever is possible [6]. Then the optimized netlist o f basic gates is mapped to look-up
tables.

9

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Logic block packing combines the LUTs and registers into Configurable Logic blocks
(CLBs). In this phase, connected LUTs are packed together so that the number o f signals to
be routed between CLBs is minimized. A CLB might contain one or more BLEs depending
on the architecture o f the FPGA [3].
The placement process determines the physical location that every logic block o f the
circuit should be assigned to. The optimization goals o f placement are to place the
connected logic block close together so that the required wire length is minimized.
Sometimes other objectives such as wiring density [7] and circuit delay [8][4] are also
considered.
Routing process determines how the routing resources are programmed to connect all
the logic blocks as required by the circuit. FPGA routers can be divided into combined
global-detailed routers [9], witch determines a complete routing path in one step, and two
step routers, which first perform global routing and then detailed routing.
Once the routing process is completed, a CAD tool will create a bitstream file
according to the target FPGA architecture. When this file is downloaded to the target
FPGA, it configures the logic blocks and routing resources o f the target FPGA to
implement the desired digital circuit.

2.3 Definition of FPGA Placement Problem
Placement problems can be formulated as optimization and constraint satisfaction
problems. The input netlist that represents the nodes and their interconnections in a logic
circuit is described by a hyper-graph. The vertices o f the hyper-graph represent circuit
elements (logic blocks for FPGAs). Placement can be defined as follows [10]:
Given an electrical circuit consisting o f modules (logic blocks) with predefined input
and output terminals and interconnected in a predefined way, construct a layout (physical
location) indicating the position o f the modules so that the estimated wire length and layout
area are minimized or other constraints are met.
More formally, the placement problem can be expressed as [11]:

10

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

• Given: A hyper-graph G = (V, E) representing the circuit, where V is the set o f
vertices (logic blocks), and E is the set o f edges (nets), with edge weight
w(e) e (0,+oo) for each e e E ; an FPGA grid o f s i z e r x s , where r ,s e N ,
and r x s > 4 n , n is the number o f nodes.
• Find: A ll placement mappings p : V —> [l, r ] x [1, 5 ] o f blocks to physical locations
on the FPGA grid.
• Minimize: A cost function c(p ).
The most commonly used cost function for FPGA placement is the total wire length
required to complete the routing. Because the cost o f the device is proportional to the
amount o f silicon required to implement it. If we can minimize total wire length used to
route the circuit, the area required for the circuit will be minimized. Hence we can use a
smaller (and cheaper) FPGA to implement the circuit. A placement that strives to minimize
the total wire length is referred to as wirelength-driven placement. There are also other
objectives that can be added to the cost function. For example, placement can be done to
minimize the length o f a critical path to meet some timing constraints, referred to as
timing-driven placement [8] [4]. Or to balance the wire density across the FPGA device,
referred to as routability-driven placement [7].
In this thesis, we use the wirelength minimization as our cost function. This is the first
step in FPGA placement research. Timing-driven placement and routability-driven
placement algorithm also strive to minimize the total wirelength as well.

2.3.1 Half-Perimeter Bounding Box Wirelength Model
As we described earlier, placement stage only tries to determine the physical location
for every logic element o f the circuit. We do not have any ideas about how these elements
are to be connected in the routing stage. That means that w e do not know the actual total
wirelength o f the current placement. We have to use some approximations to estimate the
total wirelength and use this as a metric to evaluate any given placement. There are various
11

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

approximation techniques available [12][13][14][15] and the Half-Perimeter Wire Length
(HPWL) model is the most widely used method [10].
Figure 2.5 shows how the half-perimeter wirelength model works. In Figure 2.5, Net(i)
has 8 terminals and is placed on a FPGA chip. The bounding box is defined as the smallest
rectangle that covers the net.

Bounding Rectangle

Nct(i) _
8 term inals

Figure 2.5 Half-Perimeter WireLength Model

The half-perimeter wire length is defined as:
HPWLnel(i) = {Max(xb) - Min(xh) +1) + M ax{yb) - M in(yb) + 1}, b e net(i)
In this case, a 5x4 bounding box covers the 8-terminal Net(i), the HPWL=(5+4)=9.
When the number o f terminals o f the net is less than or equal to three, HPWL is an
accurate estimation o f the actual wire length. Otherwise, HPWL underestimates the wire
length required to connect all terminals o f the net. A q(i) factor [16][3] is introduced to
compensate for the fact that the HPWL model underestimates wirelength in case number o f
net terminals is greater than three. For nets with three or fewer terminals, q(i) is 1. It slowly

12

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

increases to 2.79 for nets with 50 terminals, as shown in table 2.1. For nets that have more
than 20 terminals, the value o f q(i) linearly increases as:
q(i) = 2.7933 + 0.02616x(NumOfTerminals - 50)

[3]

(2.1)

Table 2.1 Compensation Factors for N et with Less than 50 Terminals
#Term

q(i)

#Term

q(i)

#Term

q(i)

#Term

q(i)

#Term

q(0

1

1.0

11

1.4974

21

1.9288

31

2.2646

41

2.5610

2

1.0

12

1.5455

22

1.9652

32

2.2958

42

2.5864

3

1.0

13

1.5937

23

2.0015

33

2.3271

43

2.6117

4

1.0828

14

1.6418

24

2.0379

34

2.3583

44

2.6371

5

1.1536

15

1.6899

25

2.0743

35

2.3895

45

2.6625

6

1.2206

16

1.7304

26

2.1061

36

2.4187

46

2.6887

7

1.2823

17

1.7709

27

2.1379

37

2.4479

47

2.7148

8

1.3385

18

1.8114

28

2.1698

38

2.4772

48

2.7410

9

1.3991

19

1.8519

29

2.2016

39

2.5064

49

2.7671

10

1.4493

20

1.8924

30

2.2334

40

2.5356

50

2.7933

In this thesis, we use HPWL model to estimate the wirelength and the total wirelength
is computed as
N

Total _ Wirelength —Y , ‘l( ‘) x H P W l „ ln

(2.2)

(= 1

2.4 Placement Algorithms for FPGAs
Placement is a Non-deterministic Polynomial-time (NP) complete [38] optimization
problem. If given the right information, the exact solution o f placement problem cannot be
verified in polynomial time. The time complexity for obtaining an optimal solution for
placement o f n modules is 0(n!). Except for very small circuits, there is no efficient way to
compute an exact solution using a computer program. Approximation methods, also
13

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

referred to as heuristics, are used to obtain a good solution in reasonable time. Currently
popular placement algorithms can be divided into three major classes: partitioning based
placement, quadratic placement, and simulated annealing based algorithm.
Partitioning based algorithms [17] [18] repeatedly divide the given circuit into densely
connected sub circuits by applying partitioning techniques, such as Kemighan-Lin (KL)
[19] and Fiduccia-Mattheyses (FM) [20] partitioning algorithms. These

algorithms

partition the given circuits into sub circuits and minimize the interconnection between the
different partitions. Since the interconnection usually means the wiring in the circuit, the
partitioning

algorithms

minimize the total wirelength

as well.

Recently, more

partitioning-based placement algorithms were introduced to solve the placement problem
[21][22][23],
In quadratic placement algorithms, we build up linear equations from the
interconnectivity o f the input circuits and try to minimize the objective function by solving
the equations. This technique reduces the placement problem to the solution o f a system o f
linear equations and is widely used for ASIC placement.
Simulated annealing based placement algorithms start with an initial (legal) placement
and repeatedly modify it in search o f a cost reduction. If a modification results in a
reduction in cost, the modification is accepted; otherwise it is accepted or rejected based on
a number o f factors.
Initially most o f these algorithms were developed for ASIC placement. After the
emergence o f FPGAs in mid-1980s, some o f these algorithms were modified for FPGA
placement. In the following sections, we w ill introduce three o f these algorithms. They are:
the VPR placement algorithm [3] [4] which is based on simulated annealing; PPFF
placement algorithm [23] which is based on partitioning algorithm; and Ultra-Fast
Placement algorithm [5] which is a placement algorithm that uses a combination o f
simulated annealing and clustering techniques.

14

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

2.4.1 VPR Placement Algorithm
As w e mentioned above, the VPR placement algorithm [3][4] is based on simulated
annealing algorithm. Simulated annealing is a well-developed and w idely used algorithm
for solving combinatorial optimization problems, including those used in CAD for VLSI
physical design automation. As the name suggests, this algorithm mimics the annealing
process used to gradually cool molten metal to produce high quality metal structures. An
ideally annealed crystal should be in the lowest energy state, which corresponds to an
optimum configuration in a combinatorial optimization problem.

X = Initial_Random_Placement();
T = Set_Initial_Temperature(); /* T=T0 */
Dlimit = Set_Initial_Range_Limit();
/* Dlimit = whole chip */
while (Exit_Criterion() == false) { /* annealing not done yet */
while (Inner_Loop_Criterion() == false) {/*work per temperature not done yet*/
Xnew = Generate_Move(X, Dlimit);
/* return a new configuration generated incrementally from previous one */
/* by random pairwise exchange or translation within range limit */
AC = Cost(Xnew) - Cost(X); /* calculate change in cost */
r = Get_Random_Number(0,l);
/* r = random number uniformly distributed between 0 and 1 */
if (r < e -AC/T)
X = Xnew; /* update current placement */
/* always accept move (p = l) if it improves placement (AC < 0) */
/* accept “bad” moves (AC > 0) with probability p = e -AC/T */
/* when T is large, all bad moves likely to be accepted, */
/* when T is small, only bad moves with small AC likely to be accepted */
} /* end inner loop */
/* exploration at current temperature complete */
T = Update_Temperature(a, T); /* T = aT */
Dlimit = Update_Range_Limit(D limit);
} /* end outer loop */
/* annealing complete, X = final placement solution */
Figure 2.6 Pseudo-codes for Basic Simulated Annealing Placement-based Algorithm

15

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Figure 2.6 shows the pseudo-code for the simulated annealing algorithm. The
reference [24] contains a detailed description o f the basic algorithm and the various cost
functions used for the different types o f placement problems.
A simulated annealing based placement algorithm initially places logic blocks and I/Os
o f the circuit randomly on the FPGA chip. The temperature (T) is used to determine the
probability o f whether configurations that reduce the quality o f the placement w ill be
accepted. Parameter D controls the distance that a logic block is to be moved. Initially D is
set such that a logic block can be moved to any location in the entire chip area. This
indicates that initially all the logic blocks can be selected and moved to any other place
even if the m oves deteriorate the quality o f placement. Given this random initial placement,
a source logic block is chosen randomly. Then, a target location is chosen at random for this
logic block within the displacement range specified by D. If the target location is occupied,
then the target logic block is swapped with the source logic block and the cost o f the swap is
evaluated. And if the target location is empty, the source logic block is placed in the target
location and the new placement is evaluated. If the new cost is less than the cost o f the
previous placement, the move or swap is accepted. Otherwise, the m ove or swap is only
accepted with probability e ~&c/T, where AC is the change in placement cost duce to the
move or swap, and T is the current temperature. As the placement quality improves, the
temperature is gradually reduced, make it less likely for bad moves and swaps that degrade
the placement will be accepted. Eventually, the value o f T will be reduced to a low value
such that only the moves and swaps that improve the placement quality will be accepted,
making the heuristic greedy at that point. The parameter T is what permits probabilistic hill
climbing to take place and helps the placement solution avoid being trapped in local
minima.
As we can see from Figure 2.6, besides the cost function that defines the basic way to
evaluate a placement, there are also some crucial details that affect the annealing process.
They are: the rate at which the temperature is reduced (called the temperature update

16

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

factor), the number o f configurations to explore at each temperature (known as the inner
loop criterion, or InnerNum), the exit criterion by which the annealing algorithm
terminates, and the behavior o f the range limiting mechanism These parameters, together
with the cost function, determine the quality o f the simulated annealing algorithm. When
VPR employs the simulated annealing algorithm for FPGA placement, a new cost function
and a dynamic adaptive annealing schedule is introduced [3]. It includes some o f the
features from the work done on annealing schedules by Huang et al. [25], Lam and Delssme
[26], and Swartz and Sechen [27]. It also implements a novel temperature update scheme
and stopping criterion, together with a bounding box wirelength cost function.
In VPR placement algorithm, the initial temperature T0 is set to 20 times the standard
deviation in cost after a set o f N blocks pair-wise m oves have been attempted, where N blocks
is the total number o f logic blocks and I/O pads in the circuit. The temperature T0 is high
enough to ensure that almost all early moves and swaps are accepted. The number o f new
configurations evaluated at each temperature T is set to:
M ovesPerT = InnerNum x (N blocks )4 3

(2.3)

where the scaling factor InnerNum, is set to 10 by default, give a best quality at reasonable
CPU run time.
VPR placement algorithm uses a dynamic adaptive annealing schedule. Most o f the
annealing parameters are updated according the acceptance rate o f the moves and swaps at
the current temperature. Table 2.2 shows the how the temperature update factor, C is
automatically determined according to the acceptance rate ate the current temperature
stage. The temperature is reduced in a way such that if there is little change in cost, the
temperature is reduced by a larger fraction. Tnew = a ■Told Initially, when the temperature
is very high and almost all moves are accepted, these moves and swaps do not make much
difference. The temperature update factor, a, is set to a very low value, which makes the
temperature drops very fast in this stage. When the annealing process goes on, some moves
17

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

are accepted and some are not. In this case, the temperature update factor, a, is set to a
larger value so that the placement can be thoroughly explored at the temperature. In final
stage, since most m oves are rejected and the placement does not change a lot, the
temperature w ill again reduce quickly.

Table 2.2 VPR Temperature Update Schedule

Fraction o f Moves Accepted
Temperature Update Factor (a )
(R a c c e p t)

R accept

0.5

>0.96

0.8<RaCcept<=0.96

0.9

0.15<Raccept<=0.08

0.95

R a c c e p t < -= 0 - 1 5

0.8

It is shown in [27] [26] that the most desirable annealing schedule is one that keeps the
acceptance rate o f moves near 0.44 as long as possible. VPR placement algorithm also
employs this by utilizing the value o f the acceptance rate to control the range limiter
as:
0 ; Z = D ^ , x ( l - 0.44 + « „ „ „ )

(2.4)

and D lmut <=[1, max_ FPGA _ width}
If the acceptance rate is less than 44%, the range w ill be shrunk. Otherwise it is
expanded. Typically, this range limit covers the w hole chip at the beginning, and gradually
shrinks tol at the end o f the annealing.
The VPR placement simulated annealing schedule exits when the temperature falls
below a certain fraction o f the average cost per net.
^
0.005 x Cost
h <
^

n ^
^

n ets

18

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Even with a good annealing schedule, a large number o f potential swaps w ill be
evaluated during the placement. And the most computationally expensive part o f
evaluating a swap is computing the change o f cost A C . VPR placement algorithm also
developed an incremental bounding box evaluation technique to speed up this computation.
For each net, a data structure is designed not only contain the coordinates o f the o f the four
sides o f the net bounding box, but also the number o f logic blocks that lie on each side. This
information is used to determine the new net bounding box after a swap by only examining
the logic blocks that moved. The net cost is recomputed only when the terminal moved is
the only net terminal on a side o f the bounding box and it is m oved toward the bounding
box center. This technique, on average, yields a five-fold speedup over ten large MCNC
benchmark circuits.

2.4.2 PPFF: Partitioning-based Placement for FPGAs
Partitioning-based placement algorithms are based on graph partitioning algorithms
such as FM and KL. An FPGA is divided into two or more sub-regions, and a circuit
partitioning algorithm is applied to determine which logic block should go to which
sub-region to minimize the number o f cuts in the nets that connect the blocks between
partitions, and leave the highly connected blocks in one partition. This recursive process is
repeated until each partition is small enough to be finally placed. The partitioning based
placement algorithms are referred to as top-down algorithms. B y partitioning the problem
into sub regions, a drastic reduction in search space can be achieved so that they can
achieve very high speed. However, cut size is an indirect approach to wirelength, the
placement quality is not guaranteed. Other techniques are required to further improve the
quality.
PPFF [23] is a partitioning based placement algorithm that employs both hMetis [28]
partitioning program and VPR simulated annealing techniques to achieve a final placement
for FPGAs.

19

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

PPFF placement is done by recursively partitioning the circuit by using hMetis while
maintaining a tight connection between the circuit graph and the placement. Recursive
partitioning is done when each leaf in the hierarchical partition tree contains less than a
constant number o f logic blocks (e.g. six). If some o f the partitions may have more logic
blocks than they can accommodate after the partitioning, the overlaps are removed by using
a greedy technique that moves the logic blocks to the closest available partition. Finally,
simulated annealing is applied to refine the placement.
The key idea o f PPFF is that it applies a net terminal alignment technique during the
partitioning process. Since partitioning algorithms partitioning the circuit into sub-circuit
only according to the interconnection o f the circuit and do not care about the actual position
o f the logic blocks, it w ill results in a fact that after the placement is done, although the cut
size o f each level o f partitioning is minimized, the total wirelength is not. The net terminal
alignment technique is used to solve this problem. When partitioning one part into
sub-parts, the logic blocks in other partitions is considered so that one logic block tends to
stay in the partition that minimize the wirelength o f the nets to which it is connected. Figure
2.7 illustrates this technique.

A

B

Figure 2.7 Illustration o f The Terminal Alignment Technique
In Figure 2.7, the logic block X is in the current partition that is to be further partitioned
into parts A and B. Y is another logic block that is connected to X and was previously
placed in the upper part (aligned with A). In this case, the partitioning algorithm does not
really care about the target position X will go as long as the cut size between A and B is

20

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

minimized. The net terminal alignment technique w ill tell the partitioning algorithm to
place logic block X in partition A to minimize the wirelength.

2.4.3 Ultra-Fast Placement
Unlike partitioning based algorithm, ultra-fast placement [5] is a bottom up algorithm.
It reduces the complexity o f placement problem by clustering closely connected logic
blocks into multiple levels o f clusters.
Figure 2.8 illustrates the abstract view o f multi-level clustering. In level 0, the logic
blocks in the input circuit are clustered into a according to their interconnections in the
circuit. And in the each upper level, clusters at the previous level are grouped together into
a larger cluster.
Level 0

Level 1 Level 2

Level 2 _ Level 1 _ Level 0

n_o
Eh*''

lie block
logic
I/O p a d

Multi-level Clustering

C o a r s e p la c e m e n t of d u s t e r s

Figure 2.8 Abstract View o f Multi Level Clustering
The number o f clustering levels, and the size o f the cluster at each level can be varied
to allow the tradeoff o f compile time and quality. As the size o f the clusters increases, the
placement problems become simpler because more is hidden, but there is lees accurate
representation o f the netlist and therefore lower quality may result.
21

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

The ultra fast placement algorithm is relative simple: It begins with a multi-level,
bottom-up clustering o f logic blocks based on their connectivity. Once all the required
clustering is done, a two-level placement is performed at each level o f the hierarchy: an
initial constructive placement followed by an iterative improvement step using simulated
annealing. After the placement finishes in the current level, the clusters at the current level
are decomposed. This process is repeated for the next level until all levels are placed.

2.5 Quadratic Placement Techniques
As we discussed in Section 2.4, quadratic placement is one o f the main algorithms used
for ASIC placement. Unlike other algorithms, quadratic algorithm has not been
investigated for FPGA placement.
Quadratic placement algorithms use squared wire length as the objective function and
try to minimize it by solving linear equations. Although quadratic wire length is only an
indirect measure o f the linear wire length, quadratic placement can minimize the quadratic
wire length efficiently. As a result, it is widely used in ASIC placement [29][30][31][32].
The use o f quadratic assignment in many applied areas can be traced to the paper [33]. The
main concern with quadratic placement is that the positions o f the nodes w e get from the
linear equation solver tends to locate in the center o f the placement area with a large
amount o f overlap among nodes. To deal with this overlap problem, a bisection technique
is used in [30] to recursively divide the circuit and adds more linear constraints to pull the
nodes into center o f corresponding partitions. Repelling forces are added in [34] for nodes
sharing a net to maintain a target distance between them, and attractive forces are also
introduced to pull nodes from the center to the sparse regions by some fixed dummy nodes.
In [35], spreading forces are added to pull the nodes out o f the dense regions. Another
concern with the quadratic placement is its quality. Because the objective function o f the
quadratic placement is squared wire length, not linear wire length, its quality is not
optimized. To alleviate this problem, [32] adds some linear aspects when building up the
quadratic model to improve its quality and [35] uses half perimeter adjustment to improve

22

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

its result. But since its main concern is to speed up the equation solving process o f the
quadratic placement, this technique is not w ell discussed. A ll the quadratic algorithms
discussed above target standard cell ASIC placement.

2.5.1 Essentials o f Quadratic Placement
A quadratic placer takes a hyper-graph netlist as its input and produces a placement o f
nodes on target chip such that the total squared wire length is minimized. Quadratic
placement uses the following objective function
(2 .6)

Where x, y are the coordinates o f a logic block o f the netlist. Wy is the weight o f the
edge that connects node (x,,y,) and node (xj,yj). Since the input o f the quadratic placer is
usually represented by a hypergraph, and two nodes can be connected by more than one net,
we have to convert the hypergraph into a weighted graph first. Two models are used to
convert the hypergraph into a graph. Clique model introduces an edge o f weight 2/p
between every pair o f nodes incident to a p-pin net. While Star model created a new node in
the center o f its net. [35] has shown that the total squared wire length will be the same for
any placement under either the star model or the clique model.
The objective function can be rewritten in matrix notation as:
® (x ,y ) = —x TQx + d x x + —y TQy + d T
y y + const

(2.7)

Where Q is an n x n symmetric matrix and dx dy are n-dimensional vectors.
Since the objective function can be separated into x y dimensions. Only one dimension is
considered. Then w e get
(2 .8)

® (x) = —x TQx + d xTx + const

To find the minimum value o f this objective function, w e perform V ® (x) = 0 and get
the matrix equation:
(2.9)

Q •x + d x = 0

23

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

This is the quadratic equation that minimizes the total squared wire length o f the
placement. To obtain the matrix Q and vector d, let qy be the entry in row i and column j o f
matrix Q, and d, be the ith element o f vector d. For two movable nodes, the cost can be
rewritten as —Wi .[xf + x ) - 2xtx , ]. The first and second terms contribute W g to q,i and qjj
2
respectively. The third term contributes W g to qy and qy. For a connection between a
movable node and a fixed node the cost is ~ W i f [xj + x f2 - 2xix f ] . The first term
contributes Wjf to q,i, the third term contributes to -WjfXf to the vector dx at row i, and the
second term becomes the constant part o f expression (2.5) after derivative operation.
The matrix Q is positive definite if all movable nodes are connected to fixed nodes, i.e.
TO (Input/Output) pads, either directly or indirectly. This condition holds for all real
circuits since each node o f the circuit should be accessible from the outside o f the circuit.
So the matrix equation can be solved by non-stationary iterative methods [36].

2.5.2 Linear Equation Solver
Quadratic placement algorithm produces large sparse linear equations [29] [46], These
large linear equations can be solved through the iterative methods, which use successive
approximations to obtain more accurate solutions to the linear system at each step. There
are two main types o f iterative methods: stationary methods and non-stationary methods.
The stationary methods can be expressed in the simple form x(k)=Bx(k'l-l+c, where B and c
are constant coefficient, neither o f them depend on the iteration count k. The stationary
methods are older, simpler to understand and implement, but usually not as effective, like
the Jacobi method, the Gauss-Seidel method, the Successive Overrelaxation (SOR) method
and the Symmetric Successive Overelaxation (SSOR) method [36].
Non-stationary methods differ from stationary methods in that the computations
involve information that changes at each iteration. Typically coefficients are computed by
taking inner products o f residuals or other vectors arising form the iterative method.
Non-stationary methods are a relatively recent development; their analysis is usually harder

24

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

to understand and implement, but they can be highly effective, like Conjugated Gradient
method, Generalized Minimal Residual (GMR) method, Quasi-Minimal Residual (QMR)
method, etc [36].
For a real digital circuit, all the nodes o f the circuit are connected to, directly or
indirectly, I/O pads and matrix Q generated through quadratic model is positive definite
[30]. The linear equations can be solved by some non-stationary methods. In our thesis, w e
apply Conjugated Gradient method, which is also used by many other quadratic based
placement tools [30][29][45],
The Conjugate Gradient method is an effective method for symmetric positive definite
systems. It is one o f the oldest and best known o f the non-stationary methods. The method
proceeds by generating vector sequences o f iterates {i.e., successive approximations to the
solution), residuals corresponding to the iterates, and search directions used in updating the
iterates and residuals. Although the length o f these sequences can become large, only a
small number o f vectors need to be kept in memory. In every iteration o f the method, two
inner products are performed in order to compute update scalars that are defined to make
the sequences satisfy certain orthogonal conditions. In a symmetric positive definite linear
system these conditions imply that the distance to the true solution is minimized in some
norm.
The pseudo-code o f Conjugated Gradient method is illustrated in Figure 2.9. The
iterates x (l> are updated in each iteration by a multiple ( a/ ) o f the search direction vector
P

(0
( 2 . 10)

Correspondingly the residuals r U) = b - Ax(,) is updated as
r (0

=

r (w>

_

a

q

(i)

w h e re

q

d )

=

A

p

(2 . 11)

d )

25

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

The

choice

a = a, = ( r (‘ X)) Tr (' 1} / ( p U)) T A p (,) minimizes

( r 0)) T A ' r 0)

over all

possible choices for a in equation (2.11).
The search directions are updated using the residuals

p (o = r(o + p . y - v

( 2 . 12)

where the choice

= ( r (,)) Tr (l) l { r (,~x)) Tr'~X) ensures that

p (l)

and

A p (‘~l)

or

equivalently, r u) and r lVI) are orthogonal. In fact, one can show thatthis choice o f

fJi

makes p (l) and r u> orthogonal to all previous A p (J> and r <J> respectively.

Compute r (0)=
—b
u —As i x
a . (0) i
for
u i owmvw
some initial
i i i i n a i guess x'

^

for i = 1 ,2 ....
Solve Mz(M) = r (M)
P i-i

= ( r (M)) r z (M)

if i = 1
p m = z (0)
else
A -l =

Pi-X I P i - 2

p m = za-» +pixpn-1)
endif
q (,) = A p (l)
q (i)

<*i = P i-x

x (i) = jc(w) + a iq (i)
r d)

- a

0

check convergence: continue if necessary
end
Figure 2.9 Pseudo-codes o f Preconditioned Conjugated Gradient Method
26

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

In Figure 2.9, a preconditioner M is used. Preconditioner is a matrix that transforms
the linear equations into one equivalent form but has better convergence rate. Since
preconditioner cause some additional computation, it is also a trade o ff on whether w e use a
preconditioner and what preconditioner we use. For M = I in Figure 2.9, w e can obtain
the unpreconditioned version o f the Conjugate Gradient Algorithm. In that case, we cam
further simplify the algorithm by skipping the “solve” line, and replacing z u~X) by r (l~x)
(and z (0) by r ((>)). In our thesis, we choose to use Jacobi Preconditioner [36], the simplest
one, as our preconditioner since it is not our primitive consideration.

2.6 Quadratic Placement Algorithm Examples
To our knowledge, the quadratic placement algorithm has not been used for FPGAs.
The quadratic technique has been used in ASIC placement algorithms. In this section we
will present some well-known ASIC placement algorithms based on the quadratic
technique.

2.6.1 GORDIAN
Kleinhans et al. [30] describe a placement algorithm named GORDIAN that combines
quadratic placement and partitioning to handle the ASIC placement problem. The acronym
GORDIAN comes from the two main parts o f the method: global optimization and
rectangle dissection, which is based on improved partitioning schemes. With GORDIAN,
the placement problem is formulated as a sequence o f quadratic programming problems
derived from the entire connectivity information o f the circuit. An increasing number o f
constraints restricting the freedom o f movement o f the modules are imposed; reflecting the
results o f successively refined partitioning.
The GORDIAN algorithm is formed by an iteration o f global optimization and
partitioning steps. The global optimization starts with an initial region that comprises the
27

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

whole chip and contains all modules to be placed. One constraint is added to fix all these
modules to the center o f the chip. In each partitioning step, the modules are divided and the
placement regions are dissected into sub-region accordingly. N ew constraints are
established to fix the modules in each sub-region to the center o f the corresponding
sub-region. This loop o f global optimization and partitioning steps is repeated until each
region contains less than a predefined number k (e.g. k=10) o f modules
GORDIAN introduced a global placement that can refine all modules in the circuit
simultaneously, which avoided any dependence on a processing sequence.

2.6.2 GORDIAN-L
GORDIAN-L [32] is an improved version o f GORDIAN, which optimizes the linear
wirelength objective. Since the linear wirelength objective cannot be addressed directly by
the quadratic methods, GORDIAN-L approximates the linear objective by a quadratic
objective. It then executes the following loop: first minimize the current objective to yield
some approximate solution, and then use this solution to find a better approximation o f the
linear objective.
The GORDIAN-L is based on the observation that the linear objective can be rewritten
as:

Z

i j

Z

i,j

\X i

X j\

& i,j

If the approximation o f g i . is set, the linear objective can be converted to the squared
objective and the problem is solved. If g tj

is constant, GORDIAN-L reduces to

GORDIAN. Based on this observation, GORDIAN-L first solves the quadratic problem
and gets a reasonable approximation for each g i .. It then performs successive operations
to improve it. Each time it uses the coordinates in last iteration to approximate g i } until
no more improvement can be achieved.
28

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

GORDIAN-L can achieve up to 20 per cent reduction in area than GORDIAN at the
price o f significant increase in CPU execution time.

2.6.3 FastPlace
FastPlace [35] is a quadratic based standard cell placement algorithm using cell
shifting and iterative refinement and hybrid net model. The basic idea o f FastPlace is cell
shifting and addition o f spreading forces. Chip region is divided into an array o f bins
structure. The size o f these bins is recomputed in both x and y direction so that the
utilization o f adjacent bins is averaged. B y this technique, the cells in the dense area is
gradually be pushed to bins around it. And additional spreading forces are added
accordingly so that the spread cells do not collapse back to their previous positions during
the next step. H alf perimeter wire length is also employed to refine the placement based on
the bin structure.
The FastPlace algorithm is divided into global placement stage, wirelength
improvement stage and detailed placement stage. The first stage mainly deals with cell
spreading problem based on quadratic programming. The second stage moves the cells
around different bins to reduce the wire length. And the third stage legalizes the placement
by assigning cells to pre-defined rows in the placement region and removing overlap
among them. It also consists o f further reducing the wirelength by a greedy heuristic.
The FastPlace algorithm uses a hybrid net model that combines the clique model and
star model when converting the input circuit for quadratic programming, which makes
linear sparse system sparser. And it also applies an Incomplete Cholesky Factorization
preconditioner to help solve the matrix equations. As a result, this algorithm turns out to be
faster than other algorithms.

29

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

2.7 Summary
In this chapter, we provided all the background information that is related to our
research work. We first introduced the general FPGA design flow together with the FPGA
architecture that is used in this thesis. Then we gave a clear definition o f the placement
problem for FPGAs. Wire length estimation technique was discussed since our work
focuses on wire length minimization objective. The previous work done in FPGA
placement area was presented next. Finally, w e presented the essentials o f the quadratic
technique and related algorithms in ASIC placement. Our quadratic based FPGA
placement algorithm w ill be described in the next chapter.

30

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Chapter 3
A Quadratic Placement Algorithm fo r FPGAs

In this chapter, we present the Quadratic Placement algorithm for FPGAs (QPF),
which was developed during this thesis work. We start with a discussion o f two main
problems o f the quadratic placement technique in Section 3.1. Then we present our
proposed QPF algorithm and explain how we overcome these problems. In subsequent
sections, the heuristics used are described in detail, including the pseudo codes and
parameter settings.

3.1 Two Main Problems in Quadratic Placement
As we discussed before, squared wirelength objective is used in quadratic placement.
This squared wirelength objective is applied only because it allows the one-dimensional
placement problem to be reduced to the solution o f a system o f linear equations. It reduces
the computation, but at the same time it also causes some problems. There are two main
problems: one is the overlap problem that modules trend to overlap in a dense area, the
other is quality problem that squared wirelength objective does not accurately represent the
good quality o f a placement.
The first problem comes from the lack o f legal position information in quadratic
placement model. In an FPGA chip (recall from Section 2.1) the legal positions that can
accommodate circuit elements are fixed. But in quadratic placement model, it does not

31

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

consider this information when building up linear equations. This results in an overlapped
mathematical solution, as illustrated in Figure 3.1.

Figure 3.1 Illustration o f Overlap Problem for Quadratic Placement
Figure 3.1 shows a 5x5 array o f in an FPGA chip. The gray squares are the legal
positions in which the logic blocks can be placed. The tiny circles are the positions that a
linear equations solution actually produces. A good quadratic based placer must handle this
problem efficiently so that the overlap is resolved with a minimum deterioration in
placement quality..
The second problem comes from the squared wirelength objective in the quadratic
model. To convert the placement problem into a system o f linear equations, it uses a
squared wirelength objective function, not a linear wirelength function. [31] compared the
linear and squared wirelength objectives and concluded that the linear wirelength is
superior. Quadratic wire length is an indirect measure o f linear wire length. Usually
reduction in quadratic wire length leads to reduction in linear wire length. But in some
cases the minimization o f quadratic wire length does not mean the linear wire length is
minimized. Figure 3.2 illustrates the difference between linear and squared wirelength
objective. A circuit consists o f four nodes A, B, C and D. Nodes A, B, C are fixed, and node
D is movable. The distance between A (or B) and C is L. When quadratic wire length model
is applied, the optimized position o f node D will be in d = (1/3) L. When linear wire length

32

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

is applied, the position w ill be d =0, which means node D w ill be placed as close to A and B
as possible.
L

4

^

Figure 3.2 Linear vs. Squared Wirelength: a) Squared Objective Model, b) Linear
Objective Model
Furthermore, placement with minimum squared wirelength objective has a unique
solution that can be found by solving the corresponding linear equations. But placement
with linear wirelength objective can have multiple optimal solutions. For example, a single
movable logic block connected to two fixed I/O pads by edges o f equal weight can be
optimally placed anywhere between the two I/O pads. A good quadratic based placer must
also handle this problem efficiently so as to achieve the minimum linear wirelength.

3.2 Overview of the QPF Algorithm
An overview o f the QPF algorithm (using pseudo code) is given in Figure 3.3. It
consists o f three stages, where input to the first stage is a technology mapped and packed
circuit to be placed [6] [3].
The goal o f the first stage is to obtain a good initial placement. As we discussed before,
the solution o f linear equations tends to overlap in a dense area, as shown in Figure 3. l.In
this stage, we try to expand the solution to the entire chip area while attempting to minimize
the wirelength. This is achieved by repeatedly building up, modifying and solving linear

33

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

equations to get a progressively better placement. Each iteration contains four steps: First
we build up the linear equations and solve them to get the coordinates o f every node o f the
circuit. In the first iteration, w e build our equations according to the connectivity o f the
input circuit. In subsequent iterations, we modify the linear equations according to the new
dummy nodes added in the previous iteration. Second, w e map the nodes into the entire
chip area according to their current coordinates. This mapping process is performed every
time we get a new coordinates for the nodes in the circuit. After mapping process, w e will
have legal placement and can evaluate our placement using h alf perimeter wire length
estimate. Third, w e use these new positions as reference and add extra dummy nodes to the
circuit. These new nodes are used to modify the coefficient matrix in the next iteration.

QPF Algorithm
Stage 1
-

Build and solve linear equations
Map the circuit to the FPGA chip
Add dummy nodes and expand the placement
Refinement for minimizing linear wire length
Repeat above steps until there is no significant improvement

Stage 2
- Refinement for minimizing linear wire length based on legal placement until
there is no more improvement
- Re-map the circuit to the FPGA chip

Stage 3
- Low temperature Simulated Annealing to refine the final placement
Figure 3.3 Overview o f QPF Algorithm
The first three steps in stage one is used for expansion o f the logic blocks. Fourth,
during each iteration, w e also perform linear adjustment to minimize the linear wire length
as well as the squared wire length. This step is relatively independent o f the previous three
steps. It has nothing to do with the expansion purpose. It is inserted here to ensure that
linear wirelength improvement is also considered during the expansion process. This is also
achieved by modifying the coefficient matrix o f quadratic equation. The difference is that

34

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

in this step, w e modify the coefficient matrix to reduce wirelength, not to expand the nodes
in dense area. Stage one is performed until no significant improvement can be achieved.
When w e leave stage one, we already have a reasonably good legal placement. In stage
two, w e try to further improve this placement. In this stage, since the nodes are already
evenly distributed among the chip area, we concentrate on improving the linear wirelength.
This is achieved by using the similar linear wire length reduction technique described in
step four o f stage one. The difference is that we do not build and solve the linear equations
in this stage. Instead w e move the nodes directly (by changing the coordinates o f the nodes)
to reduce the total linear wire length. Since no equation solver is needed in this stage, this
process is much faster than the linear adjustment in stage one and w e can perform more
iteration and get a better refinement.
Finally, in stage three the placement is refined by low temperature simulated annealing
algorithm to further minimize the wire length.
The QPF algorithm uses a heuristic approach that achieves the final placement
gradually. Different t techniques are used in different stages. In the following sections, we
w ill discuss these techniques and explain how w e get a good placement by utilizing these
techniques.

3.3 The Node Mapping Process
The node mapping process maps the nodes o f the input circuit to the target FPGA chip
area based on the current node coordinates. The solution o f quadratic linear equations
provides the coordinates o f all nodes in the circuit. These coordinates do not give a legal
placement; therefore, they are mapped to the physical location o f the FPGA chip using
mapping process. The mapping process is widely used both in stage one and two.
Whenever the coordinates o f the nodes in the circuit are updated, w e perform this mapping
process on the circuit and a new placement is obtained.
The mapping process is done in a partitioning-like way. But unlike typical partitioning
algorithms that partition the circuit, we partition the chip area, not the circuit, into four

35

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

parts. The Figure 3.4 illustrates how the mapping process works. For clarity, w e omit the
gray squares in Figure 3.1.

1

c

©o
-•-O —V" - 0
% O€
V 0

0

0

Figure 3.4

Over o f Mapping

The basic idea o f mapping is as follows: Firstly, we divide the whole chip into four
parts and each part has integer number o f rows and columns. For example, in Figure 3.4 a
5x5 FPGA is divided into four parts, 2x2 2 x 3 ,3x2 and 3x3 each. Secondly, w e check all the
four parts for overflow. A part is said to overflow when the number o f nodes in that part is
greater than the number o f physical positions in the part. In Figure 3.4, the left-bottom part
is overflowed since the part can only accommodates 6 nodes, far less than number o f nodes
in this part. All the overflowed nodes must be removed to other parts. Thirdly, we try to
balance the chip so that extra nodes on the edges o f the overflowed parts are removed from
the overflowed parts into other parts. Since the original FPGA chip selected is large enough
to hold all the nodes, there can be 1-3 overflowed parts. Depending on the number o f
overflowed parts, the balance sequence is shown in Figure 3.5.
The overflowed parts are shown in gray. The balance sequence in a) and c) o f Figure
3.5 is obvious.

36

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

PI I
4
P 3 - -►

P2
1
4
P4

y /p l
i

*

PI P2
A
.'.■•■■■'VP '
: :j
▼
P3«~ - P4

P2
• -i

4

i

P 3 - -► P 4

b)

a ,1

PI

-►

-

P2
ik

.

c)

k R 3;

t:p 4

d)

Figure 3.5 Balance Sequence
We remove the extra nodes in the overflowed parts to its adjacent parts. For example
the balance sequence for a) is: 1. P1->P2&P3, 2. P2->P4, 3. P3->P4. Forb), we start from
the denser one in the two overflowed parts, and for d) we always start from the overflowed
part that is in the middle and push the extra nodes to the other parts, as the arrows show in
Figure 3.5. When nodes in one dense part are to be moved to both adjacent parts (e.g., the
overflowed nodes in PI are to be moved to both P2 and P3 in Fig3.5 a), the number o f
nodes moved to each adjacent part should be carefully chosen so that they make the two
adjacent parts equally overflowed.
P = n x n FPGA CLB array;
Repeat:
Partition P into four parts. PI, P2, P3, P4;
Check overflow for all parts P1-P4;
If (overflow) {
Find all overflowed parts;
Find balance sequence;
Remove all overflowed nodes to appropriate parts;
}
mapping(Pl);
mapping(P2);
mapping(P3);
mapping(P4);
Until all parts are small enough
Figure 3.6 Pseudo-codes for Node Mapping Process(P)
The pseudo code o f the mapping process is shown is Figure 3.6. B y recursively
partitioning and mapping, all the nodes will be fixed to a small local area. And within that
small area we can find the exact physical location for them with ease. Mapping process is

37

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

the basic operation is our placement algorithm, both the expansion and linear adjustment
use mapping process to legalize and evaluate their placement.

3.4 The Expansion Process
As discussed in Section 3.1, one o f the main problems with quadratic placement is that
the nodes o f the input circuit tend to overlap in some dense area o f the chip. One o f the
challenges for quadratic placement is how to solve this problem efficiently. In the QPF
algorithm, we developed an iterative expansion technique that expands the nodes
gradually, while satisfying the wirelength minimization objective as well.

Input circuit

Build/m odify
equations

Equation solver

A dd dummy nodes

Linear adjustment

Mapping

.......

Exit if no significant
^ i m p r ov em e n t ^

Figure 3.7 Overview o f Expansion Process
The overview o f our expansion process is shown in Figure 3.7. The input to the
expansion process is the circuit to be placed, which is used to build up the original linear
equations. The equation solver solves the linear equation by Conjugated Gradient method
as discussed in Section 2.6. The mapping process maps the nodes to the entire chip area
according to the current node coordinates as discussed in the previous section. Then one
dummy node is added to every node in the circuit according to the mapping result. These

38

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

dummy nodes help to modify the coefficient o f linear equations and get a better result in the
next iteration. The linear adjustment stage here is used to reduce the linear wirelength
during the expansion process, which will be discussed in the next section. In every
iteration, we check the placement quality after the mapping process. We exit the loop when
no significant reduction in placement cost is obtained in that loop.
The basic intuition behind this expansion process is as follows: according to the input
circuit, we can build up a matrix equation, in x dimension for example, Ax = b . And
suppose we have already got the ideal placement x

= [ x , , x 2 ... x ,.... x n \ , this ideal

placement vector can be a solution o f a similar matrix equation A' x'= b ' . So if we can
modify our origin matrix A and vector b to make them approach the ideal matrix A' and
vector

b' , the

solution w ill

approach the ideal

solution

as well.

In short,

The main challenge o f this expansion process is how w e set the dummy nodes in
Figure 3.7. The dummy-node-setting process must accomplish two goals. One is that the
dummy nodes must be set so that they help to pull the nodes in the dense area apart, which
is the main goal o f expansion process. The other is that these dummy nodes must not
greatly change the relative position o f the nodes obtained by solving linear equations. In
our algorithm, w e use the mapping result in the current iteration as a reference placement to
set the dummy nodes. It is shown in Figure 3.8.
Figure 3.8 a) shows the original node position that w e get from previous iteration in
expansion process. Figure 3.8 b) shows the node position after mapping process. One node
in the circuit is marked as a square, and we call it node A. This node A is in the center near
position (3,3) in Figure 3.8 a) and is mapped to the right side (4, 2) as in Figure 3.8 b). This
tells us that based on the current information node A should be placed in (4, 2). So we add a
dummy node at this place and connect node A (and only node A) to the dummy node. The

39

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

weight between nodes A and the dummy node is set according to the average weight
connected to node A.

1

2

3 4
a)

5

1

2

3

4

5

b)

1

2

3 4
c)

5

Figure 3.8 Add a Dummy Node
a) Origin Node Position & Dummy Node b)
Reference N ode Position After Mapping c) Result o f One Iteration o f Expansion

The information about the dummy node added for node A w ill be used to modify the
coefficient o f the linear equations and w ill pull node A to itself in the next iteration o f
expansion process. It w ill also affect the other nodes in the circuit, which are connected to
node A directly or indirectly, through node A. After we add a dummy node for every node
in the circuit and modify the linear equation accordingly, w e w ill get a better placement by
solving the equations in the next iteration. As we can see in Figure 3.8 c), the nodes in
dense area in Figure 3.8 a) are expanded to a larger space. N ote that the dense area in Figure
3.8 c) is not just a direct expansion o f that Figure 3.8 a). The relative position among the
circuit nodes has been changed during the convergence o f equation solver. The equation
solver also guarantees the minimization o f squared wirelength.
The weights between circuit nodes and the dummy nodes are set by their average
weight. Based on the fact that the weight needed for expansion increase significantly as the
expansion process goes on, we start with a small portion o f average weight, like 1/10 o f the
average weight o f each node, and gradually increase the weight o f dummy nodes in later
expansion process.

40

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Although dummy nodes are necessary for expansion, they also have side effects.
Dummy nodes with small weights work w ell without causing much interference to the
original circuit. But when the expansion process goes on, the effect o f those dummy nodes
accumulates. With enough number o f iterations the weight added by dummy nodes will
become dominant. At that time, although w e can still expand our circuit, the squared
wirelength minimization objective o f quadratic model w ill not hold any more. Usually, we
terminate this expansion process after several iterations, when the improvement rate o f one
iteration is not significant (e.g. less than 10%).

3.5 Linear Adjustment
Beside the overlap problem w e discussed above, another big concern about quadratic
placement algorithm is the quality. As w e mentioned in Section 3.1, quadratic placement
algorithm is based on the squared wirelength model. Squared wirelength is only an indirect
approach and not always proportional to linear wirelength. Further reduction o f linear
wirelength is a key problem for all algorithms based on the quadratic model.
Recall from Section 2.3.1, Half-Perimeter Wire Length (HPWL) is an efficient way to
estimate the wirelength needed to connect the circuit nodes. Our linear adjustment
technique is based on the difference between linear wirelength and squared wirelength
under Half-Perimeter Wire Length (HPWL) model.
Figure 3.2 shows the basic difference between linear and squared wirelength. In real
circuits, it is more complicated. A general case is presented in Figure 3.9. It comes from the
result we get from quadratic placement model. Node A is connected to net 1, net 2, and
net3. The nodes in these three nets are represented by rectangle, circle, and triangle
respectively. In this example node A is on the right edge o f net 2, bottom edge o f net 3 and
is in the middle o f net 1.

41

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

N et 1
Figure 3.9 Wirelength Contribution o f One Node Connected to Three Nets

Under Half-Perimeter Wire Length model, only the nodes on the edges o f each net
contribute to the total wirelength. As in Figure 3.9, node A contributes to the total
wirelength through net 2 and net3. So if w e move node A to the left, it w ill reduce the
wirelength o f net 2 while not increasing the wirelength o f other two nets before it reaches
the edge o f net 3. The same scenario results if it is moved upwards.
In real circuits, a node might be on the different edges o f two or more nets, e.g. a node
can be on the left edge o f a net and the right edge o f another net at the same time. In these
cases, the weight o f all involved nets is considered so that the node is moved to the
direction that reduces the total wirelength.
Every time w e perform linear adjustment, we first check the circuit to find all such
movable nodes and their target positions. Then if this technique is used in stage one o f our
algorithm, we add extra dummy nodes at the target positions and this technique will take
effect by modifying the coefficient matrix and re-solving the equations. When it is used in
stage two, w e modify the coordinates o f the node directly so that the relative position o f the
nodes is changed, and it will take effect in the next mapping process, as shown in Figure
3.10.
42

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Find all movable nodes
and their position
Find all movable nodes
and their target position
Change the coordinates
to the target position
A dd dummy nodes at
the target position
Mapping all nodes
Modify equations and
re-solve them
Improvement ?

Mapping in stage one
Stage 3

a)

b)

Figure 3.10 Linear Adjustment Used in a) Stage 1, b) Stage 2

3.6 Low Temperature Simulated Annealing
In stage three, w e finally improve the placement quality using low temperature
simulated annealing-based [3 7] [24] iterative improvement. Refer to Chapter 2 for a
description o f the basic simulated annealing method as it is applied to placement. We have
adapted the annealing implementation in VPR described in [3].
The key parameters that control the quality and time for simulated annealing are: the
starting temperature T0, the rate at which the temperature is reduced represented as “a ”, the
number o f configurations to explore at each temperature called “InnerNum”, the behavior
o f the range limiting mechanism as Dijmit, and the exit criterion by which the annealing
algorithm terminates.
1.

The starting temperature T0

43

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

The starting temperature T0 is a crucial parameter for our refinement because it
determines the initial probability that a bad move or swap is accepted. If the starting
temperature is set too high, the subsequent annealing w ill destroy the placement we
have developed in the previous stages. Otherwise, if it is set too low, then insufficient
optimization will be performed, the placement might get trapped in some local
minimum and the quality w ill not be good enough.
In our program w e set the starting temperature in a way that a bad move or swap that
causes 0.1% degradation in quality still has 0.1% probability to be accepted.
According to the simulated annealing schedule, a bad move or swap is only accepted
with probability r < e ~&CIT, where r is a random value in range [0,1], AC is the
quality difference caused by the move. We get,
A C = C -AC/C = C
ln(r)

’

ln(r)

0.001
ln(0.001)

=

C
6907

where C is the cost o f the current placement.
2.

The number o f moves per temperature
The basic annealing algorithm o f VPR makes InnerNum ■N ^ cks moves at each
temperature, where N blocks is the total number o f logic blocks and I/O pads and the
InnerNum is set to 10 by default. Since we have already had a good placement from the
first two stages o f our algorithm, we find by experiments that it does not make much
difference for InnerNum greater than 3. In our program we set it to 4.

3.

The move limit
In VPR, the move limit Dijmjt is the distance that a node is free to move during the
annealing process. It always starts with the largest number, Dlimit covers whole chip.
In our algorithm, the placement we get from the previous stages is good enough so that
it is unlikely that a node should be moved such a long distance to improve the quality.

44

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

B y observing the annealing procedure, w e set this value to one-fifths o f the whole chip
range.
4.

Other parameters
The modification on other parameters such as exit criterion, temperature update factor
also slightly affects the annealing process. We did not change these parameters.

3.7 Computational Complexity
To evaluate the computational complexity o f our QPF algorithm, w e have to consider
all the based operations used by our algorithm. They are: the mapping process, the linear
adjustment process, the linear equations building and solving operation and the low
temperature simulated annealing process.
The time complexity o f simulated annealing is quite obvious. Since the number o f
4

4

moves per temperature is InnerNum ■N^locks, the complexity o f stage three is 0( Nf )loch.)
[3],
In the mapping process, we map the circuit by recursively partitioning the circuit into
four parts. So the computational complexity o f mapping process is 0 ( N a g )

+

N
,
N
N
4 x 0 ( —— ) + 4 x 0 ( —^P-) + ... +4" x 0 ( —— ). And the total level o f partitioning is
4
4
4"
n=

, where y]NCLB is the width o f CLB array o f the FPGA. So the computational

complexity o f mapping process becomes

0 ( N a , ) + 4 x 0 ( ^ S « ) + 4 2 O ( ^ p - ) . . . = 0 ( N aJ,-U>gf^) = 0 ( N a J -ln NCLB)
For linear adjustment, we check every node for better position. Since in real circuit, the
average fanin and fanout o f a node are limited, we only need to search limited number o f
other nodes to determine the new position o f the current node. This operation is about
0 ( k - N CLB) = 0 ( N CLB)
To determine the computational complexity o f the quadratic placement, we must
consider both the computational work o f building and solving the linear equations.

45

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Building the linear equation is simple. We check all nodes in the circuit and compute
the weight for the nodes connected to them. The com plexity is 0 (n n z ) . Where nnz means
the Number o f Non Zero element in the matrix. When w e m odify the coefficient matrix, we
only add one dummy node to each node, the com plexity is N CLB.
The complexity o f the equation solver is more complicated.. Basically, one iteration o f
the Conjugated Gradients method contains 3 N CLB -dimension vector additions, 2
N clb -dimension vector multiplications and 1 Matrix vector multiplication. The complexity
o f one iteration is 0 ( N CLB) + 0 ( N CLB) + 0( nnz ) = 0 (n n z ) . The problem is the convergence
rate o f conjugated Gradients method is not w ell studied. In [36] the convergence rate o f
Conjugated Gradients method without preconditioning is 0( h~]) , where h is the
discretization mesh width [39] and depending on the spectrum o f the coefficient matrix
involved [40]. So the total complexity o f the quadratic equation solver is 0(nnz) ■0 ( h ~').
For a real circuit, 0 ( N CLB) = 0 { N blocks) = O innz) s O ( N ) since the number o f I/O pads
and the average fanout are usually limited and independent on the circuit size. In [41] it is
said that the computational work o f unpreconditioned Conjugate Gradients is from
4

3

0 ( N 3) to 0 ( N 2) , depending on the uniform grid used. With the help o f a preconditioner,
an additional matrix to transform the original system to an equivalent one with an improved
spectrum, the convergence rate o f the Conjugated Gradients method can be accelerated. So
we believe that the complexity o f our algorithm is in a way less than the range
4

3

4

[ 0 ( i V 3) , 0 ( 7 W ) ]. It is comparable to VPR, whose complexity is 0 ( N 3).
Based on the discussion above, the linear equation solver is the most computationally
expensive operation. The overall computational complexity o f our placement algorithm is
same as the equation solver, whose computational complexity is comparable to VPR.

46

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Since the computational com plexity only shows the potential performance o f the
algorithm when the scale o f input circuit increases, it does not directly determine the speed
o f the algorithm. In this case, VPR performs swaps and m oves at a large number o f
4

temperatures, each temperature is an 0 ( N 3) operation, while the proposed algorithm
only needs several iterations o f such operation. So the proposed algorithm can run much
faster than VPR does although both algorithms have similar computational complexity.

3.8 Summary
In this chapter, w e first examined the problems with the general quadratic placement
algorithm. Then we presented QPF, our proposed placement algorithm for FPGAs and
explained how we handled these problems. The QPF placement algorithm consists three
stages. It begins in stage 1 with a normal quadratic placement method and gradually
expands the initial placement by a loop o f equation modifying and solving process. During
the expansion process, special cares is taken to ensure that linear wirelength is minimized
while satisfying squared wirelength objective as well. Then linear wirelength reduction
technique is applied in stage 2 to improve the quality o f the previous result. Low
temperature Simulated Annealing is used to finally refine the placement in stage 3. Details
about all three stages are discussed in sections 3.3 to 3.6. Finally the computational
complexity o f the QPF algorithm is discussed in section 3.7. The QPF algorithm was
evaluated and compared to the state-of-the-art VPR placement tool using 20 benchmark
circuits. The experimental results are presented in the next chapter.

47

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Chapter 4
Experimental Results and Analysis

In this chapter, we w ill present the experiment results that demonstrate the quality and
efficiency o f the QPF placement algorithm. We first present experimental results showing
the effects o f different techniques employed in our algorithm on the quality o f placement
obtained. These techniques were discussed in detail in Chapter 3. Then a comparison is
made between QPF and VPR, a well-known high-quality FPGA placement tool. The
quality versus run time trade o ff for QPF is presented later in this chapter. We also describe
our attempt at enhancing QPF to make it a timing-driven placement tool and present
encouraging preliminary results. We first describe the experiment evaluation environment
that was used to obtain the results.

4.1 Experimental Evaluation Environment
All the experiments were performed under identical hardware environment. We tested
both our QPF placement tool and VPR on a Pentium 2.5GHz based computer with no other
application program running.
The benchmark circuits we used in our experiments were obtained from two sources:
fifteen o f the circuits originate from the Microelectronics Centre o f North Carolina
(MCNC) suite [42], five circuits were generated by GEN [43], which is a synthetic
benchmark circuit generation tool that has been used in FPGA research. All circuits were
originally in the b lif format. They were optimized by SIS [44] and technology mapped into

48

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

4-LUTs using Flowmap [6]. Then w e used VPACK [3] to pack the netlist o f 4-LUTs and
flip-flops into basic logic blocks. The sizes o f benchmark circuits range from 2000 to
20,000 logic blocks.
We have implemented our QPF placement tool within the framework o f VPR using the
most recent version 4.30 o f VPR code. We also use exactly the same function to estimate
the bounding box wirelength. Time consumption is measured by using the CPU clock and
then converted it to seconds. The run time o f both tools is measured from the very
beginning to the end o f placement, including the initial input file reading time. Since our
algorithm is written within the framework o f VPR, extra time is used to convert the data
structure from VPR’s to ours. If only the time for placement is considered, QPF placement
tool can run slightly faster than what has been reported in this chapter.

4.2 Effects of Key Parameters and Techniques on Placement
Quality
Recall that in Chapter 3, we explained the techniques used in QPF to handle the
overlap problem and also how the wirelength minimization was achieved by using
expansion and linear adjustment techniques. In this section, w e present experimental
results obtained using these techniques and show how they affect the placement quality.

4.2.1 Compensation Factor
As w e discussed in Chapter 2, compensation factor q(i) is used to compensate the
under-estimated linear wirelength o f half perimeter wirelength model for large nets. Table
4.1 shows the improvement in wirelength when q(i) is applied in the quadratic model over
the original quadratic placement. To prevent the extreme situation that most nodes are
overlapped together in a very small region, we do not increase the compensation factor for
nets larger than 50 terminals when building linear equations. This is done by setting all q(i)
= q(49) for all i >= 50.

49

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

The first column shows the circuit name. The second and third column shows the
number o f blocks and number o f CLBs in the circuit respectively. The bounding box
wirelength is shown in column 5,and 6. Finally the wirelength reduction is shown in
column 6.
W e can observe from Table 4.1 that after the compensation factor is applied to the
quadratic model, we have a better estimation for the wirelength for all nets and we can get
about 20% reduction in total linear wirelength.

4.2.2 Expansion
In the QPF algorithm, w e deal with the overlap problem o f quadratic placement by
performing expansion process in stage 1, as shown in Figure 3.3. W e applied an iterative
process discussed in section 3.4 so that we spread all the nodes out gradually, while at the
same time achieving the wirelength minimization objective as well.
To show the exact effect o f the expansion technique, w e perform one iteration o f
expansion process based on the basic quadratic placement with modification o f
compensation factor. The result is shown in table 4.2. We can achieve about 21%
improvement o f quality out o f the first iteration o f expansion process. When we perform
more such iterations, the rate o f improvement falls off.

4.2.3 Linear Adjustment
Similarly, we show the effect o f the linear adjustment by employing one linear
adjustment step to the previous result.
The experimental results are shown in Table 4.3. We can see that the first linear
adjustment step can achieve about 15% improvement based on the previous result. Since
this linear adjustment is performed on a legal placement and does not need to modify or
solve the matrix equation, it runs very fast. We can try this as long as more improvement is
possible.

50

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

4.2.4 Starting Temperature
We perform low temperature simulated annealing to finally refine the placement in
QPF algorithm. As w e discussed in Chapter 3, the starting temperature should be carefully
selected. We only want to accept reasonable number o f bad m oves during the annealing
process so that enough optimization is tried while making sure that w e do not destroy the
placement w e get in previous stages.
A high starting temperature can produce a high quality placement. But at the same time
it will take a long time to finish the annealing process. Because at high temperature, the
annealing process accepts many bad moves so that it almost breaks the existing placement
and re-places it again. At extreme cases, the high starting temperature destroys everything
and treats the placement w e get from the previous stages as a random one.

In this section

we compare our adaptive starting temperature with a fixed but lower starting temperature.
We run our placement tool with T0=C/6907 as discussed in Chapter 3 and T0=0.01, a fixed
and lower starting temperature. The results are presented in Table 4.4 and show that the
fixed starting temperature works w ell for some circuits. On the whole the adaptive starting
temperature produces about 7% wirelength reduction over all 20 benchmark circuits.

51

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Table 4.1. The Wirelength Comparison o f Original Quadratic Placement
Before and After the Compensation Factor is Applied

Number
of
blocks

Number
of
CLBs

alu4

1544

Apex2

Circuit
Name

Bounding box wirelength
Without
compensation
factor

With
compensation
factor

Wirelength
Reduction

1522

405.86

278.00

46.0%

1919

1878

557.42

504.11

10.6%

Apex4

1290

1262

340.98

312.06

9.3%

Bigkey

2133

1707

458.95

370.18

24.0%

Clma

8527

8383

3560.04

3183.64

11.8%

Des

2092

1591

562.82

563.25

-0.1%

Elliptic

3849

3735

1106.72

875.60

26.4%

ExlOlO

4618

4598

2201.81

1420.31

55.0%

Ex5p

1135

1064

280.77

259.44

8.2%

Frisc

3692

3556

1235.92

1138.92

8.5%

Misex3

1425

1397

375.26

321.53

16.7%

Pdc

4631

4575

1833.91

1646.49

11.4%

S38417

6541

6406

2127.30

1258.72

69.0%

S38584

6789

6447

2155.01

1449.24

48.7%

Spla

3752

3690

1376.16

1219.26

12.9%

Fcoml

10307

9986

6445.90

6020.18

7.1%

Fcom2

11426

10984

6479.99

5396.21

20.1%

Fsm5

12104

11852

6793.37

6790.22

0%

Fcom3

13063

12962

5065.75

4446.27

13.9%

Fsm6

20051

19903

14779.78

13117.56

12.7%

Average

20.6%

52

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Table 4.2 The Effect o f One Iteration o f Expansion Process on Placement Quality

Circuit
Name

Number
of
blocks

Number
of

Bounding box
wirelength

CLBs

Original
Quadratic
placement

After one
iteration

Wirelength
reduction

of
expansion

Alu4

1544

1522

278.24

249.43

11.6%

Apex2

1919

1878

504.22

445.82

13.1%

Apex4

1290

1262

312.06

268.90

16.1%

Bigkey

2133

1707

370.18

368.67

0.4%

Clma

8527

8383

3183.64

2374.73

34.1%

Des

2092

1591

563.25

543.83

3.6%

Elliptic

3849

3735

875.60

819.62

6.8%

ExlOlO

4618

4598

1420.31

976.48

45.5%

Ex5p

1135

1064

259.44

238.80

8.6%

Frisc

3692

3556

1138.92

1004.54

13.4%

Misex3

1425

1397

321.53

283.83

13.3%

Pdc

4631

4575

1646.49

1425.79

15.5%

S38417

6541

6406

1258.72

993.94

26.6%

S38584

6789

6447

1449.24

1166.55

24.2%

Spla

3752

3690

1219.26

1021.25

19.4%

Fcom l

10307

9986

6020.18

4288.48

40.4%

Fcom2

11426

10984

5396.21

4023.05

34.1%

Fsm5

12104

11852

6790.22

4707.32

44.2%

Fcom3

13063

12962

4446.27

3498.13

27.1%

Fsm6

20051

19903

13117.56

10285.94

27.5%

Average

21.3%

53

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Table 4.3 Effect o f One Linear Adjustment Step on Placement Quality

Circuit
Name

Number
of
blocks

Number

Alu4

1544

Apex2

Bounding box
wirelength
Without
linear
wirelength
adjustment

With
linear
wirelength
adjustment

Wirelength
reduction

1522

249.43

227.64

9.6%

1919

1878

445.82

397.57

12.1%

Apex4

1290

1262

268.90

238.01

13.0%

Bigkey

2133

1707

368.67

351.91

4.8%

Clma

8527

8383

2374.73

2027.23

17.1%

Des

2092

1591

543.83

466.62

16.5%

Elliptic

3849

3735

819.62

717.72

14.2%

ExlOlO

4618

4598

976.48

855.84

14.1%

Ex5p

1135

1064

238.80

212.38

12.4%

Frisc

3692

3556

1004.54

815.03

23.3%

Misex3

1425

1397

283.83

248.22

14.3%

Pdc

4631

4575

1425.79

1212.01

17.6%

S38417

6541

6406

993.94

889.13

11.8%

S38584

6789

6447

1166.55

1018.39

14.5%

Spla

3752

3690

1021.25

837.14

22.0%

Fcom l

10307

9986

4288.48

3499.59

22.5%

Fcom2

11426

10984

4023.05

3371.54

19.3%

Fsm5

12104

11852

4707.32

3921.53

20.0%

Fcom3

13063

12962

3498.13

2940.22

19.0%

Fsm6

20051

19903

10285.94

8837.39

16.4%

Average

15.7%

of
CLBs

54

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Table 4.4 Effect o f Different Starting Temperatures on Placement Quality

Bounding box
wirelength

Circuit
Name

Number
of
blocks

Number
of
CLBs

T0=0.01

T0=C/6907

Alu4

1544

1522

198.16

195.72

1.2%

Apex2

1919

1878

315.43

277.22

13.8%

Apex4

1290

1262

195.84

186.96

4.7%

Bigkey

2133

1707

324.09

317.75

2.0%

Clma

8527

8383

1629.29

1491.26

9.3%

Des

2092

1591

401.43

390.33

2.8%

Elliptic

3849

3735

596.40

551.65

8.1%

ExlOlO

4618

4598

702.65

665.03

5.7%

Ex5p

1135

1064

178.31

173.87

2.6%

Frisc

3692

3556

628.39

564.03

11.4%

Misex3

1425

1397

206.51

196.72

5.0%

Pdc

4631

4575

975.73

914.00

6.8%

S38417

6541

6406

755.07

699.87

7.9%

S38584

6789

6447

867.37

823.58

5.3%

Spla

3752

3690

698.25

641.88

8.8%

Fcom l

10307

9986

2770.53

2491.26

11.2%

Fcom2

11426

10984

2671.69

2440.23

9.5%

Fsm5

12104

11852

3171.18

2850.71

11.2%

Fcom3

13063

12962

2392.17

2113.42

13.2%

Fsm6

20051

19903

7317.3

6799.44

7.6%

Average

Wirelength
reduction

7.4%

55

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

4.3 Comparison Between QPF and VPR
In this section we present the placement results obtained using QPF and compare them
to placement results obtained using VPR. Since quadratic placement uses fixed I/O pads
when building the linear equations, w e randomly place the I/O pads o f the benchmark
circuits first and then feed these I/O pads physical locations to both QPF and VPR.
We ran the VPR placer using the default mode, which is w ell tuned to give the best
result. Since w e only compare the wirelength, the VPR placer was set to run in wirelength
driven-mode so that it achieves better wirelength quality. We ran QPF with default
parameter settings that were discussed in Chapter 3. The placement wirelength and the
CPU runtime are compared between these two tools. A comparison o f results obtained
using QPF and VPR is shown in Table 4.5.
The first column is shows the circuit name. The second column shows the wirelength
obtained for each circuit by QPF and VPR. And similarly the column 3 and 6 present the
wirelength and CPU time for VPR placer. The wirelength ratio o f QPF and VPR is shown
in the third column. Similarly columns four and five show the run time (given in seconds)
and the run time ratio o f QPF and VPR.
From Table 4.5 we can observe that, for each circuit, the wire length obtained by QPF
is very close to that obtained by VPR. On average, across 20 benchmark circuits, the
wirelength obtained by QPF is only 0.8% longer than VPR. But QPF gives better runtime
as evidenced by Table 4.5. On average, across 20 benchmark circuits QPF runs about 5X
faster than VPR. Thus we were able to achieve the goal that motivated our work, i.e. to
achieve equal or better placement quality while improving the placement run time.

56

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Table 4.5 Comparison o f Placement Results Obtained by QPF and VPR

Bounding box
wirelength

QPF

VPR

Ratio
VPR
/QPF

1.012

6.53

33.32

5.10

275.99

1.004

7.87

50.57

6.43

186.96

183.47

1.019

4.12

27.01

6.56

Bigkey

317.75

311.14

1.021

9.90

49.35

4.98

Clma

1491.26

1465.18

1.018

255.45

1113.07

4.36

Des

390.33

385.31

1.013

7.96

49. 12

6.17

Elliptic

551.65

547.41

1.008

63.40

164.89

2.60

ExlOlO

665.03

661.88

1.005

55.67

281.21

5.05

Ex5p

173.87

173.96

0.999

3.21

23.32

7.26

Frisc

564.03

552.89

1.020

37.53

162.78

4.34

Misex3

196.72

193.13

1.019

4.50

29.53

6.56

Pdc

914.00

905.91

1.009

60.65

292.85

4.83

S38417

699.87

722.71

0.968

158.92

680.84

4.28

S38584

823.58

802.43

1.026

218.29

716.60

3.28

Spla

641.88

635.32

1.010

33.25

167.42

5.04

Fcoml

2491.26

2481.15

1.004

433.34

1946.9

4.49

Fcom2

2440.23

2425.86

1.006

481.81

2209.25

4.59

Fsm5

2850.71

2871.94

0.993

629.90

2736.96

4.35

Fcom3

2113.42

2111.55

1.001

642.23

3101.96

4.83

Fsm6

6799.44

6763.13

1.005

1959.39

6922.34

3.53

Circuit
Name

QPF

VPR

alu4

195.72

193.41

Apex2

277.22

Apex4

Average

Ratio
QPF/VPR

Time Consumption(s)

1.0081

57

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

4.932

Speed Comparison
8

7

6
5
4
3

2
1
0

0

10000
15000
20000
5000
Number ofblocks o f benchmark circuits

25000

Figure 4.1 CPU Time Comparison vs. Number o f Blocks

To further analyze the performance o f our placement tool, we plotted the normalized
(with respect to VPR) CPU time against circuit size, given by the number ofb lock s in the
circuit. The plot is shown Figure 4.1. We notice that speed decreases as the size o f the input
circuit increases. Nevertheless, w e get significant speedup even for the largest circuit (more
than 3X). In order to know what exactly affects speed, we first analyze the time and quality
relationship o f the three stages in our QPF placement tool. The comparison o f placement
quality versus run time o f QPF is shown in Figure 4.2. The values for this plot were
obtained by averaging across 20 benchmark circuits. We select four typical points in our
algorithm, they are: The first legal placement w e get by the quadratic model, the output o f
stage 1 in our algorithm, the output placement o f stage 2 and the final placement. In Figure
4.1, w e also show the random placement for comparison. The random placement almost
takes no time, but the wirelength is about 300% worse than the final placement. It takes
about 8% o f the total time to finish the initial placement based on the modified quadratic
model and the wirelength is reduced to about 65%. Then the expansion technique improves
58

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

the quality to about 45% with another 9% o f total time. The linear adjustment o f stage 2 in
our algorithm spends another 11% o f total time to further improve the wirelength to about
18% longer than the final placement. Finally, rest o f the 72% o f total time is used to get rid
o f the last 18% wirelength by the low temperature simulated annealing process.

Q u a l i t y Vs Time C o m s u m p t s i o n i n QPF

350%
300%
250%
£

S tage 1
B asic QP
Stage 1
Iterate Improvement

200%

S 150%

Stage 2
Linear A djustm ent

100 %
Stage 3
L ow T emperature SA

50%
8% 17% 28%

50%

100%

Time Comsumpt i on
Figure 4.2. The Relationship between Normalized Wirelength Penalty and Time
Consumption in QPF Algorithm
From Figure 4.2, w e find that the most o f the run time o f our placement tool is used by
the last refinement stage, which is the low temperature simulated annealing process. We
believe the last refinement step is the reason that our placement tool becomes a little slower
for larger circuits in Figure 4.1. As discussed earlier, the last refinement step uses simulated
annealing technique, and the most important parameter is starting temperature T0. Recall
that w e set it as T0 =

C

in Chapter 3. For larger circuits, the cost C here usually is

larger and the starting temperature is larger for these circuits. This results in more time in
the last step. Although, by experiments, we can add some factor in our algorithm to make it

59

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

work better for larger circuits, this factor should be carefully selected by testing large
amount o f circuits. In our algorithm w e still u se r o = ^

C

as the starting temperature and

all the experimental data is based it.

4.4 Quality and Time Tradeoff
In this section, w e present the quality versus run time tradeoff o f for QPF and compare
it with the VPR placement tool.
It is straightforward to perform the tradeoff test for VPR because both the time and
quality are determined by InnerNum, which is the number o f moves per temperature. For
our placement tool, it is different. As discussed in Section 4.3, the performance o f our
placement tool is determined by three sages, and each stage uses a different technique. It is
not likely to have an exact quality and time tradeoff over all these stages. Since the last
stage o f our algorithm also requires most o f the run time, as shown in Figure 4.2, we test the
quality versus time tradeoff based on the last stage. B y setting different InnerNum for the
simulated annealing part o f our placement algorithm, we obtained a similar quality versus
run time tradeoff for QPF and compared it with VPR, as shown in Figure 4.3.
Both algorithms can achieve faster speed at the some percentages o f quality penalties.
But VPR has a better tradeoff curve than QPF. When w e focus on the better quality, QPF
can achieve the same quality with much faster speed than VPR. When both algorithms are
about 10 times fast, these two tools almost have the same performance. If we want to run
faster, VPR is a better choice.
The reason is that w e are doing the tradeoff only in the last stage o f our algorithm and
the maximum speed we can achieve is about 15 times faster, which means w e do not use
stage 3 at all. And the deeper reason behind this is in the quadratic model. For any
placement tools based on quadratic technique, the basic operation is the equation solver.
We have to modify and solve the linear equations to get a better placement. This basic
operation takes a lot o f time. While for simulated annealing based algorithms, the basic
operation is one move o f a node, and it can be done in negligible time.
60

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Quaity Time Tradeoff Comparison

12.000
10.000
8.000
T3

g,

ui

VPR

6.000

QPF

4.000
2. 000

0.000
0 .01

0 .0 3

0. 02

0.04

0.05

Quality Penally

Figure 4.3 Quality vs. Time Tradeoff Comparisons between QPF and VPR

4.5 Critical Path Delay Comparison with VPR
Although w e are mainly focusing on the wirelength minimization objective in our
thesis, we also wanted to have an idea about the timing performance o f our placement tool.
Through experiments, we found that the maximum critical path delay obtained by
QPF is almost same as VPR in wirelength driven mode. Both are about 25% longer than the
timing driven VPR (TVPR). We used 20 MCNC circuits for comparison.
We enhanced QPF by apply a timing driven refinement in stage 3 o f our placement tool
(we call it TQPF) and compared the critical path delay with timing driven VPR. The results
for wirelength, critical path delay and runtime are shown in table 4.6. We find that with a
timing-driven refinement in stage 3, our placement tool can achieve about 5%
improvement in critical path delay than TVPR at the penalty o f only 1.4% longer
wirelength. TQPF still gives significant run time speedup over VPR (about 3X).

61

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Table 4.6 Comparisons between QPF with Timing-driven Refinement and Timing-driven
VPR
Critical Path
Delay (ns)
Circuit

Ratio
TQPF/

Total wire
length

TQPF/

Rati
0

Time

Ratio

T

TVPR

T
QPF

VPR

TQP
F/T
VPR

TQPF

TVPR

1.073

202.00

201.41

1.003

24

68

2.83

127.00

0.835

285.52

281.13

1.016

39

125

3.21

100.24

105.70

0.948

197.51

194.13

1.017

16

47

2.94

Bigkey

82.17

83.88

0.980

317.58

318.31

0.998

48

132

2.75

Clma

199.75

238.04

0.839

1563.2
7

1544.2

1.012

952

2697

2.83

Des

119.45

115.35

1.035

400.01

395.66

1.011

44

124

2.82

Diffeq

62.44

70.40

0.887

180.83

180.40

1.002

29

82

2.83

Dsip

70.98

81.56

0.870

312.25

310.61

1.005

33

89

2.70

Elliptic

116.57

116.21

1.003

617.49

614.85

1.004

215

555

2.58

ExlOlO

199.21

201.08

0.991

681.21

678.30

1.004

288

868

3.01

Ex5p

103.68

100.31

1.034

189.24

186.66

1.014

13

39

3.00

Frisc

134.85

136.89

0.985

654.05

627.49

1.042

205

585

2.85

Misex3

98.08

100.96

0.971

201.91

199.47

1.012

19

56

2.95

Pdc

183.36

217.89

0.842

986.73

955.37

1.033

323

975

3.02

S298

132.51

142.73

0.928

233.32

229.07

1.019

43

107

2.49

S38417

100.63

99.46

1.012

763.44

725.79

1.052

591

1540

2.61

S38584.
1

183.92

193.19

0.952

856.33

852.81

1.004

607

1716

2.83

Seq

105.19

116.54

0.903

281.19

274.22

1.025

37

103

2.78

Spla

158.65

179.39

0.884

671.72

681.63

0.985

187

615

3.29

Tseng

59.09

56.73

1.042

126.95

124.94

1.016

13

40

3.08

TQPF

TVPR

Alu4

101.67

94.75

Apex2

105.98

Apex4

Average

TVPR

1.014

0.9506

62

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

2.87

The reason for this also comes from the quadratic model. As w e discussed in Chapter
3, quadratic model tends to suppress very long and very short wires as in Figure 3.2. This
property gives a good timing feature to the coarse placement that is obtained in stage 1 o f
QPF. When w e use the timing driven simulated annealing refinement in stage 3, this
characteristic is preserved. More discussion about timing driven quadratic placement is
presented in the next chapter.

4.6 Summary
In this chapter, w e presented the experimental results obtained from our QPF
placement algorithm. We first described the experimental evaluation environment in
Section 4.1. Then the experimental data was presented and analyzed that showed the effects
o f different techniques used in QPF on placement quality. A comparison o f QPF with VPR
and related analysis was presented in Section 4.3. The tradeoff o f quality versus run time
was discussed and compared with VPR in Section 4.4. This was followed by the critical
circuit path delay comparison between TQPF and TVPR in Section 4.5. The experimental
results show that QPF runs faster for wirelength driven placement while providing
comparable quality and TQPF also gives better critical path delay and faster run time
compared to TVPR.
In the next chapter, we will summarize the main contributions o f this thesis work and
discuss open problems that need to be explored in future research on this topic.

63

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Chapter 5
Conclusions and Future Work

5.1 Conclusions and Contributions
In this thesis w e presented QPF, an efficient placement algorithm for FPGAs. It
combines the quadratic placement algorithm and the simulated annealing placement
algorithm to achieve both high quality and fast speed. We incorporated multiple iterations
o f equation solving process together with linear wirelength reduction techniques in our
algorithm, which help to alleviate the problems caused by quadratic model. We refine our
placement with low temperature simulated annealing technique, which help to produce a
high quality result. B y applying these techniques intelligently, our placement algorithm
takes the advantages o f quadratic and simulated annealing placement algorithms and
produces a fast and high quality placement for FPGAs.
The first contribution o f this work is the exploration o f quadratic technique in the
FPGA placement area. We know that quadratic placement technique has been used in
ASIC area for a long time. But because o f the different architectures, ASIC placement tool
cannot be directly used in FPGA placement. To our knowledge, our work is the first
investigation o f quadratic placement in the area o f FPGAs.
The second contribution o f this work is the use o f compensation factor in quadratic
model. Half perimeter wire length is an efficient model to estimate the wirelength o f the
placement, but it usually underestimates the large nets. The old quadratic model does not
64

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

consider this situation. B y applying the compensation factor when building the linear
equations, the solution o f the linear equations carries the information o f bounding box
wirelength and is closer to the ideal placement.
The third contribution o f this work is the iterative expansion technique. One o f the
main problems with quadratic placement is the node overlap problem o f linear equations
solution. We handle this problem with an iterative approach. Each iteration removes the
overlap a little bit according to the reference placement we got in the previous iteration.
This technique avoids adding artificial information to the circuit and preserves original
nodes relationship. In each iteration, since w e get new coordinates information by
re-building and re-solving the linear equations, the original squared wirelength objective is
always minimized as well.
W e also give a linear adjustment technique to reduce the wirelength based on the half
perimeter wire length model. The minimization o f total linear wirelength is the final
objective for wirelength driven placement algorithm. Even for timing driven placement,
linear wirelength is also an important factor. We studied the node distribution in practical
placements and proposed the linear wirelength reduction technique. It is very fast, and
since it only uses the position information o f the circuit nodes it can be applied to any
existing placement.
Finally, an attempt was made to extend our wirelength driven placement algorithm to a
timing driven placement algorithm. Preliminary results are very promising.

5.2 Future Work
Our work is the first attempt to apply quadratic placement technique in FPGA
placement area. There are a lot o f things within this topic that need to be explored
thoroughly.
The most interesting topic that should be studied is the timing driven placement based
on quadratic technique because the circuit delay is another important objective for
placement algorithm, and much work remains to be done in this area. A direct approach that

65

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

can be done based our work is to add the timing information in our linear adjustment
technique. In our work, w e further reduce the wirelength by searching all nodes o f the input
circuit and trying to find the movable nodes and m ove them to better position. In current
algorithm we do this only focusing on the wirelength reduction. If we perform timing
analysis before this and take the timing information into account, it is possible to reduce the
wirelength as w ell as the circuit delay during this process. More research is needed to find
out how to incorporate the timing information into the quadratic model.
Another interesting area to pursue is to further improve quality o f quadratic placement
technique. Since quadratic model use indirect approach to minimize the wirelength
objective, how to improve the quality o f the placement is always the main problem.
Although researchers have proposed different techniques, when compared with simulated
annealing based algorithm, the quality obtained is not as good. If w e can achieve similar
quality without simulated annealing process, the runtime can be further reduced.
Finally, since quadratic placement technique involves linear equation solver,
mathematical techniques [29][36] can be employed to further reduce the program runtime.

66

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

References

[1]

Steven M. Rubin, “Computer Aids for VLSI D esign”, Addison-W esley
Publishing Company, 1994.

[2]

Daniel Mlynek, “Design o f VLSI System”, Integrated Systems Center, Swiss
Federal Institute o f Technology, 1998.

[3]

V. Betz and J. Rose, “VPR: A N ew Packing, Placement and Routing Tool for
FPGAs”, FPL 1997, pp. 213 - 222.

[4]

A.Marquardt, V. Betz, and J. Jose, “Timing-Driven Placement for FPGAs”,
Intl. Symposium on FPGA February 2000, pp. 203 - 213.

[5]

Y. Sankar and J. Rose, “Trading Quality for Compile Time: Ultra-Fast
Placement for FPGAs”, Proc. o f 7th ACM/SIGDA Intl. Symposium on
FPGAs, 1999, pp. 157-166.

[6]

J. Cong and Y. Ding, “FlowMap: An Optimal Technology Mapping
Algorithm for Delay Optimization in Lookup-Based FPGA Designs”, IEEE
Trans on CAD, January 1994, pp. 1-13.

[7]

G. Parthasarathy, M. Marek-Sadowska, A. Mukherjee, and A. Singh,
“Interconnect Complexity-aware FPGA Placement Using Rent’s Rule”, Proc.
o f System Level Interconnect Prediction, March 2001.

67

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

[8]

W. Swartz and C. Sechen, “Timing-Driven Placement for Large Standard Cell
Circuits”, DAC 1995, pp. 211-215.

[9]

C. Ebeling, L. McMurchie, S. A. Hauck and S. Bums, “Placement and
Routing Tools for the Triptych FPGA,” IEEE Trans, on VLSI, 1995, pp.
473-482.

[10]

K. Shahookar and P. Mazumder, “VLSI Cell Placement Techniques,” ACM
Computing Surveys, vol. 23, no. 2, 1991, pp. 143-220.

[11]

T. Lengauer, “Combinatorial Algorithms for Integrated Circuit Layout”,
Chichester: John W iley & Sons, 1990.

[12]

S. M. Sait and H. Youssef, “VLSI Physical Design Automation”, IEEE Press,
N ew Jersey, 1995.

[13]

S. Areibi, M. Xie, and A. Vannelli, “An Efficient Rectilinear Steiner Tree
Algorithm for VLSI Global Routing”, Canadian Conference on Electrical and
Computer Engineering, May 2001.

[14]

J. B. Kruskal, “On the Shortest Spanning Tree o f a Graph and The Traveling
Salesman Problem”, Proc. o f The American Mathematical Society 7, 1956,
pp. 48-50.

[15]

R. Prim, “Shortest connection networks and some generalizations”, Bell
System Technical Journal, pp. 36:1389-1401, 1957.

[16]

C. Cheng, “RISA: Accurate and Efficient Placement Routability Modeling”,
Proc. o f ICCAD, 1994, pp. 690-695

[17]

A. E. Dunlop and B. W. Kemighan, “A Procedure for Placement o f
Standard-Cell VLSI Circuit”, IEEE Trans, on CAD, vol.4, no.l 1985, pp.
92-98

[18]

A. Caldwell, A. Kahng and I. Markov, “ Can Recursive Bisection Produce
Routable Placement?”, Proc. o f D AC ’ 00, 2000. pp. 477-482.

[19]

B. W. Kemighan and S. Lin, “An Efficient Heuristic Procedure for
Partitioning Graphs,” Bell System Technical Journal, Vol. 49, Feb 1970. pp.
291-307.

68

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

[20]

C.

M. Fiducia, R. M. Mattheyses, “A Linear Time Heuristic for Imporving

network Partitions”, Proc. o f 19th IEEE/DAC, pp. 175-181.
[21]

D. J-H. Huang and A.B. Kahng, “Partitioning-based Standard-cell Global
Placement with an Exact Objective”, Proc. o f ACM/IEEE ISPD, 1997, pp.
18-25

[22]

M.Wang, X. Yang and M. Sarrafzadeh, “ DRAGON2000: Standard-Cell
Placement Tool for Large Industry Circuits”, Proc. o f ACM/IEEE ICCAD,
2000, pp. 260-263.

[23]

P.
Maidee, C. Ababei and K Bazargan, “Fast Timing-driven
Partitioning-based Placement for Island Style FPGAs”, Proc. o f DAC 2003,
pp. 598-603.

[24]

C. Sechen and A. Sangiovanni-Vincentelli, “The TimberWolf Placement and
Routing Package,” IEEE Journal o f Solid-State Circuits, vol. 20, no. 2, Apr.
1985, pp. 510-522.

[25]

M. Huang, F. Romeo, and A. Sangiovanni-Vincentelli, “An Efficient General
Cooling Schedule for Simulated Annealing,” Proc. o f ICCAD, 1986, pp.
381-384.

[26]

J. Lam and J. Delosme, “Performance o f a New Annealing Schedule,” Proc.
ACM/IEEE DAC, 1988, pp. 306-311.

[27]

W. Swartz and C. Sechen, “N ew Algorithms for Placement and Routing o f
Macro Cells,” Proc. Intl. Conference on Computer-Aided Design, 1990, pp.
336-339.

[28]

G. Karypis, R. Aggarwal, V. Kumar and S. Shekhar, “Multilevel Hypergraph
Partitioning: Application in VLSI domain”, Proc. o f ACM/IEEE DAC, 1997,
pp. 526-529.

[29]

C. J. Alpert, T. Chan, D. J.-H. Huang, I. Markov, and K. Yan, “Quadratic
Placement Revisited”, Proc. o f ACM/IEEE DAC 1997, pp. 752-757.

[30]

J. M. Kleinhans, G. Sigl, F.M. Johannes, “Gordian: A N ew Global
Optimization/ Rectangle Dissection Method for Cell Placement”, Proc. o f
ICCAD,1988, pp. 506-509, 1988

69

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

[31]

I.I. Mahmoud, K. Asakura, T. Nishibu and T. Ohtsuki, “Experimental
Appraisal o f Linear and Quadratic Objective Functions Effect on Force
Directed Method for Analog Placement”, IEICE Trans, on Fundamentals o f
Electronics, Communications and Computer Sciences, E77-A(4). April 1994,
pp. 719-725

[32]

G. Sigl, K. Doll, F. M. Johannes, “Analytical Placement: A Linear or a
Quadratic Objective Function?”, ACM/IEEE DAC 1991, pp. 427-432.

[33]

K.M.

Hall,

“An

r-dimensional

Quadratic

Placement

Algorithm,”

Management Science 17 (1970), pp. 219-229.
[34]

H. Etawil, S. Arebi, and A. Vannelli. “Attrctor-repeller Approach for Global
Placement” In Proc. IEEE/ACM ICCAD, 1999. pp. 20-24.

[35]

N. Viswanathan, C. Chu, “FastPlace: Efficient Analytical Placement Using
Cell Shifting, Iterative Local Refinement and a Hybrid N et Mode”, Proc. o f
ISPD 2004. pp. 26-33.

[36]

R. Barrett, M. Berry, and et al. “Templates for The Solution o f Linear
Systems: Building Blocks for Iterative Methods”, SIAM, 1994

[37]

S. Kirkpatrick, C. D. Gelatt, and M. P. Yecchi, “Optimization by Simulated
Annealing”, Science, vol. 220, no.4589, May 13, 1983, pp. 671-680

[38]

Garey, M.R. and Johnson, D.S. “Computers and Intractability: A Guide to the
Theory o f NP-Completeness”. New York: W.H. Freeman, 1983.

[39]

O. Axelsson, A. Barker, “Finite Element Solution o f Boundary Value
Problems. Theory and Computation”, Academic Press, Orlando, FL. 1984

[40]

W. Hackbush, “Iterative Solution o f Large Sparse Systems”, Springer Verlag,
1994.

[41]

A. Ramage, “An Introduction to Iterative Solvers and Preconditioning
Techniques”, PIMS Workshop 2003.

[42]

S. Yang, “Logic Synthesis and Optimization Benchmarks, Version 3.0”,
Tech. Report, Microelectronics Centre o f North Carolina, 1991.

70

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

[43]

M. Hutton, J. Rose, and D. Comeil, “Generation o f Synthetic Sequential
Benchmark Circuits”, Proc. o f 5th ACM /SIGDA Intl. Symposium on FPGAs,
1997, pp. 149-155.

[44]

E. M. Sentovich et al., “SIS: A System for Sequential Circuit Analysis”, Tech.
Report No. UCB/ERL M92/41, University o f California, Berkeley, 1992.

[45]

C. J. Alpert, T. F. Chan, D. J.-H. Huang, A. B. Kahng, I. L. Markov, P. Mulet
and K. Yan, “Faster Minimization o f Linear Wirelength for Global
Placement”, Proc. o f ISPD 1997, pp. 4-11.

[46]

R. Tsay, E, Kur and C. Hsu, “Module Placement for Large Chips Based on
Sparse Linear Equations”, Int. J. Circuit Theory Appl 16, pp. 411-423

71

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

Appendix A
Basic Data Structures

Structure s_net and s_block are the main data structures used by VPR. They contain all
the connectivity information o f the input circuit. VPR parses the input circuit and saves the
connectivity information in these two data structures by using hash table.

struct s_net {char *name; int num_pins; int *blocks; int *blk_pin;};
/* name: ASCII net name for informative annotations in the output.
/* num_pins: Number o f pins on this net.
/* blocks: [0..num_pins-l]. Contains the blocks to which thepins o f this
/*
net connect. Output in pins[0], inputs in other entries.
/* blk_pin: [0..num_pins-l]. Contains the number o f the pin (on a block) to
/*
which each net terminal connects. Since I/O pads have only one
/*
pin, I set blk_pin to OPEN for them (it should only be used for
/*
clb pins). For clbs, it is the block pin number, as expected.

*/
*/
*/
*/
*/
*/
*/
*/

struct s_block {char *name; enum e_block_types type; int *nets; int x; int y;};

/*
/*
/*
/*
/*

name: Taken from the net which it drives.
type: CLB, INPAD or OUTPAD
nets[]: List o f nets connected to this block. If nets[i] = OPEN
no net is connected to pin i.
x,y:physical location o f the placed block.

72

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

*/
*/
*/
*/
*/

Structure q_block is the main data structure used by QPF to store the connectivity
information o f the circuit. We initialize this data structure by searching through the two
data structures o f VPR and converting the hyper-graph into graph by clique net model.

struct q_block {char* name; int orig_blk_num; int x, y; int edge_num; int clb_edge_num;
int io_edge_num;
int* q_blk_list; float* q_blk_weight; float av_weight;
};

/* name: Block nam.
/* orig_blk num: the origin block number in VPR
/* x,y: physical location o f the placed block.
/* edge_num: number o f edge o f this node
/* clb_edge_num: number o f edges connected to CLBs
/* io edge num: number o f edges connected to 10 pads
/* q_blk_list[]: nodes connected to this node
/* q_blk_weight[]: weight o f the nodes connected to this node
/* av_weight: average weight o f this node

73

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

*/
*/
*/
*/
*/
*/
*/
*/
*/

VITA AUCTORIS

Yonghong Xu graduated from Zhejiang University in P.R.China, where he obtained B.Sc
in June 1995 and M.Sc in March 1998 in Electrical Engineering. And afterward he worked
for Alctel Shanghai Bell till 2002. He is currently a candidate for the Master’s degree in
Electrical and Computer Engineering department at the University o f Windsor and hopes to
graduate in June 2005.

74

Reproduced with permission of the copyright owner. Further reproduction prohibited without permission.

