Brigham Young University

BYU ScholarsArchive
Theses and Dissertations
2012-01-22

Using Hard Macros to Accelerate FPGA Compilation for Xilinx
FPGAs
Christopher Michael Lavin
Brigham Young University - Provo

Follow this and additional works at: https://scholarsarchive.byu.edu/etd
Part of the Electrical and Computer Engineering Commons

BYU ScholarsArchive Citation
Lavin, Christopher Michael, "Using Hard Macros to Accelerate FPGA Compilation for Xilinx FPGAs" (2012).
Theses and Dissertations. 2933.
https://scholarsarchive.byu.edu/etd/2933

This Dissertation is brought to you for free and open access by BYU ScholarsArchive. It has been accepted for
inclusion in Theses and Dissertations by an authorized administrator of BYU ScholarsArchive. For more
information, please contact scholarsarchive@byu.edu, ellen_amatangelo@byu.edu.

Using Hard Macros to Accelerate FPGA Compilation for Xilinx FPGAs

Christopher Michael Lavin

A dissertation submitted to the faculty of
Brigham Young University
in partial fulfillment of the requirements for the degree of
Doctor of Philosophy

Brent E. Nelson, Chair
Brad L. Hutchings
David A. Penry
Michael D. Rice
Michael J. Wirthlin

Department of Electrical and Computer Engineering
Brigham Young University
April 2012

Copyright © 2012 Christopher Michael Lavin
All Rights Reserved

ABSTRACT
Using Hard Macros to Accelerate FPGA Compilation for Xilinx FPGAs
Christopher Michael Lavin
Department of Electrical and Computer Engineering, BYU
Doctor of Philosophy
Field programmable gate arrays (FPGAs) offer an attractive compute platform because of their highly parallel and customizable nature in addition to the potential of being
reconfigurable to any almost any desired circuit. However, compilation time (the time it
takes to convert user design input into a functional implementation on the FPGA) has been
a growing problem and is stifling designer productivity.
This dissertation presents a new approach to FPGA compilation that more closely
follows the software compilation model than that of the application specific integrated circuit (ASIC). Instead of re-compiling every module in the design for each invocation of the
compilation flow, the use of pre-compiled modules that can be “linked” in the final stage
of compilation are used. These pre-compiled modules are called hard macros and contain
the necessary physical information to ultimately implement a module or building block of a
design. By assembling hard macros together, a complete and fully functional implementation
can be created within seconds.
This dissertation describes the process of creating a rapid compilation flow based on
hard macros for Xilinx FPGAs. First, RapidSmith, an open source framework that enabled
the creation of custom CAD tools for this work is presented. Second, HMFlow, the hard
macro-based rapid compilation flow is described and presented as tuned to compile Xilinx
FPGA designs as fast as possible. Finally, several modifications to HMFlow are made such
that it produces circuits with clock rates that run at more than 75% of Xilinx-produced
implementations while compiling more than 30× faster than the Xilinx tools.

Keywords: FPGA, rapid prototyping, design flow, hard macros, Xilinx, XDL, RapidSmith,
HMFlow, open source, placer, router

ACKNOWLEDGMENTS

I would first like to thank my wife Ashley and daughter Katelyn for their patience
and support in allowing me to finish this work. Several long days, nights and Saturdays were
necessary for this dissertation to meet completion and their long suffering and love provided
me a great motivation to finish. I am also grateful for my parents and their love and support
by providing me the great start in life to allow me to reach this achievement.
I would like to thank Dr. Brent Nelson for taking me under his wing five and a half
years ago. I am grateful for the patience he had to allow me to find out on my own what I
should research for this dissertation. The extra time Dr. Nelson made for me to discuss my
ideas or challenges and his example helped shape my talents and skills to help me become
the engineer I am today.
I would also like to thank Dr. Brad Hutchings for his significant contributions to this
work and my development as an engineer. Dr. Hutchings went out of his way to to provide
time and support as an unofficial co-advisor to this work. His insights were invaluable and
added significantly to this dissertation and helped me grow as a graduate student.
Thanks to Dr. Michael Rice for the opportunity to work with him and the Telemetry
Lab on the Space-Time Coding project. That opportunity turned out to be a rich experience
that laid the groundwork for several of the other accomplishments I have made as a graduate
student.
Thanks also to Dr. Michael Wirthlin for all the support and time he took to help
me in various endeavors. Thanks to Dr. David Penry and the entire committee for their
valuable insight on this dissertation.
There were also several students whose example and work helped me significantly as
a graduate student that I would like to thank: Joseph Palmer, Nathan Rollins, Jon-Paul
Anderson, Brian Pratt, Marc Padilla, Jaren Lamprecht, Philip Lundrigan, Subhra Ghosh,

Brad White, Jonathon Taylor, Josh Monson and all the other students in the Configurable
Computing Lab.
Special thanks also to Neil Steiner and Matt French at USC-ISI East, Dr. Peter
Athanas and the Virginia Tech Configurable Computing Lab as well as the entire Gremlin
project for their insight and ideas that helped me more fully understand FPGAs and their
architecture.
This research was supported by the I/UCRC Program of the National Science Foundation under Grant No. 0801876 through the NSF Center for High-Performance Reconfigurable
Computing (CHREC).

TABLE OF CONTENTS
LIST OF TABLES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

ix

LIST OF FIGURES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

xi

Chapter 1 Introduction . . . .
1.1 Motivation . . . . . . . . .
1.2 Preview of Approach . . .
1.3 Contributions of this Work

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

.
.
.
.

1
1
3
3

Chapter 2 Background and Related Work . . . . . . . . . . . . . . .
2.1 FPGA Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . .
2.1.1 FPGA Primitives . . . . . . . . . . . . . . . . . . . . . . . .
2.1.2 Chip Layout and Routing Interconnect . . . . . . . . . . . .
2.2 Conventional FPGA Compilation Flow . . . . . . . . . . . . . . . .
2.3 Related Work in Accelerating FPGA Compilation . . . . . . . . . .
2.3.1 Using Pre-compiled Cores to Accelerate FPGA Compilation
2.3.2 Accelerating Placement Techniques . . . . . . . . . . . . . .
2.3.3 Routability-driven Routing . . . . . . . . . . . . . . . . . . .
2.3.4 Summary and Overview of this Work . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.

5
5
5
8
10
12
12
14
16
16

Chapter 3 RapidSmith: An Open Source Platform for Creating FPGA
CAD Tools for Xilinx FPGAs . . . . . . . . . . . . . . . . . . . . .
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.2.1 Torc . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.3 XDL: The Xilinx Design Language . . . . . . . . . . . . . . . . . . . . . . .
3.3.1 Detailed FPGA Descriptions in XDLRC Reports . . . . . . . . . . .
3.3.2 Designs in XDL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4 RapidSmith: A Framework to Leverage XDL and Provide a Platform to Create FPGA CAD Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3.4.1 Xilinx FPGA Database Files in RapidSmith . . . . . . . . . . . . . .
3.4.2 Augmented XDLRC Information in RapidSmith . . . . . . . . . . . .
3.4.3 XDL Design Representation in RapidSmith . . . . . . . . . . . . . .
3.4.4 Impact of RapidSmith . . . . . . . . . . . . . . . . . . . . . . . . . .
Chapter 4 HMFlow 2010: Accelerating FPGA Compilation
Macros for Rapid Prototyping . . . . . . . . . . . . .
4.1 Preliminary Work . . . . . . . . . . . . . . . . . . . . . . . . .
4.1.1 Selection of a Compiled Circuit Representation . . . .
4.1.2 Experiments Validating Hard Macro Potential . . . . .
4.1.3 Hard Macros and Quality of Results . . . . . . . . . .
4.1.4 Hard Macros and Placement Time . . . . . . . . . . .
v

with
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .

Hard
. . . .
. . . .
. . . .
. . . .
. . . .
. . . .

19
19
20
21
21
22
24
26
27
29
30
30

33
33
34
35
42
42

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

43
44
45
46
46
46
49
51
53
55
55
59
60
61
66

Chapter 5 HMFlow 2011: Accelerating FPGA Compilation and Maintaining High Performance Implementations . . . . . . . . . . . . . . .
5.1 Preliminary Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1.1 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.1.2 Conclusions on Preliminary Work . . . . . . . . . . . . . . . . . . . .
5.2 Comparison of HMFlow Using Large Hard Macros vs. Small Hard Macros .
5.2.1 Upgrading HMFlow to Support Large Hard Macros . . . . . . . . . .
5.2.2 Upgrading HMFlow to Support Virtex 5 FPGAs . . . . . . . . . . . .
5.2.3 Large Hard Macro Benchmark Designs for HMFlow . . . . . . . . . .
5.2.4 Comparisons of Large and Small Hard Macro-based Designs . . . . .
5.2.5 Comparisons of Large Hard Macros with HMFlow vs. Xilinx . . . . .
5.3 Modification to HMFlow for High Quality Implementations . . . . . . . . . .
5.3.1 Hard Macro Simulated Annealing Placer . . . . . . . . . . . . . . . .
5.3.2 Register Re-placement . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3.3 Router Improvements . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.4.1 Result Measurement Fairness . . . . . . . . . . . . . . . . . . . . . .
5.4.2 Results of Three HMFlow 2011 Improvements . . . . . . . . . . . . .
5.4.3 Results of Optimizing HMFlow 2011 Improvements . . . . . . . . . .
5.5 Techniques for Reducing Variance . . . . . . . . . . . . . . . . . . . . . . . .
5.5.1 T1: Move Acceptance a Function of Hard Macro Port Count . . . . .
5.5.2 T2: Small Hard Macro Re-placement . . . . . . . . . . . . . . . . . .
5.5.3 T3: Cost Function Includes Longest Wire . . . . . . . . . . . . . . .
5.5.4 Configuration Comparison of Techniques . . . . . . . . . . . . . . . .
5.6 Runtime Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

67
68
69
75
75
76
78
79
80
83
84
85
91
94
97
97
98
99
103
104
106
108
112
113
115

4.2

4.3

4.4

4.1.5 Conclusions on Preliminary Hard Macro Experiments
HMFlow 2010: A Rapid Prototyping Compilation Flow . . .
4.2.1 Xilinx System Generator . . . . . . . . . . . . . . . .
4.2.2 Simulink Design Parsing . . . . . . . . . . . . . . . .
4.2.3 Hard Macro Cache and Mapping . . . . . . . . . . .
4.2.4 Hard Macro Creation . . . . . . . . . . . . . . . . . .
4.2.5 XDL Design Stitcher . . . . . . . . . . . . . . . . . .
4.2.6 Hard Macro Placer . . . . . . . . . . . . . . . . . . .
4.2.7 Detailed Design Router . . . . . . . . . . . . . . . . .
Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4.3.1 Benchmark Designs . . . . . . . . . . . . . . . . . . .
4.3.2 RapidSmith Router Performance . . . . . . . . . . .
4.3.3 Hard Macro Placer Algorithms . . . . . . . . . . . .
4.3.4 HMFlow Performance . . . . . . . . . . . . . . . . .
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . .

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.
.
.
.
.
.
.
.

Chapter 6 The Big Picture: The Compilation Time vs. Circuit Quality
Tradeoff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
vi

6.1
6.2
6.3
6.4
6.5

Motivation . . . . . . . . . . . . .
Implications for FPGA Designers
Contributions . . . . . . . . . . .
Conclusions . . . . . . . . . . . .
Future Work . . . . . . . . . . . .

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

.
.
.
.
.

117
120
120
121
122

REFERENCES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

vii

LIST OF TABLES
2.1
2.2

Virtex 4 Routing Interconnect Types . . . . . . . . . . . . . . . . . . . . . .
Virtex 5 Routing Interconnect Types . . . . . . . . . . . . . . . . . . . . . .

9
10

3.1

RapidSmith Device Files Performance . . . . . . . . . . . . . . . . . . . . . .

29

4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8

Baseline Runtimes for each Test Design . . . . . . . . . . . . . . .
Performance of each Test Design Using Hard Macros . . . . . . .
Comparison of Baseline vs. Hard Macro Designs . . . . . . . . . .
Benchmark Design Characteristics . . . . . . . . . . . . . . . . . .
Fine-grained Hard Macro Compile Times . . . . . . . . . . . . . .
Router Performance Comparison: Xilinx vs. RapidSmith . . . . .
Hard Macro Placer Algorithm Comparison . . . . . . . . . . . . .
Runtime Performance of HMFlow and Comparison to Xilinx Flow

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

41
41
43
56
59
60
60
62

5.1
5.2
5.3
5.4
5.5
5.6
5.7
5.8
5.9
5.10
5.11
5.12
5.13

Width/Height Area Group Aspect Ratio Configurations . . . . . . . . . . .
Slice Counts for Large Hard Macro Benchmark Virtex 5 Designs . . . . . .
Coarse-grained Hard Macro Compile Times . . . . . . . . . . . . . . . . .
Runtime Comparison of HMFlow with Large Hard Macros vs. Xilinx . . .
Clock Rate Comparison of HMFlow with Large Hard Macros vs. Xilinx . .
All HMFlow 2011 Improvement Configurations Tested . . . . . . . . . . . .
HMFlow 2011 Benchmark Clock Rates of Single (Default) Run (in MHz) .
HMFlow 2011 Benchmark Clock Rates of Average of 100 Runs (in MHz) .
HMFlow 2011 Benchmark Clock Rates of Best of 100 Runs (in MHz) . . .
Variance of HMFlow 2011 (C7) and Xilinx in 100 Compilation Runs . . . .
Average Variance and Frequency using Variance-reducing Techniques . . .
Compilation Runtime for Several HMFlow 2011 Configurations vs. Xilinx .
Clock Rate Summary (in MHz) for HMFlow 2011 Configurations vs. Xilinx

.
.
.
.
.
.
.
.
.
.
.
.
.

74
80
83
83
84
99
99
100
100
102
112
114
114

ix

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

.
.
.
.
.
.
.
.

LIST OF FIGURES
2.1
2.2
2.3

General Logic Abstractions in a Xilinx Virtex 5 FPGA . . . . . . . . . . . .
Common Xilinx FPGA (Virtex 5) Architecture Layout . . . . . . . . . . . .
Conventional FPGA Compilation Flow (Xilinx) . . . . . . . . . . . . . . . .

3.1

RapidSmith and XDL Interacting at Different Points within the Xilinx Tool
Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
RapidSmith Abstractions for (a) Devices and (b) Designs . . . . . . . . . . .
Screenshots of Graphical Tools Provided with RapidSmith to Browse (a) Devices or (b) Designs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.2
3.3
4.1
4.2
4.3
4.4
4.5
4.6
4.7
4.8

6
8
11
25
27
31

Hard Macro Creation Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Compilation Flow for Experiment #1 (Conventional Xilinx Flow) . . . . . .
Compilation Flow for Experiment #2 (Modified Xilinx Flow) . . . . . . . . .
Block Diagram of Multiplier Tree Design . . . . . . . . . . . . . . . . . . . .
Block Diagram of HMFlow . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Screenshot of an Example System Generator Design . . . . . . . . . . . . . .
Front-end Flow for an HMFlow Hard Macro . . . . . . . . . . . . . . . . . .
(a) A FIR filter Design Compiled with the Xilinx Tools (b) The Same Filter
Design Compiled with an Area Constraint to Create a Hard Macro . . . . .
4.9 A Histogram of Hard Macro Sizes of All Hard Macros in the Benchmarks . .
4.10 Graph Showing Percentage of Routed Connections in Each Benchmark Before
the Design is Sent to the Router . . . . . . . . . . . . . . . . . . . . . . . . .
4.11 (a) Average Runtime Distribution of HMFlow (b) HMFlow Runtime as a
Percentage of Total Time to Run HMFlow and Create an NCD File . . . . .
4.12 A Comparison Plot of the Benchmark Circuits Maximum Clock Rates When
Implemented with HMFlow and the Xilinx Tools . . . . . . . . . . . . . . . .

36
38
40
40
44
45
47

5.1
5.2

70

5.3
5.4
5.5
5.6

5.7
5.8

General Pattern of Hard Macro Placement on FPGA Fabric . . . . . . . . .
Delay of a Path Within a 21×21 Bit LUT-multiplier Hard Macro Placed in a
Grid of 400 Locations on a Virtex 4 SX35 FPGA . . . . . . . . . . . . . . .
A More Severely Impacted Path Caused by a Hard Macro Straddling the
Center Clock Tree Spine of the FPGA . . . . . . . . . . . . . . . . . . . . .
A PicoBlaze Hard Macro Placed at 3700 Locations on a Virtex 5 FPGA . . .
(a) Illustrates a FIR Filter Implemented in System Generator (b) Shows the
FIR Filter Converted to a Subsystem to be Turned into a Hard Macro . . . .
(a) Comparison of Runtime for Large and Small Hard Macro Versions of 3
Benchmarks on HMFlow 2010a (b) Comparison of Clock Rate for Large and
Small Hard Macro Versions of 3 Benchmarks on HMFlow 2010a . . . . . . .
The Number of Existing Routed Connections in the Large Hard Macro Benchmark as a Percentage ot Total Connections . . . . . . . . . . . . . . . . . . .
Block Diagrams of HMFlow 2010a and HMFlow 2011. . . . . . . . . . . . . .

xi

48
57
57
64
65

71
72
73
77

81
82
85

5.9

5.10
5.11
5.12
5.13
5.14
5.15
5.16
5.17
5.18
5.19
5.20
6.1
6.2

(a) Representation of a Set of Hard Macros Drawn with a Tight Bounding
Box (b) An Approximated Bounding Box for the Same Hard Macros for Accelerating the Simulated Annealing Hard Macro Placer . . . . . . . . . . . .
An Illustration of the Problem of an Approximating Bounding Box Where the
Box Changes Size Based on Location . . . . . . . . . . . . . . . . . . . . . .
A Simple Example of a Register (A) Being Re-placed at the Centroid of the
Original Site of Register A and Sinks B, C and D . . . . . . . . . . . . . . .
Representation of Virtex 4 and Virtex 5 Long Line Routing Resources . . . .
Plot of Benchmark Brik1 Uphill Accepted Moves for Each Hard Macro . . .
T1 Variance of Scaling Factor Q Parameter Sweep . . . . . . . . . . . . . . .
T1 Variance of Scaling Factor Q Parameter Sweep (Zoomed) . . . . . . . . .
T2 Variance of Hard Macro Size Re-placement Parameter Sweep . . . . . . .
T2 Variance of Hard Macro Size Re-placement Parameter Sweep (Zoomed) .
T3 Variance of Longest Wire Scaling Factor Parameter Sweep . . . . . . . .
T3 Variance of Longest Wire Scaling Factor Parameter Sweep (Zoomed) . .
Percentage of Total Wire Length to Overall Cost Function Output on Brik1
Benchmark Using T3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

89
90
92
94
105
106
107
108
109
110
110
111

Generalized Representation of Current Tool Solution Offerings for Compilation Runtime vs. Quality of Result . . . . . . . . . . . . . . . . . . . . . . . 118
Quality vs. Runtime Tradeoff for HMFlow 2011 and Xilinx Tools . . . . . . 119

xii

CHAPTER 1.

1.1

INTRODUCTION

Motivation
Field programmable gate arrays (FPGAs) have steadily gained traction as a com-

putational platform for several years. With each fabrication process shrink, FPGAs have
been able to steadily grow in capacity from simple chips used as glue logic to massively
parallel processing platforms capable of hundreds of GFLOPS (billions of single-precision
floating point operations per second). They have also expanded to include a variety of dedicated cores such as block memories, processors, Ethernet MACs, PCI Express cores and
multi-gigabit transceivers (MGTs) to provide more robust solutions to a greater number of
design projects. As FPGA compute capabilities have increased, they have become a popular choice to replace other computational platforms such as application specific integrated
circuits (ASICs) and general purpose processors (CPUs).
FPGAs are becoming a more attractive alternative to ASICs in many applications for
two main reasons. First, FPGAs are re-programmable in that their circuits can essentially
be changed as many times as the user desires. This feature alone helps drive down costs in
development projects as changing an ASIC after photolithography masks have been created
is difficult and often will require a new set of the very expensive masks to be created in order
to change the circuit behavior.
Second, FPGAs have little to no non-recurring engineering (NRE) costs when compared with the costs of producing leading-edge technology ASICs. An FPGA can be purchased, designed and programed to be fully functional with almost no up front cost. ASICs,
however, will require a lengthy development, verification and expensive fabrication process
that can cost several million dollars before a single functional die is produced. In addition,
with each fabrication technology process shrink, ASIC NRE costs rise significantly—driving
more projects and engineers to choose FPGAs over ASICs. It is expected that FPGAs will
1

continue to consume the ASIC project market for all but the highest volume production
runs.
FPGAs are also an attractive alternative to CPUs in embedded applications as FPGAs can offer much higher performance and also better performance per Watt. However,
one aspect of FPGAs that has continually plagued their adoption for many typical CPU
applications has been lengthy compilation times. Software compilation for CPUs is often
completed within seconds whereas FPGA compilation can take hours or even days. Such
a large disparity in compile time puts FPGA designers at a significant design disadvantage
by limiting the number of edit-compile-debug turns that can be completed per day. Poor
designer productivity due to lengthy FPGA development times can often negate the performance savings offered by FPGAs. This drawback in productivity can also drive engineers to
use CPUs as design solutions when computational power is not critical.
Long compilation time has been a major roadblock for widespread FPGA adoption
and is mostly due to the driving force behind the CAD tools used to compile their designs.
Those projects that do leverage FPGAs (mostly as ASIC alternatives) are concerned with
two main metrics: area and clock rate. Area and clock rate directly correlate with the cost of
the project and FPGA vendors must optimize their CAD tools for these two metrics above
all else to ensure viability in the FPGA market arena. In addition, since many engineers
leveraging FPGAs are already familiar and accustomed to the lengthy development cycle
presented by ASICs, the use of the similar FPGA development and compilation process has
not produced much motivation or effort to reduce compilation times.
However, as FPGAs have increased in size and capacity, their usage in new kinds
of applications that do not demand the highest clock rate or minimal area utilization have
become more prevalent. One such example is National Instruments LabVIEW FPGA design
tool which only targets an implementation clock rate of 40 MHz, substantially lower than
modern FPGA capabilties. The largest complaint of the LabVIEW FPGA product from its
consumers has been the compilation time of their designs [1]. With the past few decades
of FPGA CAD being narrowly focused on optimizing for area and clock rate, no effective
commercial solutions are available to solve the problem of lengthy FPGA design compilation
times.
2

1.2

Preview of Approach
The goal of this work is to create and explore FPGA CAD tool techniques that

can provide acceptable FPGA implementation results while reducing compilation time for
FPGAs by an order of magnitude or more. More broadly, the goal is to enable FPGA design
engineers greater flexibility in trading circuit quality for faster compilation time and further
shed light on the largely unexplored tradeoff.
The particular approach of this work is to emulate the largely successful software
compilation model. In software, lengthy compilations are avoided by using pre-compiled
libraries which are linked in at the final stage of executable generation. Conventional FPGA
compilation is in stark contrast to this model in that each time a design is compiled, each
and every component is re-compiled from scratch regardless of whether it has been changed
since the previous compilation or not.
To avoid excessive re-compilation, this work heavily utilizes a construct in hardware
that largely parallels the pre-compiled libraries in software. This hardware construct is
called a hard macro1 . A hard macro is a pre-synthesized, pre-mapped, pre-placed and prerouted hardware block that has been designed with re-use in mind and can potentially be
instantiated at several different locations on the FPGA fabric. As hard macros are device
family specific, they are tied to a particular architecture, however, this is similarly the case for
pre-compiled libraries in software. By creating designs purely assembled from hard macros,
compilation time can be reduced significantly as shown by the techniques outlined in this
dissertation.

1.3

Contributions of this Work
Several attempts have been made to accelerate FPGA compilation, often focusing on

certain steps of the compilation process [2] [3], with some techniques even using hard macros
[4]. However, this work is unique in that it combines the approach of both algorithmic
improvement and leveraging pre-compiled information to accelerate the compilation flow
1

It should be noted that the term hard macro referred to here and throughout this dissertation is strictly
applied to circuits realized in programmable FPGA fabric. It does not refer to hard macros in an ASIC.

3

used by FPGAs. Additionally, this approach is flexible enough to accommodate designs
from all application domains.
The main contributions of this dissertation are listed below:
 Though some research efforts to perform CAD tool experimentation on commercial

FPGAs has been performed in the past (such as [5], [4], [6] and [7]), the task has been
quite difficult due to a lack of a unified framework and tools. In order to implement
the several algorithms, ideas and techniques of this work, RapidSmith, a framework for
FPGA CAD tool creation was developed and released as open source to the research
community (Chapter 3).
 HMFlow 2010 - Demonstration of a complete, custom FPGA compilation flow lever-

aging hard macros for rapid prototyping purposes. (Chapter 4).
 HMFlow 2011 - Demonstration of several techniques to improve on HMFlow 2010 that

allow for higher quality implementation with minimal runtime increases (Chapter 5).
 Provide additional insight to the tradeoff between compilation time and circuit quality

using the several techniques in HMFlow 2010 and HMFlow 2011 (Chapter 6).
The use of hard macros to directly accelerate commercial FPGA compilation of general purpose designs is a little studied area of research. In addition, providing insight into
the potentially valuable tradeoff of compile time vs. circuit quality has also received little
attention in the research literature. An overview of FPGA compilation and related work
will be outlined in Chapter 2 with Chapters 3, 4 and 5 describing in detail the accomplishments of this dissertation. In Chapter 6 I conclude and reflect on the potential impact of
the contributions made in this work.

4

CHAPTER 2.

BACKGROUND AND RELATED WORK

This chapter will describe the basics of FPGA architecture, the conventional FPGA
compilation flow and related work in accelerating FPGA compilation.

2.1

FPGA Architecture
In order to implement a wide variety of circuits, FPGAs use configurable logic and

routing interconnect to provide a rich framework on which to realize a desired circuit. The
logic components of an FPGA contain facilities to perform computations such as arithmetic,
memory storage, I/O communications and other processing. The routing interconnect is
a massive network of programmable wires that allow connections to be made in between
logic elements. This section aims to provide the reader with a rudimentary understanding
of FPGA architecture (specifically Xilinx FPGAs) to better understand the contributions
found later in this dissertation.

2.1.1

FPGA Primitives
FPGAs have a set of primitives that perform either logic, arithmetic, storage or I/O

communications. Each family of Xilinx FPGAs share the same set of logic primitives and
are characterized by a set of input and output pins with zero or more configurable options.
The most popular primitives are slices, IOBs, block RAMs and multipliers or DSPs.

Slice
At the heart of configurable logic is the programmable look up table (LUT). LUTs
can be programmed to implement any Boolean function of complexity only limited by the
number of inputs. Modern FPGAs have LUT input sizes of between 3 and 6 inputs and
correspondingly have between 8 and 64 (23 and 26 ) bits of configuration (memory cells).
5

LUT
LUT

…

LUT
LUT

Slice

LUT

…

LUT
LUT
LUT

Slice

FPGA

CLB

Figure 2.1: General Logic Abstractions in a Xilinx Virtex 5 FPGA

LUTs are organized hierarchically into different units of abstraction to aid the CAD tools in
mapping and implementing a design onto an FPGA. For example, as shown in Figure 2.1,
Virtex 5 Xilinx FPGAs use the terms slice and configurable logic block (CLB) to denote
groups of LUTs and groups of slices respectively. LUTs are also accompanied by a D flipflop, often to store the results of the LUT output and provide a method for maintaining
state in the FPGA. A slice will also contain other useful elements such as multiplexers, carry
chains (used for faster addition and subtraction), and some kinds of LUTs can be configured
as small RAMs.

Input Output Buffer (IOB)
In order to interface an FPGA design with off-chip signals, input/output buffers
(IOBs) are needed to provide the necessary circuitry and signaling. IOBs can be configured

6

to interface with a number of different voltage standards and can be configured as input,
output or tri-state buffers.

Block RAMs
When FPGAs were first created they consisted of a purely homogeneous array of
CLBs and slices. However, as FPGAs grew in capacity and capability, FPGA vendors saw
the utility in adding block memories on chip to aid in computation. Xilinx added block
RAMs to its FPGAs. The capacities and quantity of block RAMs differ from one device
to the next. As one example, Virtex 5 block RAMs contains 36 kilobits of memory and
different Virtex 5 parts can have a little as a few dozen to as many as several hundred of
these memories on a single chip.

Multipliers and DSP
In recent years, FPGAs have been quite successful at accelerating computations in
digital signal processing. Since a multiplier implemented out of several LUTs consumes a
significant portion of FPGA resources, FPGA vendors began to introduce dedicated multipliers into the FPGA fabric. These multipliers were improved in later families to become
a more fully-featured block called a DSP which is capable of several operations, the most
popular of which is a multiply-accumulate.

Other Primitives
Several other primitive cores exist on an FPGA for a variety of different applications.
Xilinx includes a primitive called the Internal Reconfiguration Access Port (ICAP) that
allows the FPGA to access its own configuration circuitry. Ethernet MACs and PCI Express
cores have also become popular primitives to allow FPGAs to more easily communicate with
common I/O standards without consuming large portions of FPGA resources. Xilinx has
also included hard processors within the FPGA fabric. The Virtex II, Virtex 4 and Virtex 5
families contained parts which had one or two PowerPC processors on chip to enable software
to interact with the FPGA fabric and build system-on-chip type systems.
7

Switch
Box

CLB

Switch
Box

Switch
Box

CLB

Switch
Box

Switch
Box

CLB

Switch
Box

Switch
Box

CLB

Switch
Box

Switch
Box

CLB

Switch
Box

Switch
Box

CLB

Switch
Box

Switch
Box

CLB

Switch
Box

Switch
Box

CLB

Switch
Box

Switch
Box

CLB

Switch
Box

Switch
Box

CLB

Switch
Box

Block

RAM

DSP
CLB

CLB
DSP

Switch
Box

CLB

Switch
Box

CLB

Switch
Box

CLB

Switch
Box

CLB

Switch
Box

CLB

Figure 2.2: Common Xilinx FPGA (Virtex 5) Architecture Layout

2.1.2

Chip Layout and Routing Interconnect
One of the major challenges of FPGA architecture is not only providing programmable

logic functionality but doing so in a way that allows actual implementation of real circuits.
If logic in different parts of the chip cannot communicate, the FPGA is worthless. Therefore,
the layout of the chip and its routing interconnect architecture are important and must be
accomplished in an intelligent fashion.
Virtually all modern FPGAs follow a layout style called an island-style architecture.
Island-style FPGAs consist of creating a regular array of islands of logic with a sea of
interconnect resources in between them. Generally, each island of logic is accompanied
by a switch box or switch matrix which acts as the entry point to access the vast sea of
interconnect resources. A high level representation of a typical Xilinx FPGA layout is given
in Figure 2.2.

8

The particular layout in Figure 2.2 most closely resembles the layout of the Virtex
5 family. The black boxes represent logic and the white boxes are switch boxes. All logic
blocks access the main routing interconnect through a switch box to its left. Each CLB is
paired with one switch box, whereas larger logic blocks such as DSPs and block RAMs are
paired with 2.5 and 5 switch boxes respectively. Xilinx FPGAs typically arrange logic so
that each column of logic blocks are of the same type. If the example in Figure 2.2 were
extended, it would be seen that repeating CLBs, block RAMs, and DSPs would be above
and below those shown until the top and bottom of the chip were reached.
It should also be noted that logic blocks such as CLBs and DSPs have direct interconnect to the adjacent blocks above and below. These direct connections are carry chains
which allow for faster arithmetic computation by chaining the critical computations closer
together.
The routing found in Figure 2.2 is simplified for ease of understanding. The actual
routing interconnect found in the Virtex 5 is more complex with several different types of
routing interconnect. Most general purpose routing starts and terminates in the switch box
and Xilinx provides a variety of types of routing interconnect to provide a robust interconnection routing structure. The most common routing resources provided in the Virtex 4 and
Virtex 5 architectures are listed in Tables 2.1 and 2.2.
Table 2.1: Virtex 4 Routing Interconnect Types
Virtex 4 Routing Resource Switch Box Hops
OMUX
1, 2
DOUBLE
1, 2
HEX
3, 6
LONG LINE
6, 12, 18, 24

XDL Name REGEX Pattern
ˆOMUX[0-9]+
ˆ[NSEW]2BEG[0-9]+
ˆ[NSEW]6BEG[0-9]+
ˆL[HV][(0)|(6)|(12)|(18)|(24)]

Some wire names start with one of four letters, N, S, E, or W (North, South, East
or West) that indicate the direction of the wire. For example, a Virtex 4 DOUBLE called
N2BEG0 is a wire that connects to the first and second switch boxes directly above the
source switch box. Note that the DOUBLE TURN and PENT TURN routing resources
in Virtex 5 change direction after 1 and 3 switch box hops, respectively. Long lines are
9

Table 2.2: Virtex 5 Routing Interconnect Types
Virtex 5 Routing Resource Switch Box Hops
DOUBLE
1, 2
DOUBLE (TURN)
1, 2
PENT
3, 5
PENT (TURN)
3, 5
LONG LINE
6, 12, 18

XDL Regular Expression Pattern
ˆ[NSEW][LR]2BEG[0-9]+
ˆ[NSEW][NSEW]2BEG[0-9]+
ˆ[NSEW][LR]2BEG[0-9]+
ˆ[NSEW][NSEW]2BEG[0-9]+
ˆL[HV][(0)|(6)|(12)|(18)]

bi-directional so they only use two letters, H and V to denote horizontal and vertical wires.
Also note that the regular expressions in Tables 2.1 and 2.2 do not cover all wire names in
XDL, simply some of the more common ones.
By providing a variety of wire interconnect lengths, Xilinx FPGAs are able to accommodate routing demands much better than if a single standard length was used. A
variety of wire lengths tailors a route’s needs to the routing resource and minimizes routing
interconnect delay by using resources that meet the distance requirements of the connection.
Long lines are the obvious choice for long distance connections as they are much faster than
chaining several shorter lengths of wire together. Conversely, DOUBLE wires are efficient
ways of connecting 1 and 2 switch box hops away.

2.2

Conventional FPGA Compilation Flow
A diagram showing the steps involved in the compilation of a typical Xilinx FPGA

design is given in Figure 2.3. FPGA designers will often describe the circuit to be implemented in a register transfer level (RTL) based language such as VHDL or Verilog. These
files are parsed and then synthesized into a netlist which describes a list of circuit elements
such as gates and flip flops with an accompanied list of their interconnections. The Xilinx
RTL synthesizer is called xst and outputs a Xilinx netlist called an NGC file. The synthesizer will often perform logic optimization and infer common circuit constructs from the
RTL provided. That is, it will find ways in which to reduce the circuit size by exploring more
efficient implementations of the logic and also match RTL patterns for common constructs
used in FPGAs such as hard multipliers and block RAMs.

10

.UCF

XST

.VHD,
.V

HDL

.NGC

NGDBuild

.NGD

MAP

Netlist

.NCD

PAR

.NCD

BITGEN

.BIT

FPGA Primitive Netlist

process(clk)
if clk’event
and
clk=‘1’
then
output <=
input;
end if;
end process;

Figure 2.3: Conventional FPGA Compilation Flow (Xilinx)

The netlist produced by the synthesizer is then technology mapped into the specific
components offered by FPGAs such as LUTs, registers, carry chains and so forth. The
Xilinx tool which performs technology mapping is called ngdbuild and creates a technology
mapped netlist called an NGD file.
After the netlist has been technology mapped, it must be packed into the FPGA
primitives such as slices and DSP48s. The packing process is performed by the Xilinx tool
called map. The output of map is a Xilinx-specific netlist that contains a list of FPGA
primitives and nets describing their interconnections. The proprietary netlist format that
Xilinx uses is called the NCD (Native Circuit Description) file. As will be discussed in the
next chapter, Xilinx also offers an open ASCII-based version of the NCD file called XDL.
The final steps in implementation are the placement and routing steps. Placement
involves assigning FPGA primitives to a specific location on the FPGA fabric on which to
reside. The routing phase then decides which programmable switches are turned on and
off to create the desired connections between primitives. Placement and routing are often
performed by Xilinx par, however, newer architectures such as Virtex 5 and later have

11

their placement performed in the map process. NCD files are also produced by par as the
NCD/XDL format can represent designs that are unplaced, placed and routed.
The final step for FPGA implementation is generating a bitstream or BIT file. A
bitstream is a large binary file where its bits correspond to resources or memories in an
FPGA. Generating a bitstream involves changing bit values in the BIT file to reflect the
desired behavior of the FPGA as specified by the primitive netlist. Xilinx uses a program
called bitgen to convert physical netlist files (NCD) to BIT files. After a BIT file is created,
it can be downloaded directly to the chip or programmed into a PROM to be loaded into
the FPGA on reset or power up.

2.3

Related Work in Accelerating FPGA Compilation
Two major approaches to accelerating FPGA compilation are: (1) make some attempt

to preserve intermediate implementation data such as pre-compiled cores and (2), modify or
develop new algorithms that will accomplish tasks such as placement and routing faster. As
these are the two major thrusts of this work, this section will discuss other relevant work in
those respective research areas.

2.3.1

Using Pre-compiled Cores to Accelerate FPGA Compilation
As described in the introduction of this dissertation, software compilation speeds

benefit significantly from using pre-compiled libraries to store commonly used routines that
do not need to be recompiled for each compilation. By linking these libraries into the final
executable, compilation time is reduced significantly. However, the traditional approach
of FPGA compilation has often been to re-compile everything for each invocation of the
compilation flow. This is often wasteful as many of the design files do not change from run
to run. There have been several attempts in the past to leverage this fact to try and accelerate
FPGA compilation. The approach taken in this work is to use pre-compiled blocks called
hard macros, however, other techniques such as bitstream cores, macroblocks, virtual fabrics
and a tool called ReCoBus also demonstrate ways in which intermediate design information
can be reused to reduce compilation time.

12

Bitstream Cores
As the bitstream is the final implementation format of the design, it presents an
attractive opportunity to attempt to assemble a design from cores stored as portions of
a bitstream. Horta and Lockwood [8] demonstrated the creation of bitstream-based relocatable cores that are quite similar in nature to hard macros. Similar efforts are reported
in [9] where bitstream hard cores were used in a network-on-chip to provide accelerated logic
emulation and prototyping. Unfortunately, bitstream hard cores are less flexible than hard
macros because they must reside between restrictive configuration boundaries and require
matching bus macro interfaces to be present in both the cores as well as the existing FPGA
configuration. Bitstream-based cores are also more difficult to construct as much of the
bitstream formats used by FPGA vendors are proprietary.

ReCoBus
ReCoBus [4] is a tool chain that attempts to generate complete systems on-the-fly.
It leverages partially reconfigurable modules that can be loaded into an FPGA at runtime.
By reusing modules that can be swapped in and out of the FPGA, it can avoid lengthy
compilation time when attempting to reconfigure the overall system. To accomplish this
task, a system-level ReCoBus hard macro is generated based on specifications provided by
the user. Each core requires an interface template to be wrapped around its interface to
provide compatibility with the overall ReCoBus system. These cores are then converted to
special partial bitstreams that will allow the cores to be swapped into and out of the FPGA
at runtime.
The ReCoBus system does provide a way to reduce compilation time, however each
time the functionality of a core must be changed, it must be completely re-compiled. The
bus-based nature of the approach limits its flexibility. However, leveraging a bitstream linker,
ReCoBus can rapidly reconfigure systems with cores that have already been compiled.

13

Intermediate Virtual Fabrics
Intermediate virtual fabrics are another strategy that exploits reuse of pre-compiled
structures [10]. This technique abstracts the FPGA to a domain-specific fabric created
off-line and customized for a particular application domain. These fabrics accommodate
macroblocks which are placed and routed quickly onto the fabric. This technique is effective
if the intermediate fabric has already been built for a particular application, Coole et al.
claim an average place and route speedup of 554×. However, the speed comes at a cost.
Each newly created intermediate fabric still requires a lengthy place and route and the area
overhead of building an intermediate fabric can consume up to 40% of the FPGA’s resources.

2.3.2

Accelerating Placement Techniques
Placement of FPGA primitives can largely determine the overall quality of the result-

ing implementation and is therefore a critical step in producing high quality of result circuits.
However, to obtain high quality results, placement (as with its counterpart process routing)
has been well known for long runtimes. Some work has been done in exploring different
techniques to alleviate the significant time requirement by creating macroblocks from design
elements or clustering blocks together and modifying the optimization algorithms involved.

Macroblocks
One way to reduce the placement problem size is to reduce the number of objects
that need to be placed. Tessier [5] demonstrated Frontier, a placement tool that utilized
macroblocks and floorplanning to accelerate placement. The macroblocks used by Frontier
were similar to hard macros in that some (not all) contained relative placement information
and accordingly could be placed as a complete unit directly onto the FPGA fabric. These
macroblocks were compiled before the Frontier placement process and are different from
the hard macros used in the approach of this dissertation for two reasons. First, not all
macroblocks contained relative placement information and thus, the inner primitives were
allowed to float and be manipulated by the placer. This could ultimately reduce the overall

14

effectiveness of a macro-based approach. Second, the macroblocks did not contain relative
routing information.
The first step of Frontier decomposes the FPGA into a set of placement bins of
equal size. Next, the macroblocks are grouped into clusters such that each cluster could
fit into a placement bin. If not all clusters can fit into the bins, the algorithm is restarted
with a larger placement bin size. After all clusters have been assigned to a placement bin,
they are subsequently swapped to minimize bin placement costs (which were determined by
inter-bin connectivity). After bin placement, the macroblocks within the clusters are each
placed within their respective placement bin based on neighboring connectivity. At this
point, Frontier can terminate, or, if the placement is calculated to be difficult to route, a low
temperature simulated annealing process can be performed to further improve the quality of
the placement.
Tessier was successful in achieving 17× speedup in placement or a 2.6× speedup
in overall place and route when compared to the Xilinx PAR-M2.1 in macro-based mode
(which is no longer is available in current Xilinx tools). When compared to the flattened
design mode of the PAR-M2.1 software, placement achieved a speedup of 50× and an overall
speedup of 8× for both placement and routing.
Although macroblocks can be effective at accelerating placement, hard macros have an
additional advantage over macroblocks. Hard macros have the potential to contain routing
information which means a hard macro contains a placed and routed implementation of a
circuit and thus, no further effort is necessary to map it to the FPGA fabric. The presence of
finished routes in the hard macro can have a significant impact on routing time as several nets
in a design can already be routed before the final routing process. Nevertheless, macroblocks
can be an effective technique to reduce overall placement time without impacting design
implementation quality.

Clustering and Hierarchical Simulated Annealing Placement
Sankar and Rose [2] proposed a similar objective to that of this work. That is,
there exists a growing set of FPGA designers who are willing to trade circuit quality for
shorter compile times. Sankar and Rose were able to demonstrate a 52.4× speedup in the
15

placement process if users are willing to accept a 30% increase in total routing wire length.
To accomplish this, Sankar and Rose used a multi-level clustering technique to group the
primitives of a design into clusters based on their connectivity. This step was followed by a
custom simulated annealing algorithm to iteratively improve the placement.

2.3.3

Routability-driven Routing
Swartz et al. [3] as with [2], also proposed trading quality for faster compile times.

However, instead of optimizing placement, Swartz et al. focuses on accelerating routing.
The method used is a basic PathFinder algorithm that attempts to reduce the total number
of iterations required to resolve congestion. It also has mechanisms for trying to identify unroutable scenarios quickly up front to avoid spending a lot of time on an impossible routing
task.
The significant methods used by Swartz et al. are first to use a directed depth-first
search expansion when routing a net. The cost function is modified to direct the search
towards the desired sink much more quickly than the conventional breath-first search. The
second method used is to filter out the routing resources considered for expansion during the
search. By being more selective of the routing resources that are followed during the routing
process, much time savings can be obtained by avoiding traversal of fruitless paths.
Overall the techniques used are quite effective, however, they were implemented using
VPR [11] which makes comparison to commercial tools difficult. Swartz et al. states that
the routability-driven router can route a 20,000 logic block circuit in 23 seconds on a 300
MHz sparcstation.

2.3.4

Summary and Overview of this Work
Although many of the techniques mentioned in this section have proven to be effective,

they each focus on a single corner of the larger compilation time problem. The goal of this
dissertation is to demonstrate a comprehensive approach to accelerating FPGA compilation
that combines the usefulness of hard macros for design re-usability with a complete end-toend compilation flow tuned at each step for rapid compilation. The approach taken here

16

also focuses on demonstrating the techniques on commercial FPGAs as such work is proof of
concept in its own right. Using these different techniques, this dissertation will demonstrate
a new way to compile FPGA designs that is significantly different than the conventional
approach.

17

CHAPTER 3.
RAPIDSMITH: AN OPEN SOURCE PLATFORM FOR
CREATING FPGA CAD TOOLS FOR XILINX FPGAS

At the onset of this work, there was no organized way to build CAD tools for commercial FPGAs and far less available to implement a rapid prototyping compilation flow.
This chapter motivates the need for a new FPGA CAD tool framework called RapidSmith,
a Java-based Xilinx CAD tool framework written specifically to support the endeavors of
this work.

3.1

Introduction
One of the biggest challenges of performing FPGA CAD experiments on commercial

FPGAs is the difficulty in gaining access to a suitable database containing all of the native
resources on an FPGA. This challenge has caused many FPGA researchers to gravitate
toward hypothetical FPGA architectures found in popular tools such as VPR [11] to perform
CAD experiments.
At the onset of this work, it was decided that in order to properly address the long
standing problem of time-consuming compilation times for commercial FPGAs, it would
be necessary to create custom CAD tools for commercial devices. However, as already
mentioned, the FPGA vendor must make available a means to obtain a database of FPGA
resources and also import/export methods to allow low-level access to FPGA designs.
Altera provides a research-based interface to their tools called the Quartus University
Interface Program (QUIP) [12]. QUIP provides some detail about their devices that enable
researchers the ability to create tools such as front-end synthesis, technology mappers, placers
and global routers. However, it does not allow its users the ability to create detailed routers
which would be an essential part of accelerating the FPGA compilation flow. Ultimately,

19

after examining the offerings of Altera, it was concluded that the provided interface lacked
certain features to allow the implementation of this work’s experiments and meet its goals.
Xilinx provides a more complete interface called the Xilinx Design Language (XDL)
and an accompanying tool xdl [13] that provides enough information to create all the tools
enabled by QUIP, and additionally that of a detailed router. It also allows conversion of
proprietary Xilinx netlist formats (NCD and NMC) to be converted to and from the human
readable ASCII XDL netlist format. This conversion functionality enables the user to make
very specific and detailed changes to the netlist of a design. One of the disadvantages of
XDL, however, is that it can be quite cumbersome to use. For example, the text-based
database report files generated by the xdl tool can reach over 70 gigabytes. This makes
direct access of such files infeasible for tools such as placers and routers. In addition, there
are some minor gaps in the information that is provided by XDL report files (also called
XDLRC files) that must be acquired through other means. Despite these shortcomings,
Xilinx FPGAs (using XDL as their interface) was choosen to demonstrate this work.
Given the usefulness but unfortunate shortcomings of XDL, it was clear that a CAD
tool framework to leverage XDL would be necessary to carry out the experiments described
later in this work. The XDL framework created for this purpose is called RapidSmith. The
remainder of this chapter summarizes related work, describes XDL and RapidSmith and
how RapidSmith is able to overcome the shortcomings of XDL and provide an effective
rapid prototyping framework for FPGA CAD tools.

3.2

Related Work
Some researchers have used XDL before to perform certain kinds of experiments such

as a bus macro generator [7], half latch detection and removal [14], C-slow re-timing [15],
power estimation [16], floor planning tools [4], routing constraints [17][18] and run-time
reconfiguration [19]. Despite these and several other efforts which leverage XDL, a unified
open source solution to facilitate the use of XDL has never materialized until recently. Steiner
[6] and Kepa et al. [20] both demonstrate functional XDLRC-derived device databases,
however they have not (at the time of writing) been openly released.

20

3.2.1

Torc
Torc [21] is a very similar and capable set of tools and framework to that of Rapid-

Smith. Developed by University of Southern California’s Information Sciences Institute
(USC-ISI), Torc aims to provide a more comprehensive tool set than RapidSmith, including
front end tools such as technology mapping and packing. It also proposes support for other
FPGA vendors if the information can be obtained which contrasts with RapidSmith goals
as it was written specifically for Xilinx devices. Torc leverages XDL in the same ways as
RapidSmith, however it is written in C++ whereas RapidSmith is in Java. Ultimately Torc
was not used because it was not available in time for this work and the ideas and concepts
of this work were able to be more quickly prototyped in Java than C++.

3.3

XDL: The Xilinx Design Language
RapidSmith is based on the interface provided by Xilinx called the Xilinx Design

Language (XDL). XDL is provided mainly in the form of an executable, xdl. The Xilinx
tool xdl provides three important capabilities (options) that enable the creation of an FPGA
CAD tool framework for Xilinx FPGAs:
1. -report: Generates complete FPGA device reports detailing placement sites and an
exhaustive routing graph (without timing data)
2. -ncd2xdl: Performs NCD to XDL conversion (allowing conventional Xilinx designs
to be converted to the open XDL format)
3. -xdl2ncd: Performs XDL to NCD conversion (allowing XDL-manipulated designs to
be re-injected into several locations within the Xilinx design flow)
With these options, the xdl tool provides detailed device information and can import
and export designs to and from the Xilinx flow. This section serves as a short reference for
XDL as it is no longer fully documented by Xilinx and a correct understanding of XDL is
necessary in order to fully understand the value and capabilities of RapidSmith.

21

3.3.1

Detailed FPGA Descriptions in XDLRC Reports
The first option supplied by xdl is the -report option. This option takes a Xilinx

part name as input and outputs a file with the extension .XDLRC that is a report file
detailing all of the tiles, primitive sites and routing resources (without timing data) that are
available in the given part. To obtain the most detailed information possible, the options
-pips -all conns are also included in the command. These XDLRC report files are
generally several gigabytes in size and some XDLRC file sizes are shown later in the chapter
in Table 3.1.
XDLRC report files have a very predictable and regular structure making them easy
to parse and harvest valuable FPGA architectural information. These reports have two main
sections: tiles and primitive definitions. An example of some XDLRC report file syntax is
shown in Listing 3.1. The tile section as seen in lines 10–15, describe a complete floor plan
layout of a device and details all of the separate resources found in the tiles such as primitive
sites and routing resources such as wires and PIPs. Line 10 gives the dimensions (height and
width) in tiles of the FPGA layout. The primitive definition section as shown in lines 16–20
is family specific (Virtex 4, Virtex 5, etc.) and is included in every XDLRC file generated by
xdl. It contains a listing of all primitives supported by the device’s family giving detailed
information about each such as inputs, outputs, internal connections and resources.
Listing 3.1: Basic Structure of an XDLRC Report File
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19

# =======================================================
# XDL REPORT MODE $Revision: 1.6 $
# time: Thu Apr 16 13:58:39 2009
# cmd: c:\Xilinx\10.1\ISE\bin\nt\xdl.exe -report -pips -all_conns xc4vfx12
# =======================================================
(xdl_resource_report v0.2 xc4vfx12sf363-12 virtex4
# **************************************************************************
# * Tile Resources
*
# **************************************************************************
(tiles 73 66
(tile 0 0 EMPTY_CORNER_X0Y72 EMPTY_CORNER 0
(tile_summary EMPTY_CORNER_X0Y72 EMPTY_CORNER 0 0 0)
)
...
)
(primitive_defs 41
(primitive_def BSCAN 8 10
)
...

22

20
21
22
23
24
25

)
# **************************************************************************
# * Summary
*
# **************************************************************************
(summary tiles=4818 sites=8594 sitedefs=41 numpins=213992 numpips=6362157)
)

Tiles
XDLRC report files abstract the FPGA fabric into a two dimensional array of sections
called “tiles” which are conceptually laid out edge to edge in a checker board fashion. Each
tile is specified with a unique name and a type. The tile name often includes an X and Y
coordinate as a suffix and the beginning of the name often exemplifies its type. For example,
a tile named CLB X2Y3 is of type CLB and is the 3rd CLB to the right of the origin and
4th above the origin (the origin being the bottom left corner of the FPGA). An interesting
aspect of the X and Y coordinate pairs are unique to the tile type rather than to every tile
declared in the grid. For example, a tile called CLB X2Y3 will have a switch box tile directly
to its left which will have the same X and Y coordinates will be called INT X2Y3.
Tiles are mainly used as a reference to specify locations of certain resources. Tiles
contain declarations of FPGA resources such as primitive sites, wires, and PIPs.

Primitive Sites
A primitive site is a location where an instance of an FPGA primitive can reside.
Each primitive site has a type that is compatible with one or more primitive instance types.
For example, a SLICEL is a type of primitive available in Virtex 4 parts. Half of all slice
primitive sites in Virtex 4 parts are type SLICEL and the other half are type SLICEM. A
SLICEL primitive can be placed on either a SLICEL or SLICEM site, however, the converse
is not true. As mentioned earlier, a list of all primitive types is included in every XDLRC
report file.

23

Wires
Wires declared in XDLRC reports refer to wires which cross tile boundaries. A wire is
simply un-programmable metal that exists on the FPGA fabric as part of the routing graph.
A wire always begins in a tile and can span one or more tiles. One of the peculiar attributes
of XDLRC is that even though a wire represents a continuous piece of metal in the FPGA
fabric, a separate name is given to each segment of the wire for each tile it occupies. Thus,
a wire which spans 3 tiles will have three separate names attached to it depending on which
tile is being referenced.

PIPs
Programmable interconnect points, or PIPs, are the configurable part of the FPGA
routing graph. PIPs are completely contained within a tile (they do not straddle a tile)
and define a potential connection that can exist between two wire segments. In most cases,
thousands of PIPs are defined in the switch matrix tiles of an FPGA, most of which allow
a certain wire segment to connect to one of many different wire segments. Some PIPs are
used simply to denote a one-to-one connection between two wire segments within a tile and
are sometimes referred to as “fake PIPs” as they do not affect the bitstream. These fake
PIPs exist simply to complete the connectivity of the routing graph as their absence would
make routing impossible with XDLRC reports alone.

3.3.2

Designs in XDL
The second and third options provided by xdl are the XDL/NCD conversion func-

tions which enable proprietary Xilinx NCD files (native netlists for Xilinx FPGAs) to be
converted to and from the open XDL format. A design in XDL is equivalent in most respects to Xilinx NCD files except that it is an ASCII file format that is human readable and
it also is directly incompatible with other Xilinx tools in the flow. However, by converting
XDL to NCD files, XDL can be re-injected into several different steps of the Xilinx flow such
24

Xilinx

.NCD

map

Xilinx

par –r

.NCD

(place only)

Xilinx

par –p
(route only)

Xilinx

Xilinx

.NCD

Xilinx

bitgen

.BIT

Xilinx

xdl

xdl

xdl

.XDL

.XDL

.XDL

BYU

RapidSmith Tools
Figure 3.1: RapidSmith and XDL Interacting at Different Points within the Xilinx Tool Flow

as after packing (map), after placement (par -r), and after routing (par -p) as shown in
Figure 3.1.
XDL design files contain four types of declarations: design, module, instance and
net. The first statement found in XDL designs is the design statement and occurs only
once in each design file. It specifies the design name and part that the design is targeting.
The module declaration is not a commonly used construct in XDL but is powerful in that
it enables a level of hierarchy in designs and is capable of representing a hard macro (a
pre-placed and routed design block). The majority of the contents of an XDL design exist
as instances and nets.

Instances and Placement
All logic elements of a design are contained within the instances declared in the XDL
file. An instance is an instantiation of a particular primitive type (SLICEL, SLICEM, etc.)
that has a unique name, a type, an optional primitive site assigned to it and a list of attributes
and values which configure the instance. Since assignment to a primitive site is optional,
XDL files can represent designs that are unplaced, partially placed or fully placed. Placement
of a design takes place when instances are assigned specific primitive site locations in the

25

final implementation. Therefore, creating a tool to automatically place a design has very
little overhead when using XDL as it simply implies creating a mapping between instances
and primitive sites.

Nets and Routing
In order to specify connections between instances, the net declaration is used. A net
has a unique name, a type, a list of input/output pins and optionally, a list of PIPs. Each
net has one outpin (source) and one or more inpins (sinks). Pins are declared by using the
unique instance name and pin on the instance. PIPs define how a net is routed and the
presence of PIPs in a net indicate that the net is at least partially routed if not fully routed.
Therefore, the task of a router is simply to assign a list of PIPs to a net. Nets cannot be
routed if a design is not placed (PIPs require placed instances to make connections), however,
XDL can represent designs that are un-routed, partially-routed or fully routed.

3.4

RapidSmith: A Framework to Leverage XDL and Provide a Platform to
Create FPGA CAD Tools
There are several challenges that exist to successfully use XDL to build new CAD tools

and leverage it as a design exchange method. Although XDLRC report files are extensive and
provide a complete view of an FPGA architecture, their representation is extremely inefficent
as the files can exceed 70 gigabytes in size. This makes direct use of XDLRC report files
infeasible for creating design tools. Another challenge exists due to lacking information in
the XDLRC report files that is necessary for efficient placement and routing. An open source
XDL parser and data structure is not provided with the xdl tool making XDL manipulation
often delegated to Perl scripts created for each specific task.
RapidSmith provides two core packages to deal with the challenges of using XDL.
First, RapidSmith provides a compact XDLRC-equivalent database file format and corresponding data structure (pictured in Figure 3.2a) that is fast loading and suitable for all

26

Design

Device
Instance

Tile[][]
PrimitiveSite[]
PrimitiveType

TileType
WireHashMap

Tile

Net

Module ModuleInstance

PrimitiveType

NetType

Attribute (List)

Pin (List)

PrimitiveSite

PIP (List)

(a)

(b)

Port
(List)
Instance
(List)

Instance (List)
Net (List)

Net (List)

Figure 3.2: RapidSmith Abstractions for (a) Devices and (b) Designs

kinds of applications. RapidSmith also includes the lacking information that can aid in
making more effective placement and routing tools.
Second, RapidSmith provides a complete XDL parser and corresponding data structure (pictured in Figure 3.2b) to easily facilitate manipulating XDL designs and enable
researchers to build highly customized CAD tools for Xilinx FPGAs.

3.4.1

Xilinx FPGA Database Files in RapidSmith
A very important attribute of an FPGA CAD database is its size and consequently,

its load time from disk. RapidSmith employs three major strategies in order to control its
device database size and load time; (1) aggressive wire and object reuse, (2) careful pruning
of unnecessary wires, and (3) customized serialization with compression.

Wire and Object Reuse
Wires are the major reason for the gigantic size of the XDLRC report files. FPGAs
are generally very regular replicated structures and the same wires can appear identically
all over the chip. Efficiently accommodating the infrequent irregularities that occur in the
routing structure is key to creating an efficient and much smaller device database. Therefore,
rather than define a wire object for every wire in every tile, RapidSmith employs a unique
set of wires that are specified by a name and a set of tile offsets that remove reference to

27

any particular tile in the chip, making them reusable. This reusability factor is able to
dramatically reduce the size required to represent wire connections on the device.

Wire Graph Pruning for Efficiency
The XDLRC report contains wires that are present simply for the sake of completeness, but, do not contribute more useful information to the actual routing structure. For
example, in a HEX wire (a wire which spans 7 switch matrices), connections to the wire can
only be made in 3 of 7 tiles which it spans. The 4 wire segments in tiles that do not connect
to other wires are implied by the overall structure and thus, removed by RapidSmith. By
removing these wires, significant memory savings are obtained as well as faster routing times
as the removed wires are no longer examined when expanding connection searches.

Serialization and Compression
RapidSmith uses custom serialization to store only the essential data needed to reconstruct the device databases. The default Java serialization routines were considered, but
were later abandoned as they turned out to be inefficient and caused files to load more slowly.
RapidSmith uses the Hessian 2.0 serialization compression protocol [22] to further reduce
the size of the device file. The Hessian protocol provides low-level serialization APIs and
also compresses/decompresses the data found in the device files with minimal impact on load
time.

Overall Performance
Using all of the techniques mentioned, RapidSmith database files are able to achieve
a file size compression of over 10,000× when compared to the original XDLRC report files
and all but the largest files can be loaded in less than a few seconds. A summary of large
parts and their corresponding device file statistics are shown in Table 3.1. Measurements
shown in Table 3.1 were recorded on a Windows 7 64-bit workstation with a Core i7-860
28

Table 3.1: RapidSmith Device Files Performance
Xilinx Part
Name
V4 SX55
V4 FX140
V4 LX200
V5 FX200T
V5 TX240T
V5 SX240T
V5 LX330
V6 CX240T
V6 SX475T
V6 LX760
V7 855T
V7 1500T
V7 2000T

XDLRC
Report Size
3.5GB
8.0GB
10.0GB
9.4GB
10.0GB
11.9GB
12.5GB
8.5GB
17.7GB
22.8GB
32.0GB
53.0GB
73.6GB

Compressed Java Heap Load Time
File Size
Usage
From Disk
539KB
34MB
0.299s
1546KB
70MB
0.616s
1010KB
61MB
0.602s
1227KB
60MB
0.585s
1111KB
56MB
0.620s
1135KB
61MB
0.630s
1069KB
69MB
0.622s
937KB
35MB
0.460s
1506KB
61MB
0.814s
1758KB
77MB
1.068s
2634KB
115MB
1.408s
4985KB
263MB
2.653s
5956KB
301MB
3.339s

processor, 8GB of DDR3 RAM and 1TB 7200RPM SATA hard disk. The 32-bit Oracle JVM
ver. 1.6.0 22 was used for Java bytecode execution.

3.4.2

Augmented XDLRC Information in RapidSmith
The XDLRC report neglects to inform the user about primitive sites that can support

more than one primitive type. The lack of this information provides two inefficiencies or
challenges: (1) placements produced without this information may not be complete or may
result in inefficient placements and (2) certain primitives do not have native sites of their
own and therefore must reside on sites of a different type. When a primitive resides on a
site of a different type, there can be pin name mis-matchings that can cause problems for a
router.
Using RapidSmith, we have built special programs to automatically compile a complete listing of primitive site compatibilities and pin name mappings. This information is
included with the RapidSmith distribution and integrated into the APIs to provide a seamless solution. Without the help and supplementation provided in RapidSmith, XDL users

29

would have to recompile this information manually or re-implement several of the programs
already included with RapidSmith.

3.4.3

XDL Design Representation in RapidSmith
As already shown, Figure 3.2 illustrates how RapidSmith data structures represent

the data contained in XDL. Each class in RapidSmith will often represent a one-to-one
relationship with XDL/XDLRC constructs that make using RapidSmith intuitive to use
with XDL designs and XDLRC reports. A fully optimized XDL parser and export method
are included with RapidSmith to populate its data structures and produce Xilinx-compatible
XDL for re-injection into the Xilinx flow. RapidSmith also includes graphical tools to browse
and explore devices and designs as shown in Figure 3.3. The Device Browser, shown in
Figure 3.3a, illustrates how the user can explore the resources and wires available in a device
in RapidSmith and demonstrates the framework which can be used by researchers to create
custom graphical CAD tools that interact with the FPGA fabric. The Device Browser
leverages the classes outlined in Figure 3.2a. The Design Explorer application, shown in
Figure 3.3b, allows the user to load XDL designs and, optionally, timing reports to rapidly
search through the different constructs and leverages the classes found in Figure 3.2b.
The APIs available in the design package of RapidSmith are particularly rich with over
700 documented methods for design manipulation. These methods include functionality such
as placement/un-placement of instances, creating new instances from scratch, determining
the type of nets (wire, VCC, GND, CLK, etc), changing the source of a net, un-routing a net,
manually routing a net (adding PIPs), and hundreds of other useful design transformations.

3.4.4

Impact of RapidSmith
RapidSmith has been released as an open source project under the GPL license and

is available at http://rapidsmith.sourceforge.net. In the first 12 months since its release, it
has been used and/or cited in 12 published works. In addition, RapidSmith has been used
30

(a)

(b)
Figure 3.3: Screenshots of Graphical Tools Provided with RapidSmith to Browse (a) Devices
or (b) Designs

31

as project material in graduate-level courses taught at both Brigham Young University and
Virginia Tech.

32

CHAPTER 4.
HMFLOW 2010: ACCELERATING FPGA COMPILATION
WITH HARD MACROS FOR RAPID PROTOTYPING

With the creation of RapidSmith as described in Chapter 3, the necessary prerequisites were in place to create a custom, end-to-end FPGA compilation flow based on
hard macros. The primary goal of this custom CAD flow is to provide extremely fast compilation that would be useful in rapid prototyping scenarios where achieving high clock rates
were not necessarily a priority. That is, in creating such a compilation flow, its optimization
and implementation were dictated by trying to answer the following question:

“Using hard macros and completely ignoring circuit quality, how fast can FPGA compilation
be made to run?”

This chapter describes the preliminary work, the various components and results of
such an endeavor to create a fast, rapid prototyping FPGA compilation flow.

4.1

Preliminary Work
This section describes the benefits of using hard macros and why they were chosen as

the mechanism for storing intermediate compiled circuit information. It also describes a series
of preliminary experiments that prove hard macros are capable of accelerating compilation
and that they are a feasible approach to creating a complete FPGA compilation flow.

33

4.1.1

Selection of a Compiled Circuit Representation
Storing pre-compiled circuit information can be done in a number of ways. The

most common method used by Xilinx and other IP vendors is as a synthesized netlist. A
netlist describes the primitive components that compose a design as well as a list of nets
describing their interconnections. Compiling designs strictly from netlists would provide
some compilation time acceleration, but it would be limited at best since it only allows the
user to skip the synthesis step.
Another technique to store a compiled circuit representation would be a relationallyplaced macro (RPM). Xilinx has a specific RPM format that allows FPGA primitives to be
described in relation to each other in terms of relative placement on the FPGA fabric. As
RPMs exist at the primitive netlist level (NCD/XDL), RPMs allow the user to skip synthesis,
mapping, packing and placement.
The most comprehensive method for storing pre-compiled FPGA circuit descriptions,
however, is a hard macro. Hard macros have all the benefits of RPMs with the additional
feature of being able to store relative routing information. Hard macros are able to provide
a more complete implementation of a circuit, yet, are still able to maintain a similar level of
placement flexibility with RPMs by leveraging the regularity of FPGA routing structures.
Ultimately, hard macros can skip all of the major steps in FPGA compilation: synthesis,
mapping, packing, placement and routing.
Although hard macros contain pre-compiled information, they must be assembled
into a design, placed at some location on the FPGA fabric, and connections in between
hard macros must be routed. However, these steps are much less complex than conventional
FPGA design compilation. For example, one of the advantageous side-effects of using hard
macros as the common design element of FPGA design is that the number of objects requiring
placement in the final placement phase is significantly reduced. Typical FPGA designs often
contain 10,000s of primitives that must be placed, whereas hard macro-based designs only
contain a few dozen to a few hundred hard macros. For this and the reasons mentioned above,
34

hard macros were chosen as the compiled circuit representation to obtain the maximum
savings in compilation time.
It should be mentioned that the placement of hard macros onto an FPGA versus the
placement of macroblocks in ASIC design are different problems. Automating the placement
of hard macros onto FPGA fabric is harder in that there are several more constraints placed
on hard macros than placing a macroblock in a conventional ASIC design. An FPGA does not
have a perfectly uniform fabric, and modern FPGAs have varying patterns of resources such
as block RAM and DSPs that can make placement challenging for a hard macro that contains
such components. Also, hard macros can be routed and must use specific routing resources
which must align exactly with those specified within the hard macro or it cannot be placed.
These two constraints make the granularity by which large hard macros can be placed, very
rough and non-linear. Even though research on macro-based ASIC placement has been
performed [23], it cannot be directly applied in the case of hard macros on reconfigurable
logic.

4.1.2

Experiments Validating Hard Macro Potential
At the onset of this work, it was necessary to prove the feasibility of using hard macros

as a basic design element and their potential to accelerate FPGA compilation. In order to
accomplish this task, it was crucial to understand how well such a hard macro-based tool
flow is currently supported by the Xilinx tools. To determine this, a few experiments were
conducted. The first experiment aimed to find out how well hard macros were supported
by the Xilinx tools. The second experiment then tried to determine how much potential
speedup a hard macro-based design flow could offer.

Hard Macro Creation
The preliminary task of these experiments was to first build a set of hard macros
which could then be assembled into a design. Although the Xilinx tools do support hard
35

Conventional Xilinx Flow
.vhd,
.v

NGD
Build

XST

XDL
–ncd2xdl

.xdl

MAP

HMG

.ncd

PAR

.xdl

XDL

.nmc

–xdl2ncd

Figure 4.1: Hard Macro Creation Flow

macros, our experiments have found that there are some significant limitations to using them
with the conventional flow. The first limitation is the hard macro creation process. Xilinx
documents a method for creating hard macros [24] using FPGA Editor. Although this is a
tedious manual process, FPGA Editor can be scripted to some extent [25] for creating hard
macros.
To create a set of hard macros for use in our experiments, we developed a Hard
Macro Generator tool (HMG) and associated flow as shown in Figure 4.1. Specific hard
macro generation programs have been created before as in [7] and [25], however, at the
time, a tool to create general hard macros from arbitrary RTL has not been created before.
All circuit manipulations in the experiments were performed in XDL using RapidSmith as
described in Chapter 3. The hard macro creation flow leverages the first four steps of the
conventional Xilinx design flow as shown in Figure 4.1. Designs were prepared by connecting
ports intended to become terminals on the hard macros to IOBs in the FPGA. In addition,
reasonably tight area constraints were placed on the design to make the hard macros more
compact and increase their place-ability (or increase the number of valid locations on which
they could reside).

36

Once a design was adequately prepared, it was implemented using the conventional
Xilinx tool flow which produced a completely placed and routed FPGA implementation with
only the circuitry bound for inclusion in the hard macro. After converting the designs into
XDL, the HMG tool would take the XDL as input and convert it into a stand-alone hard
macro. The steps taken to create a hard macro from the original XDL are listed below:
1. All IOBs were removed and hard macro port objects were created and attached to
corresponding XDL Pins.
2. Measures were taken to ensure that each net had only one hard macro port as more
would cause an error.
3. All GND and VCC sources (TIEOFFs and SLICEs) were removed as they are an illegal
construct in a hard macro. Separate ports were created in their place with a special
naming convention to indicate a GND or VCC input should be supplied.
4. All clock nets and circuitry were removed (our hard macros only used a single clock
domain) and in their place a clock port was added to the external port list of the hard
macro.
The result was an XDL representation of the hard macro which was then converted
to the Xilinx NMC format (the Xilinx hard macro counterpart format to NCD).

Experiment #1: Find Out Support Level for Hard Macros in Xilinx Tools
With a small set of hard macros to use for hard macro design, experiment #1 set
out to determine how robust the Xilinx tools are when operating on hard macro-based
designs. To create complete design implementations for this experiment, structural VHDL
was written which instantiated the hard macros and specified their inter-connections. This
VHDL was then passed through the Xilinx tool flow starting with synthesis (xst) through
place and route (par). The compilation flow for experiment #1 is simply the conventional
37

Xilinx flow (xst, ngdbuild, map and par) as shown in Figure 4.2. Since the design was
an interconnected set of black boxes in the VHDL, no real synthesis was performed nor
was there any mapping or packing done. That is, although the hard macro designs were
processed by xst, ngdbuild and map, no real computation or translation was performed
as the hard macros already existed at the native netlist level. Therefore, the bulk of the
runtime was spent in PAR1 .

Hard Macros instanced
as black boxes in
structural VHDL

.VHD

XST

.NGC

.NGD

NGDBuild

MAP

.NCD

PAR

.NCD

Figure 4.2: Compilation Flow for Experiment #1 (Conventional Xilinx Flow)

Results of Experiment #1: Mixed Support for Hard Macros
Several compilations of different hard macro-based designs were completed using the
flow shown in Figure 4.2 with varying results. The sample designs used in these tests included
a QPSK demodulator as well as various designs created by inter-connecting various open
source cores found on the web which are listed later in the section describing Experiment
#2. The results of these tests can be grouped into three categories:
1. For one collection of tests the design successfully placed and routed, but the resulting
tool chain time was significantly longer than an equivalent process starting from a
standard HDL-based design. As indicated in [24], it is known that using hard macros
actually causes Xilinx PAR to run more slowly.
1

A final hard macro-based design flow does not use xst, ngdbuild, map, or par of the Xilinx flow
but rather would directly combine, place, and route the hard macros at the XDL level—this technique was
employed for the experiments only to make it possible to quickly obtain results.

38

2. For another portion of the tests, no valid placement for the hard macros in the design
could be found and the placement phase would fail. At times PAR would fail with
an error message and at other times placement would simply never terminate within a
reasonable amount of time.
3. Finally, for some tests, placement would complete successfully but the router would
fail with an error message that it could not route all the nets in the design.
These results suggest that, in spite of the intuitive advantages that a hard macro-based
design flow would seem to offer, Xilinx PAR does not work well with such a flow. The
results further suggest that placement was the problem—either it failed to place the designs,
or it seemed to create an un-routable placement.

Experiment #2: Hard Macros in an Augmented Xilinx Flow
The second experiment was designed to bypass the Xilinx placement step to understand the router’s handling of hard macros. To do so, the original VHDL-based structural
design was processed as before but only up to mapping and packing. The result was an
unplaced NCD file (the overall flow for experiment #2 is shown in Figure 4.3). The hard
macros were then placed by hand using a custom created XDL macro placer tool. That is,
placement of each block was chosen manually and then a custom created XDL-based tool
was used to modify the un-placed XDL design file to reflect that placement. The resulting
placed design was then passed to the Xilinx router (PAR) for completion.
To make an accurate measurement of runtime improvement, a baseline (non-hard
macro) design was created for each test. The baseline tool runtime was determined by the
total runtime of the tools from the designer’s original RTL implementation (VHDL or Verilog
files) to a fully placed and routed design in NCD format. The baseline measurement omits
bitstream generation as this process cannot be accelerated due to the proprietary nature of
the bitstream (it consumes identical runtime for both the experiment and baseline circuits).

39

.VHD

XST

.NGC

NGDBuild

.NGD

Hard Macros instanced
as black boxes in
structural VHDL

MAP

XDL
2
NCD

.NCD

.XDL

NCD
2
XDL

.XDL

PAR
-p

.NCD

.NCD

Manual
Placements

CUSTOM
XDL
PLACER

.TXT

Figure 4.3: Compilation Flow for Experiment #2 (Modified Xilinx Flow)

The command line parameters to the Xilinx tools were set to obtain the fastest runtime
(ignoring timing constraints, standard effort levels, and avoiding timing-based MAP).

Mult0
Mult1
Mult2
Mult3
Mult4
Mult5
Mult6
Mult7

Mult8

Mult12
Mult9
Mult14

Mult10
Mult13
Mult11

Figure 4.4: Block Diagram of Multiplier Tree Design

The first design is a multiplier tree made of a 20 × 20 bit LUT-based multiplier with a
pipeline depth of 5. The Multiplier tree has 15 identical instances of the multiplier where the
arrangement of the multipliers is that of a binary tree as shown in Figure 4.4. This design
was chosen to estimate the speedup that a hard macro-based flow would produce when the
synthesis work load is small (the multiplier was synthesized only once).
The second design is a collection of five different cores: CORDIC, AES decryptor,
Twofish (a symmetric key block cipher), FM Receiver, and a Hilbert Analytic Filter. All

40

cores were placed in the design and connected together to form a data-path with inputs
and outputs connected to external IO pins. Although this design has a smaller hardware
footprint than the multiplier tree, its heterogeneous nature implies that more time will be
spent during synthesis and will potentially lead to increased speedup for the hard macro
based flow.

Results of Experiment #2: Reductions in Runtime

Table 4.1: Baseline Runtimes for each Test Design
Design
XST
Mult-Tree
46.7s
Heterogeneous 80.2s

NGDBuild
9.0s
4.9s

MAP PAR
17.7s 64.2s
10.4s 35.5s

Total Runtime
137.5s
131.0s

The baseline results for each design are shown in Table 4.1. All runtimes were obtained
on a desktop workstation with an Intel Core 2 Duo 3.0GHz (E6850) processor with 4GB of
RAM, running Windows XP Pro SP3 and Xilinx ISE 11.4. All designs targeted a Virtex 4
SX35 (xc4vsx35ff668-10). Comparisons were made primarily on runtime, however, clock rate
and hardware utilization are included as well to demonstrate the trade-offs between design
runtime and QOR.
Table 4.2: Performance of each Test Design Using Hard Macros
Design

Custom
Placer
Mult-Tree
4.3s
Heterogeneous
4.3s

XDL to
NCD
14.3s
12.1s

PAR -p
(route only)
25.7s
25.5s

Total
Runtime
44.3s
41.9s

Speedup
Over Baseline
3.1×
3.1×

As can be seen in Table 4.2, the hard macro designs were both built 3.1× faster
than their corresponding baseline build. These results offer promise to a design flow created
entirely from hard macros and would enable significantly faster build times. What is also
41

notable about these results is that they are two very different designs but both achieve the
same speedup by using hard macros. This could be caused by a floor on the performance of
PAR (actually, just the router since in these experiments we did our own placement). That
is, both hard macro implementations took almost an identical amount of time to run PAR.
It is possible that for designs of a certain size, PAR may not be able to execute much faster
than about 25 seconds. If this were the case, it may be beneficial to create a faster router
for these kinds of situations.

4.1.3

Hard Macros and Quality of Results
The QOR of each circuit (as seen in Table 4.3) implementation is somewhat mixed.

The multiplier tree baseline and hard macro implementation have surprisingly similar hardware utilization and maximum clock rates. This is an excellent result as it shows (in some
cases) that using hard macros will not be much worse than using the Xilinx tools (when
optimized for runtime). However, the heterogeneous design baseline produced a higher clock
rate and significantly smaller hardware footprint. This is probably due to the fact that the
synthesizer was able to optimize some of the logic across the different block boundaries leading to lower hardware utilization and a higher clock rate. No such optimization was possible
in the hard macro implementation. It must be noted, that the Xilinx tools were configured
for the fastest2 runtime in all of the experiments and it is likely that if configured for best
quality of result, clock rates of the baselines would be higher.

4.1.4

Hard Macros and Placement Time
Note that the runtimes measured in the augmented flow included the time it took

to route the hard macros together, but not the amount of time it took to place them.
Placement times were assumed to be negligible for these experiments based on the following
2

This meant running the Xilinx tools without a timing constraint (or providing a very loose one when
necessary), setting all effort level parameters to their lowest setting (-std) and disabling timing-based
placement where applicable.

42

analysis. Placement time for the hard macro experiments can be estimated by assuming
that placement runtime behavior is approximately linear in the number of design elements
to be placed. The baseline designs consisted of 4597 and 1757 slices and consumed 35.1s
and 25.5s of placement time, respectively (placement time was calculated by running par
with the -r option, which only runs the placer). The hard macro designs consisted of 15
and 5 blocks. Extrapolating using the linear-behavior assumption, placement times can
be estimated to be 0.11s and 0.07s, respectively, for the hard macro experiments. Adding
these values back into the measured runtimes does not change the result from that shown in
Table 4.2. This substantial reduction occurs because the hard macro placer only places the
top-level hard macros—placement time for their constituent cells is eliminated as they were
previously placed and routed. Note that the presumption of linear behavior is conservative.
If the placer’s behavior is super-linear, the hard macro approach will perform better than
this analysis indicates, relative to the baseline. It will be shown later in Section 4.3.3 with
the implementation of a hard macro placer that this approximation proved accurate.
Table 4.3: Comparison of Baseline vs. Hard Macro Designs
Design
Mult-Tree (baseline)
Mult-Tree
Heterogeneous (baseline)
Heterogeneous

4.1.5

Slices
4592
4830
1746
2741

Clock Rate
148 MHz
150 MHz
82 MHz
61 MHz

Conclusions on Preliminary Hard Macro Experiments
With concrete evidence that hard macros have potential to accelerate FPGA compi-

lation, the implementation of a complete, automated hard macro design flow ensued. This
hard macro flow, called HMFlow for short, operates completely on hard macros and designs
in XDL. The remainder of this chapter is devoted to HMFlow’s components and performance.

43

4.2

HMFlow 2010: A Rapid Prototyping Compilation Flow
HMFlow 2010 is the vehicle by which the effectiveness of hard macros is demonstrated

for accelerating FPGA compilation. The first implementation of HMFlow was created in 2010
and is referred to as HMFlow 2010 to differentiate it from an improved version of the flow
developed in 2011 (aptly named HMFlow 2011). Throughout the remainder of this chapter,
any reference to HMFlow is referring to HMFlow 2010 (unless otherwise specified).
HMFlow begins by importing designs created using Xilinx System Generator as seen
in Figure 4.5. Design data are stored in Simulink Model Files (.MDL) that are parsed by
the design parser and mapper tool. The mapper also identifies each block in a design and
its corresponding hard macro. If the hard macro does not exist in the hard macro cache,
the mapper invokes the hard macro generator to create one for the mapper and to store it
in the cache.

Design
Parser &
Mapper

.mdl

Hard
Macro
Placer

Design
Stitcher

.xdl
PLACED & ROUTED
IMPLEMENTATION

SYSTEM GENERATOR
INPUT DESIGN
Hard Macro
Cache

Design
Router

Hard
Macro
Generator

HARD MACRO SOURCES
Figure 4.5: Block Diagram of HMFlow

Once all of the hard macros have been created or retrieved from the cache, they
are given to the design stitcher, which will “stitch” all of the hard macros together. The
design stitcher also inserts appropriate I/O buffers (IOBs) and clock generation circuitry.
The design is then passed to the hard macro placer and router to be exported as a final
placed and routed implementation in XDL.

44

The remainder of this section describes in detail the design entry techniques, algorithms and steps that have been constructed to realize and implement HMFlow.

4.2.1

Xilinx System Generator
Although HMFlow can potentially be used with any design entry tool, System Gen-

erator was chosen for this work because it provided several benefits out-of-the-box. System
Generator is well-suited to hard macros because it is a block-based design tool. HMFlow
automatically converts each of the System Generator blocks into a hard macro as shown
in Figure 4.6. This allieviates the need of the designer to either specify where hard macro
boundaries occur or placing this burden on the tools to dynamically discover hard macro
boundaries. Also, since HMFlow uses System Generator designs as input, existing System
Generator simulation tools can be leveraged for HMFlow-bound designs.
Another advantage of using System Generator is its design compilation path. HMFlow
provides a rapid prototyping path so designs can be quickly tested on an FPGA. However,

Each block is
converted to a
Hard Macro
Figure 4.6: Screenshot of an Example System Generator Design

45

once the design is functionally correct, a high quality implementation is needed for design
deployment. Since HMFlow accepts designs that are System Generator compatible, a high
QOR is easily obtained (albeit with much longer runtime) by simply using the conventional
Xilinx System Generator tools.

4.2.2

Simulink Design Parsing
The first step of HMFlow is to parse the System Generator design stored in the

Simulink Model file (MDL). HMFlow has a custom built JavaCC-based parser to perform
this task and populates a custom made Simulink-compatible data structure.
One drawback of using System Generator is that information about design blocks,
such as bit widths of ports, are not stored in the MDL file. To obtain this missing data, the
parser has to recalculate these widths on-the-fly during parsing and it must be done on a
block-by-block basis. Despite this technical challenge, HMFlow is able to support over 75%
of the most commonly used System Generator blocks in HMFlow designs.

4.2.3

Hard Macro Cache and Mapping
The mapper receives the design in the form of a populated data structure from the

parser and iterates over all of the blocks in the data structure to identify the System Generator blocks. For each block it finds, an MD5 hash is generated from its type and parameters
to uniquely identify the block and its corresponding hard macro. The mapper then uses the
hash to query the hard macro cache to find out if the hard macro has already been built. If
the hard macro has already been built, it is loaded from the cache and given to the mapper.
If the hard macro is not in the cache, the mapper invokes the hard macro generator.

4.2.4

Hard Macro Creation
Xilinx does not provide any method for automated hard macro creation of arbitrary

designs. The only method provided is a tedious, manual process using FPGA Editor. There46

fore, an automated hard macro generator was created using RapidSmith. This hard macro
generator is an improved version of that used in preliminary experiments as mentioned in
Section 4.1.2.
Port hard macro
(black box)

Port hard macro
(black box)

System
Generator block
to become a
hard macro

Port hard macro
(black box)

XST
(Synthesis)

.SYR
(Synthesis
Report)

.NGC
(Netlist)

UCF Area
Constraint
Generator

Conventional
Xilinx Tools
(NGDBuild,
MAP, PAR)

.UCF

.NCD
(Placed &
Routed
Design)

(Constraint
File)

Figure 4.7: Front-end Flow for an HMFlow Hard Macro

To create a hard macro in HMFlow, a design block (NGC netlist from System Generator) is implemented with the conventional Xilinx tools. Designs are prepared by inserting
special port-identifying signal pass-through hard macros which are attached to each of the
inputs and outputs of a design as shown at the top of Figure 4.7. These special port hard
macros remain untouched as the design passes through the Xilinx tools and specify the ports
to create in the final hard macro.
After synthesis of the hard macro-bound design, resource utilization counts are extracted from the synthesis report to generate slice, DSP48 and BRAM area placement con47

(b)

(a)

Figure 4.8: (a) A FIR filter Design Compiled with the Xilinx Tools (b) The Same Filter
Design Compiled with an Area Constraint to Create a Hard Macro

straints as shown in Figure 4.7. Constraint generation aims to generate the hard macros
in a fairly regular region of the chip and provide at least 50% extra resources inside of the
constraint than what was calculated and reported by the synthesis tool, XST. To illustrate
the need for area constraints, consider the implemented filter design in Figure 4.8a. The conventional vendor tools cause a design to be sparsely spread out over the entire chip, making
its reusability nearly impossible. However, with a proper area constraint applied during the
compilation process, the implementation can be nicely compacted into a rectangular form
as shown in Figure 4.8b.
Listing 4.1: UCF Area Constraint Example
# All hard macro instances are declared as part of area group block
INST "sysgen_fifo_x0/fifo_13312/percent_full(4)1" AREA_GROUP = "block";
INST "sysgen_fifo_x0/fifo_13312/percent_full(3)1" AREA_GROUP = "block";
INST "sysgen_fifo_x0/fifo_13312/percent_full(2)1" AREA_GROUP = "block";
INST "sysgen_fifo_x0/fifo_13312/percent_full(1)1" AREA_GROUP = "block";
INST "sysgen_fifo_x0/fifo_13312/percent_full(0)1" AREA_GROUP = "block";
INST "sysgen_fifo_x0/fifo_13312/comp0.core_instance0" AREA_GROUP = "block";

48

# Area ranges are given in slices, dsp48s and ramb36s
AREA_GROUP "block" RANGE=SLICE_X32Y108:SLICE_X39Y115;
AREA_GROUP "block" RANGE=DSP48_X0Y0:DSP48_X0Y0;
AREA_GROUP "block" RANGE=RAMB36_X1Y17:RAMB36_X3Y27;

These constraints are applied through a typical Xilinx user constraint file (UCF)
following typical Xilinx conventions. An example of the automatically generated UCF for
hard macro creation can be seen in Listing 4.1. The constraints also ensure that hard macros
are compact and maximize the number of valid locations on which they can be placed on
the FPGA. It should be noted, however, that area constraints can only restrict placement
of FPGA primitive instances and only suggests (not constrains) the router to try and keep
routes within the area. When area constraints are too tight, it is common for routing to
“spill out” of the area constraint. The routing spill out varied depending on the tightness
of the constraint. When the area constraint contained 50% of area overhead or more, it was
rare to have any routing spill over in the designs tested. However, when the area constraint
overhead was in the ranges of 0-25%, it was common to have routes leave the area constraint
by as much as 2 to 4 tiles before returning inside the area constraint. Therefore, routing spill
out is avoided as it can increase the overall size of the hard macro and ultimately reduce its
place-ability.
Once the placed and routed design (NCD file) is generated with the Xilinx tools, it is
converted to XDL for manipulation by the hard macro generator in the RapidSmith framework. The hard macro generator then performs the same transformations (see Section 4.1.2)
on the design to convert it to a Xilinx compatible hard macro. The hard macro is then
stored in the hard macro cache using a similar compact format to that used in RapidSmith
device files as described in Section 3.4.

4.2.5

XDL Design Stitcher
After the mapper has gathered all of the hard macros corresponding to a design, the

design stitcher must perform the “stitching” of all the hard macros together into a single
49

implementation. There are several steps involved to create a complete implementation that
are listed below:
1. Create module instances (instances of the hard macros) in the design. In some cases,
hard macros can describe logic that is simply a routing connection and a special block
is instantiated in those situations.
2. Create IOB instances for external input/output.
3. Create clock circuitry (DCMs, BUFGs, etc.).
4. Create nets and populate connections to mirror original design.
5. Clean up un-needed nets and pins.
The first step of the stitcher is to create instantiations of all the hard macros in the
design. This is easily done with just a few RapidSmith API calls. However, some of the
System Generator blocks do not always generate a hard macro because, by their own nature,
they do not contain any memory or logic. These kinds of blocks are called routing blocks
as they only contain routing connection information. For example, System Generator has a
concatenate block which will convert two signals into a single bus signal. This block only
aids conceptually in the design process but does not produce additional hardware. The
stitcher has facilities to create an empty hard macro object that would contain all of the
wire mappings that occur later when the nets are instantiated. The stitcher automatically
identifies these routing blocks and processes them accordingly.
Another special case of a design block are the System Generator gateway blocks used
to describe signals leaving and entering the FPGA. From these blocks, IOBs are inferred and
created in XDL from a few routines found in the stitcher. IOBs are device family specific
and predetermined parameters and attributes are set by default for each family, however,
most of the parameters that are user customizable in the gateway block can also change how
IOBs are created.
50

HMFlow only supports single clock domain designs and thus, clock circuitry is quite
predictable based on System Generator designs. The stitcher instantiates the appropriate
digital clock manager (DCM) based on the family architecture specified in the design and
also a BUFG primitive to drive the clock network. In the Virtex 4 series, an additional step
is required to avoid the NBTI effect [26] which requires the connection of all unused DCMs
to an on-chip oscillator, the PMV primitive. The stitcher handles all of the instantiations
and adds the new primitive instances to the design.
Once all of the instances and module instances exist in the design, the nets that
interconnect them must be created and connected to mirror the connections in the original
design. In order to accomplish this process, each hard macro contains a special mapping
object that tells the stitcher how to connect the outputs of each bit of each signal to other
blocks in the design. This mechanism is especially handy for routing blocks or any blocks
that contain signals that pass through without interacting with any logic. Nets and pins are
created and connected to hard macro ports based on the specified mappings. This process
completes once all hard macro outputs have been connected.
The final stage iterates through the design and removes any stray pins or dangling
nets that are no longer connected due to the net merging process. At this point, the design
is completely mapped to the FPGA and could be output to XDL. However, to avoid slow
hard disk access times, the design is passed directly to the hard macro placer.

4.2.6

Hard Macro Placer
Since HMFlow uses hard macros instead of FPGA primitives as the objects that

are placed onto the FPGA fabric, there is a significant reduction in problem size. Even
though the hard macros produced from System Generator design blocks are relatively finegrained, they still result in a 10-20× reduction of objects requiring placement. However, the
algorithm used to perform placement also affects how fast placement will occur. Throughout

51

the development of HMFlow, three different placers were created, a recursive bi-partitioning
placer, a random placer and finally a greedy heuristic placer.
The most commonly used FPGA placement algorithm is simulated annealing [27].
This is due to its ability to obtain high quality implementations (which are often characterized by minimized wire length) at the expense of longer runtimes. In most cases, this is
quite desirable because of the need to obtain a high QOR. However with rapid compilation
as a goal for HMFlow, different algorithms were tested for their ability to perform placement
quickly.

Recursive Bi-partitioning Placer
Our first attempt at a fast placer followed the approach of Maidee et. al [28] which
used a recursive bi-partitioning algorithm. This involves partitioning the hard macros of a
design into two separate groups such that the number of connections between the two groups
is minimized. The two groups were created and optimized using the heuristic developed by
Kernighan and Lin [29]. By recursively dividing each subgroup in this manner, it would
ultimately reduce total wire length of the nets in the design. Our attempts using this implementation did show significant acceleration of the placement process, however, the quality
of most of the resulting implementations resulted in clock rates averaging approximately 67
MHz on Virtex 4 parts. This result may be due to our choice of partitioning parameters or
the lack of a final simulated annealing step as used in [28].

Random Placer
Through various experiments, random placements of several hard macro designs were
attempted. The results were surprising in that most of the designs were still route-able and
produced similar implementation quality to that of the recursive bi-partitioning placer. The
random placer was extremely fast, placing most designs in a fraction of a second. This was

52

due to the fact that each hard macro was placed only once rather than being swapped several
times as in the partitioning placer.

A Greedy Heuristic Placer
The conclusion made from the surprising performance of the random placer is that
rapid placement can be achieved by placing each hard macro once, but doing so in a manner
that made an attempt at optimizing the placement. This would improve on the quality of
the random placement, but still retain the fast execution time of the random placer.
A greedy heuristic was developed to take advantage of the random placer findings.
The heuristic was simple in that it depended mostly on the connectivity between hard macros.
The heuristic dictates that hard macros are placed one-by-one starting with those containing
DSP48 or BRAM primitives as those are the most scarce primitives on the FPGA fabric.
Following the DSP/BRAM hard macros, placement priority is given to the largest (greatest
number of occupied tiles) hard macros first, as they may be the next most difficult to place.
When a hard macro is placed, the 1st, 2nd, or 3rd most highly connected hard macro
to the current hard macro is queried to see if it has already been placed. If so, the current
hard macro is placed as close as possible to its highly connected companion. If none of
the three hard macros are already placed, the hard macro is placed arbitrarily into one of
nine bins designated on the FPGA fabric. This process proceeds until all hard macros are
placed. This technique has proven successful on all benchmark designs used when resource
utilization is 50% or less of the FPGA. Results comparing the three placement algorithms
are deferred to Section 4.3 where the benchmark designs used are described in detail.

4.2.7

Detailed Design Router
The Xilinx router, par, has an option (-p) to only perform the routing of a design,

however, if a design is partially routed, those routes are not guaranteed to remain intact.
This is a problem if the output of HMFlow were to create a placed hard macro design in
53

XDL and then converted to NCD for par to route because the hard macros are flattened
in the XDL-to-NCD conversion process. This would negate any savings provided by hard
macros which contained routed nets. To leverage pre-routed routes from hard macros, a
custom, full design router was created using RapidSmith for HMFlow.
Traditionally, the most popular FPGA router algorithm has been PathFinder [30]
because of its ability to negotiate routing conflicts in an effective way. However, PathFinder
requires several iterations over the nets of a design in order to complete successfully. It is
this iterative nature that optimizes circuit speed at the expense of longer runtime. For this
reason, a different algorithmic approach was taken for HMFlow.
A simple FPGA routing technique is that of a maze router and is the principle algorithm used in this work. A maze router is fast because it only makes a single pass through
all the nets in a design. However, unlike PathFinder, FPGA routing resources are assigned
on a first-come, first-served basis which can cause significant routing conflicts given certain
scenarios.
To overcome these potential conflicts, the router in HMFlow uses a congestion avoidance technique. When a design is first loaded in the router, it is analyzed for potential conflicts (which are mostly architecture-specific) by looking for hot spot routing switch boxes
or certain input pins that have a unique input path. In these cases, the routing resources
are reserved for the net which will require them most to complete the route. This technique
has been successful in all designs tested when FPGA slice utilization has been 50% or less.
Not all designs have successfully routed when utilization exceeds half of the FPGA. Despite
this limitation, the router used in HMFlow has proven to be 3-10× faster than the fastest
configuration of the Xilinx router (par -p) execution3 and allows for preservation of internal hard macro routes. This router is also capable of routing arbitrary designs and can be
used for general purposes.
3

Because of the limitation by which par can be timed for execution, the performance comparison represents the time it takes to load a design from disk, route the design and then save the routed design to disk
for par and the router used in HMFlow. It is expected that if file access times were removed, the HMFlow
router would be faster than this estimate.

54

4.3

Results
This section describes the efforts to ensure adequate benchmark designs that properly

illustrate HMFlow’s capability to scale to large designs and large commercial FPGAs. Using
these benchmarks, comparison data of the three placement algorithms developed for HMFlow
is put forth. Finally, comparison data of the benchmarks implemented with HMFlow and
conventional Xilinx tools is presented and analyzed.
All data presented in this section was measured on an HP workstation running Windows XP SP3 32-bit, with an Intel Core 2 Duo (E6850) 3.0 GHz processor and 4GB RAM.
All FPGAs tested are from the Xilinx Virtex 4 series with a speed grade of -10 and Xilinx
ISE ver. 12.3 was used for all experiments. RapidSmith and HMFlow Java implementations
used the Oracle JVM ver. 1.6.0 21.

4.3.1

Benchmark Designs
In order to adequately demonstrate the effectiveness of hard macros on FPGA com-

pilation time and their potential for design size scalability, a wide variety of design types and
sizes were compiled and used. These designs, outlined in Table 4.4, contain a variety of algorithms, processing cores and data paths. Some of the smaller designs include circuits such
as a PicoBlaze, FIR filters, polyphase filters, state machines and 1024-point FFTs/IFFTs.
Most of the larger blocks were taken from a very large experimental telemetry receiver design
[31] which required three large FPGAs for its implementation. The telemetry receiver design
was an excellent source of benchmark circuits on account of its large size and because the
design was originally implemented and optimized in System Generator.
As shown in Table 4.4, three different Virtex 4 devices were targeted, the SX35,
SX55 and LX200 which is the largest part in the series. The table also shows that the 13
benchmark designs range in size from 150 slices to over 23,000 slices. BRAM and DSP48
primitive counts are representative of the typical circuits used and are present to illustrate
that HMFlow and hard macros can be used with heterogeneous types of primitives without
55

Table 4.4: Benchmark Design Characteristics
Benchmark
V4 Part Slices BRAMs DSP48s HMs* /
Name
Name
Used
Used
Used
HMIs**
pd control
SX35
150
1
0
12/21
polyphaseFilter
SX35
680
8
4
30/79
aliasingDDC
SX35
806
1
3
25/78
dualDivider
SX35
1832
0
6
39/542
computeMetric
SX35
2551
56
40
64/332
fft1024
SX35
2553
8
12
48/313
filtersAndFFT
SX35
5203
25
31
92/588
frequencyEst
SX55
6988
31
72
249/757
dualFilter
SX55
11173
33
26
93/901
trellisDecoder
LX200 16973
61
53
196/1328
filterFFTCM
LX200 18883
81
12
149/920
multibandCorr LX200 19732
52
23
90/1472
signalEst
LX200 23841
126
47
390/1448
**HMs = Unique hard macros in the design
***HMIs = Total hard macro instances in the design

issue. The final column of Table 4.4 shows the number of unique hard macros the design
contains and also the total number of instances of hard macros in the design. The latter
number illustrates the number of objects the placer must place whereas the former illustrates
the number of hard macros that exist in the cache and must be created. In most cases, hard
macros are reused on average between 2 and 10 times illustrating another reuse benefit of
hard macros.

Hard Macro Granularity
One of the determining factors that measure how well hard macros can accelerate
FPGA compilation is their size or granularity and also the percentage of routes that are
preserved inside of hard macros. The hard macros used to demonstrate HMFlow 2010 are
of a very small granularity, having an average footprint of 1 to 6 tiles on the FPGA fabric.
A histogram of the hard macros used in all of the benchmarks is shown in Figure 4.9.

56

3000

2500

Hard Macro Count

2000

1500

1000

500

0

0

50

100

150

200
Hard Macro Size (in Tiles)

250

300

350

400

Figure 4.9: A Histogram of Hard Macro Sizes of All Hard Macros in the Benchmarks

Percentage Of Routed Connections After Placement
(before Routing)

35.00%

30.00%

25.00%

20.00%

15.00%

10.00%

5.00%

0.00%

Figure 4.10: Graph Showing Percentage of Routed Connections in Each Benchmark Before
the Design is Sent to the Router

57

With smaller hard macros, it is less likely that they contain a significant percentage
of routed connections. Figure 4.10 shows the percentage of connections (source to sink pairs)
in the benchmarks that are already routed due to routes being included in the hard macros.
Note that the percentages do not include connections such as clock, ground or power as these
cannot be included in hard macros and are relatively fast to route. Since the granularity
of the hard macros in the benchmarks is small, no one design has more than 25% of the
connections already routed. As will be seen later in this section, this causes extra work for
the router.

Hard Macro Compilation Time
This work investigates how fast FPGA compilation can be made to run if all hard
macros are readily available at compile time. This is the case with HMFlow when certain
kinds of design modifications are made such as re-wiring connections in between hard macros,
changing constant values or having compiled the user set of hard macros beforehand. However, it is obvious that a designer is likely to change the design such that one or a few hard
macros need to be compiled from one execution of HMFlow to the next and would thus make
hard macro compilation time part of the total compilation time of the FPGA design.
As this work leveraged the Xilinx flow to compile the hard macros, it is far from an
efficient realization of how fast hard macro compilation could potentially be made to run.
The average compile time for a fine-grained hard macro in HMFlow is about 2 minutes. A
list of some hard macro compilation times is given in Table 4.5. These hard macro compile
times could be avoided by using XDL generators that generate the hard macro on-the-fly.
Such generators were demonstrated by Ghosh [32] and illustrate that hard macro generators
can produce circuits of similar quality to that produced by the Xilinx tools and do so in a
fraction of a second. The issue of hard macro compilation time could thus be eliminated by
using such hard macro XDL generators.

58

Table 4.5: Fine-grained Hard Macro Compile Times
Hard Macro Name Tiles Occupied
Delay23
4
Delay14
4
Delay1
4
Negate
10
Inverter
1
Logical3
1
Mux4
1
Delay25
1
Delay33
2
Delay
1
Mux3
4

4.3.2

Compile Time
116 secs
116 secs
120 secs
122 secs
136 secs
136 secs
136 secs
137 secs
137 secs
147 secs
148 secs

RapidSmith Router Performance
The RapidSmith router used in HMFlow is a full design router, capable of routing

arbitrary designs in XDL and is not specific to designs containing hard macros. Because
of this attribute, it makes it reasonable to compare the router against the Xilinx router
in par. To measure the performance of the RapidSmith router, the benchmark circuits
were placed and then flattened (all references to hard macros were removed with circuitry
remaining intact) and exported to an XDL file. The designs were flattened for a more fair
comparison between the RapidSmith router and Xilinx router because it takes a significant
amount of additional time for the Xilinx tools to load designs that contain several hard
macros. The execution times in Table 4.6 represent the time it took each router to load the
design implementation, route it, and then save it to a file.
As can been seen from Table 4.6, the RapidSmith router runs over 3× to almost 10×
faster than the Xilinx router in par. The smaller designs ran the fastest because of the
lightweight RapidSmith device files mentioned in Section 3.4.1 and the execution penalty
mentioned earlier in this section for par. The execution parameters for par were set to
route only, all effort levels were set to standard (fastest run time) and to use the -x option
which ignores any given timing constraints and generates new ones suitable for the circuit.

59

Table 4.6: Router Performance Comparison: Xilinx vs. RapidSmith
Xilinx Router
Design Name
Runtime
pd control
14.1 secs
polyphaseFilter
16.6 secs
aliasingDDC
16.6 secs
dualDivider
22.9 secs
computeMetric
34.0 secs
fft1024
29.7 secs
filtersAndFFT
47.2 secs
frequencyEst
89.5 secs
dualFilter
196.2 secs
trellisDecoder
210.3 secs
filterFFTCM
452.0 secs
multibandCorr
312.3 secs
signalEst
516.4 secs

RapidSmith Router
Runtime
1.4 secs
2.6 secs
2.7 secs
4.0 secs
8.0 secs
6.5 secs
14.8 secs
20.3 secs
38.8 secs
59.2 secs
76.7 secs
82.2 secs
107.9 secs

Speedup
over Xilinx
9.9x
6.5x
6.3x
5.8x
4.2x
4.5x
3.2x
4.4x
5.1x
3.5x
5.9x
3.8x
4.8x

Table 4.7: Hard Macro Placer Algorithm Comparison
Benchmark
Runtime (seconds)
Clock Rate (MHz)
Name
REC RAND FAST REC RAND FAST
pd control
0.266
0.047 0.016
92
65
129
polyphaseFilter 3.281
0.063 0.015 111
65
108
aliasingDDC
3.234
0.047 0.016
97
67
107
computeMetric 15.86
0.157 0.047
69
54
57
fft1024
11.703 0.094 0.047
67
47
74
frequencyEst
29.156 0.391 0.219
50
30
60
filterFFTCM
203.9
2.765 0.984
43
27
37
Averages
38.2
0.51
0.192 75.6
50.7
81.7
REC: Recursive bi-partitioning placer, RAND: Random placer, FAST: Fast heuristic placer

4.3.3

Hard Macro Placer Algorithms
As mentioned in Section 4.2.6, three placers were created during the development

of HMFlow: a recursive bi-partitioning placer (REC), a random placer (RAND) and a fast
heuristic placer (FAST). Only a limited set of benchmarks were completed when the REC and
RAND placer were developed and tested, therefore comparisons between the three placers
are made only on that subset of benchmark designs. Each benchmark design was placed

60

three times, once for each placer as listed in Table 4.7. Clock rates were obtained after each
placement was routed with the HMFlow router and measured with the Xilinx tool, trce.
As can be seen from the listed averages at the bottom of Table 4.7, the FAST placer
had the fastest execution of the three placers and produced, on average, the highest quality
circuit4 . These results show that the custom heuristic designed for hard macro placement
is capable of producing higher quality results than the RAND and REC placer, yet, still
execute with speed similar to that of the RAND placer.

4.3.4

HMFlow Performance
After the complete realization of HMFlow, all 13 benchmark circuits were imple-

mented with both the Xilinx tools and HMFlow. Since the benchmark circuits were created
as Xilinx System Generator designs, the Xilinx tools runtime included the time elapsed
during the following steps:
 System Generator NGC generation (outputs netlist)
 NGDBuild (outputs NGD)
 map (outputs mapped circuit in NCD format)
 par (outputs placed and routed circuit)

Some steps of the Xilinx flow allow execution to be optimized to reduce overall runtime. Therefore, options such as overall-effort-level in map and par were set to their lowest
setting to ensure that the comparison is made with the fastest execution of the Xilinx tools.
Although theoretically possible, bitstream creation (.NCD to .BIT) could not be legally accelerated by HMFlow due to licensing issues, thus, bitgen execution time is omitted as it
would be identical for either of the design flows.
4

It should be noted that after the prototype RAND placer was implemented, the FAST placer benefited
significantly from optimization and is shown to be the fastest. It is very likely that the RAND placer could
have benefited from some optimization and could have run as fast if not faster than the FAST placer.

61

Table 4.8: Runtime Performance of HMFlow and Comparison to Xilinx Flow

62

Benchmark
Simulink Mapper /
Name
Parser
HM Cache
pd control
0.093s
0.735s
polyphaseFilter
0.094s
0.75s
aliasingDDC
0.11s
0.765s
dualDivider
0.313s
0.89s
computeMetric
0.281s
0.891s
fft1024
0.235s
0.937s
filtersAndFFT
0.328s
0.984s
frequencyEst
0.437s
1.5s
dualFilter
0.469s
1.313s
trellisDecoder
0.656s
1.719s
filterFFTCM
0.516s
1.937s
multibandCorr
0.828s
1.797s
signalEst
0.843s
2.328s
1 We

Design
Stitcher
0.187s
0.219s
0.219s
0.203s
0.641s
0.297s
0.797s
0.578s
1.203s
1.422s
1.641s
1.844s
2.157s

Hard Macro
Placer
0.016s
0.015s
0.016s
0.047s
0.047s
0.047s
0.188s
0.219s
0.437s
0.547s
0.984s
1.859s
1.531s

Design
Router
0.219s
1.406s
1.453s
2.407s
6.359s
4.953s
12.312s
18.11s
34.672s
54.015s
69.938s
73.297s
107.547s

XDL
Export
0.062s
0.11s
0.125s
0.218s
0.609s
0.375s
0.75s
1.171s
1.656s
2.5s
3.046s
5.781s
15.375s

HMFlow
Xilinx
HMFlow
Runtime Runtime Speedup1
1.3s
65.6s
50×
2.6s
60.3s
23.2×
2.7s
62.2s
23.1×
4.1s
96.6s
23.7×
8.8s
160.8s
18.2×
6.8s
119.3s
17.4×
15.4s
254.1s
16.5×
22s
373.5s
17×
39.8s
469s
11.8×
60.9s
824.6s
13.5×
78.1s
1021.3s
13.1×
85.4s
786.2s
9.2×
129.8s
1508.7s
11.6×

XDL to
NCD
2.8s
4s
7.4s
6.3s
17.1s
10.3s
20.3s
107.3s
140.4s
115.1s
541.2s
506.7s
869.2s

define HMFlow speedup as the time it takes Xilinx to create a placed and routed implementation divided by the time it takes HMFlow to create a placed and routed
implementation from the same System Generator design.

The total runtime of HMFlow is measured as the sum of time elapsed during the
following steps:
 Simulink Parser (reads/parses System Generator design)
 Mapper/Hard Macro Cache (maps/retrieves hard macros)
 Design Stitcher (combines hard macros together)
 Hard Macro Placer (places all hard macros)
 Design Router (routes all un-routed nets)
 XDL Export (outputs placed and routed XDL design file)

Table 4.8 contains the runtimes for the individual steps in HMFlow, the total runtimes
of HMFlow and Xilinx tools and the speedup of HMFlow over the Xilinx tools. A preliminary
observation is that the speedup obtained by HMFlow is much greater (23-50×) for the
smaller designs than the larger designs (9.2-13.5×). There are a few possible causes for
this phenomenon. First, the Xilinx tools write out intermediate design files between each
step of the design process and must also load device database files for each step. This
is in contrast to HMFlow which loads design information once (during Simulink parsing
and hard macro cache accesses) and loads the RapidSmith device files once for the entire
execution of HMFlow. Therefore, the Xilinx tools spend a disproportionate amount of time
reading/writing files for the smaller designs. A second possible cause for the faster speedup
of smaller designs could be attributed to the algorithms used in the Xilinx tools. As Xilinx
must accommodate very large designs on very large parts, their algorithms might be tuned
for larger designs and thus, perform better. However, our suspicions lie with the former
explanation rather than the latter.
If the runtimes of all benchmarks are averaged, Figure 4.11a shows a pie chart illustrating the runtime distribution of HMFlow. As can be seen from Figure 4.11a, the majority
63

Placer
1%
Stitcher
2%
Mapper/
HM Cache
4%

Router
85%

XDL
to
NCD
84%

HMFlow
16%

Simulink
Parser
1%
XDL Export
7%

(a)

(b)

Figure 4.11: (a) Average Runtime Distribution of HMFlow (b) HMFlow Runtime as a Percentage of Total Time to Run HMFlow and Create an NCD File

of the time is spent in the router. This illustrates the speed at which all the other steps in
HMFlow operate. A completely placed implementation can be obtained from HMFlow in a
matter of seconds for even the largest benchmarks. Although the RapidSmith router runs
3-10× faster than the Xilinx router, it is still the most time consuming process in HMFlow.
This is likely because of the granularity of hard macros used in the benchmark designs. The
hard macros do not contain a significant percentage of all nets in the final design, most of
the nets exist between the external ports of the hard macros. For this reason, the router
must route almost all of the nets in the design in the final step of HMFlow. Chapter 5 details
an improved version of HMFlow and will concentrate on creating larger, more routing-dense
hard macros to help alleviate this problem.
One challenging issue with HMFlow and the Xilinx tools is the time it takes to
create an NCD file from an XDL file. The NCD and XDL file formats can represent all
of the necessary details to describe an implementation, however, XDL must be converted
to NCD in order to create a bitstream as bitgen only accepts NCD files as input. The
far right column of Table 4.8 lists the runtime to generate an NCD file for the resulting
HMFlow benchmark implementation. As designs get larger, the conversion time escalates.
64

Figure 4.11b shows the average runtime distribution if the runtime of HMFlow and XDL
conversion are combined. This challenge is specifically associated with the Xilinx tools, not
the use of hard macros or HMFlow. The long conversion times could easily be overcome if
appropriate Xilinx proprietary knowledge were obtained to build a more efficient conversion

Max Design Clock Speed
(MHz)

tool or to use a different FPGA vendor that would provide the necessary native netlist access.
300

Xilinx Implementation
250

HMFlow Implementation
200
150
100
50
0

Figure 4.12: A Comparison Plot of the Benchmark Circuits Maximum Clock Rates When
Implemented with HMFlow and the Xilinx Tools

In order to achieve the dramatic reduction in runtime, HMFlow trades implementation
clock rates for faster runtime as shown in Figure 4.12. On average, HMFlow produces
implementations that are 2-4× slower than those created by the Xilinx tools5 . For rapid
prototyping purposes, this tradeoff may be acceptable because even though the design will
run 2-4× slower, overall, it will be executing 10,000s of times faster than simulation.
5

The reader is reminded that the Xilinx tools are likely to have produced a higher quality circuit, however,
the results shown in Figure 4.12 represent the fastest runtime configuration of the Xilinx tools.

65

4.4

Conclusion
HMFlow has demonstrated that rapid compilation of FPGA designs using hard

macros is possible. It has been shown that pre-compiled blocks can be used to construct
functional and correct designs an order of magnitude faster than conventional vendor tools.
Although QOR is traded for faster compilation time, this technique paves the way for a
completely new approach to FPGA compilation that will have the potential to change the
way engineers design and debug projects on FPGAs.

66

CHAPTER 5.
HMFLOW 2011: ACCELERATING FPGA COMPILATION
AND MAINTAINING HIGH PERFORMANCE IMPLEMENTATIONS

After the success of HMFlow 2010 in achieving rapid compilation for FPGAs, it was
time to take a step back and consider alternative approaches. Of particular interest was
exploring ways of maintaining rapid compilation, but also improving HMFlow implementation clock rates to have speeds comparable to those produced by vendor tools. The major
approach taken in HMFlow 2011 was to (as hinted at in Section 4.3.4) increase the size of
the hard macros used in the flow. That is, rather than use fine-grained hard macros such as
adders, registers, multiplexers, etc., the approach would use system-level sized blocks such
as FIR filters, FFTs, DDS/mixers, micro-controllers and so forth.
These larger hard macros benefit HMFlow in two major ways. First, larger hard
macros contain a much higher percentage of the total routes in the design. This allows timing
closure information (the timing-sensitive routing information) to be stored in the form of
routed nets that can be re-used with each instantiation of the hard macro. In addition,
it significantly reduces routing time. As an example, the frequency estimator benchmark
used to demonstrate HMFlow 2010 has 11,226 un-routed nets (66% of total nets) before the
design reached the router. However, a large hard macro version of the frequency estimator
benchmark, has only 854 un-routed nets (5% of total nets) before the design reached the
router. This resulted in a significant reduction in total runtime for the HMFlow 2011 router
as the router only had to route the last 5% of nets (rather than 66%) in the design to
complete the implementation.
The second main advantage of larger hard macros is that the larger hard macro-based
designs will typically absorb many of the smaller hard macros prevalent in the benchmarks
used in HMFlow 2010. This substantially reduces the total number of objects requiring
67

placement during the placement process, ultimately reducing the problem size. For instance,
the frequency estimator benchmark used to demonstrate HMFlow 2010, requires 757 hard
macro instances to be placed. However, a large hard macro version of the benchmark only
has 31 hard macro instances to be placed, a 24× reduction in problem size.
This chapter describes the preliminary work performed to validate the ideas of using
large hard macros and their potential to be the major building block of a high performance,
rapid compilation flow. Once large hard macros for high performance are validated, a modified HMFlow 2010 (HMFlow 2010a) is augmented to support large hard macros and a
newer Virtex 5 architecture. This sets the stage for HMFlow 2011 with a number of new
improvements and techniques that are described and implemented. Finally, several further
improvements are made on the new placer in HMFlow 2011 to help increase robustness and
finalize HMFlow 2011.

5.1

Preliminary Work
Two major questions needed to be answered if large hard macros could be feasible

for both creation and design implementation. They are:
1. If a large hard macro is created and timing verified to run at clock rate X at one
location, will it be able to run at the same or similar rate at another location on the
FPGA fabric?
2. What is the best way to create a hard macro in terms of size, shape and density?
Question #1 is simply a question to verify if the FPGA fabric can be relied upon to
provide consistent timing behavior of a hard macro regardless of where it might be placed.
This is important because when a design is implemented with HMFlow 2011, its clock rate
is determined not only by the critical path in between hard macros, but the the longest
critical path within the hard macros themselves. As each hard macro is created, it’s longest
critical path is calculated for the location on which it was created. Question #1 aims to
68

discover how accurate the critical path measurement is when the hard macro is placed at
other locations on the FPGA.
Question #2 aims to reveal information that ultimately improves the performance
of HMFlow, specifically, how to increase the place-ability (maximizing the number of valid
locations a hard macro can be placed on the FPGA) of hard macros and how to reduce their
critical paths. Increasing the place-ability of a hard macro depends largely upon its shape
and size. Smaller, more uniformly sized hard macros have many locations on which they
can be placed. However, large hard macros, especially oddly-shaped ones, have fewer valid
locations on which they can be placed. Also, at this point in the research, it was not well
understood as to what parameters affected the actual performance or critical path of a hard
macro.

5.1.1

Experiments
To answer the questions presented in the previous section, a series of experiments

were conducted.

Experiment #1: Timing Variability of Hard Macros on FPGA Fabric
Experiment #1 is designed to answer Question #1 as presented in the previous section. That is, it aims to measure the timing variability of hard macros as they are placed at
different locations across the FPGA fabric. To do this, two different hard macros containing
a reasonable set of internal routes were created. The first hard macro created was a 21×21
bit LUT-based multiplier on a Virtex 4 SX35 FPGA. Relative to the FPGA fabric size, the
hard macro was small, but did contain several timing sensitive routing paths that could be
accurately measured and variations in its timing behavior could easily be detected.
To measure the variability of the FPGA fabric, a separate implementation containing
a unique placement for each valid location of the hard macro was created. In this case, the
multiplier could be validly placed at 400 different locations on the Virtex 4 SX35. These
69

…
Figure 5.1: General Pattern of Hard Macro Placement on FPGA Fabric

implementations were produced by writing a custom program in RapidSmith that would
iteratively find each valid location at which the multiplier hard macro could be placed,
create a unique implementation with the hard macro placed at that location and then save
it to an XDL file. At the end, there were 400 XDL files, each with a single instance of the
multiplier hard macro placed at a unique location.
The placement locations formed a grid pattern on the FPGA fabric as demonstrated
by Figure 5.1. Each shaded rectangle in the grid represents a single implementation of the
hard macro placed there with every rectangle being populated in a separate implementation.
For each unique placement of the hard macro, the timing results of each placement populated
each measurement in the grid. For each of the 400 implementations, the XDL file was
converted to NCD and then provided as input to Xilinx Trace (trce) to calculate the
minimum clock period at which the multiplier could be clocked and delay values for all of
the paths within the multiplier. All of the 400 trce report files (.TWR) were analyzed
and two types of deviations in timing were found. The first is represented in Figure 5.2 as
a particular path’s delay in the multiplier hard macro is plotted in a grid with the Z axis
representing the amount of delay (in nanoseconds).
Note the regular sharp hills that span the Y direction, these correlate very well with
the clock branches found at those locations on the FPGA (the graph is rotated about 100 ◦
counter-clockwise from a normal X/Y perspective). These deviations, however, are quite

70

Example Path Deviation Across Chip

4.59

Delay (ns)

4.59

4.588

4.585
4.58
4.586

4.575
80

4.584

70

60

4.582
50

Hard Macro
X Coordinate

4.58

40

30
Hard Macro Anchor X coordinate
20

200

180

160

120

140

100

80

60

40

20

4.578

Hard Macro
Y Coordinate

Hard Macro Anchor Y coordinate

Figure 5.2: Delay of a Path Within a 21×21 Bit LUT-multiplier Hard Macro Placed in a
Grid of 400 Locations on a Virtex 4 SX35 FPGA

minimal (∼ 10 ps) and are caused by the hard macro straddling a clock distribution branch
in the FPGA fabric. This deviation is quite small and very acceptable for the purposes of
HMFlow 2011.
However, a second type of timing deviation was found among the trce reports and
is shown in Figure 5.3 (rotated about 45 ◦ counter clockwise). Here the timing deviation
is much more significant (∼ 250 ps) and is due to the multiplier hard macro straddling
the center column or clock distribution spine of the FPGA. If examined closely, the same
ridges present in Figure 5.2 can be seen, although they are dwarfed by the major ridge
down the center column of the FPGA. The delay measured in this grouping of paths is more
problematic but can easily be avoided by simply disallowing any hard macro to be placed
such that it straddles the center column of the chip.

71

Example Path Deviation Across Chip

4.75
4.8

Delay (ns)

4.75

4.7

4.7
4.65
4.65

4.6
4.55

4.6

4.5
200
150
Hard Macro 100
Y Coordinate

70

50

Hard Macro Anchor Y coordinate

0

30

20

40

50

80

60
Hard Macro
X Coordinate

4.55

Hard Macro Anchor X coordinate

Figure 5.3: A More Severely Impacted Path Caused by a Hard Macro Straddling the Center
Clock Tree Spine of the FPGA

An additional experiment was performed to verify that this behavior holds true for the
Virtex 5 series of FPGAs. A Xilinx PicoBlaze (Ken Chapman’s programmable state machine)
hard macro was created for the Virtex 5 SX240T FPGA and placed in 3700 locations. Each
implementation was created as with the multiplier and all 3700 trce reports were analyzed.
The critical path delay for each placement is plotted in Figure 5.4. In this scenario, the hard
macro was not allowed to straddle the center column of the FPGA to avoid the significant
timing variation experienced with the multiplier.
The PicoBlaze had the same behavior as the path shown in Figure 5.2 (the Z axis is
critical path delay in nanoseconds). That is, the timing deviation was minimal (∼ 10 ps)
and was caused by the clock distribution branches found at regular intervals in the FPGA
fabric. Note that as with Figure 5.2, the plot has been rotated about 100 ◦ counter-clockwise
from a normal X/Y perspective.

72

140
120
100

Delay (ns)

80
3.21
60

3.205
3.2
250

40
200

20

150

Hard Macro
Y Coordinate

100

50

Hard Macro Y Coordinate

0

Hard Macro
X Coordinate
Hard Macro X Coordinate

0

Figure 5.4: A PicoBlaze Hard Macro Placed at 3700 Locations on a Virtex 5 FPGA

Experiment #2: Optimal Hard Macro Creation Parameters
The most important aspect of hard macro creation is the area constraint applied to
it during its placement phase. In fact, there are actually multiple area constraints that are
often applied when creating hard macros as the logic will often contain slices, DSP48s and/or
block RAMs. Each primitive type requires its own area group constraint and calculating their
size and shape can affect the hard macro’s place-ability.
Experiment #2 is designed to answer Question #2 as described in the previous section. It aims to find out what are the optimal parameters in creating hard macro area
constraints such that place-ability and clock rate are maximized. In order to do this, large
hard macros were selected and several thousand runs of the Xilinx tools were executed,
each with a slightly different configuration of area constraint and timing constraint. Specifically there were 4 different hard macro designs, a 1024 point FFT, 18×18 bit LUT-based
multiplier, a PicoBlaze and a MicroBlaze.
To vary the area group constraints, a width/height metric was used as area group
constraints are generally rectangular. Width and height test values were calculated in terms
73

Table 5.1: Width/Height Area Group Aspect Ratio Configurations
Width
Height
1
2
3
4
5
1
1/1
2/1 3/1 4/1 5/1
2
1/2
3/2
5/2
3
1/3
2/3
4/3 5/3
4
1/4
3/4
5/4
5
1/5
2/5 3/5 4/5
-

of slice coordinates ranging from 1 to 5 in both dimensions. Table 5.1 shows all of the test
configurations for the area group aspect ratio parameter (duplicate ratios are denoted by
‘-’).
In addition, a density metric that determined how loose or tight the area constraint
should be. This was calculated as a function of the xst synthesis report utilization estimates
for the hard macro design of interest. This metric only applied to slices as DSP48 and block
RAM counts are always estimated accurately for the hard macros tested. For example, if xst
reported that a design required 100 slices, a density metric of 50% overhead would produce an
area constraint that contained at least 150 slices. A density metric of 0% overhead produced
an area constraint that contained at least 100 slices. In some situations, the shape of the
area constraint often would produce an area constraint slightly larger (by just a few slices)
than what was specified by the density metric. For this parameter, overhead values ranged
from 0 to 150% overhead, in increments of 10%.
One additional parameter was varied in this large set of Xilinx tool runs, the minimum
clock period constraint. All of the 4 hard macros tested were implemented in a single clock
domain and required a clock period timing constraint. In each unique parameter instance,
10 different timing constraints were run 100 ps apart, starting with a constraint where the
tools could not achieve the constraint and then lengthening the constraint 10 times. The
constraints were in the range of 4 to 5 ns but were design dependent.
Ultimately with the 4 designs and three parameters, several thousand runs of the
Xilinx tools (ngdbuild, map, par, and trce) were executed to obtain the timing perfor74

mance of each implementation. The results of these runs will be presented in a future thesis
by Jaren Lamprecht at Brigham Young University in 20121 . Ultimately, no correlation was
found between the area group aspect ratio and density metric that inhibited the Xilinx tools
from achieving timing closure. Therefore, hard macro area constraints can be generated to
maximize their place-ability without regard for trying to meet timing closure. This outcome
provided significant flexibility to the hard macro creation process as area constraints could
be optimized for a single parameter rather than two opposing parameters.

5.1.2

Conclusions on Preliminary Work
After completing the experiments, the questions put forth at the beginning of this

section can now be answered. That is, will hard macros be able to run at the same or similar
clock rates if placed at different locations? The answer is yes. Timing variation is quite
minimal (∼ 10 ps) on both Virtex 4 and Virtex 5 FPGAs as long as the hard macro does
not straddle the center column clock spine of the chip.
To answer question #2, what is the best way to create a hard macro in terms of size,
shape and density? From the experiments conducted, it was learned that the aspect ratio
and density had little to no effect on the timing closure capability of a hard macro-bound
design. Therefore, the area aspect ratio and density can be chosen for optimal place-ability.
Ultimately, the density constraint found to be best was 50% and an aspect ratio of 1/1
(width=length).

5.2

Comparison of HMFlow Using Large Hard Macros vs. Small Hard Macros
After determining that large hard macros are a feasible and an effective approach of

minimizing compile time while increasing circuit quality, a number of benchmark designs
containing several large hard macros were needed. In addition, minor changes would be
required to HMFlow in order to support larger hard macros in designs and the methods of
1

The reader is directed to Jaren Lamprecht’s thesis for a full detailed description and analysis of this
work. Only summarizing statements of the outcomes are presented here.

75

indicating how they should be created. Virtex 5 would be the FPGA platform of target for
2011 and thus, additional upgrades to HMFlow would be required to handle the additional
FPGA architecture.
Ultimately, six large designs were chosen and re-fashioned from the benchmarks used
to demonstrate HMFlow 2010. Three of these designs were the same logic as their small hard
macro counterpart, however, implemented with large hard macros rather than small ones.
These benchmarks were tested to measure the impact of using large hard macros against
small ones in the areas of runtime and quality of result.

5.2.1

Upgrading HMFlow to Support Large Hard Macros
The design theory behind implementing large hard macros is that it would enable

a designer to encapsulate a building block of his or her design into a reasonably finished
module. That is, the designer could use the small hard macro paradigm to develop, debug
and verify the module. Then, once the block was satisfactorily correct and verified, it could
be tagged to become a large hard macro so that it could be compiled for high quality.
In order to allow the designer this capability, the Subsystem construct found in the
MathWorks Simulink environment was leveraged. Simulink is the environment used to implement Xilinx System Generator designs and recognizes subsystems as a valid design construct.
For example, consider Figure 5.5a, which shows a FIR filter constructed in System Generator. In HMFlow 2010, each block is converted into a hard macro, in this case, there would be
22 hard macros created (the 4 constant blocks actually just become routing). However, the
entire FIR filter can be encapsulated inside of a subsystem as shown in Figure 5.5b, which
is bound to become a single large hard macro.
The changes in HMFlow were actually quite minimal in order to support subsystems.
The first change occurred in the Simulink parser which had to identify top-level subsystems
so that they could be converted to a single hard macro. Only top-level subsystems or
subsystems at the highest level of hierarchy were recognized for hard macro creation because

76

(a)

(b)
Figure 5.5: (a) Illustrates a FIR Filter Implemented in System Generator (b) Shows the FIR
Filter Converted to a Subsystem to be Turned into a Hard Macro

77

Simulink allows for multiple subsystem hierarchy. This change in the parser required slight
changes to the Simulink data structure and hard macro creation process but were still fairly
minimal. All of these changes were implemented, tested and verified before any other changes
were made to HMFlow.

5.2.2

Upgrading HMFlow to Support Virtex 5 FPGAs
As with all research, the technology on which it was performed quickly becomes

obsolete. To try and minimize the obsolescence of the technology used to demonstrate
HMFlow, it was upgraded to support Virtex 5 FPGAs in addition to Virtex 4. This required
a number of changes to HMFlow because of the differences in primitive types such as IOBs
and slices. Some of the changes that were made are summarized as follows:
 Hard Macro Creation

– Port hard macro identification had to change to accommodate Virtex 5 slice differences
– Synthesis reports output statistics in a different format for Virtex 5 so parsing
facilities had to change
– The layout and LUT counts of Virtex 5 slices compared to Virtex 4 slices are
different and require a separate equation for area constraint calculation
 Stitcher

– The IOB creation had to be modified to accommodate changes to IOBs in Virtex
5 primitives
– Clock generation circuitry was modified to support Virtex 5 architectures
– Virtex 5 did not require instantiation of the PMV primitive as with Virtex 4
designs

78

 Router

– The handling of static sourced nets (VCC and GND) had to be treated differently
due to the significant change in routing architecture
– RapidSmith had to include missing information for pin name mapping on some
primitives (see Section 3.4.2)
With the additional features of (1) large hard macro support through subsystems
and (2) Virtex 5 support in HMFlow, it has changed significantly enough to differentiate it
from HMFlow 2010 as described in Chapter 4. The modified HMFlow 2010 with these two
additional features will be referred to as HMFlow 2010a throughout the remainder of this
dissertation. It should be emphasized that all of the steps and algorithms used in HMFlow
2010 are the same as those used in HMFlow 2010a.

5.2.3

Large Hard Macro Benchmark Designs for HMFlow
Of the six benchmark designs selected for demonstrating HMFlow for use with large

hard macros, three were logically equivalent to three of the original benchmarks used in
demonstrating HMFlow 2010, namely frequency estimator, multiband correlator and trellis decoder. The other three benchmarks (brik1, brik2 and brik3) each represented an entire
design that occupied an entire FPGA in the original telemetry receiver [31] selected as a
major source for benchmark circuits.
Large hard macros were created in each benchmark by encapsulating major components into subsystems, often by also inserting pipeline registers at their boundaries. After
all the subsystems were finalized, all six benchmarks were built using HMFlow 2010a to
support large hard macros and targeted the Virtex 5 SX240T FPGA. The slice count of the
six benchmark designs is shown in Table 5.2. It should also be noted that the three non-brik
benchmarks actually existed as a single large hard macro inside one of the brik benchmarks.
The frequency estimator became a single hard macro in brik2, the trellis decoder became
79

Table 5.2: Slice Counts for Large Hard Macro Benchmark Virtex 5 Designs
Design Name
Virtex 5 Slices
frequency estimator
3970
trellis decoder
5929
brik3
7505
brik2
7552
multiband correlator
8258
brik1
9598

a single hard macro in brik3 and the multiband correlator became a single hard macro in
brik1.

5.2.4

Comparisons of Large and Small Hard Macro-based Designs
To measure the impact of larger hard macros on HMFlow, a comparison was to be

made between the small hard macro benchmarks and the large hard macro benchmarks.
Both sets of benchmarks were built with the newly created HMFlow 2010a and targeted the
Virtex 5 SX240T FPGA. This would provide a fair comparison of runtime and quality of
result when using large hard macros vs. small hard macros. The runtimes for the benchmarks
and their maximum achievable clock rates are graphed in Figure 5.6.
As can be seen from Figure 5.6a, using larger hard macros over smaller hard macros
can reduce compilation times significantly, by at least 2.5–4×. This is a very significant
reduction in that the smaller hard macro versions compile at least 10× faster than the fastest
Xilinx compilations. This shows very positive implications for using larger hard macros as
it validates the previous conjectures on such reductions.
Figure 5.6b shows that using larger hard macros will improve the maximum achievable
clock rate by 60–70%. This is an excellent result as no algorithms were changed in HMFlow
2010a to achieve this result. The fact that more routes were being packed into the hard
macros and being routed with the higher quality Xilinx router made a significant difference
in overall clock rate.

80

90

50

80
Max. Clock Rate (MHz)

45

Large Hard Macros
Small Hard Macros

Runtime (seconds)

40
35
30

25
20
15

70
60
50

40
30
20

10

10

5

Large Hard Macros Clk Rate
Small Hard Macros Clk Rate

0

0
frequency_estimator

trellis_decoder

frequency_estimator

multiband_correlator

trellis_decoder

multiband_correlator

Benchmarks

Benchmarks

(a)

(b)

Figure 5.6: (a) Comparison of Runtime for Large and Small Hard Macro Versions of 3
Benchmarks on HMFlow 2010a (b) Comparison of Clock Rate for Large and Small Hard
Macro Versions of 3 Benchmarks on HMFlow 2010a

Some of the benefit attributed to the larger hard macros is the increased number of
routed nets encapsulated within the hard macros. The large hard macro benchmarks include
over 3.7× the routed connections as found on average in the smaller hard macro benchmarks
as described in Section 4.3.1. This removes a significant burden from off the router as
the large hard macro benchmarks have over 60% of their connections already routed (not
including clock, power and ground nets) before arriving at the routing stage. The percentage
of routed connections in each benchmark is shown in Figure 5.7.

Large Hard Macro Compilation Time
As mentioned in Section 4.3.1, this work aims to find out how fast FPGA compilation
can be made to run assuming that hard macros are already built. Large hard macros
generally take longer to compile than small hard macros and as large hard macros contain
more routing, the hard macros perform better if they are created to run at faster clock rates

81

65.00%

Percentage Of Routed Connections After Placement
(before Routing)

64.00%

63.00%

62.00%

61.00%

60.00%

59.00%

58.00%

57.00%

56.00%
frequency_estimator

trellis_decoder

brik3

brik2

multiband_correlator

brik1

Figure 5.7: The Number of Existing Routed Connections in the Large Hard Macro Benchmark as a Percentage ot Total Connections

meaning compilation time will increase. Some compile times for larger hard macros are found
in Table 5.3.
Although large hard macro compile times are longer than small hard macro compile
times and cannot be solved as easily with XDL generators as described in [32], they do offer
the capability to capture timing closure information. By capturing the routing configuration
that meets a difficult timing constraint that might take the Xilinx router several hours or
days to generate in a hard macro, that timing closure process does not need to be repeated.
Each subsequent bug fix found in other parts of the design will not necessitate the long and
difficult compilation process that would be needed to re-achieve timing closure because such

82

Table 5.3: Coarse-grained Hard Macro Compile Times
Hard Macro Name
Tiles Occupied
digital gain amplifier
62
new correlator fifo
110
ad dynamic range calc
34
Polyphase filter
610
pd control
117
Reindexer
645
Aliasing DDC
491
multi band correlator1
12672

Compile Time
160s
168s
177s
200s
230s
260s
320s
1324s

Table 5.4: Runtime Comparison of HMFlow with Large Hard Macros vs. Xilinx
frequency est.
Xilinx
392s
HMFlow
7.0s
Speedup
56×

trellis dec.
461s
10.9s
42×

brik3
605s
12.9s
47×

brik2
852s
13.1s
65×

multiband cor.
498s
11.8s
42×

brik1
849s
14.2s
60×

information is already captured in the hard macro (obviously if a bug exists in the hard
macro in question, it must be recompiled).
One could potentially envision a development process where a hybrid approach of
using both small and large hard macros to design a block. A designer could start by using
small hard macros to quickly converge on a functional circuit. Then, once the block is
debugged and verified, it could be collapsed into a single large hard macro and compiled
to meet the difficult timing constraint. The time it takes to compile the design as a large
hard macro will take longer, but will avoid subsequent compilations where timing closure is
difficult to reach.

5.2.5

Comparisons of Large Hard Macros with HMFlow vs. Xilinx
With large hard macros making a positive impact in performance and quality of

HMFlow implementations, it begs the question, how close is this combination to the implementations produced by Xilinx? To answer this question, the remaining three benchmark
designs (brik1, brik2 and brik3) were also compiled using HMFlow 2010a and then all six
83

Table 5.5: Clock Rate Comparison of HMFlow with Large Hard Macros vs. Xilinx
Xilinx
HMFlow
Slowdown

frequency est.
237MHz
82MHz
2.9×

trellis dec.
225MHz
64MHz
3.5×

brik3
225MHz
64MHz
3.5×

brik2
multiband cor.
brik1
199MHz
243MHz
207MHz
80MHz
80MHz
81MHz
2.5×
3.0×
2.5×

benchmarks were compiled with the Xilinx tools using the same configuration as described
in Section 4.3.4. Runtimes were measured on all six benchmarks for both compilation flows
and are shown in Table 5.4. The average speedup for larger hard macros designs compiled
with HMFlow 2010a over conventional Xilinx tools is about 52×, this is substantially more
speedup than what was achieved using smaller hard macros as reported in Section 4.3.4.
Although runtimes were significantly reduced for the HMFlow-generated implementations, clock rates remained relatively low when compared with the Xilinx tools. Clock rates
obtained for the six large hard macro benchmark circuits for HMFlow and Xilinx are shown
in Table 5.5. The HMFlow implementations run between 2.5-3.5× slower than the Xilinx
implementation. These clock rates are 60–70% faster than those obtained when using small
hard macro versions of the benchmarks, but there is still plenty of room for improvement.
Because of this large disparity in clock rate, a number of techniques were devised to improve
clock rate which are discussed in the following section.

5.3

Modification to HMFlow for High Quality Implementations
In order to improve clock rates of large hard macro-based benchmark designs compiled

with HMFlow 2010a, several new techniques and modifications were required. The result is
a new version of HMFlow, called HMFlow 2011. HMFlow 2011 builds on HMFlow 2010a
and, additionally contains a completely new placer, a register re-placement step in between
placement and routing and an optimization tweak to the router. The major differences
between the two versions of the flow are illustrated in Figure 5.8. Each of these performance
optimizations are described in the remainder of this section.

84

Design
Parser &
Mapper

.mdl

XDL Hard
Macro
Placer

Design
Stitcher

XDL
Router

INPUT DESIGNS

Generic
HMG

HM
Cache

HARD MACRO SOURCES

Design
Parser &
Mapper

.mdl

PLACED &
ROUTED XDL

HMFlow 2010a

Design
Stitcher

INPUT
DESIGNS

.xdl

Simulated
Annealing
HM
Placer

Register
Replacement

XDL
Router
v. 2.0

New or Improved

Generic
HMG

HM
Cache

HARD MACRO SOURCES

.xdl

PLACED &
ROUTED
XDL

HMFlow 2011

Figure 5.8: Block Diagrams of HMFlow 2010a and HMFlow 2011.

5.3.1

Hard Macro Simulated Annealing Placer
As described in Section 4.2.6, a common algorithm used for FPGA placement is

simulated annealing [27] as it is effective at producing good results, however, at the expense
of longer runtimes. Regardless of the characteristic runtimes for the algorithm, a simulated
annealing placer for hard macros was created out of a need to determine the potential
quality of hard macro-placed designs. It was important to understand how much circuit
quality was being given up by the heuristic placer used in HMFlow 2010/HMFlow 2010a.
A very inefficient and slow placer was implemented that gave much better results than the
placer used in HMFlow 2010, but at the cost of significantly longer runtimes (a few minutes

85

versus a few seconds). Performance and runtime comparisons of both placers are deferred
to Section 5.4 where a more careful analysis can be presented.
It was found later that accelerating the schedule of the simulated annealing placer
so it finished much sooner had very little impact on the overall quality of the resulting
implementation. By using a bounding box optimization (to detect overlap) for the hard
macros and a faster cooling schedule, the slow, high quality simulated annealing placer
became a significant improvement over the HMFlow 2010 placer with a reasonable increase
in runtime.

Basic Algorithm
The basic idea behind simulated annealing is to mimic the heat treatment process
used in metallurgy, often to improve the organization of the structure of metals in making
them more ductile and work-able. Annealing occurs by the diffusion of atoms within the
metal so that the metal progresses towards an equilibrium state. By heating the metal to a
very high temperature and then causing it to cool at a controlled rate, the atoms have the
needed energy to break bonds and move toward an equilibrium.
In much the same way, the FPGA placement problem can be solved where the logic
primitives can be “heated” and then slowly cooled so that they move toward an equilibrium, or high quality placement. The basic algorithm for simulated annealing is given in
Algorithm 1.
The process begins by creating some initial state or initial placement, S, as a starting
point for the algorithm. A starting temperature, t, is chosen or calculated, generally it is a
large value to allow several uphill moves (moves which increase the total cost of the system) to
be accepted at the beginning of the process. Then, several moves will be made and evaluated
at each temperature step. When a move decreases the total system cost (calculated by the
function Cost()), a move is always accepted. If the move increases the total system cost, it
is accepted with some probability which is a function of the magnitude of the change in cost

86

Algorithm 1 Basic Simulated Annealing
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:

S ← createInitialP lacedSolution()
t ← initialStartingT emp
while not reachedStoppingCriteria() do
for i = 0 to movesP erT emperatureStep do
Si ← generateRandomM ove()
if Cost(S) ≥ Cost(Si ) then
S ← Si
else if e(Cost(S)−Cost(Si ))/t ≥ random[0, 1) then
S ← Si
end if
end for
t ← t ∗ temperatureReduceRate
end while
return S

and also the temperature (see line 8 of Algorithm 1). When the temperature is high, uphill
moves are more likely to be accepted. However, when the temperature is low, uphill moves
are less frequently accepted.
After movesP erT emperatureStep moves has been made, the temperature is decreased by some percentage, temperatureReduceRate. This process repeats over and over
until some stopping criteria is met which is often when the system cost does not change after
several moves.

Customization of Simulated Annealing for Hard Macros
The implementation of the simulated annealing placer for hard macros created for
HMFlow 2011 had a number of customizations which changed or deviated from the basic
algorithm shown in Algorithm 1. The customizations are listed below:
1. The number of moves per temperature step varies based on the acceptance rate of the
moves being generated. The number of moves per temperature step increase when the
acceptance rate is near a more productive acceptance rate which was determined to be
approximately 44% [33].

87

2. The inner for loop was replaced with a while loop as the moves per temperature step
were only counted when they were accepted moves rather than counting total moves.
3. The initial starting temperature was set to a value equal to 1.5 times the initial system
cost.
4. The process stopped if either the move acceptance rate fell below 2%, or, the temperature had dropped below a value of 0.01.
5. The cost function was the sum total measure of Manhattan distances between all port
connections in between hard macros.
Most of these changes or customization were chosen simply from many iterative attempts at placement and trial and error. The number of modification and performance
tweaks to the algorithm are limitless and would be far beyond the scope of this dissertation
in trying to explore all the effects of each particular parameter. However, several optimizations are described later in this chapter that do explore and evaluate the performance of
particular changes.

Challenges of Hard Macros in a Simulated Annealing Placer
One of the main challenges of performing simulated annealing on a design which is
composed of mostly hard macros is that it takes much more effort to evaluate a valid move
when compared with conventional designs. This is due to the way in which a move must
be validated for a hard macro versus a single primitive instance. Regular FPGA designs
are composed primarily of primitive instances. Instances are placed on compatible primitive
sites and the check required to see if the placement is valid simply requires a comparison to
make sure the primitive site is compatible and that it is unoccupied. However, when a hard
macro is moved, it may contain dozens or even hundreds of primitive instances in addition
to several routed nets. In a potential move, each instance must be checked for a compatible

88

(a)

(b)

Figure 5.9: (a) Representation of a Set of Hard Macros Drawn with a Tight Bounding Box (b)
An Approximated Bounding Box for the Same Hard Macros for Accelerating the Simulated
Annealing Hard Macro Placer

site as well as being unoccupied, and, each PIP within each routed net must also be verified
to make sure the move is valid.
Because of the significant bottleneck of checking hard macros for valid moves, trying millions of moves (a typical quantity for regular simulated annealing placers) takes an
extremely long time and is quite unreasonable for a rapid compilation approach. To help accelerate the number of moves the placer could perform each second, a few optimizations were
employed. First, during the creation of each hard macro, all of its valid placement locations
were pre-computed off-line and stored with the hard macro in the hard macro cache. This
saved several seconds of runtime when initializing the placer and allowed the move generator
to choose from a set of a valid placement locations rather than randomly choosing locations
that often would not support the hard macro.
Second, rather than using a tight bounding box around each hard macro as shown
in Figure 5.9a, each hard macro had an approximating bounding box calculated around all
of its logic and routing as shown in Figure 5.9b. The bounding boxes accelerated move
89

Figure 5.10: An Illustration of the Problem of an Approximating Bounding Box Where the
Box Changes Size Based on Location

validation in that only the bounding box needed to be moved and tested for overlap with
other bounding boxes in the design rather than each primitive instance and PIP in the
hard macro. The bounding box approximation did restrict to some degree the flexibility of
how hard macros could be placed—the bounding boxes contained some empty space making
for some inefficiencies—but the trade-off reduced runtime to make it a feasible placement
technique.
Another challenge that arose from using the approximating bounding box was that
the Xilinx FPGA fabric is not always uniform and can skew the bounding box sizes based
on where it was originally calculated. To illustrate this concept, consider Figure 5.10. On
the left of the figure is the outline of a hard macro that consumes two horizontally adjacent
CLB tiles. This would create a a bounding box of 1 tile high and 3 tiles wide. However,
now consider the same hard macro on the right side of Figure 5.10. An extra column of
configuration tiles is found in between the two columns of CLBs. This creates a bounding
box of 1 tile high and 4 tiles wide.
The problem occurs when a hard macro with the smaller 1x3 bounding box is calculated and then moved to a spot where it should have a 1x4 bounding box. When hard macros

90

are big enough, they can span multiple columns and rows of tiles that ultimately cause the
bounding box to be too small for a specific placement. This problem, on rare occasion, can
lead to overlap of hard macro placements which is placement failure.
In trying to remedy this issue, it was determined that trying to compensate for the
placement-varying bounding box would eliminate much of the runtime savings it created,
thus an alternative solution was chosen. The solution implemented was that all hard macros
would have to be checked for valid placements at the end of the simulated annealing process.
If any hard macros were part of an invalid placement (overlap) the smaller of the two hard
macros was replaced at the closest valid location to its final placer-decided location. The
situation occurs quite infrequently (once or twice in approximately 10 placement runs) and
did not significantly impact placement quality but did allow the preservation of the bounding
box runtime acceleration optimization.

5.3.2

Register Re-placement
One of the biggest problems of large hard macro placement is the inherent presence

of long distance connections between hard macros that often become critical paths in design
implementation. The placer can reduce these paths to some extent, however, because of the
rigid nature of hard macros, circuitry is not nearly as fluid and malleable as in conventional
design compilation flows.
To try to reduce the effects of this problem, a new technique called register replacement is introduced. Register re-placement occurs after the placement phase but before routing in HMFlow 2011. This technique leverages the common occurrence of registers
being present at hard macro ports. The basic process for register re-placement is given in
Algorithm 2.
The process begins by identifying all external hard macro nets (those nets connected
exclusively to hard macro ports). From the nets, all connections (a connection being a single
source pin to a single sink pin) can be extracted and then for all connections that are longer

91

C
D

B

A

C
D

A

B

Figure 5.11: A Simple Example of a Register (A) Being Re-placed at the Centroid of the
Original Site of Register A and Sinks B, C and D

92

Algorithm 2 Optimal Register Re-placement
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:

C ← identif yHardM acroP ortConnections()
T ← 10
S ← {}
for each Connection c in C do
if c.getT otalLength() ≥ T then
rs ← c.getSourceRegister()
rd ← c.getDestinationRegister()
if rs 6= null and not rs ⊆ S then
P ← getAllP inLocations(c)
M oveRegister(rs , f indCentroid(P ))
S.add(rs )
else if rd 6= null and not rd ⊆ S then
P ← getAllP inLocations(c)
M oveRegister(rd , f indCentroid(P ))
S.add(rd )
end if
end if
end for

than a certain length (Manhattan distance from source pin to sink pin) of 10 tiles, register
re-placement is considered for evaluation. If a connection has a register at its source pin,
the register is moved to the centroid (arithmetic mean) of all the net’s source and sink pins.
The centroid is the center of the smallest circle which encloses all of the points of interest. In
this case, it would be the original location of the register and all of the net’s sink locations.
To illustrate the centroid of a net and how it is used to move a register, consider
Figure 5.11 where the top picture represents a portion of the FPGA fabric after a hard
macro design has been placed. The output of register A in the large hard macro on the
left, is driving three separate sinks B, C and D in three other hard macros. Due to the
placement, the connections are very far apart, especially from register A to sink B. In the
bottom picture of Figure 5.11 it shows register A moved to the centroid (the center of the
smallest circle that still includes the 4 points). This re-placement of the register significantly
decreases the longest path the router would have to route in this particular case, ultimately
paving the way for a higher quality implementation.

93

5.3.3

Router Improvements
As seen in Chapter 4, the router used in HMFlow 2010 consumes the largest fraction

of the total runtime of any step in the flow. This is largely due to the higher percentage of
routes it has to complete when compiling smaller hard macro-based designs. Since measuring
the results of HMFlow 2010, small incremental improvements have been made to reduce
the runtime of the router by profiling and testing different code transformations but no
algorithmic changes. With the goal of improving clock rates with HMFlow 2011, the router
was re-visited to find additional improvements with minimal impact on increasing runtime.
One of the major inefficiencies found in the HMFlow 2010 router was the fact that
it often did not use long line routing resources efficiently when connections had to be made
over a long distance. This resulted in long distance connections being routed with several
shorter hops of less efficient resources that ultimately added up to a significant amount of
propagation delay for the net. This behavior was mostly a side effect of the driving force
behind the router’s original implementation which was “route as fast as possible.”

Long Lines
A long line in Xilinx FPGAs is the longest routing wire available. In the Virtex 4
architecture, long lines spanned 24 switch boxes, however, they were reduced in length to

1
Virtex 4 Horizontal Long Line

Virtex 5 Horizontal Long Line
Figure 5.12: Representation of Virtex 4 and Virtex 5 Long Line Routing Resources

94

18 switch boxes in the Virtex 5 architecture. The long line is also the only bi-directional
routing wire available in modern Xilinx devices. They are also placed in both horizontal and
vertical connections. Long lines are also capable of connecting to every 6th switch box in its
path as illustrated in Figure 5.12.
By using long lines, routing connections that are very far apart can be connected
with relatively few long lines. In Virtex 4, the next longest wire is the hex line providing a
connection distance of 6 switch boxes. In Virtex 5, the next longest routing resource is the
pent line which provides a connection distance of 5 switch boxes.
Unfortunately, the timing information of the wires found in Xilinx FPGAs is not
available publicly and therefore, wire delay cannot be used for comparison. However, it
is clear that the number of wire segments used in routing a connection can often have a
bigger impact on delay rather than a connection’s length and so if the use of a long line can
significantly reduce the number of total wire segments in the routed connection, it can also
significantly reduce delay.

Long Line Router
To overcome the long distance routing inefficiency of the router, a specialized long
line router was developed to specifically route long distance connections during the routing
process. This long line router finds very good routes using as many long lines as possible
to get closer to the sink of the connection. Once the route has used as many long lines as
possible, the routing is then finished by the main routing algorithm.
The long line router is essentially a maze router that only uses long line resources. It
is invoked by the main router when it encounters a connection to be routed that has a Manhattan distance of 12 switch boxes or more between source and sink. Through preliminary
testing on a handful of designs and routes, it was determined that the minimum distance for
which the long line router provided benefit was to invoke it for distances of 12 switch boxes
or more. This is because there is some extra distance involved from connecting the source

95

to the starting point of the long line as well as the terminal point of the long line to the
sink. These shorter connections cannot be routed as efficiently as with the Xilinx tools for
lack of Xilinx proprietary timing information. Therefore, in order to avoid situations where
the long line router produces routes that have a longer delay than the conventional HMFlow
router2 , this threshold is imposed.
In addition, long lines can only be used in discrete hops of 6, 12 and 18 hops as
previously shown in Figure 5.12 and if a route is only using 6 hops worth of routing, it still
incurs the delay of 18 hops. Therefore, setting a threshold of 12 would reduce the likelihood
of the long line router utilizing a long line for a distance of only 6 hops. The goal of the
threshold was simply to invoke the long line router when it would be likely to provide benefit
and avoid situations where it could introduce unnecessary delay into a route.
Once the long line router is invoked, the first task is to find the closest and most
efficient entry point to a nearby long line. This is a higher quality search than is performed
by the main router and takes a bit longer to complete when compared with equivalent work
loads on the main router. If no available long line resource is available, the long line router
fails and returns the routing task to the main router to complete the net without long line
optimization. This occurs quite infrequently, but it does happen when congestion is high
around the source of the connection.
Once an efficient path to a long line has been found, the long line router will change
modes to only search out a path using long line resources. Again, the algorithm is based on
a maze router and will end prematurely if too much congestion is encountered. However, if a
partial route using long lines was found, the long line router will provide the partial route to
the main router as a starting point. This happens more frequently, but still provides some
propagation delay improvements.
2

It may be interpreted by the reader that the conventional HMFlow router has access to proprietary
Xilinx timing information. This is not true. The only reason it may produce routes with lower delays for
short routes than the long line router is the fact that it is not being constrained to use a long line and has
greater flexibility to explore for a better solution.

96

When the long line router exhausts the options to find the closest long line exit to the
sink, it returns the partially routed path back to the main router to finish the final connection.
Regardless of whether the long line router succeeds in finding a partial or complete long line
path, the main router’s parameters change to try and obtain a higher quality path as the
longest paths are more likely to become a critical path in the final implementation. These
changes do increase router runtime to an amount of 15-20% on average.

5.4

Performance Analysis
Once all three of the HMFlow 2011 improvements were complete, it was necessary

to perform analysis on their impact on quality of result. The ultimate goal of these improvements was to improve implementation clock rate while limiting runtime increase. In
order to accurately and fairly measure quality of result of each improvement, a variety of
configurations were created and run with all 6 of the large hard macro benchmark designs.

5.4.1

Result Measurement Fairness
By introducing the simulated annealing algorithm into the hard macro placer, the

quality of result and variability become a function of the random number generator seed used
in the algorithm. If the seed is always the same, then the results will always be repeatable
but potentially limit quality. However, it is common practice to run the Xilinx compilation
process with different seeds (Xilinx calls them table cost entry values). In fact, Xilinx has
provided users with a specific tool called SmartXplorer that allows several compilation jobs
to be run either in series or in parallel on a cluster where each job uses a different seed or
configuration. This technique can be quite effective in obtaining higher quality results as it
is very likely that the default seed may not provide the best possible implementation.
Due to the variability of some configurations of HMFlow, a single result does not
accurately portray the true performance of the tools and additional metrics are given to
supplement the results. For example, the Xilinx tools have a default mode which will produce
97

a repeatable result of a particular quality. However, it can also be configured to run with
99 other table cost placement entries which will vary the performance obtained. There will
often be better resulting circuits found by running all 100 configurations versus a single run.
Therefore, three results are provided for each configuration and each benchmark. The first is
a single run with a default configuration, the second is the average (arithmetic mean) of 100
runs with different seeds and third, the best result obtained from the same 100 runs which
produced the average.
One of the major reasons for providing all three metrics is that some designers may
have a cluster of computers suitable for farming out the 100 jobs to run in parallel. In this
scenario, the best of 100 results could be obtained rather quickly and would be the metric
of interest. However, other designers may not have such a luxury and could only run 2 or 4
in parallel and would be more interested in the average (or expected) result. To aid decision
making to both groups of designers, both metrics are included in addition to the single run
result when relevant.

5.4.2

Results of Three HMFlow 2011 Improvements
To get a sense of the impact of each improvement added to HMFlow 2011, all 8

possible configurations of the three improvements within the flow were tested separately
on all 6 benchmarks. The first configuration (Configuration 0) is the absence of all three
improvements and is the same as HMFlow 2010a. This configuration is considered the
baseline against which all 7 other configurations are compared. All of the configurations are
summarized in Table 5.6.
As can be seen from Table 5.6, there are two placers (the heuristic hard macro placer
from HMFlow 2010 and the new simulated annealing hard macro placer added in HMFlow
2011), an optional register re-placement step, and two routers (the router used in HMFlow
2010 and the new long line-enabled router in HMFlow 2011). All 8 configurations were tested
with the 6 large hard macro benchmark circuits. Those configurations that did not use the

98

Table 5.6: All HMFlow 2011 Improvement Configurations Tested
Configuration
Name
C0 (Baseline)
C1
C2
C3
C4
C5
C6
C7 (Best)

Placer
Heuristic
Simulated
(2010)
Annealer (2011)
X
X
X
X
X
X
X
X

Register
Re-placement

X
X
X
X

Router
Design
Long Line
(2010) enabled (2011)
X
X
X
X
X
X
X
X

simulated annealing hard macro placer (C0, C2, C3, and C5) were not run 100 times as the
heuristic placer was not dependent on a random seed as its input.

5.4.3

Results of Optimizing HMFlow 2011 Improvements
The results of the eight configurations of HMFlow 2011 improvements are shown in

Tables 5.7, 5.8, and 5.9. Each table includes results from the Xilinx tools compiling the
benchmarks to provide a comparison. The Xilinx tools were run in ‘Performance Evaluation
Mode’ that attempts to obtain good results in a reasonable amount of time without a timing
constraint. This mode was chosen because it most closely matched the goals of HMFlow
2011. Also, all units in the three tables are in MHz unless otherwise specified.
Table 5.7: HMFlow 2011 Benchmark Clock Rates of Single (Default) Run (in MHz)
Benchmark
C0 C1
frequency est. 116 146
trellis decoder
62
88
brik3
66
118
brik2
82
99
multiband corr. 67
83
brik1
65
153
Average
76
115
Improvement
1.5×

C2
156
88
77
122
87
127
109
1.43×

C3
C4
128
189
74
111
84
115
101
139
94
127
80
167
93
141
1.22× 1.85×

99

C5
165
114
81
141
109
144
126
1.65×

C6
C7
177
194
123
128
132
122
113
153
114
146
171
189
138
155
1.81× 2.04×

Xilinx
227
251
241
203
250
200
229
2.99×

Table 5.8: HMFlow 2011 Benchmark Clock Rates of Average of 100 Runs (in MHz)
Benchmark
C0 C1
frequency est. 116 144
trellis decoder
62
100
brik3
66
108
brik2
82
91
multiband corr. 67
101
brik1
65
144
Average
76
115
Improvement
1.5×

C2
156
88
77
122
87
127
109
1.43×

C3
C4
128
170
74
124
84
116
101
131
94
151
80
169
93
143
1.22× 1.88×

C5
165
114
81
141
109
144
126
1.65×

C6
C7
167
186
123
139
119
123
115
153
124
177
169
168
136
158
1.78× 2.06×

Xilinx
220
247
238
198
253
203
226
2.97×

Table 5.7 represents clock rates obtained for each of the benchmarks when executing
a single default compilation run of the tools. Table 5.8 shows the average clock rates obtained of compiling each benchmark 100 times, each compilation using a different seed or
table cost entry. For those configurations where a seed does not have an effect (heuristic
placer configurations, C0, C2, C3, and C5), the values are the same as those in Table 5.7.
Interestingly, the single run and average of 100 run results are very similar in their average
clock rates and average improvement in clock rates over the baseline, C0. This validates the
notion of a single run being a good representation of a tool’s performance. This observation
holds true also for the Xilinx tool’s results.
Table 5.9: HMFlow 2011 Benchmark Clock Rates of Best of 100 Runs (in MHz)
Benchmark
C0
C1
frequency est. 116
194
trellis decoder
62
124
brik3
66
139
brik2
82
134
multiband corr. 67
148
brik1
65
191
Average
76
155
Improvement
2.03×

C2
156
88
77
122
87
127
109
1.43×

C3
C4
128
208
74
157
84
146
101
167
94
207
80
214
93
183
1.22× 2.4×

C5
165
114
81
141
109
144
126
1.65×

C6
C7
217
228
157
176
146
157
160
180
172
223
205
212
176
196
2.31× 2.57×

Xilinx
264
280
260
205
270
214
249
3.26×

Table 5.9 represents the best clock rates obtained from the best of 100 compilation
runs. These results could be obtained very quickly if all 100 compilation runs could be run
100

in parallel such as on a supercomputer or cluster. When compared with averages of the
single and average of 100 runs, the best of 100 runs results are approximately 50% better.
This provides a significant advantage to those with easy access to parallel computing power.
However, the advantage is less if the designer is using the Xilinx tools which only produce
approximately 25–30% better clock rates.

Impact of Improvements
The simulated annealing hard macro placer had the greatest impact in performance
improvement. Compared to baseline (C0) implementations, replacing the heuristic placer
with the simulated annealing version (C1) increased clock rates on average by 50%. The
next most impactful improvement was the addition of the register re-placement step (C2)
that improved clock rates by 43%. The improvement of the long line optimized router (C3)
had the smallest impact when measured in isolation, providing improved clock rates of only
22%. These results illustrate the impact a good or bad placement can have on the overall
implementation clock rate. The router is also likely to be disadvantaged as the timing
information on which a router could make intelligent decisions is not freely available at a
sufficient level of detail.
Dual combinations of the improvements (C4, C5, and C6) were closely additive in their
improvement of clock rate showing how each improvement approach was mostly independent
of the others. C7, the configuration which combined all three of the improvement techniques
gave the best clock rate improvement of 2× or 2.5× in the best of 100 runs case.

Comparison of HMFlow 2011 to Xilinx
Xilinx still produced the fastest clock rates on average when compared to HMFlow
2011. Xilinx implemented design clock rates were on average almost 50% faster than the
C7 clock rates. However, if the C7 results of Table 5.9 are compared to the Xilinx results

101

of Table 5.7, the resulting implementation clock rates are much closer and in two cases
(frequency estimator and brik1) are actually faster than those produced by Xilinx.

Variance of Results
It was obvious from the results that the simulated annealing placer introduced variance of the quality of the results when comparing 100 runs of the same design. The lingering
question after these results were obtained was, how much variance existed and how did it
compare to the variance of the Xilinx tools? Table 5.10 shows a summary of the variance of
the 100 runs for both HMFlow 2011 and Xilinx for the 6 benchmarks.
Table 5.10: Variance of HMFlow 2011 (C7) and Xilinx in 100 Compilation Runs
Benchmark
HMFlow 2011 Xilinx
frequency est.
0.368
0.094
trellis decoder
0.637
0.050
brik3
2.054
0.027
brik2
0.437
0.007
multiband corr.
0.460
0.011
brik1
0.396
0.020
Average
0.725
0.035

The variance ρ was calculated across the entire population of 100 implementation
results for each benchmark using the standard equation

ρ=

n
X

(xi − µ)2

(5.1)

i=1

where µ is the arithmetic mean calculated by

µ=n

−1

n
X

xi .

(5.2)

i=1

The variance of HMFlow 2011 is significantly higher than that of the Xilinx tools,
more than 20× higher. This illustrates the ability for the Xilinx tools to produce results
102

that are much closer together than those of HMFlow 2011. The reason this is significant is
the concern that if a designer were to use HMFlow in a single run scenario, it is much more
likely to obtain a poor result because of the large variation of the results than it would be if
the designer were to use the Xilinx tools.

5.5

Techniques for Reducing Variance
To try and improve the robustness of the new improvements found in HMFlow 2011,

three techniques were considered to attempt to minimize the total variance of the results
produced by the compilation flow. It was determined through some preliminary tests that
the source of much of the variability in HMFlow 2011 was due to the simulated annealing
placer. Therefore, the techniques to reduce variability changed the behavior of the simulated
annealing placer. The three techniques are summarized below:
1. Change move acceptance (when a move increases the system cost) to be a function of
the hard macro port count
2. Perform a small hard macro re-placement optimization step (similar to register replacement)
3. Change the cost function to include a factor of the longest wire length between any
two hard macros
Each technique will be referred to hereafter as T1, T2 and T3 respectively. The
remainder of this section will describe each technique to reduce variance in detail and also
present results that show the best configuration to use in each case. Finally, all possible
configurations of the three techniques are compared and the best performing (lowest variance)
is selected.

103

5.5.1

T1: Move Acceptance a Function of Hard Macro Port Count
T1 is a technique that was designed to address a behavior that was noticed in placing

large hard macros during the simulated annealing process. The hard macros that had several
ports (generally larger hard macros) were being moved less often than other hard macros
with fewer ports. This often led to lower quality placements because the larger hard macros
would not benefit from as many moves as other hard macros. The equation of interest to
this problem is the move acceptance equation

r < e−

∆C
T

.

(5.3)

The acceptance equation determines how often a move is accepted when it increases
the total system cost (moves that decrease total cost are always accepted). When the net
change in total system cost ∆C is large, as is often true in the case of moves involving
hard macros with several ports, the likelihood of accepting the move is lower. The current
temperature T also reduces the likelihood of accepting the move as it decreases. The right
hand side of Equation 5.3 is compared to a random number r that is generated inside the
interval (0, 1]. If the right hand side of the equation is larger than r, the move is taken
otherwise it is discarded.
To attempt to equalize the number of moves or perturbations each hard macro received, the acceptance equation was changed from Equation 5.3 to

∆C

r < e− T +QP .

(5.4)

Two factors were added to the denominator of the right hand side of the exponential.
First, the number of ports P involved in the hard macro move and second, a scaling factor
Q to control the impact of the change. If Q = 0, then the performance should be roughly
(random seed differences) the same as the variance of C7 as shown in Table 5.10.

104

450
Q=0
Q=10
Q=20

Number of Moves Per Hard Macro

400
350
300
250
200
150
100
50
0

0

5

10
15
20
25
30
35
Hard Macro (Increasing Port Count from Left to Right)

40

45

Figure 5.13: Plot of Benchmark Brik1 Uphill Accepted Moves for Each Hard Macro

After making this change to the acceptance equation, some preliminary tests were
conducted to verify that the change indeed behaved as expected. To illustrate the behavior
of this technique, consider Figure 5.13 which numbers all of the hard macros from benchmark
brik1 from 1 to 45. Each accepted uphill move (moves with a positive ∆C) for each hard
macro is tabulated for three scenarios: Q set to 0, 10, and 20. The port counts on these 45
hard macros vary from 1 to 88 and the hard macros are sorted in increasing port count from
left to right along the X-axis. When Q is 0 (baseline or C7), many of the hard macros with
larger port counts do not have as many accepted uphill moves. However, as Q increase to
10 and then 20, the hard macros with higher port counts receive a more equal share of the
total accepted uphill moves.
With the desired behavior of this technique controlled by the parameter Q, a parameter sweep was conducted in order to determine the best value of Q to reduce variance of the

105

12
brik1
brik2
brik3
frequency_estimator
multiband_correlator
trellis_decoder
Avg. Benchmarks 2011

Variance of Clock Period

10

8

6

4

2

0
−1
10

0

10

1

10

2

10
Port Scaling Factor (Q)

3

10

4

10

5

10

Figure 5.14: T1 Variance of Scaling Factor Q Parameter Sweep

implementation output. A graph of the scaling factor sweep of Q is shown in Figure 5.14
and an enlarged version of the range 0 to 11 in Figure 5.15.
Figure 5.14 shows some impact to the variance in the interval where 0 < Q < 11.
However, when Q > 11, the variance rises significantly. This is likely due to a significant
reduction in the number of accepted uphill moves which allow the simulated annealing process
to escape local minima. Overall, the best performing scaling factor was 2.0 with a variance
of 0.492.

5.5.2

T2: Small Hard Macro Re-placement
T2 is a technique that mimics the behavior of the register re-placement improvement

in HMFlow 2011. It was found in some preliminary testing that the register re-placement
step reduced variability in the quality of results produced by HMFlow 2011 and was thought
that exploiting that technique within the placer could provide further reduction of variability.

106

2.2
2

Variance of Clock Period

1.8
1.6

brik1
brik2
brik3
frequency_estimator
multiband_correlator
trellis_decoder
Avg. Benchmarks 2011

1.4
1.2
1
0.8
0.6
0.4
0.2
0

10

1

Port Scaling Factor (Q)

10

Figure 5.15: T1 Variance of Scaling Factor Q Parameter Sweep (Zoomed)

The technique works by modifying placement after the simulated annealing process
has completed. It iterates over all of the small hard macros (which are quite likely to
contain registers) in a design and performs re-placement on them so that they are placed
at the centroid of all of their port connections using the same approach as described in
Section 5.3.2. This is a greedy optimization that attempts to reduce wire length and critical
paths so that the final implementation will be closer to the ideal clock rate.
The main question in implementing such a technique is, up to what size hard macro
should be moved? There should be a ceiling on the effectiveness of moving ever larger hard
macros as moving larger hard macros will tend to create more problems than it solves. In
order to answer this question, a parameter sweep in hard macro tile size was performed.
Hard macro size can be calculated by the number of tiles it occupies. The results of the
parameter sweep of hard macro size versus variance of results is shown in Figure 5.16 and
an enlarged version of the graph from 0 to 100 tiles in Figure 5.17.

107

brik1
brik2
brik3
frequency_estimator
multiband_correlator
trellis_decoder
Avg. Benchmarks 2011

3

Variance of Clock Period

2.5

2

1.5

1

0.5

1

10

2

10
Hard Macro Re−placement Size (tiles)

3

10

Figure 5.16: T2 Variance of Hard Macro Size Re-placement Parameter Sweep

The average variance plot experiences a step like behavior at about 150, 400 and
2000 tiles. This is due to a benchmark experiencing significant quality reduction when
trying to move hard macros of a certain size. The lowest variance results were obtained
when only moving hard macros of size 150 tiles or less. The lowest variance result (0.499)
was obtained when moving hard macros of 20 tiles or less and this value was chosen as the
final configuration for this technique.

5.5.3

T3: Cost Function Includes Longest Wire
In preliminary tests it was found that total wire length between hard macros was

a better performing cost function than the longest wire in between any two hard macros.
However, when using large hard macros that contain timing verified connections that have
been placed and routed with the higher quality Xilinx tools, only the connections in between

108

brik1
brik2
brik3
frequency_estimator
multiband_correlator
trellis_decoder
Avg. Benchmarks 2011

0.9

Variance of Clock Period

0.8

0.7

0.6

0.5

0.4

0.3

0.2

1

10

2

Hard Macro Re−placement Size (tiles)

10

Figure 5.17: T2 Variance of Hard Macro Size Re-placement Parameter Sweep (Zoomed)

hard macros will limit the maximum achievable clock rate. This would indicate that the
longest wire in between any two hard macros would likely become the critical path for the
design. Therefore, reducing the longest path in between any two hard macros would likely
produce a better result.
T3 combines the total wire length function and longest wire scaled by some factor
into a single cost function. Again, as in the previous techniques, a parameter sweep was
conducted to find the best variance-minimizing value for the scaling factor of the longest
wire in the cost function. Results of the parameter sweep are shown in Figure 5.18 and an
enlarged area of the graph (values 3000 to 100000) is shown in Figure 5.19.
Using the parameter sweep for T3 uncovers a significant reduction in variance with
the minimum variance at 0.278 when using a scaling factor of 16384 on the longest wire.
Because of the significant improvement, a more fine-grained sweep was performed around
16384 and can be seen in both Figure 5.18 and 5.19 with the several jagged peaks and valleys
109

brik1
brik2
brik3
frequency_estimator
multiband_correlator
trellis_decoder
Avg. Benchmarks 2011

1.4

Variance of Clock Period

1.2

1

0.8

0.6

0.4

0.2
0

1

10

10

2

3

10
10
Longest Wire Scaling Factor

4

10

5

10

Figure 5.18: T3 Variance of Longest Wire Scaling Factor Parameter Sweep

brik1
brik2
brik3
frequency_estimator
multiband_correlator
trellis_decoder
Avg. Benchmarks 2011

0.55

Variance of Clock Period

0.5
0.45
0.4
0.35
0.3
0.25
0.2
4

10

Longest Wire Scaling Factor

Figure 5.19: T3 Variance of Longest Wire Scaling Factor Parameter Sweep (Zoomed)

110

Percentage (%) of Total Wire Length of Cost Function

2.5

2

1.5

1

0.5

0

0

1

2

3
4
5
Time (Cost Function Evaluations)

6

7

8
4

x 10

Figure 5.20: Percentage of Total Wire Length to Overall Cost Function Output on Brik1
Benchmark Using T3

around 16384. Unfortunately, the fine-grained sweep did not uncover any better performing
scaling factors. However, this technique turned out to be the most significant reduction of
all 3 techniques used to try to minimize variance.
To understand the effect this technique has on the cost function, consider Figure 5.20.
This figure plots total wire length as a percentage of the total cost of the system for each cost
function evaluation for the brik1 benchmark. This data was taken using the T3 configuration with the scaling factor set to 16384. The total wire length only becomes approximately
0.5–2% of the total system cost, which seems insigificant, however it appears that this combination produces an effective balance to find good solutions.

111

5.5.4

Configuration Comparison of Techniques
With three different techniques shown to reduce variance of results in HMFlow 2011

to some degree, it would be advantageous to determine if such techniques could be combined
to further reduce variation of results. Therefore, experimental runs were conducted where all
possible combinations of techniques were combined and their results are given in Table 5.11.
Table 5.11: Average Variance and Frequency using Variance-reducing Techniques
Configuration
Baseline (C7)
T1
T2
T3
T1+T2
T1+T3
T2+T3
T1+T2+T3
Xilinx

Variance
0.725
0.492
0.499
0.278
0.451
0.34
0.303
0.307
0.035

Freq.
Avg. 100 Run Freq.
155MHz
158MHz
160MHz
159MHz
160MHz
158MHz
175MHz
165MHz
164MHz
158MHz
157MHz
161MHz
165MHz
164MHz
159MHz
163MHz
229MHz
226MHz

Best 100 Freq.
196MHz
200MHz
196MHz
195MHz
199MHz
197MHz
193MHz
200MHz
249MHz

The results shown in Table 5.11 are calculated based on each tool configuration compiling all 6 large hard macro benchmark designs and averaging their variance, single run
implementation clock rate, average implementation clock rate of 100 runs and best implementation clock rate of 100 runs. The baseline (C7) configuration defined in Table 5.6 and
Xilinx tools data are repeated for reference.
The HMFlow 2011 tool configuration with the minimum variance is T3 implemented
by itself. No combination of techniques including T3 was able to further reduce variance
when compared to T3 alone. T3 also had the highest average single run frequency of any
configuration (175MHz) and also the highest frequency of the average of 100 runs (165MHz).
It would appear that reducing variance also increases average implementation clock rate as
the resulting frequencies are grouped tighter together. The only metric in which T3 did not
perform better than the other configurations is the best frequency of 100 runs. This also

112

could be explained as caused by reducing variance as the best implementations are pushed
closer to the average.
T1 and T2 did reduce variance when compared with the baseline (C7), however, their
variance was almost twice as much as that of T3. T1 and T2 also did not have a significant
impact on the implementation clock rate of the average of 100 runs—matching or slightly
better than baseline. The combinations of the techniques did reduce variance more than T1
or T2 alone, T3 simply performed better.
Overall, the techniques did not drastically reduce the variance of the placer. This is
likely due to the secondary motive of trying to compile quickly and also using a variety of
hard macro sizes (large and small). However, T3 did reduce variance in results by over 2.5×
and also slightly improved performance. For the effort, the improvements were certainly
worthwhile.

5.6

Runtime Results
Now that circuit quality has been optimized for the different configurations presented

in this chapter, the runtimes of each configuration should also be reported and compared.
Table 5.12 shows the runtime for each of the 6 benchmarks as compiled with the major
HMFlow revisions presented in this chapter. Runtime is measured in seconds and is defined
as the total time to compile a design starting from the time taken to read in a design’s source
files and ending after creating a placed and routed implementation file (XDL or NCD).
As can be seen from Table 5.12, the baseline C0 has the fastest runtime performance
of any tool configuration. This is largely due to its usage of the lower quality but faster
placer and router algorithms present in HMFlow 2010. When the newer simulated annealing,
register re-placement and long-line optimized router are introduced (C7), runtimes increase
by a little less than 50%. T3 boasts the same algorithmic improvements as C7 but with a
significant reduction in variance of result and roughly has the same runtime performance of

113

Table 5.12: Compilation Runtime for Several HMFlow 2011 Configurations vs. Xilinx
Benchmark
frequency estimator
trellis decoder
brik3
brik2
multiband correlator
brik1
Average Runtime
Speedup (over Xilinx)

Baseline (C0)
6.82s
11.81s
13.81s
15.26s
12.45s
15.98s
12.69s
48.0×

C7
HMFlow 2011 (T3)
10.01s
10.91s
18.2s
18.57s
28.56s
27.81s
20.94s
19.98s
18.82s
18.8s
15.56s
15.66s
18.68s
18.62s
32.6×
32.7×

Xilinx
392.06s
461.34s
605.16s
851.61s
497.91s
848.79s
609.48s
-

C7, the only modification of T3 was to add one extra component to the cost function in the
placer which is a very small percentage of total runtime.
Runtime, however is only half of the story. Looking at both runtime and quality of
result provides a better picture of the actual tradeoff occurring as a result of the various
configurations of HMFlow. As performance numbers are scattered throughout this chapter,
they are summarized in Table 5.13 for convenience. The clock rates shown in Table 5.13 are
measured in MHz and were all taken from a single compilation run of the tools.
Table 5.13: Clock Rate Summary (in MHz) for HMFlow 2011 Configurations vs. Xilinx
Benchmark
frequency estimator
trellis decoder
brik3
brik2
multiband correlator
brik1
Average Clock Rate

Baseline (C0) C7 HMFlow 2011 (T3)
116
194
182
62
128
166
66
122
131
82
153
173
67
146
196
65
189
199
76
155
175

Xilinx
227
251
241
203
250
200
229

Overall, the results can be viewed as a positive outcome for HMFlow. The final
configuration of HMFlow 2011 (T3) can produce implementations over 30× faster than the
Xilinx tools and still obtain clock rates that are 75% of the implementations produced by the
best Xilinx efforts. Put another way, HMFlow produces a placed and routed implementation
that runs 175 MHz on average and can be obtained in less than 20 seconds. This is in stark
114

contrast to Xilinx which will take over 10 minutes to produce a design that will only run
about 30% faster. This is a tremendous result for HMFlow as it demonstrates that rapid
compilation is a very feasible accomplishment in FPGA CAD and could radically change the
way designers create and implement designs for FPGAs.

5.7

Conclusions
The beginning of this chapter began with HMFlow 2010—optimized for rapid com-

pilation times, utilized small hard macros and the Virtex 4 architecture. Then, the flow was
augmented to support large hard macros and the Virtex 5 architecture which created HMFlow 2010a. Once those modifications were in place, three new improvements were added
(simulated annealing placer, register re-placement, and long line-optimized router) to create
HMFlow 2011. Several configurations of these techniques were tested with C7 as the best
performing. However, it was noted that variance of result was significant and more improvements were attempted to reduce it. This resulted in more experiments and configurations
with T3 being the best configuration overall.
From where this chapter began with HMFlow 2010, average clock rates have improved
by almost 3× and even though implementation clock rate variance is still significant, it was
reduced by 2.5×. All of this was accomplished while delivering compilation times that are
still over 30× faster than the conventional Xilinx flow. Given these performance numbers,
HMFlow offers an attractive alternative to conventional FPGA compilation techniques and
has the potential to increase designer productivity with its rapid compilation benefits.

115

CHAPTER 6.
THE BIG PICTURE: THE COMPILATION TIME VS.
CIRCUIT QUALITY TRADEOFF

In Chapter 4, the major motivation was to implement a compilation flow (HMFlow
2010) that could compile as fast as possible without regard for circuit quality. With the
success of HMFlow 2010, efforts described in Chapter 5 were focused on improving quality
while trying to also maintain short runtimes in creating HMFlow 2011. With the tradeoff of
compilation runtime and clock rate pulling in opposite directions and affecting FPGA design
methodologies in a variety of ways, trying to understand and convey the implications of such
a tradeoff is difficult to condense down to a nicely formatted LATEX table. Something more
is required to endow the reader with any reasonable sense of how the tradeoff behaves and
its implications for FPGA designers.
The purpose of this chapter is to provide the reader a view of the big picture of why
such significant efforts were made to build a rapid compilation flow and what impact these
results can have on the future of FPGA design.

6.1

Motivation
It should first be made clear why the compilation time vs. quality of result tradeoff

is significant. To do this, consider a rough representation of current tool implementation
quality vs. runtime offerings in Figure 6.1 where compilation time is on the horizontal
axis and quality of result (clock rate) is on the vertical axis. FPGA vendors provide very
high quality tools to produce high quality of result implementations as this has been what
markets have generally demanded over the past several years. FPGA design starts have
typically required the fastest clock rate possible, using as much of the FPGA as possible for

117

Quality of Result (Clock Rate)

Unsupported Solutions!

The Range of Current
FPGA Vendor Solutions

???
Compilation Runtime

Figure 6.1: Generalized Representation of Current Tool Solution Offerings for Compilation
Runtime vs. Quality of Result

the cheapest price. Given these difficult pressures on FPGA vendors, compilation runtime
has typically had to suffer as most of the effort to improve tools has been expended in
producing higher quality of result. Hence, the current offerings of FPGA vendor tools is
shown on the right-hand side of Figure 6.1.
However, in more recent years, the demands of FPGAs have been changing. With
the benefits of Moore’s Law continually doubling FPGA logic capacities every few years, the
critical pressure that existed previously—to fill an FPGA to capacity in order to cut costs—is
starting to be replaced by growing development costs driven mostly by longer development
times. FPGAs have benefited significantly from Moore’s Law, but workstation processors,
on which the FPGA vendor tools run, have not. Although FPGA vendor tools continue to
improve, the time to compile an FPGA circuit is getting longer with each generation.
118

250

Clock Rate (MHz)

200

150

100

C0
C7
T3

50

Xilinx
0
0

100

200

300

400

500

600

700

Runtime (seconds)
Figure 6.2: Quality vs. Runtime Tradeoff for HMFlow 2011 and Xilinx Tools

With changing market pressures that are starting to push for faster compile times,
it is advantageous to enable new ways of compiling FPGA circuits that could allow FPGA
designers the opportunity to take advantage of the left-hand side of Figure 6.1. HMFlow
accomplishes this task by demonstrating compile times over 30× faster than the conventional
flow. In Figure 6.2, the major revisions of HMFlow 2011 are plotted with the Xilinx tools
to produce a tradeoff graph similar to the one presented in Figure 6.1. Based on the data in
the plot of Figure 6.2, it would indicate that there exists a very steep ramp at the beginning
of the tradeoff. That is, when runtimes are very short, increasing runtime by a small amount
would increase clock rate by a larger amount. However, when runtimes are long, adding
additional runtime will not produce as much gain in clock rate.

119

6.2

Implications for FPGA Designers
The significance of this tradeoff was also presented in [34] as the authors recognized

the utility of choosing a compilation scheme to match the design situation rather than
providing a one-compile-scheme-fits-all solution. Providing the ability to trade small amounts
of quality for significantly faster runtimes is extremely useful to an FPGA designer and it
could motivate new compilation strategies for rapid prototyping. FPGA designers could
work in a rapidly compiling “debug mode” that could significantly increase the number of
edit-compile-debug turns per day he or she could accomplish. By accelerating compilation
times, development times become shorter and less costly. Another benefit of this approach
is that once a design is fully verified, the high quality tools can still be used to produce the
final high quality implementation. The results of this dissertation shows that there is merit
in approaching compilation from a different angle than the one traditionally taken by FPGA
vendors.

6.3

Contributions
To summarize, the major contributions of this dissertation are listed below:

1. Provided an open source framework to create custom FPGA CAD experiments on
commercial FPGAs called RapidSmith. RapidSmith has received over 500 downloads
worldwide and received the Community Service Award at the 21st International Conference on Field Programmable Logic and Applications Conference in 2011. RapidSmith
served as the foundation framework for all of the research performed in this dissertation, has served and continues to serve as a resource for other research projects at
Brigham Young University and at other universities internationally.
2. Demonstrated a complete custom rapid compilation flow leveraging hard macros called
HMFlow 2010 that compiles designs 10× faster or more than the fastest conventional
vendor tools by accepting clock rates 2–4× slower than vendor tools. HMFlow 2010
120

targeted real Virtex 4 FPGAs and produced functional implementations demonstrating
the potential of hard macros to serve as a rapid compilation technology.
3. Demonstrated an improved HMFlow 2011 that leverages large hard macros to preserve
valuable timing closure information to accelerate compilation of high performance designs. HMFlow 2011 included a number of improvements over its predecessor to improve clock rates to an average of 75% of the best implementations provided by Xilinx
while still compiling an average of 30× faster.
4. Provided insights into the largely unexplored compilation time vs. circuit quality
tradeoff curve to enable new kinds of compilation approaches for rapid prototyping.
These insights will enable new techniques of compilation and design that could dramatically increase FPGA designer productivity and ultimately lower development costs for
FPGA designs.

6.4

Conclusions
Engineering is about creating solutions to problems and then further enhancing those

solutions to become better, less expensive, more efficient, lower power, smaller and faster.
FPGA circuit design is both a solution and problem that has existed for several years.
FPGAs have enabled significant computational performance in embedded applications for
attractive performance per Watt rates. However, the problem of long expensive design times
has been growing and new techniques to solve this growing problem are needed.
This dissertation aims to provide valuable insight and results that FPGA compilation
can be accomplished over an order of magnitude faster than what is conventionally done if
the user is willing to trade a little bit of clock rate performance for a lot of compilation
speedup. Pre-compiled modules or hard macros enable such compilation performance and
this dissertation demonstrates their effectiveness in HMFlow 2010 and HMFlow 2011. These
experiments and results were performed on actual commercial parts and are proof of con-

121

cept in their own right. The adoption of hard macros and the techniques presented in this
dissertation would significantly reduce designer development time by enabling rapid compilation and bring FPGA designer productivity much closer to that of their software designer
counterparts.

6.5

Future Work
This dissertation branches out into an area of relatively unexplored research. Al-

though the goals of this dissertation have been met, the work performed herein has only
scratched the surface of exploring the implications of creating hard macros for general purpose design. Future work could expand into such areas as hard macro creation, representation, placement and routing.
The artificial challenges presented by XDL and its conversion time to NCD is certainly
a step that would need to be addressed for commercial usage of these techniques. However, if
similar device information could be obtained from other FPGA vendors, the same techniques
could also be applied to their respective architectures and the same performance and results
could be achieved.
As detailed Xilinx timing information is proprietary and not publicly available, it was
not used or leveraged for the work presented in this dissertation. If persons with such access
to timing information were able to implement the techniques presented here, additional
optimizations would be quite productive, especially in the routing algorithm to produce
better implementations.
It is theoretically possible with detailed knowledge of the proprietary Xilinx bitstream
that hard macros could be pieced together at the bitstream level as was mentioned in Chapter 2. If such information were available, future work could include such techniques and
provide even further acceleration of the compilation flow.

122

REFERENCES
[1] J. Truchard, “CHREC Midyear Workshop Keynote,” June 2011, CEO of National Instruments. 2
[2] Y. Sankar and J. Rose, “Trading Quality For Compile Time: Ultra-Fast Placement For
FPGAs,” in Proceedings of the 1999 ACM/SIGDA seventh international symposium on
Field programmable gate arrays. ACM New York, NY, USA, 1999, pp. 157–166. 3, 15,
16
[3] J. S. Swartz, V. Betz, and J. Rose, “A Fast Routability-Driven Router For FPGAs,”
in FPGA ’98: Proceedings of the 1998 ACM/SIGDA sixth international symposium on
Field programmable gate arrays. New York, NY, USA: ACM, 1998, pp. 140–149. 3, 16
[4] D. Koch, C. Beckhoff, and J. Teich, “ReCoBus-Builder A Novel Tool and Technique
to Build Statically and Dynamically Reconfigurable Systems for FPGAs,” in Field
Programmable Logic and Applications, 2008. FPL 2008. International Conference on,
September 2008, pp. 119–124. 3, 4, 13, 20
[5] R. Tessier, “Fast Placement Approaches for FPGAs,” ACM Trans. Des. Autom. Electron. Syst., vol. 7, no. 2, pp. 284–305, 2002. 4, 14
[6] N. Steiner, “A Standalone Wire Database for Routing and Tracing in Xilinx Virtex,
Virtex-E, and Virtex-II FPGAs,” Master’s thesis, Virginia Polytechnic Institute and
State University, 2002. 4, 20
[7] C. Claus, B. Zhang, M. Huebner, C. Schmutzler, J. Becker, and W. Stechele, “An XDLbased Busmacro Generator for Customizable Communication Interfaces for Dynamically
and Partially Reconfigurable Systems,” in Workshop on Reconfigurable Computing Education at ISVLSI 2007, Porto Alegre, Brazil, May 2007. 4, 20, 36
[8] E. L. Horta and J. W. Lockwood, “Automated Method to Generate Bitstream Intellectual Property Cores for Virtex FPGAs,” in Proc. Field Programmable Logic.2004, 2004.
13
[9] Y. E. Krasteva, F. Criado, E. d. l. Torre, and T. Riesgo, “A Fast Emulation-Based NoC
Prototyping Framework,” in RECONFIG ’08: Proceedings of the 2008 International
Conference on Reconfigurable Computing and FPGAs. Washington, DC, USA: IEEE
Computer Society, 2008, pp. 211–216. 13
[10] J. Coole and G. Stitt, “Intermediate Fabrics: Virtual Architectures for Circuit Portability and Fast Placement and Routing,” in Proceedings of the Eighth IEEE/ACM/IFIP
International Conference on Hardware/software Codesign and System Synthesis, ser.
CODES/ISSS ’10. New York, NY, USA: ACM, 2010, pp. 13–22. 14
123

[11] V. Betz and J. Rose, “VPR: A New Packing, Placement And Routing Tool For FPGA
Research,” in Proceedings of the 7th International Workshop on Field-Programmable
Logic and Applications. Springer-Verlag London, UK, 1997, pp. 213–222. 16, 19
[12] S. Malhotra, T. Borer, D. Singh, and S. Brown, “The Quartus University Interface Program: enabling advanced FPGA research,” in Field-Programmable Technology, 2004.
Proceedings. 2004 IEEE International Conference on, December 2004, pp. 225–230. 19
[13] Xilinx Design Language Version 1.6, Xilinx, Inc., Xilinx ISE 6.1i Documentation in
ise6.1i/help/data/xdl, July 2000. 20
[14] P. Graham, M. Caffrey, D. Johnson, N. Rollins, and M. Wirthlin, “SEU Mitigation for
Half-latches in Xilinx Virtex FPGAs,” Nuclear Science, IEEE Transactions on, vol. 50,
no. 6, pp. 2139 – 2146, December 2003. 20
[15] N. Weaver, Y. Markovskiy, Y. Patel, and J. Wawrzynek, “Post-placement C-slow
Retiming for the Xilinx Virtex FPGA,” in Proceedings of the 2003 ACM/SIGDA
eleventh international symposium on Field programmable gate arrays, ser. FPGA
’03. New York, NY, USA: ACM, 2003, pp. 185–194. [Online]. Available:
http://doi.acm.org/10.1145/611817.611845 20
[16] V. Degalahal and T. Tuan, “Methodology for High Level Estimation of FPGA Power
Consumption,” in Design Automation Conference, 2005. Proceedings of the ASP-DAC
2005. Asia and South Pacific, vol. 1, January 2005, pp. 657 – 660 Vol. 1. 20
[17] A. A. Sohanghpurwala, “OpenPR: An Open-Source Partial Reconfiguration Tool-Kit
for Xilinx FPGAs,” Master’s thesis, Virginia Tech, December 2010. 20
[18] D. Koch and J. Torresen, “Routing Optimizations for Component-based System Design
and Partial Run-time Reconfiguration on FPGAs,” in Field-Programmable Technology
(FPT’10). International Conference on, December 2010. 20
[19] K. Puttegowda, W. Worek, N. Pappas, A. Dandapani, P. Athanas, and A. Dickerman,
“A Run-time Reconfigurable System for Gene-sequence Searching,” in VLSI Design,
2003. Proceedings. 16th International Conference on, January 2003, pp. 561 – 566. 20
[20] K. Kepa, F. Morgan, K. Kosciuszkiewicz, L. Braun, M. Hübner, and J. Becker, “FPGA
Analysis Tool: High-Level Flows for Low-Level Design Analysis in Reconfigurable Computing,” in Reconfigurable Computing: Architectures, Tools and Applications. Springer
Berlin / Heidelberg, 2009, vol. 5453, pp. 62–73. 20
[21] N. Steiner, A. Wood, H. Shojaei, J. Couch, P. Athanas, and M. French, “Torc: Towards
an Open-Source Tool Flow,” in Proceedings of the 19th Annual ACM/SIGDA International Symposium on Field Programmable Gate Arrays, ser. FPGA ’11. New York,
NY, USA: ACM, 2011. 21
[22] S.
Ferguson
and
E.
Ong,
“Hessian
2.0
Serialization
Protocol,”
http://hessian.caucho.com/doc/hessian-serialization.html, August 2007. 28

124

[23] K. Shahookar and P. Mazumder, “VLSI Cell Placement Techniques,” ACM Comput.
Surv., vol. 23, pp. 143–220, June 1991. 35
[24] “AR #10901 - 6.1i FPGA Editor - How do I create a hard macro?”
http://www.xilinx.com/support/answers/10901.htm. 36, 38
[25] “Using Three-State Enable Registers in 4000XLA/XV, and Spartan-XL FPGAs
(XAPP123 v2.0),” Xilinx Inc., Tech. Rep., January 2002. 36
[26] A.
Lesea
and
A.
Percey,
“Negative-Bias
Temperature
Instability
(NBTI)
Effects
in
90
nm
PMOS,”
Xilinx
Inc.,
White
http://www.xilinx.com/support/documentation/white papers/wp224.pdf,
Paper 224, November 2005. 51
[27] C. Sechen, “Chip-planning, placement, and global routing of macro/custom cell integrated circuits using simulated annealing,” in Design Automation Conference, 1988.
Proceedings., 25th ACM/IEEE, Jun 1988, pp. 73 –80. 52, 85
[28] P. Maidee, C. Ababei, and K. Bazargan, “Fast Timing-driven Partitioning-based Placement for Island Style FPGAs,” Design Automation Conference, vol. 0, p. 598, 2003.
52
[29] B. W. Kernighan and S. Lin, “An Efficient Heuristic Procedure for Partitioning Graphs,”
The Bell System Technical Journal, vol. 49, no. 1, pp. 291–307, 1970. 52
[30] L. McMurchie and C. Ebeling, “PathFinder: a Negotiation-based Performance-driven
Router for FPGAs,” in Proceedings of the 1995 ACM Third International Symposium
on Field-programmable Gate Arrays, ser. FPGA ’95. New York, NY, USA: ACM, 1995,
pp. 111–117. 54
[31] C. Lavin, B. Nelson, J. Palmer, and M. Rice, “An FPGA-based Space-time Coded
Telemetry Receiver,” in Aerospace and Electronics Conference, 2008. NAECON 2008.
IEEE National, July 2008, pp. 250–256. 55, 79
[32] S. Ghosh and B. Nelson, “XDL-Based Module Generators for Rapid FPGA Design
Implementation,” International Conference on Field Programmable Logic and Applications, vol. 0, pp. 64–69, 2011. 58, 82
[33] J. Lam and D. Jean-Marc, “Performance of a new annealing schedule,” in Proceedings
of the 25th ACM/IEEE Design Automation Conference, ser. DAC ’88. Los Alamitos,
CA, USA: IEEE Computer Society Press, 1988, pp. 306–311. [Online]. Available:
http://dl.acm.org/citation.cfm?id=285730.285780 87
[34] C. Mulpuri and S. Hauck, “Runtime And Quality Tradeoffs In FPGA Placement And
Routing,” in Proceedings of the 2001 ACM/SIGDA ninth international symposium on
Field programmable gate arrays. ACM New York, NY, USA, 2001, pp. 29–36. 120

125

