ON TWO-STEP ROUTING FOR FPGAS by Guy G. F et al.
ON TWO-STEP ROUTING FOR FPGAS
Guy G.F. Lemieux Stephen D. Brown Daniel Vranesic
Department of Electrical and Computer Engineering, University of Toronto, Canada
lemieux|brown|danv@eecg.utoronto.ca
ABSTRACT
We present results which show that a separate global and detailed
routing strategy can be competitive with a combined routing pro-
cess. Under restricted architectural assumptions, we compute a
new lower bound for detailed routing and show that our detailed
router typically requires no more than two extra routing tracks
abovethiscomputedlimit. Also,experimentalresultsshowthatthe
MappingAnomalypresentedin[20],whichsuggeststhatseparated
routing may yield arbitrarily poor results in certain instances, is a
concernonly if nets are restrictedto a single track domain. Finally,
tomotivatefuturework,weshowthelatesttwo-steproutingresults
that we have achieved with the VPR global router and SEGA de-
tailed router tools on the largest CBL benchmark circuits.
1. INTRODUCTION
Recent FPGA routing results have suggested that a separate global
and detailed routing strategy is inferior to a combined routing pro-
cess [1, 10, 19, 21]. Similarly, the practise of dividing multipoint
nets into multiple two-point nets for routing was thought to neg-
atively impact routability. In fact, recently published results have
shown that combined routers have used signiﬁcantly fewer routing
tracks than the best-known two-step routers, CGE [4] and SEGA
[11]. However, results obtained with a new global router, VPR,
show that distinct global and detailed routing, combined with mul-
tipoint net division, can be competitive with the latest published
FPGA routing tools. This is encouraging because separate global
anddetailedroutingoftwo-pointnetsmayhaveotherpracticalben-
eﬁts such as reduced memory use or compute time.
There is an additional concern that separate global and detailed
routing may suffer from what [20] calls a Mapping Anomaly. This
is a condition where the global route forms such a constraint that
the channel density greatly under-speciﬁes the minimum number
of routing tracks required. After making the architectural assump-
tions suggested by [20], we sometimes detect the presence of a
Mapping Anomaly. Our experimental results indicate that this
anomaly is of critical concern if multipoint nets are constrained to
a single track domain. However, the anomaly was not found to be
present when nets were allowed to be split onto multiple track do-
mains at input and output pins.
Finally, a new lower bound for evaluating the performance of
any detailed router is presented. Although the new bound is not
completely tight, the SEGA detailed router typically routes bench-
marks within two tracks of the bound.
1.1. Paper Overview
The rest of this paper is organized as follows. Section 2 describes
theFPGAarchitecturemodelused. Section3providesanoverview
of previous FPGA routing algorithms against which comparisons
will be made. In Section 4, the Mapping Anomaly, confronting
graph, and other graph-theoretic terminology are deﬁned. In Sec-
tion 5 the empirical methodology and tools used in this paper are
presented, and the results and analysis follow in Section 6. The
conclusions drawn from this data are summarized in Section 7.
2. FPGA ARCHITECTURE MODEL
The style of FPGA architecture assumed in this paper is similar to
theXilinxXC4000series,butitismodeledwithasetofparameters
that represents a range of architectures. As illustrated in Figure 1,
the architecture comprises a rectangular array of logic blocks with
both horizontal and vertical routing channels, and I/O cells around
the periphery. The contents of the logic blocks (L) are not of in-
terest for this study. The routing channels comprise the wire seg-
ments and switches used to interconnect logic blocks. Wire seg-
ments are organized into both vertical and horizontal tracks; in the
exampleinFigure1therearefourtracksperchannelandeachlogic
block has two pins on each of its sides. We assume that all rout-
ing tracks consist of only short wire segments that span a single
logic block. This assumption is made because we wish to compare
results achieved by several recently-produced FPGA routing algo-
rithms, and all of these algorithms’ published results assume only
short wire segments.
A key characteristic of the FPGA model is that the channels
comprise two kinds of blocks, called Switch (S) and Connection
(C) blocks, as illustrated in Figure 1. The S blocks hold routing
switches that can connect one wire segment to another, and the C
blocks house the switches that connect the wire segments to the
logic block pins.
An S block is a rectangular switch box that connects wire seg-
ments in one segment of a channel to those in another. Depending
on the topology, each wire segment on one side of an S block may
be switchable to either all or some fraction of the wiring segments
on each other side of the S block. The ﬂexibility of the S block
is given by the parameter
￿
￿
￿
, which deﬁnes the number of other
wire segments that a wire segment ending at an S block can con-
nect to. An example S block appears in Figure 1a, in which each
dashedline representsa programmablerouting switch—inthis ﬁg-
ure,
￿
￿
￿
￿
￿
￿
. In this study, the S-block topology is assumed to be
disjoint. This meansthat the wiring tracksare isolated into disjoint
domains by the switch organization. Consequently, if all S-block
switchesareturnedon, anumberofunconnectedwiringgroupsare
created, called track domains. For example, with the S block in
To appear in Proc. ISPD-97, April 14-16, Napa, CA 1 c
￿
ACM 1997Figure 1. General Model of an Array-based FPGA.
Logic Block
S
C L
C S
C L
C S
C L
C S
C
S
C L
C S
C L
C S
C L
C S
C
S
C L
C S
C
C S
C L
C S
C
S C S C S C S
3
2
1
0
0 1 2 3
0
1
2
3
0 1 2 3
a) S block detail. b) C block detail.
0 1 2 3
0 1 2 3
L L
Wire Segments I/O Cell
L
Figure 1a, a signal beginning on track 0 is restricted to wire seg-
mentsintrack 0,nomatterwhichS-blockswitchesitgoesthrough.
Figure 1b illustrates a C block. The tracks are hardwired to pass
through it and can be connected to the logic block pins via a set of
switches. The ﬂexibility of a C block,
￿
￿
￿ , is deﬁned as the num-
ber of wire segments in the C block that each logic block pin can
connect to. In the ﬁgure, a routing option is represented as an
￿ —
for this example, each pin can be connected to two vertical tracks;
hence
￿
￿
￿
￿
￿
￿ .
An important architectural feature is how the C block is imple-
mented. If each
￿ is simply a pass transistor, then two or more
switches on the pin may be turned on to permit a routing dogleg,
wherethepinandconnectedwiresbehaveasoneelectricallyequiv-
alent wire. However, if the
￿ ’s along an input (driver) pin are im-
plemented as a (de)multiplexor, only one connection to the tracks
can be made. In thesecases, doglegsare not possible. Many previ-
ous routing studies have assumed that routing doglegs can be used
at both input and driver pins. However, commercial FPGAs such
as Xilinx XC4000 [22] and Lucent ORCA FPGAs [12] do not per-
mitinputpindoglegs. Thestudyinthispaperconsidersbothcases:
doglegs at only the driver pin, and doglegs at any pin.
The main advantage provided by the FPGA model described
above is its generality, which supports a wide range of routing ar-
chitectures by changing the number of tracks per channel and the
contents of the C and S blocks. Earlier studies have examined the
effects of the
￿
￿
￿ and
￿
￿
￿ parameters [16]. Based on those previous
studies, we will use the values
￿
￿
￿
￿
￿
￿ and
￿
￿
￿
￿
￿
￿ , where
￿
is the number of tracks per channel, for all of the experiments in
this paper. Note that these same assumptions are also used in re-
cent publications on routing algorithms [1, 7, 10, 19, 21], and so
are generally accepted as being reasonable.
3. PREVIOUS WORK
This section describes previous work related to FPGA routing that
is directly comparable to the study in this paper.
3.1. FPGA Logic Block and I/O Placement
ManyFPGAroutingstudieshaveusedthebenchmarknetlistsorig-
inally generated for CGE/SEGA. The placement for these bench-
marks was generated by ALTOR [14], a tool originally intended
for standard cell placement. ALTOR used a recursive min-cut bi-
partitioning strategy. By repeatedly partitioning in horizontal and
vertical directions, ALTOR creates a ﬁnal placement.
Recent tools, namely FPR [2], SPLACE [19] and VPR [3], in-
cludeplacementalgorithmsthat are targetedspeciﬁcallyfor FPGA
use. FPR uses a recursive-partitioning technique that is similar to
ALTOR,buteachstepusessimulatedannealingtodividethenetlist
into an
￿
￿
￿
￿
￿ grid, for some small ﬁxed
￿ and
￿ . Before each
recursive step, FPR also performs some global routing. This si-
multaneousplacementandglobalroutingstrategyisuniqueamong
the FPGA tools considered in this paper. In comparison, SPLACE
andVPRusesimulated-annealingplacementalgorithms. VPRpro-
vides more efﬁcient treatment of high-fanout nets and can there-
foreconsidermoremovesthanSPLACEinagivenamountofCPU
time.
3.2. FPGA Global Routers
The global router LocusRoute [15] was originally intended for
standard cell applications. It accepts a placement and a multipoint
netlist as inputs and breaks the nets into two-point nets. Each two-
point net is routed with two or fewer bends with the objective of
minimizing channel density. A bend cost can be applied to further
discourage bends [18]. The output is a coarse graph for each con-
nectionconsistingofaseriesofadjacentchannelsegmentstoguide
it through the FPGA array. The quality of the LocusRoute channel
assignment is measured by the maximum channel density,
￿
￿
￿
￿
￿
!
  ,
which is the largest number of distinct signals occupying a single
channel segment.
The global routing step of VPR [3] uses a maze router on mul-
tipoint nets in a manner similar to [13]. All nets are routed, ripped
up, and rerouted several times. After every iteration, it accrues a
history cost to channel segments with a density greater than the
target density,
￿
￿
"
#
￿
%
$
’
&
)
(
*
" . Subsequent net routings tend to avoid
congested channels unless no alternative exists. VPR ﬁnds the
minimum possible
￿
￿
"
+
￿
%
$
’
&
,
(
*
" that successfully routes a circuit with
￿
￿
￿
￿
￿
!
 
￿
-
￿
￿
￿
"
+
￿
%
$
’
&
)
(
.
" .
3.3. FPGA Detailed Routers
TheCoarseGraphExpansion(CGE)algorithm[4]wasspeciﬁcally
developedforFPGAroutingresearch. Itexpandsalltwo-pointnets
along their global route into a small number of distinct paths, care-
fully pruning the search space. Wire resources for the lowest cost
path are committed until the circuit is routed. A rip-up strategy is
employed if needed, in which less pruning of possible choices is
done in hard-to-route areas.
Thesuccessorto CGE, SEGmentAllocator(SEGA) [11], used a
differentcostfunctionstructuretomakeuseoflongwiresegments.
SEGAalsomadetheassumptionthatanetcouldbe fullyexpanded
intoall possiblepathsalongtheglobalroute. Consequently, SEGA
does not re-expand a net when its paths are exhausted. Instead, the
cost functionincreasesa net’spriorityas its choicesdiminish. This
approach yielded good results, so CGE-style rip-up was deemed
unnecessary to the algorithm.
Since SEGA’s original publication date, a number of different
cost functions have been explored to investigate routability and
speed-performance [5]. The cost function used to produce the re-
sults for this paper, called Area, has been the most successful so
far in using the fewest wiring tracks. The Area cost causes SEGA
2to ﬁrst identify the nets which have the fewest number of remain-
ingpaths. Amongthesenets, thepathwiththelowestDemand cost
(akin to CGE’s cost) is chosen.
3.4. Combined FPGA Global and Detailed Routers
The Greedy Bin Packing (GBP) algorithm [20] combines both
global and detailed routing into one step. By making the assump-
tions that
/
￿
0
2
1
4
3 ,
/
￿
5
6
1
￿
7 , and that a disjointS-block topology is
used,anFPGAisroutedbytreatingeverytrackdomainasabinand
greedily ﬁlling that bin with nets until no more will ﬁt. GBP then
proceeds to the next track domain and repeats the process. In this
way, GBP is similar to the Best Fit Decreasing bin-packing heuris-
tic. Observations that GBP did not densely pack the last few track
domains led to the Orthogonal Greedy Coupling (OGC) algorithm
[21]. By switching from one greedy algorithm to another (which
has a different optimization goal) after some track domains have
beenpacked, thelast fewtrack domainswere moredenselypacked
and fewer routing tracks were used.
A series of one-step routing algorithms was presented in [1]. In
these algorithms, multipoint nets are routed one net at a time. If a
net fails to route, it is moved to the front of the net order and rout-
ing is restarted. The FPGA routing resources are represented by a
graphwhichshrinksasnetsarerouted. Thealgorithmsdifferbythe
way they route a multipoint net through the remaining graph. Five
different core algorithms were presented, three of which were fur-
therenhancedusingiteration. Oftheseeightalgorithms,fourmini-
mizedwirelengthbysolvingtheNetworkSteinerTreeProblemand
four minimized source to sink distance using shortest-paths algo-
rithms. In this paper, we compare our results to those produced by
IKMB, one of the iterated Steiner-tree algorithms.
The FPGA Placement and Routing (FPR) algorithm [2] uses the
same net routing strategy described above. It also uses the IKMB
algorithm to perform detailed routing. However, before each re-
cursivepartitioningstep,FPRgreedilyselectsapartialglobalroute
foreachnetbasedonrectilinearSteinerarborescences
8 andassigns
nets to speciﬁc S blocks. This allows FPR to balance congestion
across each cut and ﬁx the signal entrance or exit points on each
side before cutting each subpartition.
The TRACER-fpga algorithm [7] also performs combined rout-
ingofmultipointnets. It usesamazerouterseededfromthesource
and all sinks to route each net. Initially, all nets are routed by
allowing them to share wires. Next, a simulated evolution tech-
nique (similar to simulated annealing) chooses nets for rip-up and
rerouting; nets sharing resources are more likely to be ripped up.
During rerouting, a high cost is used to discourage future sharing.
When no more sharing occurs, a solution has been found. The
TRACER-fpga PR algorithm [10] is similar, except that it avoids
sharing during initial net routing. Also, it uses slacks to order nets
during initial routing and for selection of nets during rip-up. By
using slacks, it gives long nets priority for direct connections and
allows short nets to route around congestion.
The SROUTE algorithm [19] sequentially maze-routes each
multipoint net by searching out the next closest sink. If a path for
a net cannot be found, it is moved to the front of the net order and
routingisrestarted. Toreducethemaze-routingsearchspace,itini-
tially follows paths which advance toward the closest sink.
3.5. Summary
Anumberofroutershavebeenpresentedwhichaddresstherouting
problem in slightly different ways. None of the algorithms above
directly address the issue of speed-performance, but some try to
reduce wasted wirelength or take more-direct paths. All of these
algorithms emphasize routability, and all try to minimize wire-
length. The most recent routers (VPR, SROUTE, FPR, TRACER,
8 An arborescence is a construction which contains the shortest path
from a distinguished vertex, or source, to all other vertices or sinks.
Figure 2. A sample global routing,
9 , for three nets (a, b, and
c) and the corresponding confronting graph,
: . Notice the
three connected vertices in
: imply that three routing tracks
are required to route. If the multipoint net c is broken into
two-point nets (c
; c
8 and c
8 c
< ) and dogleg routes are permit-
ted, the confronting graph
:
￿
=
>
,
? results and can be routed with
only two tracks. Onepossiblesolutionis implicitlyshownin
9 ,
where a dogleg occurs on pin c
8 . The confronting graph
:
=
>
,
>
,
?
shows that when doglegs are allowed at the driver pin only,
three tracks are still required.
H H¢ dl H¢ ddl
G
a b a b a b
c0c1c2 c0c1 c1c2 c0c1 c0c2
L
L
L
L
L
L
L L L
0
1
0
1
0 1 0 1
a
b
c0
c1
c2
IKMB, GBP, and OGC) improve routability by relaxing the mini-
mum wirelength condition.
Table1givesageneraloverviewofvariousarchitecturalfeatures
and routing techniques that are used by the routers in this paper.
Blank entries in the table mean ‘not applicable’.
4. TERMINOLOGY
With the architectural assumptions that
/
0
1
@
3 ,
/
5
1
A
7 , and
that S blocks are disjoint, the routing problem can be restated as
a graph colouring problem. This observation led to the concepts
of the Mapping Anomaly and a confronting graph in [20]. In this
section, these terms and the underlying graph theory are deﬁned.
A multipoint netlist which has been global routed can be repre-
sented by a forest of trees,
9
C
B
E
D
G
F
I
H
￿
J , or simply
9 . Each net is a
tree, where the driver and sinks form the leaves and intermediate
vertices between the leaves are C or S blocks through which the
net is routed. All vertices are labeled with the
B
L
K
￿
F
’
M
N
J co-ordinates
oftheir locationin theFPGAmodel. Additionally, leafverticesare
labeled with their corresponding logic block pin number.
In this paper, a two-point netlist
9
O
=
P
B
E
D
Q
F
I
H
￿
J is constructed from
9 in two different ways. In the ﬁrst method, routing doglegs are
permittedatinputanddriverpinsofanet, forming
9
=
>
)
?
B
E
D
G
F
I
H
￿
J . The
second method permits routing doglegs at the driver pin only, and
iscalled
9
=
>
I
>
,
?
B
E
D
Q
F
I
H
￿
J . Eachconnectedcomponentof
9
=
>
)
? or
9
=
>
I
>
,
? is
a two-point net; in the latter graph, one endpoint is always a driver
pin.
The disjoint S-block topology divides each routing track into a
separate domain. This property allows the construction of the con-
fronting graph,
:
￿
B
E
D
G
F
I
H
￿
J . Each vertexin
: correspondsto a net in
9 . Anedgeisplacedbetweentwonets(vertices)in
: iftheytravel
through a common C block in
9 , i.e., each net contains a vertex
in
9 with the same C block
B
L
K
￿
F
*
M
R
J label. Thus, an edge in
: rep-
3Table 1. Comparison of architectural features (upper rows) and routing techniques (lower rows) used by FPGA routers.
LocusRoute CGE SEGA GBP OGC IKMB TRACER SROUTE FPR VPR
exploits pin equivalence y n n n n y n y
exploits output pin doglegs y y y y y y y y
exploits input pin doglegs y/n
S y/n
S y y y y n y
exploits long wire segments y
T n y n n n n n n y
T
performs rip-up and re-routing y y n n n y y n n y
greedy selection mechanism y y y y y n n y y
U n
Steiner tree/arborescence based
global routing n n n y n n y n
shortest-path global routing y n n n n n y n
maze global routing n n n n y y n y
guaranteed performance bounds n n n n n y
V n n y
V n
net-order dependent results y y n n n y n y y n
S CGE/SEGA will not dogleg at sinks if one end of the 2-pin input netlist always connects to the driver.
T A ‘bend cost’ can be applied to help better exploit long wire segments.
U A greedy selection is done to assign global routes at each partitioning step.
V Each net is guaranteed to use
W
O
X
Z
Y minimum number of wires (out of those remaining in the FPGA after previous nets are all routed).
resents an incompatibility between two nets to be assigned to the
same track domain.
Similar confronting graphs can be built for
[
O
\
V
,
]
^
E
_
Q
‘
I
a
￿
b
and
[
O
\
V
I
V
)
]
^
E
_
G
‘
I
a
￿
b
, denoted
c
￿
\
V
)
]
^
E
_
Q
‘
I
a
￿
b
and
c
￿
\
V
I
V
)
]
^
E
_
Q
‘
I
a
￿
b
, respectively.
However, for these graphs an edge is never placed between two-
point nets (vertices) which are part of the same multipoint net be-
cause they may be safely assigned to the same track domain. An
examplegraph
[ , andtheresultingconfrontinggraphs
c ,
c
\
V
)
] and
c
\
V
I
V
)
] are shown in Figure 2. Note that
[ is shown embedded in an
array of logic blocks to illustrate the global route.
Using the confronting graph, the detailed routing problem is
mapped to a graph (vertex) colouring problem. In this perspective,
the vertices of
c must be assigned a colour (track domain) such
that no two adjacent vertices are assigned the same colour, and the
minimum number of colours is to be used. The graph colouring
problem is NP-complete on general graphs [9], so heuristics are
commonly used to solve it.
This minimum number of colours required to colour
c is called
the chromatic number, denoted
d
^
c
b
. It is important to note that
d
^
c
b
represents the minimum number of routing tracks required
for detailed routing of
[ , and a routing solution with this many
tracks is guaranteed to exist.
The Mapping Anomaly [20] is the observation that
[ may be
constructed such that
d
^
c
b
can be arbitrarily higher than the max-
imum channel density,
e
￿
f
S
%
g . Since
c is implicitly produced by
theglobalrouter, thedetailedrouterhasnocontrolover
d
^
c
b
. Ad-
ditionally, the global router attempts to minimize
e
f
S
%
g and not
d
^
c
b
directly, soitmayconstructpathologicallybad
c conﬁgura-
tions. This observation was used in [20] to support the notion that
global and detailed routing should be combined.
The results presented in this paper suggest the Mapping Anom-
aly may not be a concern if routing doglegs are permitted, but it
is a problem if doglegs are not allowed. This is intuitive because
doglegs permit an ‘escape hatch’ for a signal to avoid interference,
effectively reducing the net’s length. Doglegs in the confronting
graph have the effect of splitting a vertex in
c and spreading the
connectivity among the split vertices. The freedom to colour the
split vertices similarly or differently, depending on the colour of
adjacent vertices, often means that fewer colours are required.
It is desirableto compute
d
^
c
b
anduse it to determinethe qual-
ity of the detailedrouting heuristic. However, we could not ﬁnd an
effective way to directly compute it. Instead, we compute a well-
knownlowerbound: thecliquenumberof
c ,or
h
^
c
b
. Theclique
number of a graph is the size of the largest clique, or completely
connected subgraph. Clearly, at least
h
^
c
b
different colours are
neededtocolourthelargestcliquebecauseall ofitsverticesare ad-
jacenttoeachother. SinceallofthenetsinaCblockarecompletely
connected (thus, forming a clique), the following useful relation is
developed:
e
f
S
%
g
j
i
h
^
c
b
i
d
^
c
b
A similar relationship holds for the
c
￿
\
V
)
] and
c
￿
\
V
,
V
,
] graphs.
The cliquenumberis usefulin twoways. First, it forms a tighter
lower bound to gauge the quality of SEGA. Second, it helps show
the presence of the Mapping Anomaly, as follows. If
h is much
larger than
e
f
S
%
g , then
d must be large, so the Mapping Anom-
aly is present. However, if
h is comparable to
e
￿
f
S
%
g then it may
or may not be present. In this case, it is not present only if the
graph can be coloured (routed) with a few colours (tracks) more
than
e
￿
f
S
%
g .
5. METHODOLOGY AND TOOLS
Theapproachusedinthisstudyisempirical. Thatis,asetofbench-
mark circuits is input to a CAD tool chain and the routing results
are analysed. TheCAD toolchain consistsof anew placementand
global routing tool, VPR [3], and a detailed routing tool, SEGA
[5, 11]. Two sets of benchmarks are used: older benchmarks pro-
vide a means of comparing to previously published results, and
newer benchmarks allow more rigourous testing of the tools. The
process and tools used are described in detail below. All of the
tools, circuits, and results are available for download.
k
5.1. Benchmark Preparation
Benchmark circuits from [11] were widely used to produce com-
parative results between routers. To use a new VPR placement or
global routing for these circuits, they had to be converted to a for-
matunderstoodbyVPR.Ashortprogram, sega2blif, waswrit-
ten to extract the multipoint nets from the SEGA input, and output
the connectivity information in BLIF format. The same tool also
wroteoutaplacementﬁlewhichcouldoptionallybeusedbyVPR.
l
ThenewbenchmarkcircuitsusedinthisstudyarefromtheCAD
Benchmarking Laboratory (CBL) LGSynth93 suite [6]. A total of
198 circuits were converted to BLIF, optimized with SIS [17] and
mapped into 4-input LUTs with FlowMap [8]. They were then run
throughtheblifmaptoolincludedwithVPRtoremoveclocksig-
nals and, where possible, pack ﬂip-ﬂops into logic blocks. Clock
k http://www.eecg.utoronto.ca/˜lemieux/sega
l Note that while SEGA permits I/O pins to be in the four corners of the
periphery, VPRdoesnot. Consequently, anycornerI/Osignalsweremoved
as short a distance as possible to the next available I/O pad.
4signals are removed because it is assumed that a global clock rout-
ing resource is available to route them.
5.2. Placement and Routing
All benchmarks were placed and global routed using VPR and de-
tail routed using SEGA. The exact tool setup is described below.
ThedefaultVPRplacementoptionswereusedforallbenchmark
circuits. However, for the older benchmarks, VPR was sometimes
told to use the old placement information. Also, VPR required a
parameter to describe the number of physical I/O pins that ﬁt in
the pitch of a logic block. This pitch was set to two for the new
benchmarks, since this is comparable to current technology, and to
an appropriatevalueforolderones. Forthe newbenchmarks,VPR
was allowed to choose the smallest square logic block array that
ﬁt the I/O padframe or logic block demand. However, the older
benchmarks were consistently restricted to the original FPGA di-
mensions.
Forglobalrouting,VPRrequiresalogicblockarchitecturespec-
iﬁcation. A logic block identical to the one used previously was
speciﬁed: fourfunctionally-equivalentinputpins,oneoneachside,
and two electrically-equivalentoutput pinson the right and bottom
sides. VPR also allows a bend cost to be speciﬁed. We varied the
bendcost between0 and10onasubsetof thenew benchmarksand
experimentally determined that a value of 1.25 gave the lowest to-
tal
m
￿
n
￿
o
%
p and the lowest total number of tracks required by SEGA
to route. Finally, VPR was restricted to route a net within 3 logic
blocks of a bounding box formed by the sources and sinks.
Prior to detailed routing, the VPR multipoint net format had to
be converted to a two-point net format for SEGA. To do this, a
vpr2sega tool was written. This program can operate in one of
two modes: doglegs and driverdoglegs. In doglegs mode,
q
O
r
s
,
t is
constructed as follows. A VPR net is read in and the distances be-
tweenallpinsalongtheglobalroutearecomputedandusedasedge
weights in a complete graph spanning all the pins. A minimum
spanning tree (MST) is then constructed, starting at the source, ac-
cording to Prim’s algorithm. Each edge in the MST is converted
back to a two-point net that follows the global route and joins
the pins. In driverdoglegs mode, a two-point netlist
q
O
r
s
I
s
,
t is con-
structed between the driver and every sink. Although this repre-
sentation is not as concise as
q
O
r
s
,
t , it implicitly instructs SEGA that
a net can connect to multiple track domains only at the driver pin.
Once the netlist is converted, SEGA is used with the Area cost
function to ﬁnd the minimum number of tracks required to route.
5.3. SEGA Netlist Analysis
To analyse the SEGA netlist for
m
￿
n
u
o
%
p and construct the con-
fronting graph,
v , and its properties, a new tool, chandens, was
written. It reports the maximum channel density and computes
w
y
x
v
{
z . Although computing
w
y
x
v
{
z is knownto be NP-hardin gen-
eral [9], we have employed a branch-and-bound scheme with rea-
sonable success. One of the most difﬁcult benchmarks to evaluate
with chandens in this fashion was pdc, requiring about 60 CPU
hours on a 167MHz UltraSPARC. Most other benchmarks were
evaluated in a matter of seconds to minutes.
Optionally, the chandens tool can also build the two-point
net versions of the confronting graph,
v
r
s
,
t and
v
r
s
I
s
)
t . Since these
graphs are generally less connected than
v , this has still proven to
becomputationallyfeasibleformostbenchmarks. However, dueto
memorylimitationswewereunabletocompletelyevaluatesomeof
thelargestbenchmarks. Inthesecases,thelargestcliquesizefound
at the time of failure, indicated by a
| symbol, is used instead.
6. RESULTS AND ANALYSIS
6.1. Comparison to Previous Routers
The routing results for the older benchmark suite are shown in Ta-
ble 2. When the old ALTOR placement is used, the VPR/SEGA
combination routed all benchmarks with a total of 89 tracks, or
5 tracks fewer than IKMB. TRACER is the only router that pro-
duced better results, using only 85 tracks. If the placement is mod-
iﬁed, the VPR/SEGA combination performed better than all oth-
ers, using 9 fewer tracks than SPLACE/SROUTE
} and 41 fewer
than FPR. It is unexpected that a two-step router would perform
as well as the combined routers. For these results, SEGA required
one routing track more than the minimum predicted by the clique
size and two more tracks than the minimum predicted by
m
￿
n
￿
o
!
p ,
on average. Inthe worstcase, SEGA requiredtwotracksabovethe
clique size.
6.2. Results with New Benchmarks
The 198 new benchmarks were all placed and routed. The results
for the 20 largest benchmarks (ranging in size from 1046 to 8381
logicblockseach) arepresentedin Table3. Notethat someentries,
denoted with a
| symbol, could not be exactly computed in a rea-
sonabletimebecauseofexcessivememory demandsby SEGAand
chandens. TheVPR
m
￿
n
u
o
%
p columnreferstothemaximumchan-
nel density, the old lower bound for detailed routing. The
w
y
x
v
￿
r
s
,
t
L
z
column shows the new lower bound for detailed routing with dog-
legs, based on the clique number of the doglegs confronting graph.
From Table 3, it is clear that
w
y
x
v
￿
r
s
)
t
z is often larger than
m
￿
n
￿
o
!
p and
thereforeprovidesatighterboundfordetailedrouting. Onaverage,
the clique number tightens the bound by 1.1 tracks.
TheSEGA
q
r
s
,
t columnshowstheactualchannelwidthrequired
bySEGAtoroutethebenchmarkwithdoglegs. Onaverage,SEGA
requirestwo tracksabove
w
y
x
v
￿
r
s
,
t
L
z to routethese largebenchmarks,
or 3.1 tracks over
m
￿
n
￿
o
!
p . As a result, the Mapping Anomaly is not
signiﬁcantly present when doglegs are permitted.
The
w
y
x
v
￿
r
s
,
s
,
t
z and SEGA
q
O
r
s
I
s
,
t columns show the clique number
and channel width required to route with driver doglegs. Although
w
y
x
v
r
s
,
s
,
t
L
z couldnotbecomputedexactlyforsomecircuits,itisrela-
tively unchanged from
w
~
x
v
r
s
)
t
￿
z . Despite this, SEGA requires 55%
more tracks than before to route these circuits. For some bench-
marks such as s298, driver doglegs is a considerable restriction
which requires 160% more routing tracks than before. However,
other benchmarks such as tseng were relatively unaffected. Al-
though the Mapping Anomaly is clearly not present in tseng, we
cannottellwhetheritispresentins298. Thepoorperformanceby
SEGA may be causedby the Mapping Anomalyor by poor heuris-
tic behaviour. To prove that the Mapping Anomaly is not at fault,
we would need to ﬁnd a valid colouring of s298 with just over 6
colours.
Lastly, the
w
y
x
v
{
z column in Table 3 shows the clique number of
the confronting graph produced on the multipoint netlist. This col-
umn represents the lower bound for routing if no doglegs are per-
mitted at all. Since these clique sizes are considerably larger than
m
￿
n
￿
o
!
p , the Mapping Anomaly is present.
The data in Table 3 shows that the Mapping Anomaly has no
effect if doglegs are permitted in the architecture. Although the
driver dogleg restriction does not increase the clique size, it can-
not be said for certain whether the Mapping Anomaly is present.
However, it is strongly present if doglegs are not permitted at all.
In this case, the global router should attempt to compensate for the
‘confronting’nets. Onewaytodothisistoperformcombinedrout-
ing, as previous routers have done, with the objective of minimiz-
ing the ﬁnal channel width. Another way would be to use a better
metric than channel density during global routing. For example, it
may be reasonable to compute the clique number (or an estimate)
of the confronting graph as global routing is done. The best way to
approach this problem is a topic for future research.
} Note that among these routers, only SROUTE restricts doglegs to
drivers. Thisrestrictionplacesitatadisadvantageincomparisontotheoth-
ers, yet it still performs well.
5Table 2. Channel widths required to route older benchmarks. New results are in boldface.
Placement ALTOR SPLACE ALTOR VPR
Global R. LocusRoute FPR VPR
Detailed R.
￿
~
￿
￿
￿
!
￿ CGE SEGA GBP OGC IKMB TRACER SROUTE SEGA
9symml 9 9 9 9 9 8 6 7 7 9 7 6
alu2 10 12 10 11 9 9 9 9 8 10 8 7
alu4 13 15 13 14 12 11 11 12 9 13 10 8
apex7 13 13 13 11 10 10 8 9 6 9 10 5
example2 17 18 17 13 12 11 10 11 7 13 10 5
k2 16 19 16 17 16 15 14 15 11 17 14 10
term1 9 10 9 10 9 8 7 8 5 8 8 5
too large 11 13 11 12 11 10 9 11 8 11 10 7
vda 14 14 14 13 11 12 11 12 10 13 12 9
TOTAL 112 123 112 110 99 94 85 94 71 103 89 62
Table 3. Channel widths using VPR and SEGA for placement, global and detailed routing of the 20 largest benchmarks.
VPR SEGA SEGA VPR SEGA SEGA
Circuit
￿
~
￿
￿
￿
!
￿
￿
￿
￿
￿
￿
￿
￿
￿
,
￿
P
￿
￿
￿
￿
,
￿
￿
Q
￿
￿
￿
￿
￿
￿
I
￿
)
￿
P
￿
￿
￿
￿
,
￿
,
￿
￿
￿
￿
￿
￿
￿ Circuit
￿
~
￿
￿
￿
%
￿
￿
Q
￿
￿
￿
￿
￿
￿
,
￿
P
￿
￿
￿
￿
)
￿
￿
Q
￿
￿
￿
￿
￿
￿
I
￿
)
￿
P
￿
￿
￿
￿
I
￿
)
￿
￿
￿
￿
￿
￿
￿
alu4 7 9 10 9 16 19 frisc 9 10 13
￿ 10 18 15
apex2 8 9 11 10 20 27 misex3 8 9 12 9 17 19
apex4 9 10 12 10 19 26 pdc 11 12 16
￿ 12
￿ 31 44
bigkey 6 7 8 7 9 9 s298 5 6 7 6 18 26
clma 9
￿ 10 14
￿ 10
￿ 24 30 s38417 6
￿ 7 8
￿ 7 10 11
des 6 7 9 7 11 11 s38584.1 7
￿ 8 9
￿ 8 12 11
diffeq 6 7 9 7 10 11 seq 8 10 12 10 18 24
dsip 5 6 7 6 9 9 spla 10 11 14
￿ 11 26 38
elliptic 8 9 11
￿ 10 16 20 tseng 6 6 8 7 9 9
ex1010 8 10 11
￿ 9 22 29 AVG. 7.6
￿ 8.7 10.7
￿ 8.8
￿ 16.6 20.4
ex5p 10 11 13 11 16 19 TOTAL 152
￿ 174 214
￿ 176
￿ 331 407
6.3. Graphical Results
In the graphs on the following page, we show the same routing
results in a different fashion with all 198 benchmarks included.
￿
The benchmarks are uniformly spread along the horizontal axis.
The vertical axis shows the channel width, in discrete steps, of the
routed circuits. All of the data could be presented in one graph,
but we chose instead to separate them for clarity. Because of this,
the vertical axis has different scales in the graphs. To further im-
prove clarity, we sort the order of the benchmarks differently in
each graph. This allows us to better illustrate trends in the data.
Figures 3 and 4 show how
￿
y
￿
P
￿
￿
￿
,
￿
L
￿ and
￿
y
￿
P
￿
￿
￿
I
￿
)
￿
￿
￿ , respectively,
form a tighter bound than
￿
￿
￿
￿
￿
%
￿ for detailed routing. In Figure 3,
theSEGAresultwithdoglegsisshowntobeveryclosetothelower
boundgivenby
￿
y
￿
P
￿
￿
￿
,
￿
L
￿ . Inthiscase,SEGAtypicallyrequiresonly
one routing track above the minimum to ﬁnd a solution. The Map-
ping Anomaly is not present because SEGA found a valid routing
which is close to
￿
￿
u
￿
%
￿ . In this graph, SEGA is exhibiting excel-
lent behaviour.
The corresponding SEGA result for driver doglegs is shown in
Figure 4. Although many circuits require less than three routing
tracks above
￿
y
￿
P
￿
￿
￿
I
￿
,
￿
L
￿ , a few require signiﬁcantly more. In these
cases, since
￿
y
￿
P
￿
￿
￿
I
￿
)
￿
￿
￿ is close to
￿
￿
￿
￿
%
￿ we are not certain whether
thisisaresultoftheMappingAnomalyorpoorheuristicbehaviour.
In Figure 5 the clique sizes of
￿ ,
￿
￿
￿
,
￿ , and
￿
￿
￿
I
￿
,
￿ are compared.
The graph indicates that the dogleg and driver dogleg clique sizes
are very similar, but the no-doglegs clique size,
￿
y
￿
P
￿
￿ , can grow
very large. Since
￿
￿
￿
￿
￿
!
￿ islower thanthe lowest line onthis graph,
the Mapping Anomaly must be strongly present in the circuits on
the left of the graph. As a result, detailed routing without doglegs
willuseupmanymoretracksthanwhatispredictedby
￿
￿
u
￿
%
￿ ,even
￿ Thefew results whichcould not be properlycomputedare allincluded
in Table 3 and are approximated by the values shown there.
if a perfect algorithm is used. Although not shown, it is interest-
ing to note that the channel width from routing
￿
￿
￿
I
￿
)
￿ with SEGA
roughly follows (but usually remains below)
￿
y
￿
P
￿
￿ , even though
there is no direct relationship between them. We speculate that
this may be caused by the presence of the Mapping Anomaly in
￿
￿
I
￿
)
￿ that does not take the form of a clique until some vertices are
merged as in
￿ .
7. CONCLUSIONS
The comparison to previous results has shown that a two-step
global and detailed routercan be competitivewith the one-step ap-
proacheswhenroutabilityisimportant. Infact,onlyTRACERwas
abletoprovidealowertrackcountthantheVPRandSEGAcombi-
nation. Thisresultindicatesthatoneshouldconsidermorethanjust
routingin the minimumnumberof trackswhen decidingwhethera
one or two-step router is appropriate. Some of the other issues in-
cludemaintainability, design time,expectedmemoryuse andcom-
pute time, partitioning of software development effort, circuit de-
lay, and result quality monitoring. This last point is interesting be-
cause the global router and detailed router can be separately opti-
mized,andtheprogressofeachcanberecorded. Further,theclique
number and chromatic number of the confronting graphs serve as
improved lower bounds for detailed routing if certain architectural
assumptions are made.
Theexperimentalresultsshowthatitisimportanttoconsiderthe
Mapping Anomaly in a global router if no doglegs are permitted,
but it is not important to do so if they are. If only driver doglegs
areallowed,wesuggestthattheMappingAnomalymaybepresent,
but we do not have proof. Since this case is important for existing
FPGAs, more research is needed to conﬁrm this.
OnewayforaglobalroutertoaccountfortheMappingAnomaly
istoperformadetailedrouteinternally, hencebecomingaone-step
router. However, another way is to estimate the chromatic num-
6ber or clique number as the global route is performed. Minimizing
thisnew numberwouldbe thenew optimizationgoalfor theglobal
router. This is an open problem for future research.
Another way to interpret the routing data is that doglegs (at the
driver and input pins) may be very useful architectural features to
reduce the channel width required for routing. This raises another
topic of future interest: is it area-efﬁcient to fully supportdoglegs?
8. ACKNOWLEDGEMENTS
We wish to thank Vaughn Betz for his insightful commentsand for
making the necessary modiﬁcations to VPR. Mike Hutton did the
painstakingeffortofoptimizingandmappingallofthebenchmarks
used in this study. Mike Hutton and Steve Wilton also provided
valuable comments on an early version of this paper.
Figure 3. The number of tracks required by SEGA to route
with full doglegs is slightly higher than the lower bounds
formed by
￿
y
￿
P
￿
￿
￿
￿
,
￿
L
￿ and
￿
￿
￿
￿
￿
!
￿ .
|
￿
0
¡ |
￿
50
¡ |
￿
100
¡ |
￿
150
¢
|
￿
200
¡
|
£ 0
|
£ 4
|
£ 8
|
£ 12
|
£ 16
 Benchmark Circuits
 
C
h
a
n
n
e
l
 
W
i
d
t
h  SEGA(G’dl)
 w(H’dl)
 Dmax
Figure 4. The number of tracks required by SEGA to route
with only driver doglegs is more pronounced than the lower
bounds implied by
￿
y
￿
P
￿
￿
￿
I
￿
,
￿
￿ and
￿
￿
￿
￿
￿
%
￿ .
|
￿
0
¡ |
￿
50
¡ |
￿
100
¡ |
￿
150
¢
|
￿
200
¡
|
£ 0
|
£ 4
|
£ 8
|
£ 12
|
£ 16
|
£ 20
|
£ 24
|
£ 28
|
£ 32
 Benchmark Circuits
 
C
h
a
n
n
e
l
 
W
i
d
t
h  SEGA(G’ddl)
 w(H’ddl)
 Dmax
Figure 5. The lower bound required for routing without dog-
legs,
￿
y
￿
P
￿
￿ , is signiﬁcantly higher than the bounds with dog-
legsanddriverdoglegs,
￿
y
￿
P
￿
￿
￿
,
￿
￿ and
￿
y
￿
P
￿
￿
￿
I
￿
)
￿
￿ ,respectively. This
large difference in bounds shows that the Mapping Anomaly is
strongly present if doglegs are not permitted.
|
￿
0
¡ |
￿
50
¡ |
￿
100
¡ |
￿
150
¢
|
￿
200
¡
|
£ 0
|
£ 5
|
£ 10
|
£ 15
|
£ 20
|
£ 25
|
£ 30
|
£ 35
|
£ 40
|
£ 45
 Benchmark Circuits
 
C
h
a
n
n
e
l
 
W
i
d
t
h  w(H)
 w(H’ddl)
 w(H’dl)
REFERENCES
[1] M.J. Alexander, G. Robins, “New Performance-Driven FPGA Routing Algo-
rithms,” Design Automation Conference, June 1995.
[2] M.J. Alexander, J.P. Cohoon, J.L. Ganley, G. Robins, “Performance-Oriented
Placement and Routing for Field-Programmable Gate Arrays,” European De-
sign Automation Conference, September 1995.
[3] V.Betz,J. Rose, “DirectionalBias and Non-UniformityinFPGAGlobal Rout-
ing Architectures,” IEEE/ACM International Conference on Computer-Aided
Design, pp. 652–659, 1996.
[4] S. Brown, J. Rose, Z.G. Vranesic, “A Detailed Router for Field-Programmable
Gate Arrays,” IEEE Transactions on Computer Aided Design, 11(5), pp. 620–
628, May 1992.
[5] S. Brown, G. Lemieux, M. Khellah, “Segmented Routing for Speed-Perfor-
mance and Routability in Field-Programmable Gate Arrays,” Journal of VLSI
Design, 4(4), pp. 275–291, 1996.
[6] CAD Benchmarking Laboratory, North Carolina State University, LGSynth93
suite, http://www.cbl.ncsu.edu/www/
[7] C.-D. Chen, Y.-S. Lee, A.C.-H. Wu, Y.-L. Lin “A Performance and Routabil-
ityDrivenRouterforFPGAsConsideringPathDelays,”IEEETransactionson
Computer-Aided Design, 14(3), pp. 371–374, March 1995.
[8] J. Cong, Y.Ding, “FlowMap: An Optimal TechnologyMapping Algorithmfor
Delay Optimization in Lookup-Table Based FPGA Designs,” IEEE Transac-
tions on Computer-Aided Design, pp. 1–12, January 1994.
[9] M.R. Garey, D.S. Johnson, Computers and Intractability: A Guide to the The-
ory of NP-Completeness, W.H. Freeman, New York, NY, 1979.
[10] Y.-S.Lee, A.C.-H.Wu, “APerformanceand RoutabilityDrivenRouterfor FP-
GAs Considering Path Delays,” Design Automation Conference, June 1995.
[11] G. Lemieux, S. Brown, “A Detailed Router for Allocating Wire Segments in
FPGAs,” ACM/SIGDA Physical Design Workshop, Lake Arrowhead, CA, pp.
215–226, April 1993.
[12] Lucent Technologies, Field-Programmable Gate Arrays Data Book, October
1996.
[13] L.E.McMurchie,C.Ebeling,“PathFinder: ANegotiation-BasedPerformance-
Driven Router for FPGAs,” ACM/SIGDA International Symposium on Field-
Programmable Gate Arrays, pp. 111–117, February 1995.
[14] J.S. Rose, W.M. Snelgrove, Z.G. Vranesic, “ALTOR: An Automatic Standard
Cell Layout Program,”Canadian Conferenceon VeryLarge Scale Integration,
pp. 169–173, November 1985.
[15] J.S. Rose, “Parallel Global Routing for StandardCells,” IEEE Transactions on
Computer-Aided Design, 9(10), pp. 1085–1095, October 1990.
[16] J.Rose, S.Brown,“FlexibilityofInterconnectionStructuresinField-Program-
mable Gate Arrays,” IEEE Journalof Solid State Circuits, 26(3), pp. 277–282,
March 1991.
[17] E.M. Sentovich et al., “SIS: A System for Sequential Circuit Analysis,” Tech-
nical ReportNo. UCB/ERLM92/41, University of California, Berkeley, 1992.
[18] B.Tseng,J.Rose, S.Brown,“UsingArchitecturalandCADInteractionstoIm-
prove FPGA Routing Architectures,” First International ACM/SIGDA Work-
shop on Field-Programmable Gate Arrays, pp. 3–8, February 1992.
[19] S.J.E. Wilton, Architectures and Algorithmsfor Field-ProgrammableGate Ar-
rays with Embedded Memories, Ph.D. Dissertation, University of Toronto,
1997. An on-line version of this dissertation may be found at
http://www.ee.ubc.ca:80/home/staff/faculty/stevew/etc/www/.
[20] Y.-L. Wu, M. Marek-Sadowska, “An Efﬁcient Router for 2-D Field-Program-
mable Gate Arrays,” European Design Automation Conference, pp. 412–416,
Paris, 1994.
[21] Y.-L. Wu, M. Marek-Sadowska, “OrthogonalGreedy Coupling — A New Op-
timization Approach to 2-D FPGA Routing,” Design Automation Conference,
June 1995.
[22] Xilinx, The Programmable Logic Data Book, 1994.
7