Accelerating FPGA routing through algorithmic enhancements and connection-aware parallelization by Zhou, Yun et al.
18
Accelerating FPGA Routing Through Algorithmic
Enhancements and Connection-aware Parallelization
YUN ZHOU, DRIES VERCRUYCE, and DIRK STROOBANDT, Ghent University, Belgium
Routing is a crucial step in Field Programmable Gate Array (FPGA) physical design, as it determines the routes
of signals in the circuit, which impacts the design implementation quality significantly. It can be very time-
consuming to successfully route all the signals of large circuits that utilize many FPGA resources. Attempts
have been made to shorten the routing runtime for efficient design exploration while expecting high-quality
implementations. In this work, we elaborate on the connection-based routing strategy and algorithmic en-
hancements to improve the serial FPGA routing.We also explore a recursive partitioning-based parallelization
technique to further accelerate the routing process. To exploit more parallelism by a finer granularity in both
spatial partitioning and routing, a connection-aware routing bounding box model is proposed for the source-
sink connections of the nets. It is built upon the location information of each connection’s source, sink, and
the geometric center of the net that the connection belongs to, different from the existing net-based routing
bounding box that covers all the pins of the entire net. We present that the proposed connection-aware rout-
ing bounding box is more beneficial for parallel routing than the existing net-based routing bounding box.
The quality and runtime of the serial and multi-threaded routers are compared to the router in VPR 7.0.7. The
large heterogeneous Titan23 designs that are targeted to a detailed representation of the Stratix IV FPGA are
used for benchmarking. With eight threads, the parallel router using the connection-aware routing bounding
box model reaches a speedup of 6.1× over the serial router in VPR 7.0.7, which is 1.24× faster than the one
using the existing net-based routing bounding box model, while reducing the total wire-length by 10% and
the critical path delay by 7%.
CCS Concepts: • Hardware→ Electronic design automation; Physical design (EDA);Wire routing;
Additional Key Words and Phrases: FPGA routing, timing-driven, connection-based routing, algorithmic en-
hancements, connection-aware parallelization, routing bounding box model, partitioning-based
ACM Reference format:
Yun Zhou, Dries Vercruyce, and Dirk Stroobandt. 2020. Accelerating FPGA Routing Through Algorithmic
Enhancements and Connection-aware Parallelization. ACM Trans. Reconfigurable Technol. Syst. 13, 4, Article
18 (August 2020), 26 pages.
https://doi.org/10.1145/3406959
This work was funded by CSC (China Scholarship Council) and co-funded by BOF (Special Research Fund) at Ghent Univer-
sity. The computational resources (Stevin Supercomputer Infrastructure) and services used in this work were provided by
the VSC (Flemish Supercomputer Center), funded by Ghent University, FWO, and the Flemish Government—Department
EWI.
Authors’ addresses: Y. Zhou, D. Vercruyce, and D. Stroobandt, Ghent University, Technologiepark-Zwijnaarde 126, Ghent,
Flanders, 9000, Belgium; emails: {Yun.Zhou, Dries.Vercruyce, Dirk.Stroobandt}@ugent.be.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee
provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and
the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored.
Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires
prior specific permission and/or a fee. Request permissions from permissions@acm.org.
© 2020 Association for Computing Machinery.
1936-7406/2020/08-ART18 $15.00
https://doi.org/10.1145/3406959
ACM Transactions on Reconfigurable Technology and Systems, Vol. 13, No. 4, Article 18. Pub. date: August 2020.
18:2 Y. Zhou et al.
1 INTRODUCTION
Field Programmable Gate Arrays (FPGAs) are integrated circuits that can be (re)programmed after
fabrication to implement digital designs as their functionalities are not fixed during the production
process. For this purpose, the FPGA fabric consists of a large number of programmable logic blocks,
which can each implement a small amount of digital logic, and programmable routing resources
that allow the logic block inputs and outputs to be connected to form larger circuits.
The FPGA computer-aided design (CAD) flow including synthesis, technology mapping, pack-
ing, placement, and routing translates the description of a digital circuit in a hardware design
language to an FPGA configuration bitstream. Routing is an important part of the CAD flow, as
the delay of a circuit implemented in an FPGA is mostly due to routing delays, rather than logic
block delays, and most of an FPGA’s area is devoted to programmable routing [2]. Key metrics
of an efficient FPGA router involve fast runtime and high-quality configurations in terms of the
total wirelength and critical path delay of the circuit. The configuration should efficiently use the
available resources while minimizing the total wirelength and critical path delay.
In the routing phase, the programmable routing architecture of an FPGA is usually modeled
as a routing resource graph (RRG). Given the RRG of a target FPGA device and the netlist of a
placed circuit, an FPGA router works out legal routes for each net (i.e., wires transporting a signal
between a source and one or more sinks) of the circuit. It is equivalent to the NP-complete problem
of finding disjunct routing trees in the graph, which is time-consuming. With the increasing size
of FPGAs and the circuits, the routing runtime becomes prohibitively large.
There are several approaches to accelerate the routing process. One way is to improve the serial
routing solution. A combined approach of the FPGA architecture and the routing algorithm has
been presented in Reference [5]. Shifting from rerouting congested nets to only rerouting con-
gested source-sink connections of the nets is a routing strategy change that has been proved to
be helpful [20, 22, 23]. Thanks to today’s multi-core commodity processors and multi-processor
workstations that offer significant computing runtime reduction potential through parallelization,
the other promising direction to address the FPGA routing runtime problem is to parallelize the
routing process [4, 6, 7, 11, 12, 14–17, 24, 25]. Among those parallel solutions, partitioning-based
parallel routers [4, 6, 7, 14, 24] are gaining popularity recently. In general, those parallel routers
first do spatial partitioning to categorize nets into sets by their bounding boxes that at least enclose
all the pins of the nets. The spatial partitioning is followed by a parallel routing procedure taking
advantage of the fact that nets fitting entirely in one region have no resource competition with
those in other regions.
In this work, we elaborate on the connection-based routing strategy and the algorithmic en-
hancements in our previously published work CRoute [22] to improve the serial routing. We also
explore a recursive partitioning-based parallelization technique to further accelerate the routing
process. To exploit more parallelism by a finer granularity in both spatial partitioning and routing,
a connection-aware routing bounding box model is proposed for the source-sink connections of
nets. It is built upon the location information of each connection’s source, sink and the geomet-
ric center of the net that the connection belongs to, different from the existing net-based routing
bounding box that covers all the pins of the entire net. We present that the proposed connection-
aware routing bounding box is more beneficial for parallel routing than the existing net-based
routing bounding box. The quality and runtime of the serial and multi-threaded routers are com-
pared to the router in VPR 7.0.7. The large heterogeneous Titan23 designs that are targeted to a
detailed representation of the Stratix IV FPGA are used for benchmarking. With eight threads, the
parallel router using the connection-aware routing bounding box model reaches a speedup of 6.1×
over the serial router in VPR 7.0.7, which is 1.24× faster than the one using the existing net-based
ACM Transactions on Reconfigurable Technology and Systems, Vol. 13, No. 4, Article 18. Pub. date: August 2020.
Accelerating FPGA Routing Through Algorithmic Enhancements 18:3
routing bounding box model, while reducing the total wire-length by 10% and the critical path
delay by 7%.
Overall, this journal version of CRoute includes the following newmaterial over the conference
paper [22]: a new connection-aware routing bounding boxmodel, connection-aware recursive par-
titioning and routing, and a new experimental study on the comparison between parallel versions
of CRoute using the two methods of connection-aware partitioning and net-based partitioning.
The remaining part of this article is organized as follows. Section 2 provides the relevant back-
ground of the FPGA routing problem and connection-based FPGA routing. Algorithmic enhance-
ments to the connection-based routing including the adapted wirelength-driven cost, a timing-
driven version of the connection-based router, and the new connection-aware routing bounding
box model are presented in Section 3, with experimental results given in Section 4. Section 5 de-
scribes the proposed connection-aware parallelization using the new routing bounding box model
for partitioning and routing, with the corresponding experimental study presented in Section 6.
Improvements on the parallel routing and the experimental results are included in Section 7. Fi-
nally, conclusions are drawn in Section 8.
2 BACKGROUND
2.1 FPGA Routing
To deal with the routing problem, the routing architecture of a target FPGA is usually modeled
as a routing resource graph (RRG). In this way, the routing problem of each net in the circuit
is reduced to finding a subgraph of the RRG, called a routing tree. The routing trees of the nets
should be disjoint to avoid short circuits. Each routing tree contains the source and sink nodes of
its associated net, as well as enough wire nodes so that source and sink nodes are connected.
FPGA routers are typically based on the negotiated routing algorithm PathFinder [10], which
balances the competing goals of eliminating congestion and minimizing the delay of critical paths
in an iterative framework. The emphasis of the approach is to adjust the costs of routing re-
sources in a gradual, semi-equilibrium fashion to achieve an optimum distribution of the routing
resources [10]. It allows nets to share resources initially, but subsequently determines which net
gets to keep the shared resource through a negotiation among nets. Connections within the same
net can legally share routing resources so that the total wirelength can be reduced. The nets are
ripped up and rerouted every iteration until no routing resources are illegally shared.
2.2 Connection-based Routing Strategy
Traditional PathFinder-based routing algorithms rip up and reroute all nets every iteration, even
when some nets or parts of the nets may have already been legally routed, which is not necessarily
an efficient routing strategy. Since each net can be regarded as a set of source-sink connections,
the FPGA routing problem definition can also be expressed in terms of connections [20]. Thus,
the routing problem can be simplified to find a single path in the RRG for each connection in the
circuit. Each path starts at the source node and ends at the sink node of its associated connection.
These paths should only share nodes if the corresponding connections have the same source. The
connection-based router partially rips up and reroutes a net by rerouting congested connections,
instead of rerouting all the congested nets (including uncongested connections) that have been
used in the router of VPR [8, 9].
2.2.1 Routing a Connection. A connection is routed by expanding nodes starting from its
source. In each expansion step starting from a specific node, which is taken as the current node
n, all its downstream neighbors are explored. The downstream neighbors that have not been ex-
plored previously and those that have been explored but can be granted a smaller estimated path
ACM Transactions on Reconfigurable Technology and Systems, Vol. 13, No. 4, Article 18. Pub. date: August 2020.
18:4 Y. Zhou et al.
cost in this step are chosen as the candidate next-hop nodes. The candidate next-hop nodes are
expanded in the same way until the sink of the connection is reached. The path cost of the current
node f (n) consists of three parts: an upstream node cost cprev (n), the congestion cost c (n), and an
expected cost from itself to reach the target sink cexp (n) multiplied by a direction factor α :
f (n) = cprev (n) + c (n) + α · cexp (n). (1)
The upstream path cost cprev (n) is the sum of the congestion costs of all route nodes along
the upstream path from the current node to the source. The congestion cost c (n) of a node is
the product of its base cost b (n), the present congestion penalty p (n), and a historical congestion
penalty h(n). It is divided by a sharing factor share (n), because multiple connections in a net can
share the same node, as explained in Section 2.2.2:
c (n) =
b (n) · p (n) · h(n)
1 + share (n)
. (2)
The expected cost cexp (n) enables a directed search to the target sink of a connection. Instead of
expanding the node with the lowest congestion cost c (n), which was done in the first routability-
driven routers targeted to small FPGAs [1], the node that leads to the lowest path cost is ex-
panded [18]. This results in a narrow wavefront that expands in the direction to the target sink,
controlled by the direction factor α . The direction factor determines how aggressively the router
explores toward the target sink.
The expected cost cexp (n) is the sum of the estimated wire segment cost, the base cost of the
sink b (sink ), and the base cost of the sink’s input pin b (ipin). The estimated wire segments re-
quired from the current node to the sink are split up into the wire segments in the same direction
as the current node and wire segments in the orthogonal direction of the current node [20]. Con-
sequently, the estimated wire segment cost consists of a wire segment cost in each of the two
directions. The wire segment cost in each direction is the product of an estimated number of wire
segments (nseд,same or nseд,or tho ) that are required from the current node to the sink and the base
cost (bsame or bor tho ) in that direction, divided by a sharing factor. Equation (3) shows how the
expected cost is calculated:
cexp (n) =
nseд,same · bsame
1 + share (n)
+
nseд,or tho · bor tho
1 + share (n)
+ b (ipin) + b (sink ). (3)
That the expected cost in both directions is divided by the sharing factor share (n) is to keep the
expected cost heuristic in the connection router admissible for an A* search [20].
2.2.2 Negotiated Sharing Mechanism. Connections can legally share routing nodes if they are
driven by the same source. For this reason, the cost of a node for a connection should be lower
in case it is already being used by other connections in the same net. It cannot be zero, because
that would force the router to explore these nodes. Instead, the cost of a node in a connection is
divided by a sharing factor share (n), which is equal to the number of connections through this
node that share the same source. The reason lies in the fact that the cost of a connection is the sum
of the base costs of the nodes that realize the connection. If a node is shared between a number of
connections driven by the same source, then the cost of that node has to be shared equally by all
connections using it. By ripping up a connection at a time, the rest of the routing tree remains and
influences the cost of the nodes through the share (n) division. This effectively encourages sharing
routing resources over multiple pathfinder routing iterations.
Figure 1(a) depicts a suboptimal routing tree of a net with two sinks obtained after the first
pathfinder routing iteration. The generation of the suboptimal routing tree is mainly due to the
fact that there are a large number of possible equivalent shortest paths for a single source-sink
ACM Transactions on Reconfigurable Technology and Systems, Vol. 13, No. 4, Article 18. Pub. date: August 2020.
Accelerating FPGA Routing Through Algorithmic Enhancements 18:5
Fig. 1. An example of (a) a suboptimal and (b) an optimal routing solution for a net with two sinks.
connection. For example, connection source-sink0 has 20 equivalent paths with a minimum cost
of 6 wire segments from which one is arbitrarily chosen. In the case of Figure 1(a), the path of
source-sink0 does not share any resource with connection source-sink1. However, there are other
possibilities that would enable more routing resource sharing and result in a routing tree closer
to a minimum Steiner tree, as shown in Figure 1(b). This non-sharing problem worsens if the
Manhattan distance between the source and sink of a connection increases, as the number of
equivalent shortest paths increases exponentially.
The negotiated sharing mechanism is targeted to alleviate the problem. However, the negotiated
sharing mechanism, even with the sharing factor introduced, only works well if a connection is
routedwhen other connections from the same net have been routed.When the router is routing the
first connection of a net, it is clueless about the location of the other sinks of the net. The negotiated
sharing mechanism is further improved in Section 3.1.3 by introducing a bias cost relative to the
geometric center of the net.
2.2.3 Routing Schedule. The present congestion penalty p (n) of a node is updated follow-
ing Equation (4) whenever a connection is rerouted. Its value is based on the node’s capacity
cap (n), and the occupancy occ (n). The occupancy is the number of nets that are currently using
the node. It is thus equal to one if multiple connections in a net share the same node. The factor
pf is used to increase the illegal sharing cost as the algorithm progresses:
p (n) =
{
1, cap (n) > occ (n),
1 + pf (occ (n) − cap (n) + 1), otherwise. (4)
The historical congestion penalty h(n) is updated after every pathfinder routing iteration
demonstrated by Equation (5). The impact of h(n) on the total resource cost is controlled by the
factor hf :
hi (n) =
⎧⎪⎪⎨⎪⎪⎩
1, i = 1,
hi−1 (n), cap (n) ≥ occ (n),
hi−1 (n) + hf (occ (n) − cap (n)), otherwise.
(5)
The way the congestion factors pf and hf change as the algorithm progresses is called the
routing schedule.
3 ALGORITHMIC ENHANCEMENTS
The aforementioned connection-based routing focuses more on the routability than on the timing.
Corresponding cost functions given by Equations (1)–(3) are wirelength-driven. In this section,
we first discuss the enhancements to the connection-based routing algorithm by adapting the
ACM Transactions on Reconfigurable Technology and Systems, Vol. 13, No. 4, Article 18. Pub. date: August 2020.
18:6 Y. Zhou et al.
wirelength-driven cost. It is then extended to a timing-driven version for simultaneously optimiz-
ing thewirelength and timing. A connection-aware routing bounding boxmodel is also introduced.
3.1 Adapted Wirelength-driven Cost
The aim of adapting the wirelength-driven cost is to deal with the heterogeneity of modern FPGAs
and to improve the quality of results. First, thewirelength-driven cost functions are adapted to cope
with a heterogeneous architecture that contains multiple wire segment types. Second, a bias cost
is used to improve the negotiated sharing mechanism.
3.1.1 Adapted Base Cost of the Wire Segments. Modern FPGA architectures have routing net-
works with multiple wire segment types [13]. Short wires enable the routability-driven routing of
short connections, while long wires are added to improve the delay of the necessary long connec-
tions and hence improve themaximum clock frequency. The initial version of the connection-based
router uses the same base cost for all wire segment types. However, if wire segments with different
lengths have the same cost, then the router is unaware of the fact that using a long wire segment
for a short connection will result in a large overhead from the unused part of the wire. Long wires
should have a larger cost to ensure that short wires are used if this reduces the wire-length with-
out influencing the maximum delay. The base cost of the wire segments is therefore adapted to its
actual length.
3.1.2 Expected Distance to the Sink of a Connection. The expected remaining cost to the sink of
a connection in Equation (3) is based on an estimated number of wire segments that are required to
reach that node. However, in case there are multiple wire segment types available, it is not possible
to provide a good estimation as the length of the used wire segments is not known. Therefore, we
do not estimate the number ofwire segments, but use an estimation of the total wire-length instead.
The estimation is based on the Manhattan distance from the current node to the sink. It is split
up in the same direction and orthogonal part, similar to Equation (3). The distance is multiplied
with an average cost per distance in each direction. The cost per distance (c¯same and c¯or tho ) is the
average of a unit distance cost over all wire segments types, taking into account the number of
wire segments of each type:
cexp (n) =
δsame · c¯same
1 + share (n)
+
δor tho · c¯or tho
1 + share (n)
+ b (ipin) + b (sink ). (6)
3.1.3 Bias Cost. As discussed in Section 2.2.2, the negotiated sharing mechanism only works
if one of the other connections is routed on a part of one of the shortest paths. When the router
is routing the first connection of a net with Dijkstra, it is clueless about the location of the other
sinks of the net. To help the router with initially choosing a good path from the equivalent shortest
paths, a bias cost is added toward the geometric center of the net, formulated in Equation (7) [19].
c (n) =
b (n) · p (n) · h(n)
1 + share (n)
+ cbias (n),
cbias (n) =
b (n)
2 · f anout ·
δm,c
HPWL
.
(7)
The bias cost must have a smaller influence than the wire cost as it is only meant to be a
tiebreaker. The minimum cost of a node is b (n)/f anout in case a wire is shared by all of the con-
nections in a net. The bias cost will thus maximally be half of the minimumwire cost. The bias cost
depends on the Manhattan distance to the geometric center of the net (δm,c ), which is normalized
against its half perimeter wire-length (HPWL). As the search of routing a connection closes in on
ACM Transactions on Reconfigurable Technology and Systems, Vol. 13, No. 4, Article 18. Pub. date: August 2020.
Accelerating FPGA Routing Through Algorithmic Enhancements 18:7
the geometric center, the effect of the bias cost reduces. With the negotiated congestion mecha-
nism, the cost of the nodes (Equation (2)) can only increase. So the effect of the bias cost becomes
smaller toward the later routing iterations.
3.2 Timing-driven Connection Router
The connection-based routing principle is extended with a timing-driven implementation to op-
timize for minimum wire-length and maximum clock frequency simultaneously. A criticality is
assigned to all connections in the design during a static timing analysis in each pathfinder routing
iteration, following Equation (8). The criticality of a connection determines if it should be routed
with minimum delay, in case the criticality is large, or with minimum wire-length, if the criticality
is low:
fcr it =min
[(
1 − slack
Dmax
)ϕ
, fcr it,max
]
. (8)
Equation (9) shows the calculation of the slack of a connection based on the connection’s delay
(Tdel ), the arrival time of the source (Tarr ), and the required time of the sink (Tr eq ). The arrival and
required times are calculated in a forward and backward traversal of the timing graph, respectively:
slack = Tr eq −Tarr −Tdel . (9)
The forward traversal calculates the arrival time of all nodes in the timing graph and gives us the
maximum delay in the circuit (Dmax ). This maximum delay is set as the required time of the timing
path leaf nodes in the backward traversal. The slacks are normalized to Dmax in Equation (8). As a
result, the criticality is between 0 and 1 for all the connections in the design. It is larger as the delay
of a connection is more critical and is capped at fcr it,max to prevent deadlock in case a congested
wire is occupied by several critical connections. Typically, fcr it,max is equal to 0.99.
In case a design has multiple clock domains, the traversals are repeated for each clock domain
separately [8]. Paths between two clock domains are cut to ensure that the router optimizes for
each clock domain separately [13]. The benchmark I/Os are constrained to a virtual I/O clock. To
ensure that the router can not unrealistically ignore I/O timing, the paths between the netlist clock
domains and the I/O domain are included [13].
3.2.1 Buffered Routing Switches. The current implementation of the timing-driven router is
designed for architectures withwire segments that are driven by buffered routing switches. Amore
detailed timing analysis is required to allow the routing of architectures with pass transistors. The
capacitance and resistance of a wire are dependent on the downstream capacitance and upstream
resistance of routing resources along with the connection. In a buffered architecture, the delay
of a wire is only dependent on its own resistance, capacitance, and driving route switch. This
simplification in the first timing-driven version is acceptable as many architectures are buffered.
The detailed representation of the Stratix IV FPGA in the Titan23 design suite [13] is fully buffered
as well as the flagship architecture in the VTR framework [8].
3.2.2 Timing-driven Cost Function. The timing-driven router uses an adapted node cost c (n)
and an adapted expected cost cexp (n) to ensure that critical connections focus more on reducing
delay than on resolving congestion, which are expressed in Equations (10) and (11), respectively.
The adapted node cost is the sum of a wire-length-driven cost and a timing-driven cost. The wire-
length-driven cost of a node is set to its congestion cost given by Equation (2). The timing-driven
cost is equal to the delay of that routing resource. The relative importance of the wire-length-
driven and the timing-driven part is determined by the criticality of the connection.
c (n) = (1 − fcr it ) · c (n)wld + fcr it ·Tdel , (10)
ACM Transactions on Reconfigurable Technology and Systems, Vol. 13, No. 4, Article 18. Pub. date: August 2020.
18:8 Y. Zhou et al.
Fig. 2. Runtime-quality trade-off for (a) the wire-length-driven direction factor α and (b) the timing-driven
direction factor β . The results are the average over the Titan23 benchmark designs.
cexp (n) = (1 − fcr it ) · α · cexp,wld (n) + fcr it · β · cexp,td (n), (11)
cexp,td (n) = δsame · T¯same + δor tho · T¯or tho . (12)
The adapted expected cost cexp (n) consists of two parts: a wirelength-driven part adapted from
Equation (6) of the wire-length-driven router, and a timing-driven part, which is based on an es-
timation of the remaining delay to the sink calculated by Equation (12). The parameters α and
β are the wire-length driven and timing-driven direction factors that are further explained in the
next subsection. The estimated delay to the sink is equal to the estimated distance to that node,
multiplied by an average delay per distance in the same (T¯same ) and orthogonal (T¯or tho ) direction
as the wire under consideration. The delay per distance for a direction is based on the average
delay over all wire segments in that direction.
3.2.3 Timing-driven and Wire-length-driven Direction Factors. To enable a better runtime-
quality trade-off, direction factors are assigned to the wire-length-driven and timing-driven ex-
pected costs in Equation (11). The direction factor α of the wire-length-driven part is large and
allows a fast and aggressive search toward the target sink. Since the critical path delay is more
important, a second, smaller direction factor β is used for the timing-driven part. This increases
the runtime, because more nodes are expanded, but leads to a lower maximum delay.
We rely on the fact that the maximum delay of a design is only determined by the longest paths.
Pathswith a small delay can therefore be optimized forwire-lengthwithout affecting themaximum
delay. The runtime increase introduced by the small timing-driven direction factor β is minimized
by using an exponent ϕ in the calculation of the criticality as included in Equation (8), called the
criticality exponent. The criticality exponent ϕ is larger than 1 that spreads the criticalities of less
critical connections and more critical connections apart. The larger the value of ϕ, the smaller the
criticality of the non-critical connections will be, resulting in a fast wire-length-driven routing
of these connections with the aggressive direction factor α . In this way, only the highly critical
connections are routed with the slow timing-driven direction factor β . The criticality exponent ϕ
is set to 3, which is experimentally validated as being a good value on average.
The exact values of the direction factors α and β are important for a good runtime-quality trade-
off. Small changes can largely increase runtime or reduce quality (Figure 2). First, we analyze the
timing-driven direction factor β . The geomean runtime and critical path delay of CRoute are
shown relative to each other in Figure 2(b) for a β varying from 1.5 to 0.5. Both the geomean
runtime and critical path delay are highly sensitive to the exact value of β . The runtime is equal to
ACM Transactions on Reconfigurable Technology and Systems, Vol. 13, No. 4, Article 18. Pub. date: August 2020.
Accelerating FPGA Routing Through Algorithmic Enhancements 18:9
108 seconds if β is equal to 1.5 and increases by a factor of 2.7x to 291 seconds if β is reduced to 0.5,
thereby gaining 13.9% in critical path delay. The wire-length is not sensitive to β , with a difference
of only 0.4%. An optimal value for β is between 0.6 and 0.9. It is set to 0.7 in the experiments
section.
The value of the wire-length-driven direction factor α should be larger to enable an aggressive
search toward the sink of a connection. Its value affects both thewire-length and critical path delay,
but the influence is small. If α varies from 2 till 1.05, then the geomeanwire-length and critical path
delay improve by 3% and 1.2%, respectively, at the cost of a 5.4× runtime increase (Figure 2(a)).
Further reducing α results in extremely long runtimes. The optimal runtime-quality trade-off is
chosen at an α value of 1.4.
3.2.4 Initial Connection Criticality. In the first routing iteration, it is not possible to exactly cal-
culate the criticality of the connections as the delay of the yet unrouted connections is not known.
A possible solution is to perform a congestion oblivious first iteration by setting the criticality of
all connections equal to one. The exact delay and the corresponding criticality of the connections
can then be calculated after the first iteration. We do not use this method as it stresses too much on
reducing delay instead of congestion in the first routing iteration, even for the non-critical connec-
tions in the design. Therefore, we use an estimated delay for the connections in the first iteration.
This delay is equal to the optimistic congestion oblivious minimum delay used by placement tools.
The minimum delay should be calculated only once for a given FPGA architecture and is already
available as placement precedes routing. In this way, we relax the first routing iteration by only
stressing on delay for the long paths in the design.
3.2.5 Rerouting Critical Connections. The first routing iteration of CRoute uses an optimistic
estimation for the connection delays to calculate their criticalities. This leads to a non-minimal de-
lay for connections with a low criticality, which can affect the maximum clock frequency in later
iterations if the delay of other connections is reduced. Therefore, in each later routing iteration,
all uncongested connections with a criticality larger than a predefined value θf are rerouted. This
allows previously routed uncongested connections to be rerouted if their delay limits the maxi-
mum clock frequency. The influence of rerouting these connections is however small. The total
runtime spent to reroute uncongested critical connections is only 2.5% of the total routing runtime
as only the highly critical connections are rerouted. In case a design has many connections with
a criticality larger than the threshold θf , its value is increased so that a maximum of 3% of the
connections are rerouted.
3.2.6 Fixing Illegal Routing Trees. The routing of a source-sink connection of a net is typically
sped up by expanding from the (partial) routing tree of the already routed connections in that
net [8, 23]. In CRoute, each connection is routed from scratch, starting from the source, to max-
imally exploit the negotiated sharing mechanism. A drawback of this methodology is that illegal
routing trees can occur. In case a node is congested, the router will try to circumvent the con-
gestion. It is possible that the routing graph will be temporarily illegal in-between iterations. An
example of a three-sink net with illegal routing trees is given in Figure 3. Connections of that net
with sink0, sink1, and sink2 use a congested node in iteration i (Figure 3(a)) and the congestion
mechanism is gradually solving the connection. In iteration i + 1 (Figure 3(b)) the routing graph
is not a tree as it contains an illegal node that is driven by two nodes. The multi-driver problem of
the illegal node is resolved in iteration i + 2 (Figure 3(c)) and all congestion is resolved in iteration
i + 3 (Figure 3(d)).
In case there are remaining illegal routing trees with multi-driver nodes after all congestion is
resolved, the connections containing these multi-driver nodes are rerouted. An illegal routing tree
ACM Transactions on Reconfigurable Technology and Systems, Vol. 13, No. 4, Article 18. Pub. date: August 2020.
18:10 Y. Zhou et al.
Fig. 3. An example of the illegal routing trees of a three-sink net along iterations.
occurs if a net consists of connections with low criticality and connections with high criticality.
The connections with a low criticality use the lowest-cost path in terms of congestion, while the
connections with a high criticality use the lowest-cost path in terms of delay. This problem is
solved by a forced rerouting of all illegal connections in an illegal routing tree along the path of
the connection with the highest criticality, since the maximum clock frequency of a design is more
important than the wirelength.
3.3 Connection-aware Routing Bounding Box
Usually, as in the PathFinder implementation of VPR, a bounding box for each net is computed
before routing, outside of which routing resources cannot be used by the net [4, 8, 9, 18]. The
bounding box contains all pins of the net plus a buffer in each direction. Each source-sink connec-
tion within the same net has the same bounding box.Wemark it as the net-based routing bounding
box.
CRoute also uses the net-based routing bounding box to restrict the routing resources that each
connection can use [22]. In this work, to exploit a finer granularity in both spatial partitioning and
routing for a parallel version of CRoute, we modify the routing bounding box of each connection
to a connection-aware routing bounding box. We explain as follows how the connection-aware
routing bounding box for each pair of source and sink is formed.
In terms of a N -fan-out net, given its source whose coordinate is (xsource ,ysource ), and the
N sinks with coordinates belonging to {(xsinki ,ysinki )} where i ∈ [0,N − 1], its geometric center
is the point with the mean values of the source’s and sinks’ coordinates as its coordinate (x ,y).
Equations (13) and (14) demonstrate the coordinate calculation for the geometric center. The new
routing bounding box of each source-sink connection contains the source pin, the sink pin and
the geometric center of the net that the connection belongs to, plus a buffer in each direction.
ACM Transactions on Reconfigurable Technology and Systems, Vol. 13, No. 4, Article 18. Pub. date: August 2020.
Accelerating FPGA Routing Through Algorithmic Enhancements 18:11
Fig. 4. An example of the net-based bounding box and connection-aware bounding boxes for two connections
of a 2-fan-out net.
In this way, connections within the same net have different routing bounding boxes with finer
granularity than the net-based bounding box:
x =
1
1 + N
· 	xsource +
N−1∑
i=0
xsinki

 , (13)
y =
1
1 + N
· 	ysource +
N−1∑
i=0
ysinki

 . (14)
Given a two-fan-out net, Figure 4 illustrates the new connection-aware bounding boxes and
the net-based bounding box for the net’s two connections. The grey dashed line grid represents
the area of an FPGA architecture grid that the net has been placed onto. Each of the two solid blue
line rectangles containing the source pin, one sink pin, and the geometric center of the net depicts
the connection-aware bounding box of each connection, while the black dashed line rectangle
shows the net-based bounding box that covers all the pin locations of the net. As an easy example
to show the granularity difference between a connection-aware bounding box and a net-based
bounding box, the buffer of each bounding box presented in Figure 4 is equal to one FPGA tile. It
should be noted that the buffer of the net-based bounding box in VPR and CRoute [22] by default
is equal to three FPGA tiles. It is also set to three FPGA tiles in our following experimental studies
on the new connection-aware routing bounding box model so that the influence of the buffer on
the routing is left out. It is for the sake of allowing connections of the same net to share wire
segments as much as possible while decomposing multi-sink nets into sets of connections that the
geometric center is included for building connections’ routing bounding boxes.
4 EXPERIMENTAL STUDY
The serial timing-driven router with the aforementioned algorithmic enhancements is compared
with the router in VPR 7.0.7 (r75b47d3) in terms of quality of results and required time for routing.
The Titan23 benchmark circuits [13] and an architecture model of the Stratix IV FPGA are used for
benchmarking. The RRG for the target FPGA is extracted from VPR [8] as a file containing the data
of all routing resources and their interconnectivity. The Titan23 designs are packed by MultiPart
[21] and placed by VPR [8]. MultiPart is used for packing as it enables the routing of many
Titan23 designs with the default channel width of 300 [21]. The first three columns of Table 1 show
the circuits used for benchmarking along with the number of nets and connections in each circuit.
Experiments with regard to the routing step are performed 10 times on a workstation with an
Intel E5-2680v3@2.5 GHz. The routing runtimes of VPR and CRoute are the actual time required
for routing, excluding the runtime associated with the generation and loading of the routing
resource graph. Results are compared in terms of the number of required routing iterations (IT),
ACM Transactions on Reconfigurable Technology and Systems, Vol. 13, No. 4, Article 18. Pub. date: August 2020.
18:12 Y. Zhou et al.
Table 1. Runtime andQuality Comparison of VPR and CRoute Expressed in the Number of Required
Routing Iterations (It), the Number of Rerouted Connections (ReR), Routing Runtime (RT), Total
Wire-length (All), the Wire-length of the Length 4 (L4) and Length 16 (L16)
wire Segments, and the Critical Path Delay (CPD)
The CPD is the maximum delay over all clock domains.
the number of rerouted connections (ReR, in thousands), routing runtime (RT, in seconds), total
wirelength (WL, in million CLB length), and the critical path delay (CPD, in nanoseconds), as
shown circuit-by-circuit in Table 1. Numbers in the brackets are normalized values of CRoute to
those of the router in VPR 7.0.7.
On average, CRoute improves the wirelength and critical path delay by 10% and 7%, respec-
tively, with the runtime reduced by 3.5×. Compared to the previously published version in Refer-
ence [22], CRoute with the connection-aware routing bounding box model trades an approximate
1% increase in wirelength for a 1% decrease in the critical path delay.
4.1 Runtime Improvement
The reasons for the runtime gain by CRoute are discussed as follows. First, the total number of
rerouted connections (ReR) is largely reduced (9.1×) by the connection-based routing strategy,
i.e., rerouting congested connections of nets instead of rerouting the entire nets. Moreover, only
rerouting congested connections also leads to a faster convergence, as 7.0× less pathfinder routing
iterations are required by CRoute to achieve a congestion-free solution.
To better visualize the runtime improvement, Figure 5 illustrates the runtime breakdown of VPR
and CRoute in terms of the runtime of the first routing iteration, runtime of remaining iterations
to reroute congested nets/connections, runtime of static timing analysis, and other runtime to
initialize data and to calculate routing statistics.
The runtime of the first iteration and other time is approximately equal for VPR and CRoute,
while the runtime of rerouting and timing analysis is largely reduced by CRoute. In the routing
process, a static timing analysis is performed in each iteration. As the required number of routing
ACM Transactions on Reconfigurable Technology and Systems, Vol. 13, No. 4, Article 18. Pub. date: August 2020.
Accelerating FPGA Routing Through Algorithmic Enhancements 18:13
Fig. 5. Routing runtime breakdown of VPR and CRoute averaged over the Titan23 benchmark designs.
iterations is reduced from 60 (VPR) to 9 (CRoute), the time for static timing analysis is largely
reduced by CRoute.
4.2 Quality of Results Improvement
CRoute requires less routing resources to route every design than VPR, as shown in Table 1. On
average, 10% less wire-length is required and the critical path delay is reduced by 7%. This result is
important as the interconnection network consumes a large fraction of the total FPGA area and the
critical path delay determines the maximum frequency of the circuit. With CRoute, the designs
can be implemented on FPGAs that contain fewer routing resources.
The used Stratix IV-like FPGA architecture contains length 4 (L4) and length 16 (L16) wire seg-
ments. It is important to note that the wire-length improvement is mainly due to the lower usage
of the L16 wire segments. CRoute needs 58% less L16 wire segments, while the required amount
of L4 wire segments by CRoute is similar to that of VPR. The reduction of L16 wire segments
benefits from the algorithmic enhancement of adapted wire segment cost (Section 3.1.1). The cost
of using long wire segments is increased. Therefore, the L16 wire segments are only used if they
are really required, for example, to reduce the delay of connections on the critical path.
5 CONNECTION-AWARE PARALLELIZATION
Orthogonal to the improvements of the PathFinder-based algorithms to make the serial routing
more efficient, another promising direction to address the FPGA routing runtime problem is to
parallelize the routing process. In this section, we explore the recursive bipartitioning technique
for the parallelization of the routing process.
5.1 Related Partitioning-based Parallel FPGA Routing
In the implementation of PathFinder-based FPGA routing, not all signals have the potential to use
the same routing resources, due to the fact that each signal has its own bounding box, out of which
routing resources cannot be used. This makes the partitioning and parallel routing feasible. There
are routers that make use of today’s multi-core processors to accelerate FPGA routing through the
parallelization technique based on a partitioning approach.
Gort and Anderson [4] developed a recursive bi-partitioning technique with load balancing tak-
ing advantage of geographical independence between signals for the parallel routing solution.
Their parallelization alone provides about 2.3× speedup using four cores. The parallel router pro-
duces deterministic results, with no considerable impact on the quality of results.
Shen and Luo [14] presented their partitioning-based parallel router in the context of non-
timing-critical applications. In each level of the recursive partitioning, nets are partitioned into
three subsets: one subset of nets that cross the pseudo cutline, and the two remaining subsets of
ACM Transactions on Reconfigurable Technology and Systems, Vol. 13, No. 4, Article 18. Pub. date: August 2020.
18:14 Y. Zhou et al.
nets separated by the cutline. The latter two subsets are routed in parallel after the former one is
routed. To improve the efficiency and reduce runtime of the parallel router, nets that cross two
regions are also parallelized in the same way. A speedup of 7.06× is achieved with a wirelength
increase of 10% using 32 processes, compared to the VPR 7.0 router.
Hoo and Kumar reported another partitioning-based parallel router ParaDRo [7]. The paral-
lelism of ParaDRo is obtained in three phases. First, it uses a lower bounding box expansion
factor (similar to the bounding box buffer in Figure 4) than VPR 7.0 to reduce the overlap be-
tween nets. Second, nets are assigned to nodes of the slicing tree by the recursive bi-partitioning.
Third, nets within a node are further scheduled to be routed in parallel by a decomposition, where
multi-sink nets are decomposed into source-sink connections. The bounding box size of each
source-sink connection is further reduced by restricting its route to be on the perimeter of its
bounding box, to increase the number of nets that can be routed in parallel. With their enhance-
ments, ParaDRo achieves a maximum speedup of 5.4× with eight threads, compared to the VPR
7.0 router.
Wang et al. proposed ParRA [24] based on their serial router [23]. ParRA is composed of a hybrid
partitioning and a parallel routing process. In their partitioning, an FPGA area is divided into
multiple regions and the nets are geographically partitioned in a recursive fashion. Nets spanning
more than one region are further partitioned [24]. Finally, nets are partitioned into two kinds of
subsets: conflict-free subsets consisting of nets that are spanning different regions and independent
of others in the same subset, and local subsets with nets fitting entirely in one region. The conflict-
free subsets are routed one by one with nets in each subset routed in parallel. After the routing
of conflict-free subsets is finished, local subsets are routed in parallel with nets in each subset
routed one by one. ParRA gains speedups of 1.6×, 2.7×, 3.9×, and 5× with 2, 4, 8, and 16 threads,
respectively, relative to the serial router [23].
In general, the recursive spatial partitioning used in the published works is based on the con-
cept of a net, which we mark as net-based partitioning. In this work, we extend the traditional
partitioning-based parallelization method to a connection-aware parallelization approach, relying
on using the proposed connection-aware routing bounding box model for both partitioning and
routing. It is worth mentioning that this is the first work adopting this kind of connection-aware
and geometric center adjusted routing bounding box for both the spatial partitioning and routing.
Although the routers in References [20, 22–24] adopt the connection-based routing strategy, they
keep using the net-based routing bounding box as VPR [8] does. ParaDro [7] decomposes each
multi-sink net to single-sink virtual nets after the partitioning phase, which are indeed similar to
the source-sink connections in this work. However, it keeps using the net-based bounding box
for its partitioning and thus is not able to reduce the number of connections spanning more than
one region. Moreover, ParaDRo restricts the routing path of each source-sink pair to be on the
perimeter of its bounding box. This highly limits the routing resource sharing among different
connections of the same net, resulting in an increased wirelength when compared with the router
in VPR 7.0.
5.2 Connection-aware Recursive Partitioning
The recursive partitioning framework in this work shares the same principle as the partitioning in
the works mentioned in Section 5.1, taking advantage of the geographical independence of nets in
different regions. The difference lies in the fact that the connection-aware partitioning in this work
deals with the source-sink connections of the nets rather than the nets entirely. The partitioning
uses the information of the connection-aware routing bounding boxes. The aim is to partition
the connections into sets. The way in which geographic partitioning using the connection-aware
ACM Transactions on Reconfigurable Technology and Systems, Vol. 13, No. 4, Article 18. Pub. date: August 2020.
Accelerating FPGA Routing Through Algorithmic Enhancements 18:15
Fig. 6. An example of connection-aware recursive bi-partitioning with load balance in terms of the number
of connections where: callouts in different shapes and colors represent different nets’ terminals; sources and
sinks of nets are named after capital letters and corresponding lower case letters, respectively; each solid line
rectangle indicates the connection-aware routing bounding box of a source-sink connection.
Fig. 7. Corresponding slicing trees of the (a) connection-aware and (b) net-based spatial partitioning for four
threads.
routing bounding boxes of connections is applied recursively is visualized in Figure 6. The corre-
sponding slicing tree is shown in Figure 7(a), where nodes represent sets of connections.
The partitioning starts with dividing the FPGA area into two regions, the left and the right re-
gion, by the first cutline at position x = 8. A connection with a routing bounding box that resides
entirely within one region is geographically independent from those within a different region. The
connections in the left region make up the leftConSet, while the connections in the right region
form the rightConSet. The connections with bounding boxes that overlap both regions form the
crossingConSet. The cutline is determined based on a load balance strategy where the difference
between the number of the connections in leftConSet and the number of connections in right-
ConSet is minimized, similar as in the work of Reference [14] and ParaDRo [7]. As an example
shown in Figure 6, where a connection is named by its sink’s name in lower case letters, after the
first partitioning, the connection sets are formed as leftConSet = {a0, a1, b0, c0, c1, d0}, rightCon-
Set = {f0, f1, д0, д1, h1} and crossingConSet = {c2, d1, e0, e1}. Because of the geographical overlap
ACM Transactions on Reconfigurable Technology and Systems, Vol. 13, No. 4, Article 18. Pub. date: August 2020.
18:16 Y. Zhou et al.
Table 2. Configurations of VPR, CRoute, and Parallel Variants of CRoute
Routers Routing strategy Routing bounding box Partitioning Parallelization
VPR[8] net-based net-based no no
CRoute connection-based connection-aware no no
ParaCon connection-based connection-aware connection-aware yes
ParaNet connection-based net-based net-based yes
with the other two sets, which definitely leads to routing resource competition, the connection set
crossingConSet is assigned to the top level (Level 0) node in the slicing tree.
If there were only two threads available, then the recursive bipartitioning process would be
complete. For four threads, the left and right regions are further partitioned in the same way,
splitting each region into two parts and assigning the connections in each region into three sets
accordingly, leftConSet, rightConSet, and the crossingConSet, as shown in Figures 6 and 7. The
recursive partitioning will stop when the number of nodes at the bottom level equals the number
of available threads. In this work, the node at the top level of a slicing tree is named the root node,
while the nodes at the bottom level are leaf nodes. Nodes between the top level and the bottom
level are intermediate nodes.
To visualize the difference between this connection-aware partitioning and the existing net-
based spatial partitioning, Figure 7(b) is given. It demonstrates the slicing tree of the net-based
partitioning with the same load balance strategy applied to nets in Figure 6. As can be seen, the
connection-aware partitioning benefits a reduction (10 to 7) on the number of crossing connec-
tions in the root and intermediate nodes, which are regarded as a bottleneck of the maximum
speedup of the partitioning-based parallel routing.
5.3 Parallel Routing Approach
In this work, the partitioning is performed once following the loading of the placement informa-
tion. Having assigned the connections to nodes of the slicing tree, the routing phase starts to find
a path for each connection.
The routing phase begins with processing the nodes of the slicing tree in a top-down fashion [4,
7, 14]. The root node at the top level of the slicing tree is processed first. After finishing the root
node, it continues routing connections in nodes at the downstream level until the leaf nodes have
been processed. Nodes at one level will not get processed until nodes at its upstream level are
completely routed to avoid race conditions. Connections in each node are routed one by one, while
nodes at the same level are routed in parallel, since they are geographically independent sharing
no routing resources.
6 EXPERIMENTAL STUDY
In this section, we present and analyze results obtained using the original bi-partitioning and
parallel approach described above, since it is the fundamental step from which the partitioning-
based parallel routers start. We first compare the connection-aware and net-based partitioning,
followed by the routing quality and runtime analysis of corresponding parallel routers.
To clarify routers that are referred to for the comparisons in the following sections, Table 2
shows an overview of the configurations of the routers. ParaCon is a parallel version of CRoute
with the connection-aware partitioning that uses the connection-aware routing bounding boxes,
while ParaNet is another parallel router using the net-based partitioning with the net-based rout-
ing bounding boxes.
ACM Transactions on Reconfigurable Technology and Systems, Vol. 13, No. 4, Article 18. Pub. date: August 2020.
Accelerating FPGA Routing Through Algorithmic Enhancements 18:17
Fig. 8. The proportion of the root node’s workload in the total workload of the slicing tree when using the
net-based partitioning.
6.1 Effects of the Connection-aware Routing Bounding Box on Partitioning
Before getting into the evaluation of the parallel routing, we first explore how the connection-
aware partitioning makes a difference in the workload of the nodes at each level of the slicing tree
in comparison with the net-based partitioning. The load balancing strategy in terms of the number
of connections is used. We will start by presenting some necessary definitions of the metrics we
use to evaluate the two partitioning approaches.
Note that there are 2i nodes at Level i of the slicing tree in Figure 7. For a node n at Level i , we
take the number of connections as its workload (loadn ). The workload of Level i is the maximum
workload among the nodes at that level, as Equation (15) shows:
loadleveli = max
0≤n<2i
{loadn }, (15)
with 0 ≤ i ≤ I , where 2I−1 = Nr eдion , and Nr eдion is the number of regions obtained through the
partitioning. The number of regions is equal to the number of available threads. Theworkload of all
the intermediate levels between the root level and the leaf level is given in Equation (16). The total
workload loadtr ee of the entire slicing tree is the sum of the workload at each level (Equation (17)):
loadinter s =
I−1∑
i=1
loadleveli , (16)
loadtr ee =
I∑
i=0
loadleveli . (17)
High fanout nets usually span a large area of the FPGA, having a high chance to be assigned
to the root node. Note that it is the high fanout nets that dominate the total routing runtime [3].
So the quantitative analysis of the root node is worth mentioning. Figure 8 shows the proportion
of the root node’s workload in loadtr ee of the net-based partitioning tree. The workload of the root
node remains the same with a various number of regions, while the total workload reduces with
the increasing number of regions. Therefore, the workload proportion of the root node increases
with more partitions. The root node workload accounts for 49% of the total workload when the
number of partitioning regions reaches 8. As the number of regions continues to increase, the
ACM Transactions on Reconfigurable Technology and Systems, Vol. 13, No. 4, Article 18. Pub. date: August 2020.
18:18 Y. Zhou et al.
Fig. 9. The root node’s workload of each circuit using the connection-aware partitioning, normalized to that
of the net-based partitioning.
proportion will be held at 51%, which means the root node is a bottleneck for the parallel router
using the net-based partitioning.
The proposed connection-aware partitioning can alleviate this problem efficiently. Figure 9
presents circuit-by-circuit the root node’s workload of the connection-aware partitioning, nor-
malized to that of the net-based partitioning. The connection-aware partitioning is able to reduce
the workload of the root node for all the benchmarks. The reduction is 28% on average, which is
resulted from the finer granularity bounding box of the connection.
To have a comprehensive insight in the overall influence on the workloads at different levels of
the proposed method, we also present the comparison with regard to the workload of intermediate
levels, the workload of the leaf level as well as the total workload of the slicing tree, as depicted by
the green short dashed line, orange dotted line, and solid black line, respectively, in Figure 10. It
indicates that the connection-aware partitioning also reduces the workload of intermediate levels.
Along with Figure 9, a conclusion can be drawn that the proposed connection-aware partitioning
can reduce the overall connections that span more than one region, which means that more par-
allelism can be exploited than using the net-based partitioning, especially for the time-consuming
high fanout nets. Since connections assigned to the root level and the intermediate levels are re-
duced, it is reasonable that there will be more connections partitioned to leaf nodes of the slicing
tree when using the connection-aware partitioning. Overall, the total workload of the slicing tree
loadtr ee is reduced by the connection-aware partitioning. The solid black line in Figure 10 shows
the geomean loadtr ee across all the benchmarks achieved by the connection-aware partitioning,
normalized to that of the net-based partitioning for different numbers of partitioning regions. It
can be seen that themore partitioning regions, the larger the reduction of loadtr ee is. The reduction
will be held at around 15% when there are no less than eight regions.
6.2 Multi-threaded Parallel Routing
In the previous section, we present that the connection-aware partitioning is more beneficial for
parallel routing than the existing net-based partitioning, in terms of the workloads of the parti-
tioning trees.
ACM Transactions on Reconfigurable Technology and Systems, Vol. 13, No. 4, Article 18. Pub. date: August 2020.
Accelerating FPGA Routing Through Algorithmic Enhancements 18:19
Fig. 10. Normalized workload of the connection-aware partitioning to that of the net-based partitioning.
Fig. 11. Runtime speedups by ParaNet and ParaCon vs. VPR.
To evaluate the real routing runtime efficiency gain of connection-aware partitioning for paral-
lel routing over the net-based partitioning, we use the multi-threading technique to build two
parallel FPGA routers with the two partitioning methods, where the communication between
threads is based on the shared memory mechanism. As mentioned above, we mark the one with
the connection-aware partitioning as ParaCon, while the other with the net-based partitioning is
marked as ParaNet. We evaluate ParaCon and ParaNet in terms of the runtime speedup and qual-
ity of results (i.e., the wirelength and the critical path delay), taking the router in VPR 7.0.7 [8] as
the baseline. Figure 11 illustrates the overall speedups (geomean speedup across all the circuits) of
ParaCon and ParaNet vs. the VPR router with different numbers of threads. It shows that ParaCon
is more time-saving than ParaNet. Specifically, ParaCon achieves a 1.10×, 1.15×, 1.17×, and 1.18×
larger speedup than ParaNet with 2 , 4, 8, and 16 threads, respectively, which is in line with the
total workload reduction trend indicated in Figure 10.
Figure 12 shows the quality of results of ParaCon and ParaNet with different numbers of threads,
where results are normalized to VPR. Both ParaCon and ParaNet have shorter wirelength and
ACM Transactions on Reconfigurable Technology and Systems, Vol. 13, No. 4, Article 18. Pub. date: August 2020.
18:20 Y. Zhou et al.
Fig. 12. Quality of results of ParaCon and ParaNet compared to VPR.
smaller critical path delay than VPR. ParaNet has a slightly better wirelength than ParaCon. This
is because the net-based bounding boxes enable connections of the same net to have more wire
segments that can be shared than the connection-aware bounding box. Yet ParaCon has a smaller
critical path delay than ParaNet. When compared to the results of serial CRoute as shown in
Table 1, ParaCon does not lose the routing quality, since both the wirelength and the critical path
delay only have a minimal change within ±0.5%.
Aiming at better analyzing and improving the overall routing runtime, we further profile the
parallel routers by categorizing different phases of CRoute, ParaNet, and ParaCon with different
numbers of threads. Generally, the routing runtime we evaluate can be split into three categories:
the portion of runtime spent routing and rerouting connections (t_routing), the runtime required
to update the timing information and routing statistics (t_update), and other remaining time, in-
cluding time in initializing data (t_other). It is the t_routing that the thread-level parallelization
accelerates. The runtime spent routing and rerouting connections of ParaCon and ParaNet is fur-
ther split into t_root, t_inter, and t_leaf, corresponding to the runtime spent processing the root
node, intermediate nodes and leaf nodes, respectively. For a node n at Level i , its real processing
time is tn . The runtime of Level i tleveli is the maximum processing time of the nodes. Then t_leaf
is tlevelI . The t_inter is obtained by summing up the runtime of all the intermediate levels. For the
ACM Transactions on Reconfigurable Technology and Systems, Vol. 13, No. 4, Article 18. Pub. date: August 2020.
Accelerating FPGA Routing Through Algorithmic Enhancements 18:21
Fig. 13. Runtime contributions of (a) ParaNet and (b) ParaCon expressed in percentage of CRoute.
serial CRoute, we take its t_routing as its t_root:
tleveli = max
0≤n<2i
{tn }, (18)
t_inter =
I−1∑
i=1
tleveli. (19)
Figure 13 illustrates the runtime contributions including t_root, t_inter, t_leaf, t_update, and
t_other of ParaNet and ParaCon. A notable difference in Figure 13(b) compared to Figure 13(a)
is the decrease in t_root, which stems from the workload reduction achieved by the connection-
aware partitioning shown in Figure 9. In both figures, the percentage of t_inter goes up and t_leaf
decreases with the increasing number of threads, because the number of connections assigned to
intermediate levels increases with the increasing number of partitions, while connections assigned
to leaf nodes are reduced.
When it reaches eight threads, t_root plus t_inter accounts for the major runtime of both paral-
lel routers based on the original recursive geographical partitioning. It should be noted that t_root
plus t_inter is the time spent routing and rerouting connections that span more than one parti-
tioning region. It is promising that the speedups can be improved by exploring more parallelism
among those crossing connections.
7 PARALLEL ROUTINGWITH IMPROVEMENTS
As can be seen from Figures 11 and 13, the multi-threading router based on the original bi-
partitioning method does not scale well for large heterogeneous circuits, even for the one with
a finer granularity routing bounding box. There are possibilities to improve the parallelism of the
original partitioning-based parallel routing.
ACM Transactions on Reconfigurable Technology and Systems, Vol. 13, No. 4, Article 18. Pub. date: August 2020.
18:22 Y. Zhou et al.
Fig. 14. An example of the slicing tree with further splitting.
ParaDro [7] extracts further parallelism by scheduling nets within a partition to be routed in
parallel. The routing path of each source-sink pair is restricted to be on the perimeter of its bound-
ing box (i.e., the restrictive perimeter bounding box). If the source-sink pairs have no overlapped
restrictive perimeter bounding boxes, then they are routed in parallel. However, the restrictive
bounding boxes highly limit the routing resources that can be shared by source-sink connections
within the same net. In terms of routing the four Titan23 benchmark circuits including neuron,
stereo_vision, segmentation and denoise with eight threads (Figure 19 in Reference [7]), ParaDRo
approximately increases the total wirelength by 2%, 2%, 11%, and 10%, respectively, compared with
the router in VPR (7.0 r4292). Its worst degradation in the critical path delay can be 5% [7].
ParRA [24] takes all the nets in the non-leaf nodes as the inter-subregion nets and then assigns
them to conflict-free subsets. Nets in each conflict-free subset are geographically independent of
each other, in terms of the net-based bounding boxes. The nets in each conflict-free subset are
routed in parallel and the conflict subsets are processed one by one. In this way, the percentage of
the parallel portion of routing is increased. However, finding a good assignment of nets spanning
more than one region to the conflict-free subset is nontrivial due to a large number of those nets in
the large Titan23 benchmark circuits. It is equivalent to the independent set problem, which is NP-
complete. Moreover, the way in which the first net not assigned yet is chosen each time influences
the formation of those conflict-free subsets and therefore matters for the routing. ParRA is able to
route seven circuits of the Titan23 benchmark suite with increased channel width. Similar quality
of results is obtained by ParRA when compared to VPR 7.0.7.
Since we expect to keep the high quality of CRoute and route as many Titan23 benchmark
circuits as possible, we further geographically bi-partitioning the connections that span more than
one region as an attempt to improve both ParaCon and ParaNet.
Corresponding to the slicing tree in Figure 7, further geographically bi-partitioning the crossing
connections means partitioning the root node and the intermediate nodes of the slicing tree in the
same way as the slicing tree is generated (Section 5.2). It is as simple as recalling the partitioning
function already developed for the nodes. Figure 14 shows an example of the new slicing tree
with further splitting. Connections in each non-leaf node are further geographically partitioned
into three subsets, taking the connection-aware partitioning tree Figure 7(a) of connections in
Figure 6 as the baseline. The crossingConSet = {c2, d1, e0, e1} of the root node in Figure 7(a) is
further partitioned into three sets of connections: crossingConSet = {e0, e1}, leftConSet = {d1}, and
rightConSet = {c2} with a horizontal cutline. The intermediate nodes at level 1 in Figure 7(a) can
not be further split. The parallelization approach is generally the same as described in Section 5.3,
with the node containing the crossingConSet = {e0, e1} be the first to be processed, while the two
nodes at Level 1 processed in parallel until the routing of the node at Level 0 is completed. We
explore in the following section on the performance gain of ParaCon over ParaNet.
ACM Transactions on Reconfigurable Technology and Systems, Vol. 13, No. 4, Article 18. Pub. date: August 2020.
Accelerating FPGA Routing Through Algorithmic Enhancements 18:23
Fig. 15. Normalized workload of the connection-aware partitioning to that of the net-based partitioning
with further splitting.
7.1 Further Partitioning
7.2 Experimental Results
Similar to the analysis in Section 6, we also present results with regard to the workload of the
slicing tree. The workloads of the root node, the intermediate nodes, and the total workload of the
connection-aware partitioning, normalized to those of the net-based partitioning, are shown in
Figure 15 by the blue long dashed line, green short dashed line, and solid black line, respectively.
The workload of the leaf nodes remains the same as in Figure 10, since the further partitioning is
not carried out on the leaf nodes generated by the original partitioning in Section 5.2, to keep the
same number of available threads. The workload of the root node is further reduced. A reduction
of 45% on average is achieved by the connection-aware partitioning. The workload of interme-
diate nodes is also reduced. The total workload is reduced by 20% with the connection-aware
partitioning.
Figure 16 displays the geomean speedups across all circuits of ParaNet and ParaConwith further
splitting when given different numbers of threads, still taking the results of VPR shown in Sec-
tion 4 as the baseline. Compared with Figure 11, it can be seen that the runtime speedups of both
ParaNet and ParaCon are improved by further partitioning the crossing connections in each non-
leaf node. ParaCon has a speedup improvement of 1.16×, 1.18×, 1.24×, and 1.24× over ParaNet,
with 2, 4, 8, and 16 threads, respectively. Compared with the speedups of ParaCon and ParaNet
shown in Figure 11, the runtime speedup gain achieved by ParaCon over ParaNet gets larger with
connections spanning more than one region being further partitioned. This is because the speedup
improvements of parallel routing benefit more from the fine granularity of connection-aware rout-
ing bounding boxes than the relatively coarse net-based routing bounding boxes.
Figure 17 shows the quality of routing results in terms of the circuit wirelength and critical
path delay. It can be seen that the connection-aware partitioning with crossing connections fur-
ther partitioned has no impact on the quality of the wirelength and the critical path delay, when
comparing Figures 17 and 12.
ACM Transactions on Reconfigurable Technology and Systems, Vol. 13, No. 4, Article 18. Pub. date: August 2020.
18:24 Y. Zhou et al.
Fig. 16. Runtime speedups of ParaNet and ParaCon with further splitting vs. VPR.
Fig. 17. Quality of results of ParaCon and ParaNet with further partitioning compared to VPR.
The runtime difference of ParaCon and ParaNet mainly exists in the routing and rerouting of
connections. When only considering the time spent routing and rerouting connections, ParaCon
obtains a speedup improvement of 1.20×, 1.23×, 1.33×, and 1.35× over ParaNet.
8 CONCLUSIONS
In this work, we elaborate on the connection-based routing principle and present a connection-
based timing-driven router CRoute embedded with algorithmic enhancements. The router is
ACM Transactions on Reconfigurable Technology and Systems, Vol. 13, No. 4, Article 18. Pub. date: August 2020.
Accelerating FPGA Routing Through Algorithmic Enhancements 18:25
analyzed through a comparison with the router in VPR 7.0.7 for routing the Titan23 benchmark
designs. CRoute improves both the routing runtime and the quality of results.
We also explore the recursive partitioning-based parallelization technique to further acceler-
ate the routing process. We propose the connection-aware parallelization approach, where the
partitioning deals with each connection of the nets, instead of the nets entirely, based on the new
connection-aware routing bounding box model. The new connection-aware routing bounding box
model is more beneficial for parallel routing than the existing net-based routing bounding box,
while having no significant impact on the circuit wirelength and critical path delay. The better
runtime efficiency benefits from the reduction of connections spanning more than one region,
which stems from the finer granularity routing bounding boxes of connections of multi-sink nets.
The speedup gain by the connection-aware partitioning against the net-based version is larger
when further improvements are introduced.
To our knowledge, this is the first work reporting the new connection-aware routing bounding
box method, a new experimental study on the parallel routers using the connection-aware and
the net-based partitioning methods, and the runtime contributions of different levels of the parti-
tioning trees to the overall runtimes of the parallel routers. We believe that our connection-aware
parallelization approach can invoke insights in the parallel routing. We expect that other existing
partitioning-based parallel routers can be further accelerated by the proposed connection-aware
partitioning, since it is highly compatible with them.
The connection-aware partitioning is orthogonal to other fine-grain level enhancements, such as
the possible parallelization of the node expansion of the PathFinder algorithm. Moreover, parallel
routing with the fine-grain parallelization using the proposed connection-aware routing bounding
box may also obtain more runtime improvement than using the existing net-based routing bound-
ing box, relying on the fact that there are fewer wire segments to be expanded during the maze
expansion of PathFinder-based algorithms.
ACKNOWLEDGMENT
The authors thank Elias Vansteenkiste for his contributions to initial versions of CRoute.
REFERENCES
[1] Vaughn Betz and Jonathan Rose. 1997. VPR: A new packing, placement and routing tool for FPGA research. In Pro-
ceedings of the 7th International Workshop on Field-Programmable Logic and Applications (FPL’97). 213–222.
[2] Vaughn Betz and Jonathan Rose. 1999. FPGA routing architecture: Segmentation and buffering to optimize speed and
density. In Proceedings of the ACM/SIGDA 7th International Symposium on Field Programmable Gate Arrays (FPGA’99).
59–68.
[3] X. Chen, J. Zhu, andM. Zhang. 2011. Timing-driven routing of high fanout nets. In Proceedings of the 21st International
Conference on Field Programmable Logic and Applications. 423–428.
[4] M. Gort and J. H. Anderson. 2012. Accelerating FPGA routing through parallelization and engineering enhancements
special section on PAR-CAD 2010. IEEE Trans. Comput.-Aided Design Integr. Circ. Syst. 31, 1 (2012), 61–74.
[5] M. Gort and J. H. Anderson. 2013. Combined architecture/algorithm approach to fast FPGA routing. IEEE Trans. Very
Large Scale Integr. Syst. 21, 6 (2013), 1067–1079.
[6] C. H. Hoo, Y. Ha, and A. Kumar. 2016. ParaFRo: A hybrid parallel FPGA router using fine grained synchronization
and partitioning. In Proceedings of the 26th International Conference on Field Programmable Logic and Applications
(FPL’16). 1–11.
[7] Chin Hau Hoo and Akash Kumar. 2018. ParaDRo: A parallel deterministic router based on spatial partitioning and
scheduling. In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’18).
67–76.
[8] Jason Luu, Jeffrey Goeders, Michael Wainberg, Andrew Somerville, Thien Yu, Konstantin Nasartschuk, Miad Nasr,
Sen Wang, Tim Liu, Nooruddin Ahmed, Kenneth B. Kent, Jason Anderson, Jonathan Rose, and Vaughn Betz. 2014.
VTR 7.0: Next generation architecture and CAD system for FPGAs. ACM Trans. Reconfig. Technol. Syst. 7, 2, Article 6
(July 2014), 30 pages.
ACM Transactions on Reconfigurable Technology and Systems, Vol. 13, No. 4, Article 18. Pub. date: August 2020.
18:26 Y. Zhou et al.
[9] Jason Luu, Ian Kuon, Peter Jamieson, Ted Campbell, Andy Ye, Wei Mark Fang, Kenneth Kent, and Jonathan Rose.
2011. VPR 5.0: FPGA CAD and architecture exploration tools with single-driver routing, heterogeneity and process
scaling. ACM Trans. Reconfig. Technol. Syst. 4, 4, Article 32 (Dec. 2011), 23 pages.
[10] L. McMurchie and C. Ebeling. 1995. PathFinder: A negotiation-based performance-driven router for FPGAs. In Pro-
ceedings of the 3rd International ACM Symposium on Field-Programmable Gate Arrays. 111–117.
[11] Y.Moctar, M. Stojilović, and P. Brisk. 2018. Deterministic parallel routing for FPGAs based onGalois parallel execution
model. In Proceedings of the 28th International Conference on Field Programmable Logic and Applications (FPL’18). 21–
25.
[12] Y. O. M. Moctar and P. Brisk. 2014. Parallel FPGA routing based on the operator formulation. In Proceedings of the
51st ACM/EDAC/IEEE Design Automation Conference (DAC’14). 1–6.
[13] Kevin E. Murray, Scott Whitty, Suya Liu, Jason Luu, and Vaughn Betz. 2015. Timing-driven titan: Enabling large
benchmarks and exploring the gap between academic and commercial CAD. ACM Trans. Reconfig. Technol. Syst. 8, 2,
Article 10 (Mar. 2015), 18 pages.
[14] Minghua Shen and Guojie Luo. 2015. Accelerate FPGA routing with parallel recursive partitioning. In Proceedings of
the IEEE/ACM International Conference on Computer-Aided Design (ICCAD ’15). 118–125.
[15] Minghua Shen and Guojie Luo. 2017. Corolla: GPU-accelerated FPGA routing based on subgraph dynamic expansion.
In Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays (FPGA’17). 105–114.
[16] M. Shen and N. Xiao. 2018. Fine-grained parallel routing for FPGAs with selective expansion. In Proceedings of the
IEEE 36th International Conference on Computer Design (ICCD’18). 577–586.
[17] M. Shen and N. Xiao. 2018. Load balance-aware multi-core parallel routing for large-scale FPGAs. In Proceedings of
the IEEE 36th International Conference on Computer Design (ICCD’18). 595–602.
[18] Jordan S. Swartz, Vaughn Betz, and Jonathan Rose. 1998. A fast routability-driven router for FPGAs. In Proceedings
of the ACM/SIGDA 6th International Symposium on Field Programmable Gate Arrays (FPGA’98). 140–149.
[19] Elias Vansteenkiste. 2016. New FPGA Design Tools and Architectures. Ph.D. Dissertation. Ghent University.
[20] E. Vansteenkiste, K. Bruneel, and D. Stroobandt. 2013. A connection-based router for FPGAs. In Proceedings of the
International Conference on Field-Programmable Technology (FPT’13). 326–329.
[21] D. Vercruyce, E. Vansteenkiste, and D. Stroobandt. 2018. How preserving circuit design hierarchy during FPGA pack-
ing leads to better performance. IEEE Trans. Comput.-Aided Design Integr. Circ. Syst. 37, 3 (2018), 629–642.
[22] D. Vercruyce, E. Vansteenkiste, and D. Stroobandt. 2019. CRoute: A fast high-quality timing-driven connection-based
FPGA router. In Proceedings of the IEEE 27th Annual International Symposium on Field-Programmable Custom Com-
puting Machines (FCCM’19). 53–60.
[23] D. Wang, Z. Duan, C. Tian, B. Huang, and N. Zhang. 2018. A runtime optimization approach for FPGA routing. IEEE
IEEE Trans. Comput.-Aided Design Integr. Circ. Syst. 37, 8 (2018), 1706–1710.
[24] D. Wang, Z. Duan, C. Tian, B. Huang, and N. Zhang. 2020. ParRA: A shared memory parallel FPGA router using
hybrid partitioning approach. IEEE Trans. Comput.-Aided Design Integr. Circ. Syst. 39, 4 (2020), 830–842.
[25] C. Zhu, J. Wang, and J. Lai. 2013. A novel net-partition-based multithread FPGA routing method. In Proceedings of
the 23rd International Conference on Field programmable Logic and Applications. 1–4.
Received December 2019; revised May 2020; accepted June 2020
ACM Transactions on Reconfigurable Technology and Systems, Vol. 13, No. 4, Article 18. Pub. date: August 2020.
