Attention Routing: track-assignment detailed routing using
  attention-based reinforcement learning by Liao, Haiguang et al.
Proceedings of the ASME 2020 International Design Engineering Technical Conferences &
Computers and Information in Engineering Conference
IDETC/DAC 2020
August, 2020, St. Louis, USA
IDETC2020-19380
ATTENTION ROUTING: TRACK-ASSIGNMENT DETAILED ROUTING USING
ATTENTION-BASED REINFORCEMENT LEARNING
HAIGUANG LIAO1 QINGYI DONG1 XULIANG DONG1 WENTAI ZHANG1
WANGYANG ZHANG2 WEIYI QI2 ELIAS FALLON2 LEVENT BURAK KARA1∗
1. Carnegie Mellon University
Pittsburgh, PA 15213
2. Cadence Design Systems
San Jose, CA 95134
ABSTRACT
In the physical design of integrated circuits, global and de-
tailed routing are critical stages involving the determination of
the interconnected paths of each net on a circuit while satis-
fying the design constraints. Existing actual routers as well
as routability predictors either have to resort to expensive ap-
proaches that lead to high computational times, or use heuristics
that do not generalize well. Even though new, learning-based
routing methods have been proposed to address this need, re-
quirements on labelled data and difficulties in addressing com-
plex design rule constraints have limited their adoption in ad-
vanced technology node physical design problems. In this work,
we propose a new router — attention router, which is the first
attempt to solve the track-assignment detailed routing problem
by applying reinforcement learning. Complex design rule con-
straints are encoded into the routing algorithm and an attention-
model-based REINFORCE algorithm is applied to solve the most
critical step in routing: sequencing device pairs to be routed.
The attention router and its baseline genetic router are applied
to solve different commercial advanced technologies analog cir-
cuits problem sets. The attention router demonstrates generaliza-
tion ability to unseen problems and is also able to achieve more
than 100× acceleration over the genetic router without severely
compromising the routing solution quality. Increasing the num-
ber of training problems greatly improves the performance of
attention router. We also discover a similarity between the at-
∗Address all correspondence to this author.
tention router and the baseline genetic router in terms of posi-
tive correlations in cost and routing patterns, which demonstrate
the attention router’s ability to be utilized not only as a detailed
router but also as a predictor for routability and congestion.
1 INTRODUCTION
Integrated circuits (IC) are becoming increasingly more so-
phisticated keeping pace with Moore’s Law [1]. To solve the
increasingly more complex IC system design problems, new ad-
vanced electronic design automation (EDA) tools are needed to
help engineers especially in the domain of advanced technology
node (< 16 nm) IC designs. In the physical design flow of IC,
a critical step is routing, where paths for connecting separate
groups of devices are generated based on the locations of de-
vices determined in the previous step of placement. To make
the problem tractable, the routing problem is addressed in two
stages: global routing and detailed routing [2]. While global
routing aims to coarsely assign space resources used for routing,
detailed routing generates the exact routes that connnect electric
components. The placement and routing, while applied sequen-
tially, are interdependent: a good placement makes routing sim-
pler, and quantitative routing measures can in turn be used to
assess the quality of a placement solution.
To achieve successful and high quality IC physical designs,
prior works have emphasized quality of routing [3,4,5] vs. speed
of attaining a solution [6, 7], and have developed routability pre-
This material is based upon work supported by DARPA under Contract No. HR0011-18-3-0010. Any opinions, findings and conclusions or recommendations expressed
in this material are those of the author(s) and do not necessarily reflect the views of the U.S. Government. Distribution Statement ”A” (Approved for Public Release,
Distribution Unlimited).The views, opinions and/or findings expressed are those of the author and should not be interpreted as representing the official views or policies
of the Department of Defense or the U.S. Government.
ar
X
iv
:2
00
4.
09
47
3v
2 
 [c
s.L
G]
  2
2 M
ay
 20
20
diction algorithms [8, 9]. However, existing routing algorithms
are primarily based on heuristic based methods which impose
stringent constraints and therefore do not generalize well to un-
seen problems. Although, there have been several learning-
based methods to improve the performance of routing algorithms
[10,11,12], these approaches either only work as routability pre-
dictors or are hampered by limited generalization ability and the
inability to account for complex design rule constraints, which
are becoming increasingly sophisticated in advanced technology
nodes IC design. As such, fast routing algorithms with strong
generalization ability are urgently needed. In this work, we
present attention routing, which is an application of attention-
model based reinforcement learning (RL) model, to solve the
track-assignment detailed routing problems on advanced node
technologies problem sets. The routing algorithm is designed
to encode the design rules into the track-assignment steps. The
RL algorithm addresses one of the most critical steps in routing,
which is determining the best order sequence of the set of de-
vice pairs to be routed, such that the overall solution quality is
maximized. To the best of our knowledge, this work is the first
attempt to solve the detailed routing problem using RL. The RL
model is a policy gradient method based on attention model [13].
We describe our attention router and also a genetic router, which
is based on genetic algorithms (GA). Both methods are tested on
commercial advanced technology nodes IC problem sets, perfor-
mance is compared and analyzed.
2 BACKGROUND and RELATED WORK
2.1 Width Spacing Pattern
In advanced technologies node, the manufacturing and de-
sign rule constraints (e.g. those due to multi-patterning) have sig-
nificantly increased the complexity of the physical design task.
As a result, it is becoming increasingly more challenging for
layout designers to parse and memorize all the design rules. A
further layer of abstraction is introduced to address this issue,
namely the Width Spacing Pattern (WSP).
WSPs define a set of track patterns that consist of different
width and spacing configurations for metal wires. By restricting
the routes on the WSP rows and tracks, many design rules asso-
ciated with full custom designs, including those concerning the
spacing, minimum widths, and coloring rules can be avoided.
In this work, as we attempt to solve the detailed routing prob-
lem for analog circuits in advanced technologies (FinFET), the
routing strategies we present follow the design rule specifications
through the adoption of WSP abstraction.
2.2 Track-Assignment Routing
As mentioned above, routing is typically divided into two
stages: global routing and detailed routing [2]. In global routing,
the routing resources are allocated into sub-regions (in the WSP
setting, rows) and the detailed router will then implement the
planned routes, satisfying various constraints (e.g. open, short,
design rule checkers (DRC). However, when many routes are se-
quentially implemented, the two-step solution could result in un-
desirable detours for global routes that are planned to be straight
[14]. Such issues can be addressed with a time-consuming rip-
up and re-route strategy, which involves heuristics that depend
on the routing style and technology requirements [15]. Another
approach is to insert a track assignment step between the global
and detailed routing stages that aims to solve the routing problem
in a more hierarchical manner [16, 17, 18, 19, 20].
FIGURE 1: Schematic showing track assignment constraints.
The goal of track assignment is to place the long routes onto
tracks defined by the WSP, with the constraints imposed by tech-
nology, routing resources, as well as conflicting routes being si-
multaneously considered. It also facilitates addressing layout de-
pendent issues such as crosstalk [19]. After the long routes are
embedded on the tracks, the detailed router’s job is to connect
those components (instance terminals, instTerms) belonging to
the same net, thereby significantly reducing the routing search
space.
Let us denote ti as the i-th track, T the set of tracks de-
fined by the WSP, and sik the k-th occupied interval on track ti,
as shown in Fig. 1. Then, the utilized track resources on ti is:
uti =
⋃
k sik, and instTerm ri is assignable to track t j iff the re-
spective track interval is not occupied (ri ∩ uti = /0). Similarly,
the instTerms to be assigned are defined as, ri ∈ I, where ri is the
i-th instTerms and I the instTerm set extracted from the global
routing result. The task of assigning instTerm ri onto track t j,
denoted as mi j, is associated with an assignment cost Ci j, which
reflects the cost including track occupation, perpendicular con-
nection, via insertion [18]. The track-assignment step is there-
fore deciding a mapping M which assigns all the instTerms onto
the available tracks without any conflict, while minimizing the
assignment cost:
2
M∗ = min
M
{
∑
mi j∈M
Ci j
}
(1)
s.t. ∀mi j ∈M∗, ri∩uti = /0
Note that this is a modified weighted bipartite matching
problem, which is known to be NP-complete. As in [18], we
solve it using a heuristic based algorithm, and the details are dis-
cussed in Section 3.1.
More specifically, in our case (as shown in Figure 2), the
routing task consists of two sub-tasks, i.e. routing the instTerms
on the appropriate tracks and connecting the instTerms. Al-
though an instTerm could consist of many pins with different x-
coordinates, an instTerm must be routed on a single track, mak-
ing it suitable to use the track assignment formulation. There-
fore, in the proposed approach, instTerm routing is addressed
with track assignment and we then use attention-based RL model
to solve the most critical part in actual routing the assigned inst-
Terms.
FIGURE 2: Schematic showing track assignment and routing.
2.3 Attention-based REINFORCE
Recent works [13] on using attention-model-based REIN-
FORCE - a reinforcement learning (RL) algorithm to solve com-
binatorial problems have demonstrated near optimum perfor-
mance with significant generalization capability compared to ex-
isting heuristic based method. It outperforms previous work
including Pointer Network (PN) [21], actor-to-critic version of
PN [22] and LSTM version of PN [23] in widely studied prob-
lem sets including Travelling Salesman Problem (TSP), Orien-
teering Problem (OP) and price collecting TSP. In solving these
problems, the solution can always be formulated as a sequential
decision process. One important reason they tend to be solved
reasonably well with reinforcement learning (RL) is that they can
be modelled as Markov Decision Process (MDP).
In solving such combinatorial problems, the attention-
model-based REINFORCE use an existing policy gradient RL
model: REINFORCE. In a policy gradient RL algorithm, a
model is used to learn a policy model p(a|s,θ), matching state
s of a problem at each time step to a corresponding probability
distribution of all actions a by iteratively optimizing the policy
model parameters over training samples. The cost that training
process aims to minimize is the expectation of reward r collected
after certain policy p has been rolled out for an episode, which
can be expressed as Ep[r(τ)] [24]. Following the policy gradient
theorem, the gradient of the cost can be expressed as shown in
Eqn. 2:
∇Epiθ [r(τ)] = Epiθ [r(τ)(
T
∑
t=1
∇logpθ (at |st))] (2)
which can be sampled and approximated from training data.
In REINFORCE, the above gradient of cost function is used
to optimize or train the policy model. However, the training pro-
cess of REINFORCE tends to be unstable due to the delayed
reward mechanism of REINFORCE. Thus, REINFORCE with
baseline is applied to stabilize REINFORCE.
In attention-model-based REINFORCE [13], the formula-
tion of problem, taking Travelling Salesman Problem (TSP) as an
example, can be described as follows: the solution is defined as a
tour pi = (pi1, ...,pin), which is a sequence of the n nodes (cities)
in a TSP problem s. The input of the policy model is based on
the graph structure (layout of cities) of the TSP, and the policy
model output a probability distribution pθ (pit |s,pi 1:t−1) over all
the nodes that are likely to be visited at the next time step n,
during which nodes already visited are masked to ensure zero
probability to be visited. Based on this, a problem policy p(pi |s)
is defined as Eqn. 3, which is the product of probability distri-
bution for the n steps. If at each time step, the node with the
highest probability is chosen, a solution (path) is said to be given
in a deterministic greedy rollout manner:
pθ (pi |s) =
n
∏
t=1
pθ (pit |s,pi 1:t−1) (3)
Based on the problem policy and policy gradient theorem,
the gradient of the loss function used in the attention model is
defined as Eqn. 4. For a TSP problem, the loss term L(pi ) is the
total tour length of the TSP problem instance. When applying
the attention-based model to solve detailed routing problems, the
loss is changed accordingly:
3
∇L(θ |s) = Epθ (pi |s)[(L(pi )−b(s))∇logpθ ((pi)|s)] (4)
A rollout baseline b(s) is applied that is periodically updated
in the following ways: b(s) is the cost of a solution from a de-
terministic greedy rollout of the policy defined by the best model
so far. In actual implementation, a t-test is applied to ensure that
the baseline is based on the best model so far.
The policy model is realized with attention-based encoder-
decoder model which can be considered as a Graph Attention
Network [25], as shown in Fig. 3. The encoder is similar to the
Transformer architecture [26] encoder. After a first learned linear
projection layer, the major part of encoder consists of N attention
layers, with each layer consisting of two sublayers: a multi-head
attention (MHA) layer and a fully connected feed-forward (FF)
layer, skip-connection [27] and batch normalization (BN) [28]
are applied at each sublayer. Formally, the attention layers can
be expressed as follows:
hˆi = BNl(h
(l−1)
i +MHA
l
i(h
(l−1)
1 , ...,h
(l−1)
n )) (5)
h(l)i = BN
l(hˆi+FF l(hˆi)) (6)
where
ˆ
h(l)i represents the output values (in vector form) in the i
th node of l th MHA layer and hi represents the output values of
the hˆi
(l)
after BN.
At the heart of the MHA structure is the attention mecha-
nism which can be summarized as a weighted message passing
between the nodes in a graph. The weight of a message value that
a node receives from a neighbor depends on the compatibility of
its query with the key of the neighbor [13]. For each node, its
corresponding key, query and value is obtained by projecting the
node embedding/input by parameter matrices correspondingly,
whose weights are automatically learned during training. The
decoder consists of only one layer of MHA and it outputs the
probability distribution of nodes to be visited at each time step
based on the embeddings from the encoder and the output gener-
ated at the previous time steps.
Inspired by the similarity between existing combinatorial
problems and the critical sequencing step underlying detailed
routing algorithms, we propose an attention-based REINFORCE
model as a new way to address the detailed routing problem.
In applying the attention-based model to solve detailed routing,
while the model architecture remains the same, the loss function
is modified and the definition of a node now becomes a pair of
instTerms, which will be addressed in details Method. (Another
motivation for applying learning-based algorithm to solve de-
tailed routing is the difficulty to apply a robust heuristics based or
hard-coded method to the sequencing step. It is worth mention-
ing that although there exists simpler algorithms [13] for stan-
dard combinatorial problems such as Nearest Neighbors for TSP
, the unique hierarchical nature and complexity of IC physical
design flow and routing makes these algorithms not readily ap-
plicable to solve detailed routing [12].)
FIGURE 3: Encoder-decoder model structures. Adopted from
[13].
2.4 Genetic Algorithm
Genetic algorithms (GA) [29] have been widely used to
solve combinatorial optimization problems [30, 31]. It is real-
ized with iterations of generations, as schematically illustrated
in Fig. 4. For one generation, a population consisting a pool of
chromosome is firstly generated either randomly or from elite
parents of previous generation. A fitness function is then applied
to calculate the fitness of each chromosome in the population, a
proportion of the chromosome in the original population that has
higher fitness scores are selected to be the elites, which naturally
becomes parents for generating the next generation of popula-
tion. A new generation of population is generated by crossover
and mutation operations of chromosome among elites. In the
next generation iteration, the previous population is replaced by
newly generated population.
In this work, GA algorithm, as a comparison to attention
model, is used in genetic router for determining the sequence of
instTerm pairs to be routed, which significantly determines the
quality of detailed routing solutions. GA has been one of the best
4
methods in solving combinatorial optimization problems in IC
physical design [32, 33, 34], especially when no other heuristics
and learning methods are not readily available. Unfortunately,
although it tends to work well in small scale problems with
no stringent run time requirements, it suffers from lack of
generalization ability, large run time cost and limited scalability.
In this work, we propose to use attention router as an alternative
to the genetic router in track-assignment detailed routing.
FIGURE 4: Generation iteration of genetic algorithm (GA).
3 METHOD
Fig. 5 illustrates our attention router model. Firstly, all prob-
lem files in a specific problem set describing the design informa-
tion including instTerms locations and nets information is read
in and parsed by the Initializer. Track Assigner is then applied
to complete the track assignments for all instTerms in each prob-
lem. Once complete, the exact locations of all instTerms are de-
termined. Next, Pin Decomposer is then applied to all the nets
of each problem to further simplify each problem for the subse-
quent routing in the form of a set of instTerm pairs.
FIGURE 5: Pipeline of our attention routing.
In routing instTerm pairs, GA (for genetic router) and
attention-based REINFORCE model (for attention router) are ap-
plied to determine the sequence of the instTerm pairs in a prob-
lem to be routed. GA Sequencing is executed only once, while
Attention Sequencing is firstly trained by solving problems from
the training set, and then applied to problems in both training
and test sets. During execution of GA Sequencing and Atten-
tion Sequencing, the same Pattern Router function is utilized to
compute the exact routes connecting each instTerm pairs and to
calculate the cost of the solution. The problem solution is a con-
catenation of the actual routes for all the instTerm pairs and the
cost of a solution is defined as a weighted sum of wirelength
(WL) and number of openings (#Open), which is the number of
instTerms pairs that remain unconnected due to a lack of feasible
route for the given problem. The cost is given in Eqn. 7. Since
openings is highly undesirable in the physical design process, in
this research weights are set as: w1 = 1,w2 = 10
Cost = w1 ∗WL+w2 ∗#Open (7)
The details of the individual modules of our method are pro-
vided next. Python and Pytorch (Machine Learning Framework)
are used for implementing the proposed algorithm.
3.1 Track Assigner
The first step of routing is to assign instTerms to WSP tracks.
As the x-coordinates of all instTerms are fixed, the routing prob-
lem is reduced to finding the appropriate track while satisfying
the assignment rules and ensuring no short circuits.
Not all tracks can be used to route an instTerm. For illus-
tration, in Fig 6 (a) with three tracks, Track 1 and 2 can only
contain gate (G) and source/drain (S/D) terminals, while Track
3 can have both G and S/D terminals. Specifically, in this work,
we have seven tracks per row, where the G terminals should be
on tracks 1, 2, 6, and 7, S/D terminals on tracks 2, 3, 4, 5, and 6,
hence for instTerms composed of both G and S/D, only track 2
and 6 shall be used.
Two graphs are used in the track assigner, namely the over-
lap graph and the assignment graph (Fig 6). The overlap graph
models the conflicts between instTerms, where each node repre-
sents an instTerm, and an edge exists between two nodes if they
belong to different nets and their x-ranges overlap (implying that
they cannot be assigned to the same track). This constitutes a hor-
izontal constraint graph [2]. The assignment graph is a weighted
bipartite graph, the nodes on one side are the instTerms and the
other are the available tracks, if an instTerm is assignable to a
track, these two nodes are connected by an edge with a weight as
the assignment cost. Because instTerm is constrained to route on
the tracks of the containing row, the vertical connection and via
costs can be omitted, and we model the cost considering only the
horizontal track utilization.
5
FIGURE 6: Illustration of the track assignment problem: (1) the instTerms and tracks, (2) the overlap graph, (3) the assignment graph.
With the help of the two graphs, the track assignment prob-
lem is reduced to matching the instTerm nodes to the track nodes
in the assignment graph while minimizing the matching cost,
such that no conflicting instTerm nodes in the overlap graph are
matched to the same track. For the standard bipartite matching
problem, [35] provides a polynomial algorithm, but the prob-
lem now becomes NP-complete [18] after introducing the assign-
ment conflict constraint. As instTerm splitting is not allowed, we
modified the algorithm presented in [18] to solve the instTerm
track assignment problem, and the algorithm is described in Al-
gorithm 2.
Algorithm 1: The track assignment algorithm.
Input : Netlist containing the instTerm and track
information
Output: Assigned instTerm-track pairs
1 Build overlap graph GO = (Vo,E0) and assignment graph
GA = (VA,EA,w);
2 while Exists assignable instTerms do
3 Find the largest clique Km in GO;
4 Perform weighted bipartite matching on the
sub-graph Gm = (Vm,Em), where
Vm ⊆VA,Em ⊆ EA,Vm ∈ Km;
5 Assign the uniquely assignable instTerms to the
corresponding track (look-ahead heuristic in [18]);
6 Update GO and GA: remove assigned instTerm nodes
and associated edges;
7 end
3.2 Pin Decomposer
Each net is composed of multiple instTerms. Each instTerm
has the coordinate of (x1, x2, y) ,x1,x2,y ∈ Z+. In order to sim-
plify the problem, instead of directly working on the sequence
of nets, we first decompose each net into multiple two-instTerm
pairs, so that after the decomposition, our model will produce the
best sequence of these instTerm pairs. Kruskal’s algorithm [36]
is utilized to construct a Minimum Spanning Tree (MST) first, as
the MST naturally reveals the pin pairs that should be connected
as shown in Fig. 7a.
In order to create the MST, a distance matrix is needed,
where each element (i, j) in this matrix the distance between in-
stTerm i and instTerm j. However, due to the fact that we are
dealing with instTerms (bars) instead of nodes (points), distance
is computed as the minimum Manhattan distance between the
instTerm bars as shown in Fig. 7b and Fig. 7c. Note that even
after this decomposition, we are still dealing with instTerm pairs
rather than node pairs.
3.3 Pattern Router
We route each instTerm pair sequentially using a simplified
pattern router [37] in the rectilinear space that can be modelled
as a graph G(V,E). In the graph, each edge has a capacity ci j,
which is initialized to 1 before routing. In routing the instTerm
pairs of a problem, the routable paths are edges ei j, i, j ∈Z+ with
non-zero capacity.
3.3.1 Routing two vertices We loop through all com-
binations of (vi, v j), where vi is a vertex from instTerm i, and v j
is a vertex from instTerm j. For each combination, we use our
simplified pattern router, where we only consider “L” patterns
and then “Z” patterns if “L” patterns fail.
“L” pattern routing There are 2 kinds of “L” patterns: up-
per “L” and lower “L”, which are shown in Fig. 8a and Fig. 8b
6
FIGURE 7: Pin decomposer. (a) An MST reveals in the instTerm
pairs. (b)(c) Calculation of the distance between two instTerms.
respectively. Straight lines are also considered as a special case
of “L” pattern, when two instTerms overlap in the x axis or share
the same y coordinate.
“Z” pattern routing If “L” patterns fail, our router employs
“Z” pattern routing. There are also two kinds of “Z” patterns, as
shown in Fig. 8c and Fig. 8d.
If no above patterns can route (vi, v j), an opening occurs.
3.3.2 Post processing Once vi and v j are routed suc-
cessfully, we obtain the path that connects them. Any redundan-
cies in the path are removed. Fig. 8e and Fig. 8f illustrate that
the redundancy is the overlapping parts between the path and in-
stTerms i, j.
3.4 Attention Model Implementation
In order to find an optimized routing solution that minimizes
the loss among all possible routing sequences, we use an atten-
tion based encoder-decoder model with a rollout baseline. We
define each problem instance as a graph with n nodes, and each
node ni, i ∈ 1, ...,n is represented by an instTerm pair between
instTerm i and instTerm j. Each instTerm pair is in the form:
FIGURE 8: Two pin router. (a)(b) “L” pattern routing, (c)(d) “Z”
pattern routing, (e)(f) Post Processing.
(xi1,xi2,yi,x j1,x j2,y j, l), ∀i, j, i 6= j,
where xi1,xi2,yi represents the xy-coordinates of instTerm i. Sim-
ilarly, x j1,x j2,y j represents the xy-coordinate of instTerm j; l
represents the net index which is later used in the routing part
and thus will not be discussed in detail here. We define the solu-
tion to the routing sequence pi as a permutation of the n nodes.
Since the number of instTerms varies in each routing prob-
lem, to ensure that each problem instance s has the same number
of nodes n (inst pairs), we perform two possible padding strate-
gies on the problem instance: Pad Random and Pad Empty. In
the Pad Random strategy, we uniformly sample xy-coordinates
in the domain of all instTerms; in the Pad Empty strategy, we
are not concerned with the actual coordinates and instead pad
with all-zero nodes in the form of (0,0,0,0,0,0,0). In each strat-
egy, we set the graph size n to be the maximum graph size of all
the problem instances, and pad nodes to problem instances of
smaller size. After experiments with the two padding strategies,
we decide to use the Pad Empty strategy, which performs more
stably and generates routing sequences with smaller loss. We
think the result of the Pad Random strategy can be improved if
the coordinate sampling is done based on the original coordinate
distribution of the instTerms instead of a uniform sampling.
After careful parameter tuning, we set the number of training
batches B= 20, and for each batch, the training batch size T = 5.
7
We train our model on epoch sizes E = 100.
Our current problem sets contains different total number of
problem instances, which we split into 60% as training, 20% as
validation, and 20% as test cases. Each epoch is a walkthrough
of all the problem instances in the training set. In each batch, the
dataset loader loads 5 out of the training problem instances in
sequence as the current training batch. At the end of each epoch,
the model is evaluated on the validation set, and the average loss
is computed. After completion of all epochs, the model with the
smallest average loss is used to evaluate the test set by generating
corresponding routing sequences to each test problem.
We define the loss L(θ |s) = Epθ (pi|s)[L(pi)], where L(pi) is
a vector of length 5, containing losses of each problem instance
returned by the pattern router as discussed in section 3.4. The
loss of each problem instance returned by the pattern router is
defined as a weighted sum of the total wire length and number
of openings using the routing sequence pi generated by the atten-
tion model, as shown in Eqn. 7. We optimize the loss L(θ |s) by
gradient descent, using the REINFORCE gradient estimator with
rollout baseline b(s) [13]:
∇L(θ |s) = Epθ (pi|s) [(L(pi)−b(s))∇ log pθ (pi|s)] (8)
At the end of each epoch, we perform a one-sided t-test be-
tween the current model and the baseline model with a signifi-
cance parameter α to decide whether or not the baseline model
should be updated. The algorithm is described in Algorithm 2.
Algorithm 2: Attention Sequencing
Input : Number of epochs E, batch size B, training set
T, significance α
Output: Sequence based on best policy
1 Init θ , θBL ← θ ;
2 for epoch=1,...,E do
3 for batch=1,...,B do
4 ti ← SampleInstance() ∀i ∈ 1, ...,T ;
5 pii ← SampleRollout(ti, pθ ) ∀i ∈ 1, ...,T ;
6 piBLi ← GreedyRollout(ti, pθBL ) ∀i ∈ 1, ...,T ;
7 ∆L← ∑Bi=1 (L(pii)−L(piBLi ))∆θ log pθ (pii);
8 θ ← Adam(θ , ∆L);
9 end
10 if OneSidedPairedTTest(pθ , pθBL ) ≤ α then
11 θBL ← θ ;
12 end
13 end
3.5 Genetic Algorithm (GA) Sequencing
The GA-based sequencing, which works as a comparison to
attention-based model in this work follows the typical generation
iterations of the GA algorithm shown in Fig. 4, with crossover
and mutation operations within each generation. The details of
the GA sequencing are shown in Algorithm 3. In this prob-
lem, each chromosome consists of an ordered vector of num-
bers, representing the routing sequence for all instTerms pairs in
a problem. Since we are trying to minimize the cost in Eqn. 7,
the fitness of a chromosome is the negative value of the cost in
Eqn. 7 by solving the corresponding problem with the sequence
indicated by the chromosome. Model parameters for the GA se-
quencing are set as: generation number: 10, population size: 10,
elites size: 4, number of mutations: 1. Note that a limited number
of generations is chosen to avoid the run time of GA sequencing
from becoming too long.
Since each chromosome in our algorithm is a sequence
rather than independent numbers, each number can only appear
once a chromosome. To address this uniqueness, the crossover
and mutation operations adopted in research is demonstrated
in Fig. 9. In generating a new child’s chromosome, partially
matched crossover is adopted. After crossover, a newly gener-
ated chromosome is obtained. In the mutation step, two genes in
two random selected locations in the newly generated chromo-
some switch their positions. This crossover and mutation method
guarantee that all generated kids represent a legal sequence.
Algorithm 3: Genetic Algorithm Sequencing
Input : Number of generations G, population size P,
elites size Q, number of mutations M
Output: Last generation of sequencing (chromosome)
1 Init chromosomes {C1, ...,CP} in first generation ;
2 for generation=1,...,G do
3 Select elites E1,..., EQ based on Fitness Score;
4 for i=1,...,P do
5 Ci ← CrossOver(Ei, E j) i, j ∈ 1, ...,Q;
6 Ci ←Mutation(Ci);
7 end
8 end
4 EXPERIMENTS
In order to assess the performance of the attention router
and its comparison to the baseline genetic router, both algorithms
are applied to detailed routing problems from two problem sets:
Small and Large, which are both analog design problems based
on commercial advanced node technologies (sub-16 nm tech-
nology). To be specific, Small problem set consists of different
8
FIGURE 9: Crossover and mutation methods used in the GA se-
quencing.
placement solutions for Comparators and OpAmp, while Large
problem set consists of different placement solutions of Analog-
to-Digital Converter (ADC). For Small problem set, the number
of instTerms for each problem range from 10 to 100, and in Large
problem sets, the number of instTerms for each problem range
from 100 to 1000.
We ran experiments based on the two problem sets. Three
experiments were conducted using the Small data set with 100,
500, and 5000 training problems denoted as Small100, Small500,
and Small5000 respectively. Two experiments were conducted
using the Large data set with 100 and 500 training problems de-
noted as Large100 and Large500 respectively. In the genetic
router, GA Sequencing is run for each of the problems in the
problem sets. In the attention router, attention sequencing is
trained iteratively using the training sets and then applied to pre-
viously unseen problems in the test sets. For the four sets of ex-
periments, the key parameters for the attention model are: batch
size= 5 and epoch number= 100. Increasing the batch size and the
number of epochs significantly improves the attention model’s
performance, (thus we set the epoch number to allow the model
gain enough learning experience, while not spending too much
time for training.). All experiments are run on a workstation
with an Intel Core i7-6850 CPU . In training the attention router,
it takes around 6 minutes for a training epoch on problem sets
Small500 and around 25 minutes for a training epoch on prob-
lem sets Large500 .
5 RESULTS AND DISCUSSIONS
5.1 Training
Fig. 10 shows cost versus training epochs plots for problem
sets Small100 and Small500. The high variation during the train-
ing process is an intrinsic property of the REINFORCE policy
gradient algorithm used in the attention model. It can be ex-
plained by the “delayed reward” of policy gradient REINFORCE
algorithm used to optimize the network, as shown in Eqn. 2. In
the equation, the reward signal is not obtained until T steps of
actions {a1, ...,aT} have been taken, then it is multiplied with
the summation of log-values of pθ (at |st) at each step to form the
gradient values for optimizing the policy networks. The delayed
reward signal mechanism makes the training unstable as there is
no clear guidance in terms of each action’s contribution in the
action sequence {a1, ...,aT} to the reward. As such, the gradient
based on the policy gradient theorem equation in Eqn. 2 can only
optimize the policy networks with a rough guidance in terms of
the optimization directions, instead of a more desirable one that
can lead to monotonically decreasing cost values.
The variation in the training process of REINFORCE is
remedied with the introduction of a baseline [24], which can be
described as a gauge for the difficulties of problem the model is
solving and usually leads to faster learning in the REINFORCE
model. In this work, although a baseline has been implemented,
the variation in training is still present, which might be further
reduced with techniques such as decay learning rate and appli-
cation of critic networks [13], which will be a part of our future
work.
FIGURE 10: Cost vs. training epochs on (a) Small100, and (b)
Small500 sets.
5.2 Attention router performance
5.2.1 Performance comparison between attention
model and GA. Figure 11 shows the cost comparison be-
tween the attention router and the genetic router for all problems
in the training and test sets for Small500 and Large500 prob-
lems. In each figure, the horizontal axis corresponds to problem
indices sorted based on an ascending order of the genetic router
results. For problem sets Small500, while the genetic router per-
forms better in almost all problems compared to the attention
router, the difference in cost for a given problem between the
two routers is rather small (mostly within 100). For problem sets
Large500, which has approximately ten times the number of in-
stTerms than Small500 in each problem, while the genetic router
still performs better overall, the number of cases in which the
attention router outperforms GA is higher. It has been argued
in prior work [38] that when applying a genetic router to solve
large scale problems, it tends to exhibit high computational cost
9
accompanied by a degradation of the solution quality. This is
primarily due to the increased complexity of problems, where
the number of possible sequences of n instTerm pairs is O(n!).
This makes increasingly larger problems intractable for GA with
limited computational cost.
By comparing the performance of the attention router on
training and test sets (as shown in Fig. 11), it can be seen that
in both problem sets, the performance of the attention router is
similar in training sets and test sets. This implies that the at-
tention router can solve previously unseen problems once it is
trained on the training set. This is due to the attention model’s
ability to learn proper strategies across various spatial structures
of the instTerm pairs. The multi-head attention (MHA) mecha-
nism can be seen as a message passing method that allows each
instTerm pair to communicate with all other instTerm pairs in the
same problem regarding their relative spatial information [13]. In
this way, an instTerm pair’s spatial configuration within a space
shared by other instTerm pairs can be continuously monitored
and factored in. This spatial information is then used to form
the sequential decisions for the final sequencing of the instTerm
pairs.
FIGURE 11: Cost comparison across different problems: (a)
Small500 training, (b) Small500 test, (c) Large500 training, (d)
Large500 test.
Figure 12 compares the cost vs. run time of the atten-
tion router and the genetic router on problem sets Small500 and
Large500. Attention router’s results are shown in blue dots,
while genetic router’s results are shown in red dots. In both plots,
while the cost range of attention router’s results is slightly higher
than the genetic router ones, the run time of attention router
is more than two orders of magnitude (100×) shorter than the
genetic router: For Small500 problem set, genetic router takes
more than 10 seconds to solve each problem, while for the atten-
tion router, the time is less than 0.1 seconds. For the Large500
problem set, the genetic router takes close to 100 seconds for
each problem, while the attention router only takes a little more
than 0.1 seconds to solve a problem.
This significant increase in speed enabled by the attention
router is due to different algorithmic structures of the attention
model and GA. For the attention model, once the model’s train-
ing is completed on the training set in an off-line setting, it is
applied to new problems in a forward fashion through primarily
matrix multiplications without iterations. The genetic router, on
the other hand, solves each problem anew, without the ability to
learn from previously solved problems. The significant run-time
acceleration enabled by the attention router provides a new al-
ternative for the GA router especially in the early stages of the
IC design where placement decisions are yet to be made in the
upstream of the workflow. In such instances, the inner optimiza-
tion involving detailed routing can be significantly accelerated
using the attention model as a way to provide useful guidance
to the placement algorithm, by leveraging the positive correla-
tions between results from the attention model and GA (Fig. 13).
However, as our results suggest, for the ultimate detailed rout-
ing decisions, the genetic router currently provides better quality
solutions (Fig. 11).
FIGURE 12: Cost vs. run time on test sets for problems in (a)
Small500, (b) Large500. The same problems are connected with
gray lines.
The routability prediction is crucial in the placement step of
IC physical design [39,40]. In order to achieve successful design
of a chip, there always exists the need in placement step to fast
and accurately assess whether there is a good routing solution
exists based on certain placement solution. Existing routability
prediction algorithms [39, 40, 41, 42] have been mainly focusing
on global routing stage, and even those that takes into account
detailed routing stage [39], supervised learning method is used,
which makes it depend on other routers to provide labelled data.
The attention router in this research provides a promising way
10
FIGURE 13: Cost comparison between attention model and GA
in (a) Small500, (b) Large500.
for routability prediction by leveraging its positive correlations
with genetic router solutions, which is a feasible solution that can
be utilized for providing high quality solutions for detailed rout-
ing. Another advantage is that, since the attention router utilizes
RL, it does not rely on any supervised learning requiring labelled
training data. Yet, in order to assess the accuracy of routability
prediction with the use of attention router, the model needs to
be tested on more problems across different problem sets, which
will be part of our future work.
5.2.2 Effect of Training Sample Numbers on the
Performance of Attention Model. To investigate the ef-
fect of training sample numbers on the performance of the atten-
tion model, the model is trained on the problem set Small5000.
Fig. 14 shows the results of the trained model performance on
the training and test sets. Compared to the model trained on
Small500 (Fig. 11a and b), the solutions produced by the at-
tention router improves significantly, approaching the genetic
router’s solutions. The sample efficiency of the attention router
remains part of our future work.
FIGURE 14: Cost comparison with increased number of training
problems : (a) Small5000 training, (b) Small5000 test.
5.2.3 Final Routes The routing results of the attention
router and the genetic router on a randomly chosen problem from
the Large500 problem set are shown in Fig. 15. Black dots and
bars correspond to instTerms and the colored lines represent the
actual routes. As seen, the solutions of the two routers share
some similarity. For instance, high congestion regions (shown
in red circles indicating densely configured routes) are located
at similar regions of the physical space. The blow out regions
of the attention router (c,d) and the genetic router (e,f) of the
same area also indicate similar patterns in routes and similar un-
connected instTerms. This type of similarity suggests that the
attention router can be used by upstream modules to rapidly pre-
dict congestion regions as well as the instTerms that may remain
open in the detailed routing stages.
FIGURE 15: Final routes: (a) Attention model and (b) GA on
a problem in Large500 problem set; magnified routes on same
regions of (c,d) attention model solution and (e,f) GA solution.
6 CONCLUSIONS
We present a new approach to the track-assignment de-
tailed routing using RL. A detailed routing pipeline we call
the attention router that takes into account complex design rule
constraints is developed and the attention-model based REIN-
FORCE algorithm is applied for the ordering of instTerm pairs.
The attention router and a baseline genetic router is tested on dif-
ferent commercial advanced technologies analog circuits prob-
lem sets. The attention router demonstrates a generalization abil-
ity to unseen problems from the same problem set after appro-
priate training. While the genetic router can have slightly better
quality solutions, the attention router is able to achieve more than
100× acceleration compared to the genetic router without a se-
vere degradation of the routing solution. Increasing the number
11
of training problems also greatly improves the performance of
the attention router on both training and test sets. Positive cor-
relations in terms of cost are also found between the attention
router and the genetic router, which enable the possibility of ap-
plying attention router as a routability predictor in the placement
stage. Similarities in the routing solution patterns (congestion
region and disconnected instTerms locations) are also discov-
ered, which demonstrate the attention router’s ability to work as
a more fine-grained congestion predictor and a predictor for dis-
connected instTerms locations in detailed routing. Future work
includes analyzing the correlation between the attention router
performance and the genetic router in terms of cost and con-
gestion. Sample efficiency of the model will also be studied to
provide guidance on training set size when solving problems of
different sizes.
7 ACKNOWLEDGEMENTS
This work is funded by the DARPA IDEA program
(HR0011-18-3-0010; Funder ID: 10.13039/100006502). The au-
thors would like to thank Prof. Barnabas Poczos for his useful
feedback.
REFERENCES
[1] Schaller, R. R., 1997. “Moore’s law: past, present and fu-
ture”. IEEE spectrum, 34(6), pp. 52–59.
[2] Sherwani, N. A., 2012. Algorithms for VLSI physical de-
sign automation. Springer Science & Business Media.
[3] Hu, J., and Sapatnekar, S. S., 2001. “A survey on multi-net
global routing for integrated circuits”. Integration, 31(1),
pp. 1–49.
[4] Mo, F., Tabbara, A., and Brayton, R. K., 2001. “A
force-directed maze router”. In IEEE/ACM Interna-
tional Conference on Computer Aided Design. ICCAD
2001. IEEE/ACM Digest of Technical Papers (Cat. No.
01CH37281), IEEE, pp. 404–407.
[5] Cong, J., Kahng, A. B., Robins, G., Sarrafzadeh, M., and
Wong, C.-K., 1992. “Provably good performance-driven
global routing”. IEEE Transactions on Computer-Aided
Design of Integrated Circuits and Systems, 11(6), pp. 739–
752.
[6] Soukup, J., 1978. “Fast maze router”. In Proceedings of the
15th Design Automation Conference, IEEE Press, pp. 100–
102.
[7] Chang, Y.-J., Lee, Y.-T., and Wang, T.-C., 2008. “Nthu-
route 2.0: a fast and stable global router”. In 2008
IEEE/ACM International Conference on Computer-Aided
Design, IEEE, pp. 338–343.
[8] Li, W., and Banerji, D. K., 1999. “Routability prediction
for hierarchical fpgas”. In Proceedings Ninth Great Lakes
Symposium on VLSI, IEEE, pp. 256–259.
[9] Pui, C.-W., Chen, G., Ma, Y., Young, E. F., and Yu, B.,
2017. “Clock-aware ultrascale fpga placement with ma-
chine learning routability prediction”. In 2017 IEEE/ACM
International Conference on Computer-Aided Design (IC-
CAD), IEEE, pp. 929–936.
[10] Qi, Z., Cai, Y., and Zhou, Q., 2014. “Accurate prediction
of detailed routing congestion using supervised data learn-
ing”. In 2014 IEEE 32nd International Conference on Com-
puter Design (ICCD), IEEE, pp. 97–103.
[11] Tabrizi, A. F., Rakai, L., Darav, N. K., Bustany, I., Behjat,
L., Xu, S., and Kennings, A., 2018. “A machine learning
framework to identify detailed routing short violations from
a placed netlist”. In 2018 55th ACM/ESDA/IEEE Design
Automation Conference (DAC), IEEE, pp. 1–6.
[12] Liao, H., Zhang, W., Dong, X., Poczos, B., Shimada, K.,
and Burak Kara, L., 2020. “A deep reinforcement learn-
ing approach for global routing”. Journal of Mechanical
Design, 142(6).
[13] Kool, W., Van Hoof, H., and Welling, M., 2018. “At-
tention, learn to solve routing problems!”. arXiv preprint
arXiv:1803.08475.
[14] Hetzel, A., 1998. “A sequential detailed router for huge
grid graphs”. In Proceedings Design, Automation and Test
in Europe, IEEE, pp. 332–338.
[15] Chen, H.-Y., and Chang, Y.-W., 2009. “Global and de-
tailed routing”. In Electronic Design Automation. Elsevier,
pp. 687–749.
[16] Sriram, M., Kang, S.-M., et al., 1992. “Detailed layer
assignment for mcm routing”. In International Confer-
ence on Computer Aided Design: Proceedings of the 1992
IEEE/ACM international conference on Computer-aided
design, Vol. 1992, pp. 386–389.
[17] Zhou, H., and Wong, D., 1999. “Global routing with
crosstalk constraints”. IEEE Transactions on computer-
aided design of integrated circuits and systems, 18(11),
pp. 1683–1688.
[18] Batterywala, S., Shenoy, N., Nicholls, W., and Zhou, H.,
2002. “Track assignment: A desirable intermediate step
between global routing and detailed routing”. In Proceed-
ings of the 2002 IEEE/ACM international conference on
Computer-aided design, pp. 59–66.
[19] Wu, D., Hu, J., Mahapatra, R., and Zhao, M., 2004. “Layer
assignment for crosstalk risk minimization”. In ASP-DAC
2004: Asia and South Pacific Design Automation Confer-
ence 2004 (IEEE Cat. No. 04EX753), IEEE, pp. 159–162.
[20] Liu, X., Zhang, Y., Yeap, G. K., Chu, C., Sun, J., and Zeng,
X., 2010. “Global routing and track assignment for flip-
chip designs”. In Proceedings of the 47th Design Automa-
tion Conference, pp. 90–93.
[21] Vinyals, O., Fortunato, M., and Jaitly, N., 2015. “Pointer
networks”. In Advances in neural information processing
systems, pp. 2692–2700.
12
[22] Bello, I., Pham, H., Le, Q. V., Norouzi, M., and Bengio, S.,
2016. “Neural combinatorial optimization with reinforce-
ment learning”. arXiv preprint arXiv:1611.09940.
[23] Nazari, M., Oroojlooy, A., Snyder, L., and Taka´c, M., 2018.
“Reinforcement learning for solving the vehicle routing
problem”. In Advances in Neural Information Processing
Systems, pp. 9839–9849.
[24] Sutton, R. S., and Barto, A. G., 2018. Reinforcement learn-
ing: An introduction. MIT press.
[25] Velicˇkovic´, P., Cucurull, G., Casanova, A., Romero, A.,
Lio, P., and Bengio, Y., 2017. “Graph attention networks”.
arXiv preprint arXiv:1710.10903.
[26] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I., 2017. “At-
tention is all you need”. In Advances in neural information
processing systems, pp. 5998–6008.
[27] He, K., Zhang, X., Ren, S., and Sun, J., 2016. “Deep resid-
ual learning for image recognition”. In Proceedings of the
IEEE conference on computer vision and pattern recogni-
tion, pp. 770–778.
[28] Ioffe, S., and Szegedy, C., 2015. “Batch normalization:
Accelerating deep network training by reducing internal co-
variate shift”. arXiv preprint arXiv:1502.03167.
[29] Whitley, D., 1994. “A genetic algorithm tutorial”. Statistics
and computing, 4(2), pp. 65–85.
[30] Jing, T., Lim, M. H., and Ong, Y. S., 2003. “A parallel hy-
brid ga for combinatorial optimization using grid technol-
ogy”. In The 2003 Congress on Evolutionary Computation,
2003. CEC’03., Vol. 3, IEEE, pp. 1895–1902.
[31] Mu¨hlenbein, H., 1992. “Parallel genetic algorithms in com-
binatorial optimization”. In Computer science and opera-
tions research. Elsevier, pp. 441–453.
[32] Lienig, J., and Thulasiraman, K., 1993. “A genetic algo-
rithm for channel routing in vlsi circuits”. Evolutionary
Computation, 1(4), pp. 293–311.
[33] Lienig, J., 1996. “A parallel genetic algorithm for two
detailed routing problems”. In 1996 IEEE International
Symposium on Circuits and Systems. Circuits and Systems
Connecting the World. ISCAS 96, Vol. 4, IEEE, pp. 508–
511.
[34] Esbensen, H., 1994. “A macro-cell global router based on
two genetic algorithms”. In European Design Automation
Conference: Proceedings of the conference on European
design automation, Vol. 19, pp. 428–433.
[35] Karp, R. M., Vazirani, U. V., and Vazirani, V. V., 1990.
“An optimal algorithm for on-line bipartite matching”. In
Proceedings of the twenty-second annual ACM symposium
on Theory of computing, pp. 352–358.
[36] Kruskal, J. B., 1956. “On the shortest spanning subtree of
a graph and the traveling salesman problem”. Proceedings
of the American Mathematical society, 7(1), pp. 48–50.
[37] Kastner, R., Bozorgzadeh, E., and Sarrafzadeh, M., 2002.
“Pattern routing: use and theory for increasing pre-
dictability and avoiding coupling”. IEEE Transactions on
Computer-Aided Design of Integrated Circuits and Sys-
tems, 21(7), pp. 777–790.
[38] Rivera, W., 2001. “Scalable parallel genetic algorithms”.
Artificial intelligence review, 16(2), pp. 153–168.
[39] Zhou, Q., Wang, X., Qi, Z., Chen, Z., Zhou, Q., and Cai,
Y., 2015. “An accurate detailed routing routability predic-
tion model in placement”. In 2015 6th Asia Symposium on
Quality Electronic Design (ASQED), IEEE, pp. 119–122.
[40] Chan, P. K., Schlag, M. D., and Zien, J. Y., 1993. “On
routability prediction for field-programmable gate arrays”.
In Proceedings of the 30th international Design Automation
Conference, pp. 326–330.
[41] Xie, Z., Huang, Y.-H., Fang, G.-Q., Ren, H., Fang, S.-Y.,
Chen, Y., and Hu, J., 2018. “Routenet: Routability pre-
diction for mixed-size designs using convolutional neural
network”. In 2018 IEEE/ACM International Conference
on Computer-Aided Design (ICCAD), IEEE, pp. 1–8.
[42] Brown, S. D., Rose, J., and Vranesic, Z. G., 1993.
“A stochastic model to predict the routability of field-
programmable gate arrays”. IEEE Transactions on
Computer-Aided Design of Integrated Circuits and Sys-
tems, 12(12), pp. 1827–1838.
13
