Timing-Constrained Global Routing with Buffered Steiner Trees by Rotter, Daniel
Timing-Constrained Global Routing
with Buffered Steiner Trees
Dissertation
zur
Erlangung des Doktorgrades (Dr. rer. nat.)
der
Mathematisch-Naturwissenschaftlichen Fakultät
der
Rheinischen Friedrich-Wilhelms-Universität Bonn
vorgelegt von
Daniel Rotter
aus
Siegburg
Bonn, 15. März 2017
Angefertigt mit Genehmigung der Mathematisch-Naturwissenschaftlichen Fakultät der
Rheinischen Friedrich-Wilhelms-Universität Bonn
1. Gutachter: Prof. Dr. Jens Vygen
2. Gutachter: Prof. Dr. Stephan Held
Tag der Promotion: 2. Juni 2017
Erscheinungsjahr: 2017
Acknowledgements
This thesis would not have been possible without the support of many people.
First, I would like to thank my supervisors Prof. Dr. Jens Vygen and Prof. Dr. Stephan
Held for their guidance and helpful discussions. Already during my Bachelor studies, Jens
attracted my interest to Discrete Mathematics and accompanied me on the long journey
through my Bachelor, Master, and PhD studies. Without Stephan’s supervision I would
have got stuck inside the IBM-jungle too many times. I would also like to thank Prof. Dr.
Bernhard Korte for creating excellent working conditions at this institute.
Special thanks go to my former and present colleagues for helpful discussions and
productive collaboration on several topics such as timing optimization, global routing,
resource sharing, and pangea. Especially, I would like to thank Siad Daboul, Dr. Nicolai
Hähnle, Dr. Dirk Müller, Pietro Saccardi, Rudolf Scheifele, and Dr. Ulrike Schorr.
I am thankful to many people at IBM for sharing their experience with me. In particular,
I would like to thank Alexander J Suess for patiently answering questions about many
timing related topics, Nancy Y Zhou and Steven Quay for helping me to run BuffOpt
here in Bonn, and to Harald Folberth and Friedrich Schröder for the close and friendly
collaboration with pangea.
I am grateful to Dr. Ulrich Brenner, Siad Daboul, Prof. Dr. Stephan Held, Dr. Dirk
Müller, Dr. Ulrike Schorr, and Prof. Dr. Jens Vygen for proofreading this thesis or parts of
it, and for valuable remarks and comments.
I would like to emphasize the importance of a good and friendly working atmosphere
beyond everyday office life. Among many others I would like to thank Anna Borutzky, Dr.
Ulrich Brenner, Siad Daboul, Dr. Nicolai Hähnle, Friederike Michaelis, Dr. Dirk Müller,
Pietro Saccardi, Dr. Tomás Silveira Salles, Dr. Jan Schneider, Dr. Ulrike Schorr, and
Kristina Stellwag for on-and off-season barbecues, both productive and funny business
trips, celebrating carnival with me, for cheering me up during the most exhausting periods
of my PhD studies, and for all occasions that required to order pizza to the conference
room.
My biggest thank goes to my parents Gisela and Bernd Rotter for their great assistance
during my whole life and to my girlfriend Eva Börgens for her lovely support and endless
patience while I was writing this thesis.
I

Contents
1 Introduction 1
2 Basic Concepts of Chip Design 5
2.1 Steiner Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 The Structure of a Computer Chip . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Packing of Wires . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4.1 Routing Layers and Wire Codes . . . . . . . . . . . . . . . . . . . . 9
2.4.2 Global and Detailed Routing . . . . . . . . . . . . . . . . . . . . . 10
2.4.3 Global Routing Graph . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.5 Timing Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5.1 Signals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.5.2 Standard Timing Graph . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5.3 Static Timing Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.5.4 Electrical Properties . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.5.5 Buffering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
2.5.6 Power Consumption . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3 Global Routing from a Timing Point of View 17
3.1 Timing: An Essential Objective for Global Routing . . . . . . . . . . . . . 17
3.2 Min-Max Resource Sharing . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2.1 Resources and Customers: An Abstract View on Global Routing . 18
3.2.2 Algorithms for Min-Max Resource Sharing . . . . . . . . . . . . . . 20
3.3 Modeling Timing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.3.1 The Timing Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.3.2 New Resources and Customers . . . . . . . . . . . . . . . . . . . . 23
3.4 Lower and Upper Bounds on Arrival Times . . . . . . . . . . . . . . . . . 26
3.4.1 Arrival Time Intervals Based on Lower Delay Bounds . . . . . . . 26
III
3.4.2 Shrinking Arrival Time Intervals with Upper Delay Bounds . . . . 26
3.4.3 Further Reduction of Arrival Time Intervals . . . . . . . . . . . . . 29
3.5 Properties of Low-Congestion Solutions . . . . . . . . . . . . . . . . . . . 30
3.6 Block Solvers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.6.1 A Simple but Unstable Block Solver for Arrival Time Customers . 33
3.6.2 Stabilizing Arrival Time Computation by Iteration . . . . . . . . . 34
3.6.3 Stabilizing Arrival Time Computation with Newton’s Method . . . 36
3.7 Overall Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
3.8 Obtaining Integral Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . 39
4 Buffering-and-Routing Oracles 41
4.1 Minimum Cost Buffered Steiner Trees . . . . . . . . . . . . . . . . . . . . . 41
4.1.1 Buffer Space Resources . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 Delay Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.2.1 The Elmore Delay Model . . . . . . . . . . . . . . . . . . . . . . . 43
4.2.2 Generalized Non-Linear Delay Model . . . . . . . . . . . . . . . . . 44
4.3 The Minimum Cost Buffered Steiner Tree Problem . . . . . . . . . . . . . 45
4.3.1 Hardness Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.3.2 Existing Algorithms for Special Cases . . . . . . . . . . . . . . . . 47
4.4 Cost-Delay Minimum Steiner Tree Problem with Loops . . . . . . . . . . . 47
4.4.1 Shortening the Model: Eliminating Pin Properties . . . . . . . . . 47
4.4.2 Shortening the Model: Representing Repeaters by Loops . . . . . . 47
4.4.3 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.4.4 Necessity of Conservative Edge Costs . . . . . . . . . . . . . . . . . 49
4.5 Cost-Delay Minimum Steiner Path Problems . . . . . . . . . . . . . . . . . 52
4.5.1 NP-Hardness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.5.2 A Fully Polynomial Time Approximation Scheme . . . . . . . . . . 55
4.5.3 Unbuffered Non-Linear Steiner Paths . . . . . . . . . . . . . . . . . 59
4.6 Cost-Delay Minimum Steiner Trees with Fixed Topologies . . . . . . . . . . 61
4.6.1 Preparation for the Proof of Theorem 4.12 . . . . . . . . . . . . . . 62
4.6.2 Proof of Theorem 4.12 . . . . . . . . . . . . . . . . . . . . . . . . . 64
4.6.3 Topology Embeddings in Graphs without Loops . . . . . . . . . . . 66
4.7 Electrical and Polarity Constraints . . . . . . . . . . . . . . . . . . . . . . 67
5 Topology Generation 69
5.1 Placed Topologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
IV
5.1.1 Properties of Placed Topologies . . . . . . . . . . . . . . . . . . . . 70
5.1.2 Contradicting Objectives . . . . . . . . . . . . . . . . . . . . . . . . 70
5.1.3 Delay-Minimum Placed Topologies . . . . . . . . . . . . . . . . . . 72
5.2 Nonapproximability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.3 Bicriteria Approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.3.1 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.3.2 A Bicriteria-Approximation Algorithm . . . . . . . . . . . . . . . . 79
5.4 Shallow-Light Topologies with Criticalities . . . . . . . . . . . . . . . . . . 83
5.5 Topology Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.5.1 Placement of Steiner Points in Delay-Optimum Solutions . . . . . . 84
5.5.2 Changing Component Layout and Steiner Point Positions . . . . . 88
5.5.3 Optimization with Greedy . . . . . . . . . . . . . . . . . . . . . . . 89
5.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.6.1 Layout of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.6.2 The Impact of Optimization . . . . . . . . . . . . . . . . . . . . . . 93
5.6.3 Comparison between Bicriteria and Bounds for Length and Slack . 97
5.6.4 Comparison between Bicriteria and Greedy . . . . . . . . . . . . . 99
6 On the Way to a Practical Algorithm: Virtual Buffering 101
6.1 A Linear Delay Model for Steiner Trees . . . . . . . . . . . . . . . . . . . . 101
6.2 Shortest Paths and Optimum Topology Embeddings . . . . . . . . . . . . 102
6.3 Speed-up Techniques for Practical Instances . . . . . . . . . . . . . . . . . 103
6.3.1 Reducing Running Time by Limiting Search Areas . . . . . . . . . 104
6.3.2 Future Costs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.4 Reach-Aware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.4.1 Reach-Aware 2-Dimensional Steiner Trees . . . . . . . . . . . . . . 110
6.4.2 Reach-Awareness by Restricting the Routing Area . . . . . . . . . 110
6.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
6.6 Port and Assertion Generation . . . . . . . . . . . . . . . . . . . . . . . . 118
6.6.1 Hierarchichal Design Flows . . . . . . . . . . . . . . . . . . . . . . 118
6.6.2 Abutted Hierarchy and Port Assignment . . . . . . . . . . . . . . . 119
6.6.3 Assertion Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
7 Buffering a Given Steiner Tree 123
7.1 The Minimum Cost Steiner Tree Buffering Problem . . . . . . . . . . . . . 123
7.1.1 Connecting the Detailed Pin Shapes . . . . . . . . . . . . . . . . . 124
V
7.1.2 Steiner Tree Transformations . . . . . . . . . . . . . . . . . . . . . 124
7.1.3 Elmore Delay Model with Slew Propagation . . . . . . . . . . . . . 126
7.1.4 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . 129
7.2 Previous Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
7.2.1 Buffering by Dynamic Programming . . . . . . . . . . . . . . . . . 130
7.2.2 The Fast Buffering Algorithm . . . . . . . . . . . . . . . . . . . . . 132
7.3 An Algorithm for Cost-Based Buffering . . . . . . . . . . . . . . . . . . . . 134
7.3.1 Candidates and Candidate Pairs . . . . . . . . . . . . . . . . . . . 134
7.3.2 Dominance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
7.3.3 Infeasible Repeater Positions . . . . . . . . . . . . . . . . . . . . . 137
7.3.4 Overview of the Algorithm . . . . . . . . . . . . . . . . . . . . . . 138
7.3.5 Pre-Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
7.3.6 The Move Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
7.3.7 Speed-up Techniques for the Move Step in the Fast Version . . . . . 141
7.3.8 The Merge Step . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
7.3.9 Choosing a Final Solution . . . . . . . . . . . . . . . . . . . . . . . 143
8 BonnRouteBuffer: A Tool for Global Buffering 145
8.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
8.1.1 BonnRouteBuffer as Part of BonnRouteGlobal . . . . . . . 145
8.1.2 Block Solver for Arrival Time Customers . . . . . . . . . . . . . . . 146
8.1.3 Block Solver for Net Customers . . . . . . . . . . . . . . . . . . . . 147
8.1.4 Slew Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 147
8.2 Layer and Wire Code Assignments . . . . . . . . . . . . . . . . . . . . . . 148
8.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
8.3.1 Comparison with the IBM Physical Design Flow . . . . . . . . . . 153
8.4 Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . 155
Bibliography 163
VI
Chapter 1
Introduction
Placement, routing, timing optimization – these are probably the most important steps of
any physical chip design flow, i. e. in a simplified way, placing the circuits, connecting the
pins, and optimizing the critical paths such that all signals arrive in time.
Several years ago, these steps have often been solved separately, but this approach
seems to be insufficient for modern technologies. Since electronic devices have become
smaller, faster, and more powerful, computer chips had to keep up with this development.
Nowadays, the number of critical signals is overwhelmingly large while the available routing
space is limited: roughly 1 km of wire needs to be packed on a chip that is not larger than
1 cm2.
“Optimizing the critical paths” very likely results in re-computing solutions for major
parts of the chip if timing constraints are ignored during placement and routing.
The need for keeping an eye on timing during routing can be shown easily. Assume
that the pins in Figure 1.1 have to be connected by horizontal and vertical wires. From a
routing point of view, both solutions, 1.1(a) and 1.1(b), seem to be equally good as both
trees are shortest possible. If we assume that the green pin s is a source pin from which
we want to send a signal to t1, t2, and t3, we might prefer Solution 1.1(b) since here, the
paths from s to all sinks are shortest possible as well.
This is of course a simplified example. In practice, things are much more complicated.
Instead of dealing with one instance only, we have to connect millions of sets of pins (called
nets) which compete for the same routing space containing
s
t1 t2
t3
(a) A shortest connection. A signal
starting at s arrives late at t3.
s
t1 t2
t3
(b) Another shortest connection.
Figure 1.1: Two shortest connections of the pins consisting of horizontal and vertical wires only.
Without taking timing constraints into account we might choose the left solution in which the
signal starting at s arrives late at t3.
1
2 Introduction
• routing space on low layers allowing a dense packing of wires but on which signal
delays are large, and
• routing space on high layers that yield space for only a small fraction of wires but
that allow a fast signal transportation.
If the routing space needed by Solution 1.1(b) is used by other trees, we have to decide
whether we choose a different solution (like Solution 1.1(a)) and take a detour or an
unfavorable choice of layers into account, or if we re-route other nets in order to free the
space needed by Solution 1.1(b).
Another difference between the simple example and practice is the estimate of the time
signals need to traverse a connection. In practice the delay along a wire grows (roughly)
quadratically with its length and replacing long wires by several shorter wires can result in
overall better timing. This is done by a step called buffering in which additional gates with
one input and one output are inserted such that the computed logical function remains
unchanged.
Assume that in our example the horizontal wires need to be buffered. The yellow areas
in Figure 1.2 depict the regions where we can insert a buffer. In all other regions, the
available placement space is not sufficient.
s
t1 t2
t3
(a) An optimally buffered connection
if there is placement space everywhere.
s
t1 t2
t3
(b) Restricted placement space com-
pletely changes wiring and buffering of
an optimum solution.
Figure 1.2: Available buffer space influences both routing and buffering of a net. The horizontal
wires need to be buffered but buffers can only be inserted in the yellow areas.
Here, we observe how the available placement space influences routing and buffering of
the depicted net. Depending on placement constraints, Solution 1.2(b) can be optimum
even if the path length from s to t3 is large.
Although research is still far away from solving (most of) the tasks arising during
physical design by one algorithm that takes all objectives into account, successful design
flows nowadays have to combine tasks that have been solved separately in the past.
In this thesis we show how to solve the two central problems buffering and global
routing by one single tool taking into account timing constraints, placement-, and routing-
congestion. We investigate both theoretical and practical aspects of the problem.
In Chapter 2 we introduce the basics concepts of chip design needed in this thesis. On
the routing side we will focus on global routing. During global routing, Steiner trees are
computed in a coarse graph to quickly estimate global routability of a chip. On the timing
side, we will concentrate on the buffering problem. In order to understand and solve that
problem, we need knowledge about electrical properties of transistors and wires, and static
timing analysis. Without re-defining it we use notations concerning the theory of graphs
Introduction 3
and concepts in combinatorial optimization from the book by Korte and Vygen [KV12]. As
the only exception to this we give a non-standard definition of a Steiner tree in Section 2.1.
In Chapter 3 we show how to model timing inside the Min-Max Resource Sharing
Problem. The resource sharing approach has already proved to be effective for the Standard
Global Routing Problem in theory and practice by Müller, Radke, Vygen [MRV11]. We
extend their model and incorporate timing constraints. As most important sub-task we
have to compute a solution for a single net. The resource sharing framework can be applied
for various phases of optimization and for various delay models.
For the concrete goal of solving the combination of buffering and global routing, we
have to compute a buffered Steiner tree in the global routing graph minimizing a sum
of costs for timing, wiring congestion, placement congestion, and other metrics such as
net length and power consumption. We show in Chapter 4 that this problem is already
NP-hard for simple delay models that ignore slew effects such as the Elmore delay model
and give a fully polynomial approximation scheme for instances with a constant number
of pins. The key idea to obtain such an algorithm is to enumerate all possible routing
topologies for a net and to compute buffered embeddings of it.
Already for medium-sized instances, a complete enumeration of all topologies is far
beyond practicability. As an alternative we develop in Chapter 5 fast algorithms to compute
one topology with provably good properties concerning timing and net length.
In contrast to the theoretical algorithms of Chapter 4 we concentrate on practical
algorithms in Chapters 6 and 7. We divide the complex task of computing a buffered
Steiner tree into the problem of computing a timing- and routing-congestion-aware Steiner
tree (Chapter 6) and the task of buffering a given Steiner tree taking into account costs for
routing congestion, placement congestion, timing, and electrical violations (Chapter 7).
Combining the results of Chapters 3 to 7 yields the tool BonnRouteBuffer that
is part of the BonnTools suite of physical design optimization tools developed at the
Research Institute for Discrete Mathematics, University of Bonn. We show results on
real-world VLSI instances provided by our industrial cooperation partner IBM in Chapter 8.
As a byproduct of our main tool for global routing with buffered Steiner trees we obtain
an algorithm for timing- and congestion-aware port assignment and assertion generation
that can be used in early parts of physical design (Chapter 6).

Chapter 2
Basic Concepts of Chip Design
Life in the 21st century is certainly marked by the technology we are surrounded with.
While 20 years ago the property of containing a computer chip was (at least almost)
reserved for computers, surprisingly many devices are equipped with great computing
power nowadays. TVs, mobile phones, watches, and even ordinary domestic appliances
like refrigerators and heatings may contain computer chips superior to the largest super
computers built in the 1980s.
One of the major requirements to make such a development possible was the produc-
tion of smaller and faster computer chips. The process of creating these modern chips
consisting of several billions of transistors is called chip design, or VLSI (=very large
scale integration) design, and is probably one of the most fascinating applications of
mathematical optimization. Due to its enormous complexity, a detailed introduction to
VLSI design would certainly go beyond the scope of this thesis. Instead, we focus on the
following aspects in this chapter:
1. In Section 2.3 we shortly explain how a chip is structured and introduce some notation
to denote the parts of a chip.
2. To produce smaller chips we have to pack many wires on a small area. In Section 2.4
we will see how this is accomplished in today’s computer chips. The core problem
that arises thereby is the Standard Global Routing Problem that we also formulate.
3. To produce faster chips we have to speed-up signal delays through wires and the
building blocks of a chip. Many different problems have to be solved to fulfill that task.
After a short introduction to the basic concepts of signal propagation in Section 2.5.1
we consider one of these problems, the Buffering Problem, in detail in Section 2.5.5.
In addition to the chip design related topics mentioned above we will setup some
mathematical notation. In large parts of this thesis we will build tree-like structures with
certain properties like small length, small delays, and small costs. In Section 2.1 we give a
slightly unusual but very general and flexible definition for Steiner trees. An important
property of a Steiner tree is its topology which we define in Section 2.2.
For all other notation related to graph theory we refer to the book of Korte and
Vygen [KV12]. In the whole thesis we denote the logarithm with basis 2 by log and the
natural logarithm by ln. Logarithms with a basis x 6= 2, e are denoted logx.
5
6 Basic Concepts of Chip Design
2.1 Steiner Trees
The most important structure of this thesis is the Steiner tree:
Definition 2.1 (Steiner tree) Let G be a directed or undirected graph and N ⊆ V (G).
Let s ∈ N be a special vertex (the source of N). A Steiner tree for N is a pair (A, κ)
such that A is an arborescence rooted at s with N ⊆ V (A) in which the set {ν ∈ V (A) :
δ+A(ν) = ∅} of leaves of A is equal to N\{s} and κ is a function
κ : V (A) ∪ E(A)→ V (G) ∪ (E(G) ∪ {◦}) such that
• κ(ν) ∈ V (G) for all ν ∈ V (A),
• κ(ζ) ∈ E(G) ∪ {◦} for all ζ ∈ E(A),
• for each (ν, ω) ∈ E(A) either κ((ν, ω)) = ◦ and κ(ν) = κ(ω) or
κ((ν, ω)) =
{
(κ(ν), κ(ω)) if G is directed
{κ(ν), κ(ω)} if G is undirected,
• κ(v) = v for v ∈ N .
If N = {s, t}, we call a Steiner tree for N an s-t Steiner path. The vertices in V (A)\N
are called Steiner vertices or Steiner nodes.
This definition is somehow unusual as in the literature a Steiner tree is usually defined
as a connected and acyclic subgraph of G containing N .
Let c : E(G) ∪ {◦} → R≥0 with c(◦) = 0. If we are interested in finding Steiner trees
(A, κ) minimizing ∑
ζ∈E(A)
c(κ(ζ)),
an optimum solution is always attained by (A, κ) such that |κ−1(e)| ≤ 1 for e ∈ E(G). In
this case we can identify A with the image of the restriction of κ to the set {ζ ∈ E(A) :
κ(ζ) 6= ◦} which is indeed a directed and acyclic subgraph of G containing N . We obtain
the usual definition of a Steiner tree.
In this thesis we will mainly be interested in computing Steiner trees minimizing more
complex cost functions including costs for delays. Restricting to solutions for which κ is
injective would be too severe. Although we do not identify the complete Steiner tree with
its image we identify N ⊆ V (A) with its image in G under κ.
It is easy to see that a Steiner tree for N with source s in G exists if and only if all
t ∈ N are reachable from s in G. This condition can be checked in linear time by breadth
first search or depth first search and is usually satisfied in our practical application.
Henceforth we will assume that Steiner trees for our instances exist without explicitly
mentioning it.
Sometimes we are interested in rectilinear Steiner trees rather than in Steiner trees in
graphs. Using Definition 2.1 we can define a rectilinear Steiner trees as follows:
Definition 2.2 (Rectilinear Steiner tree) Let n ∈ N and let N ⊂ Rn be a finite set
containing an element s ∈ N . An n-dimensional rectilinear Steiner tree for N is a
Steiner tree (A, κ) for N in the infinite and undirected graph(
Rn, {{p, q} : p, q ∈ Rn, p 6= q})
such that κ(ν) and κ(ω) differ in at most one coordinate for all (ν, ω) ∈ E(A).
Basic Concepts of Chip Design 7
s
t1
t2
t3
t4x1
x2
x3
(a) An undirected graph
G and a net N consist-
ing of the blue and green
nodes.
x1 x3
x2
t1
s
t1
t2
t3
t4
(b) A Steiner tree (A, κ)
for N . The path A[s,t4]
is colored brown, the red
edge is mapped to ◦ by κ.
x1x1 x3
x2
t1
t1
s
t1
t2
t3
t4
(c) A Steiner tree that
cannot be represented us-
ing the usual Steiner tree
definition.
(d) Edges in G used by
the two Steiner trees.
Figure 2.1: Two Steiner trees for which the sets κ(E(A)) are equal.
In nearly all applications we will be interested in the delay along paths inside Steiner trees.
Definition 2.3 Let (A, κ) be a Steiner tree and let ν, ω ∈ V (A) such that ω is reachable
from ν in A. We define A[ν,ω] as the unique ν-ω path in A.
By the assumption that A is rooted at s, each element of N\{s} is reachable from s.
This property is important for our application as we will use Steiner trees to propagate
“signals” from s to all remaining elements ofN . Note that {ν ∈ V (A) : δ+A(ν) = ∅} = N\{s}
implies V (A[s,t]) ∩N = {s, t} for t ∈ N .
Examples of Steiner trees can be found in Figure 2.1.
2.2 Topologies
An important property of a Steiner tree is its topology.
Definition 2.4 (Topology) A topology T for a net N with source s is an arborescence
with N ⊆ V (T ) such that T is rooted at s and
|δ+( ”v)| =

0 if ”v ∈ N\{s},
1 if ”v = s,
2 otherwise.
Vertices ”v with |δ+(”v)| = 2 are called Steiner vertices or Steiner nodes.
Definition 2.5 Let T be a topology and let ”v, ”w ∈ V (T ) such that ”w is reachable from ”v
in A. We define T[ ”v, ”w] as the unique ”v- ”w path in T .
Definition 2.6 (Embedding of a topology) An embedding of a topology T for N
is a Steiner tree (A, κ) for N (in an arbitrary graph) such that there exists a function
φ : V (T )→ V (A) with
• φ( ”v) = ”v for all ”v ∈ N ,
• for ( ”v, ”w) ∈ E(T ), φ( ”v) ∈ V (A[s,φ( ”w)]) and no vertex of V (A[φ(”v),φ(”w)])\{φ(”v), φ(”w)}
is in the image of φ.
We say that T is the underlying topology of (A, κ) if (A, κ) is an embedding of T .
8 Basic Concepts of Chip Design
s
t1 t2
t3
t4
t5
(a) Steiner tree
(A, κ).
s
t1 t2
t3
t4
t5
(b) A topology un-
derlying to (A, κ).
s
t1
t2
t3
t4
t5
(c) Another topol-
ogy underlying to
(A, κ).
s
t1
t2
t3
t4
t5
(d) A topology
not underlying to
(A, κ).
Figure 2.2: Example of two different topologies underlying to the same Steiner tree and one
topology that is not underlying to that tree. Here, N = {s, t1, t2, t3, t4, t5}.
Note that each Steiner tree has an underlying topology but that this topology is not
unique. See Figure 2.2 for an example. The property that (A, κ) is an embedding of T is
independent of the function κ. Instead, we have an embedding relation T φ→ A κ→ G.
From the strict degree constraints of a topology we can immediately derive some
well-known properties:
Lemma 2.7 (“well known”)
• The number of Steiner nodes in a topology for N is |N | − 2.
• The number of different topologies for N is
|N |−2∏
i=1
(2i− 1) = (2|N | − 4)!
2|N |−2(|N | − 2)! .
Proof For a topology T let S be the set of Steiner nodes. It holds that
|N |+ |S| − 1 =
∑
”v∈V (T )
|δ−T ( ”v)|
= |E(T )|
=
∑
”v∈V (T )
|δ+T (”v)|
= 1 + 2|S|
which implies the first fact. To prove the second fact we mention that for k ≥ 3 all
topologies for an instance with k pins can be obtained from the topologies for k − 1 pins
by subdividing an edge and that the topologies created that way are pairwise different. By
the first fact, the number of edges in a topology for k pins is equal to 2k − 3 and hence,
(# topologies for k sinks) = (2k − 3) · (# topologies for k − 1 sinks).
For a more detailed proof see Lemma 1.4 of [BZ15]. 
Basic Concepts of Chip Design 9
2.3 The Structure of a Computer Chip
Since this thesis will be largely technology independent, we can think of a computer chip
as a collection of wires that connect smaller cells and the chip itself.
These cells can be gates implementing the basic logic functions such as and, or, and
not, or they can be more complex chips again. Each of these cells contains pre-defined
parts to which wires have to be connected. These parts are called pins and are usually
very small such that we may consider them as points. The chip itself also contains pins,
the so-called primary pins. In the context of signal propagation we distinguish between
two types of pins, input pins and output pins (see Section 2.5.1). Especially if chips are
built hierarchically, pins are also called ports (see Section 6.6).
The large set of pins is partitioned into subsets, called nets. Except for a few special
cases that we do not consider in this thesis, each net contains exactly one source that
distributes the signal to the sinks of the same net. A source is either a primary input pin
or an output pin of a cell. All other pins of a net are primary output pins or input pins of
cells. Wires have to be arranged such that they form a tree for each net as we will see in
Section 2.4 in more detail.
For a more comprehensive description of the indeed much more complicated structure
of computer chips see [KRV07][HHV15][Vyg16].
2.4 Packing of Wires
2.4.1 Routing Layers and Wire Codes
Wires are usually arranged in routing layers. Within each layer either only horizontal or
only vertical wires are allowed. In the first case we say that a layer has preferred direction
x while in the latter case it has preferred direction y. For the sake of simplicity we do not
consider wires that go against the preferred direction (so-called jogs) in this introductory
chapter although they are sometimes allowed in practice. Wires on adjacent routing layers
can be connected by a wire in z-direction, a so-called via.
Wires have a certain width and require a certain distance (spacing) to each other.
Widths and spacings are determined by a mapping layer 7→ (width, spacing) that is called
wire code. Wire widths and spacings influence packing and timing optimization as follows:
• Wires with small width and spacing consume less routing space and are thus easier
to pack.
• Wires with large width have a small resistance and wires with large spacing have
small capacitance. These wires allow a faster signal propagation.
Each design has a default wire code that usually assigns smallest possible widths and
spacings to all layers. The effect that even these smallest possible values are larger for
higher layers than they are for the lowest layers is enforced by different usage of metal.
The metal used on higher layers allows a faster signal propagation. This way, optimization
algorithms that take into account packing (congestion) and timing constraints tend to
put wires of trees for critical nets on higher layers while uncritical nets are connected by
wires on the lower layers. If it is required to speed-up signal propagation on a wire on a
low layer, it is possible to change the wire code such that a larger width and spacing is
assigned there.
10 Basic Concepts of Chip Design
2.4.2 Global and Detailed Routing
The design step in which nets are connected by wires is called detailed routing (see e. g.
[Ges+13][Ahr+15]). The output of this step is a set of Steiner trees (one for each net) in
which each edge represents a wire in the available routing space (Section 2.4.1) with an
assigned wire code.
Apart from the obvious requirements that wires must not overlap, the output of a
detailed routing has to obey complex design rules and has to ensure sufficiently fast signal
propagation. Among all solutions that satisfy these requirements we are looking for a
solution with minimum total wire length.
The Detailed Routing Problem is hard in theory and practice. Even the simpler related
problems of computing one shortest rectilinear Steiner tree ([GJ77], [GJ79] page 208f) and
computing a set of edge disjoint paths in a grid graph [Sch09] are NP-hard.
To solve the problem in practice, there is a prior step called global routing which models
the total routing space by a coarser grid graph. Instead of computing edge disjoint trees,
we are looking for a set of Steiner trees in that graph obeying certain capacity constraints.
Complex design rules are not taken into account. Finally, the solution of the global routing
step serves as a guide for the detailed routing.
In its most basic form the Global Routing Problem can be defined as follows:
Standard Global Routing Problem
Instance: A directed or undirected graph G,
functions
cap : E(G)→ R≥0
and
length : E(G) ∪ {◦} → R≥0 with length(◦) = 0,
a set N of nets with N ⊆ V (G) for N ∈ N .
Output: A Steiner tree (AN , κN ) for N in G for each N ∈ N such that∑
N∈N
|{ζ ∈ E(AN ) : κN (ζ) = e}| ≤ cap(e) for all edges e ∈ E(G)
and among all these solutions∑
N∈N
∑
ζ∈E(AN )
length(κN (ζ)) is minimum.
In the above definition the space consumption of a wire is assumed to be 1. In the
simplified setting of the Standard Global Routing Problem, this assumption makes sense
since the selection of a non-default wire code is not necessary. In Chapter 3 we show how
to incorporate timing constraints into the global routing problem. In this extended model,
an edge ζ of a Steiner tree consumes a value usg(ζ) from its assigned edge in the global
routing graph. This value is then equal to the sum of width and spacing of the wire and
depends on the wire code.
Basic Concepts of Chip Design 11
x
y
z
Figure 2.3: Partition of the chip area into tiles and the global routing graph formed by this
partition in the case of four routing layers. Wires on higher layers are wider than wires on lower
layers.
2.4.3 Global Routing Graph
In most practical applications the global routing graph is a 3-dimensional grid graph
arising from a subdivision of the chip area as follows. Let Z ∈ N be the number of
wiring planes, let [0,W ]× [0, H] be the chip area, and let 0 = x0 < x1 < . . . < xw = W ,
0 = y0 < . . . < yh = H be horizontal and vertical cuts (w, h ∈ N). The vertices of the
global routing graph represent the tiles. Two vertices are joined by an edge if one of the
following conditions holds:
• They represent tiles with centers
(
xi+xi+1
2 ,
yj+yj+1
2 , z
)
and
(
xi+1+xi+2
2 ,
yj+yj+1
2 , z
)
of
the same layer z with x as the preferred routing direction on z.
• They represent tiles with centers
(
xi+xi+1
2 ,
yj+yj+1
2 , z
)
and
(
xi+xi+1
2 ,
yj+1+yj+2
2 , z
)
of
the same layer z with y as the preferred routing direction on z.
• They represent tiles with centers of the form
(
xi+xi+1
2 ,
yj+yj+1
2 , z
)
and(
xi+xi+1
2 ,
yj+yj+1
2 , z + 1
)
.
Figure 2.3 depicts a global routing graph. An edge between adjacent tiles represents the
routing space between them.
In the definition of the Standard Global Routing Problem a net is a subset of the vertex
set. This can be achieved by projecting a pin to the node representing a tile that has
non-empty intersection with that pin. To overcome the inaccuracy induced by these pin
projections, Saccardi [Sac15] proposed a continuous global routing model that takes into
account the actual pin positions.
12 Basic Concepts of Chip Design
Often, the layers of the global routing graph can be grouped into pairs such that for
each of these layer pairs {z, z′},
• |z − z′| = 1 and hence they have different preferred direction,
• wires on z and z′ use the same metal,
• the default wire code assigns the same width and spacing to wires on z and z′.
After contracting the vias between two grouped layers, the global routing graph can be
considered a special case of the more general global routing graph structure with vertex set
M × {1, . . . , Z2 } for a finite metric space (M, dist). The distance function dist allows us to
define a geometric distance between two vertices as the distance between their projections
onto M . We will use this property in later parts of this thesis (see e. g. Section 6.3.2).
2.5 Timing Optimization
In this section we briefly describe the basic concepts of timing optimization in VLSI design
that are needed in this thesis. For a more comprehensive overview on timing optimization
see [Hel08], [Sap04], and [Sch15].
2.5.1 Signals
The most important concept in timing optimization is a signal. A signal can be considered
a voltage change over time at a certain pin. For the sake of simplicity we imagine that
a chip has only two electrical potentials. The ground voltage V0 represents the logical
value 0. The positive voltage over ground, Vdd, represents the logic logical value 1. If the
voltage changes from ground voltage V0 to Vdd we speak of a rising signal. If the voltage
changes from Vdd to V0 we speak of a falling signal. The voltage change occurs gradually
and requires a certain time (see Figure 2.4). The time at which the voltage at a pin has
reached 50%Vdd is called arrival time. We call the time the signal needs to change from
10%Vdd to 90%Vdd (in case of a rising signal) or from 90%Vdd to 10%Vdd (in case of a
falling signal) respectively, the slew.
time
voltage
10%
50%
90%
100%
arrival time
slew
Figure 2.4: Example of a voltage change from V0 to Vdd, i. e. a rising signal. The time at which
the voltage reaches 50%Vdd is called arrival time. The time needed for the signal to change from
10%Vdd to 90%Vdd is called slew.
Basic Concepts of Chip Design 13
(a) A small chip with 16 pins and 4 gates. The
green pins are the timing starting points, the blue
pins are the timing end points.
(b) Corresponding standard timing graph consist-
ing of 16 vertices, 9 net edges, and 7 gate edges.
Figure 2.5: Example of a small chip with corresponding standard timing graph. Two signal paths
are highlighted (red and yellow). Each of these paths induces a signal at the orange pin. In this
thesis we do not distinguish between different timing phases and assume that these signals are
equal.
2.5.2 Standard Timing Graph
A signal at a source pin of a net induces signals at all sink pins of the same net. A signal at
an input pin of a logic gate can induce signals at output pins of that gate. This dependency
is modeled as an acyclic directed graph, the standard timing graph. The vertices of the
standard timing graph are the pins of the chip. Two pins p, q are joined by an edge (p, q) if
• p is source and q is sink pin of the same net, or
• p is input and q is output pin of the same logic gate.
Figure 2.5 shows an example of a small timing graph.
The vertex set V of a standard timing graph can be written as V = Vin
.∪ Vout
.∪
(V \(Vin ∪Vout)). Vertices in Vin represent pins at which signals start (primary input pins or
latch output pins). The in-degree of these vertices is zero. Signals end in vertices in Vout.
These vertices have no outgoing edges and represent timing endpoints (primary output pins
and latch input pins). In Figure 2.5 the set Vin is colored green and Vout is colored blue.
Maximal paths in the standard timing graph, i. e. Vin - Vout paths, are called signal paths.
Their number can be exponential in the number of pins (see Figure 3.1 for an example).
The yellow and the red path highlighted in Figure 2.5 are examples of signal paths.
A vertex of the standard timing graph can be reachable from several timing start points
that originate different signals. In order to distinguish different signals (e. g. signals with
different origins) we can associate them with a phase.
Taking different timing phases and their different arrival times, required arrival times, and
delays into consideration is a straightforward task for all problems addressed in this thesis
but requires a more complicated notation. For the sake of simplicity we do not distinguish
between different timing phases and assume that there is exactly one signal.
2.5.3 Static Timing Analysis
The major task in (late mode) timing optimization is to make sure that the signal arrives
in time at all timing end points. Static timing analysis (Hitchcock et al. [HSC82]) is a
method to check this property.
Let D be a standard timing graph. We assume that we are given arrival times at(v) at
the timing start points v ∈ Vin ⊂ V (D) and required arrival times rat(w) at the timing
14 Basic Concepts of Chip Design
end points w ∈ Vout ⊂ V (D). Let d(e) be the time the signal needs to traverse an edge
e ∈ E(D). This value is called delay and there are several delay models that can be used
to approximate it. In this thesis we will get to know three different delay models: the basic
variant of the Elmore delay model (Section 4.2.1), the linear delay model (Section 6.1),
and the Elmore Delay Model with Slew Propagation that takes slew effects into account
(Section 7.1.3).
We say that all timing constraints are met if for each signal path P starting in v ∈ Vin
and ending in w ∈ Vout,
at(v) +
∑
e∈E(P )
d(e) ≤ rat(w).
Since the number of signal paths is usually very large, we cannot use this definition directly.
Instead, we propagate the latest arrival time along the standard timing graph in topological
order.
Starting with the given arrival times at timing start points we compute the latest
arrival time at a vertex u ∈ V (D)\Vin recursively with the formula
at(u) = max{at(x) + d((x, u)) : (x, u) ∈ δ−D(u)}.
The signal arrives in time if at(w) ≤ rat(w) for all w ∈ Vout. This condition can be checked
in linear time.
During early design phases, it will rarely be the case that all timing constraints are
satisfied. Instead, we have to speed-up critical parts of the chip. To identify the most
critical parts we first propagate required arrival times along the standard timing graph in
reverse topological order. For u ∈ V (D)\Vout we recursively set
rat(u) = min{rat(x)− d((u, x)) : (u, x) ∈ δ+D(u)}.
We define the worst slack at a pin u ∈ V (D) as
wsl(u) = rat(u)− at(u).
The smaller wsl(u), the more critical pin u is.
2.5.4 Electrical Properties
Signal delays depend on electrical capacitances and slews that we must compute efficiently.
Capacitance. We need to compute the capacitance at the source pin or at the Steiner
nodes of a Steiner tree for a net N . Given capacitances at all sink pins of N and
the capacitance per length for all layer / wire code pairs, we can compute the missing
capacitances in linear time by propagation in reverse topological order. The required
capacitances and capacitance per length values are usually given with the library and the
design.
Slew. For each timing start point p ∈ Vin the slew slew(p) at p comes with the design
data. For each gate edge (v, w) of the standard timing graph we are given a function
outslew(v,w) : R≥0 × R≥0 → R≥0
Basic Concepts of Chip Design 15
that depends on the slew at v and the capacitance at w. Slew changes induced by
propagation along wires are given by functions
wireslew(z,wc) : R≥0 × R≥0 × R≥0 → R≥0
for each pair (z,wc) of layer z and wire code wc. These functions depend on the length of
the wire, the slew at v, and the capacitance at w.
In practice, these functions also depend on other parameters such as the type of the
signal at v and w (e. g. rising or falling signal). For the sake of simplicity we ignore this
fact here. As for the timing phases, distinguishing between rising and falling signals does
not require new algorithmic ideas for all parts of this thesis.
After having computed Steiner trees for all nets and capacitances at all nodes of the
Steiner trees, we can compute slews at all nodes by a linear number of evaluations of the
above functions.
Except for monotonicity in each argument we do not make any assumptions on the
functions outslew and wireslew but use them as black-box functions.
2.5.5 Buffering
Buffering is one of the main tasks in timing optimization and a main topic of this thesis.
If consists of reducing electrical capacitances per net by inserting further gates, so-called
repeaters. Repeaters can be either
• circuits implementing the identity function (so-called buffers), or
• circuits that turn rising signals into falling signals and vice versa (inverters).
We call a set containing all possible types of repeaters a repeater library. Repeaters of
different type can differ by size, power consumption, capacitance limit, and their timing
behavior. In this thesis we perform repeater insertion alongside with computation of Steiner
trees. Our task will be to compute a buffered Steiner tree for each net:
Definition 2.8 (buffered Steiner tree) Let L be a finite repeater library, let G be a
directed or undirected graph, and let N ⊆ V (G) be a net. We assume that for l ∈ L we are
given a set V (l) ⊆ V (G) of nodes at which insertion of a repeater of type l is allowed.
A buffered Steiner tree is a Steiner tree (A, κ) for N in G together with a function
b : V (A)→ L ∪ {} such that b(t) =  for t ∈ N and b(ν) = l ⇒ κ(ν) ∈ V (l).
Setting b(ν) = l ∈ L represents to associate Steiner point v with a repeater of type l
while b(ν) =  represents to not associate ν with a repeater.
Of course one has to be careful that the logic function computed by the chip stays the
same. In the context of buffering we deal with the netlist N obtained from the original
netlist Norig by removing all repeaters. For each sink pin t of a net N ∈ N we are given a
polarity
pol(t) ∈ {invert, ident}.
This value is equal to the number of inverters on the path between the source s of N and
t in Norig and when computing a buffered Steiner tree ((A, κ), b) for N we have to make
sure that the number ∣∣{ν ∈ V (A[s,t]) : b(ν) ∈ L inverter}∣∣
of inverters on the path between s and t is
16 Basic Concepts of Chip Design
• even if pol(t) = ident, and
• odd if pol(t) = invert.
There are several objectives to measure the quality of a buffered Steiner tree such
as minimizing routing and placement congestion, and minimizing delays (see Chapters 4
and 7). The latter objective depends on the timing model we use. If we are using a delay
model that considers electrical capacitances, the impact of buffering on delays is usually
very large and without buffering (i. e. by setting b(ν) =  for all Steiner nodes ν) we would
always end up with a hopeless timing as we explain now.
In the presence of long interconnections, large wire capacitances and hence huge delays
cannot be avoided. With buffering we can subdivide a long wiring path into several smaller
pieces. Although we have to take the extra delay through the newly inserted repeaters into
account, buffering can improve the timing of a design significantly. While the delay through
a long wiring path can be almost quadratic in its length, the delay along an optimally
buffered path is (almost) linear in its length.
In addition to timing improvements for long paths, buffering is important if nets have
many sinks. For these nets, already the accumulated capacitances of the sink pins can
result in a too bad timing. Buffering helps to replace a large net by several smaller ones.
Apart from the positive effect on the timing of a chip, capacitance reductions that
result from a good buffering are important to obey electrical constraints. At source pins
we are usually given a capacitance limit. Obeying these limits is necessary to reduce
electromigration effects and to guarantee strengths of signals.
Besides capacitance limits there are slew limits at all sink pins as well as a slew limit for
the timing phase that has to be obeyed by all signals. The latter is typically very tight
and dominates slew limits at sink pins. Without capacitance reductions that result from
buffering it would be impossible to obey slew limits.
2.5.6 Power Consumption
A chip consumes power and during timing optimization we have to keep this power
consumption as small as possible. We distinguish between static and dynamic power
consumption. Static power is consumed by each circuit when it is not switching. Dynamic
power is consumed by a circuit when it is charging or discharging and is proportional to the
product of its downstream capacitance and a switching factor that estimates the switching
frequency of the circuit.
In the context of the buffering problem we assume that we are given a function
power : L ∪ {} → R≥0
that determines the static power consumption of a repeater of type l ∈ L and that assigns
power() = 0. When we compute a buffered Steiner tree ((A, κ), b) we want to minimize
the total static power consumption
∑
ν∈V (A) b(ν).
For the sake of a simpler notation we do not address the dynamic power consumption
directly in this thesis. In all parts in which we compute buffered Steiner trees, optimizing
dynamic power consumption can be done analogously to optimizing the delay along a
newly inserted repeater and the source gate of a net.
Chapter 3
Global Routing from a Timing Point
of View
In this chapter we deal with the global routing problem introduced in Section 2.4.2.
One of the most important results on global routing is by Müller, Radke, and Vy-
gen [MRV11] who approximated the Standard Global Routing Problem within a factor that
is arbitrarily close to the approximation guarantee of an algorithm for the Minimum Cost
Steiner Tree Problem. They also achieved good results on practical VLSI instances.
We show how to incorporate global static timing constraints into their approach by
increasing the model by polynomial size only. Our approach works for many delay models,
including the Elmore delay model (Section 4.2.1) and the linear delay model (Section 6.1).
The results of this chapter are joint work with Stephan Held, Dirk Müller, Rudolf
Scheifele, Vera Traub, and Jens Vygen [Hel+17][Hel+15].
3.1 Timing: An Essential Objective for Global Routing
Global routing is an essential part of any modern physical design flow. It serves as
preparation for detailed routing, is used for quick congestion estimation during placement,
and is input to many steps in timing optimization. To achieve timing closure, it is
necessary that nets on timing critical paths are routed on high layers (cf. Section 2.4.1)
and connections to the most critical sinks are shortest possible. In some cases, wires need
to have a wire code that allows even faster signal propagation at the cost of an increased
routing space consumption. Making the right layer and wire code choices, and choosing
efficient routing topologies trading-off delay and routing congestion, is a key task in global
routing.
As many algorithms for global routing used in practice are not able to optimize timing
directly there is a step called layer and wire code assignment in which wire codes and a
range of wiring planes are assigned to timing-critical nets. These assignments then serve
as constraints during non-timing-driven global routing. An example of a congestion driven
layer assignment algorithm that uses a timing-unaware global router as black box for
congestion analysis is CATALYST [Wei+13]. For more details on the layer assignment
step see Section 8.2.
These two-step approaches have a big disadvantage: Layer ranges and wire codes can
17
18 Global Routing from a Timing Point of View
only be assigned to entire nets. For nets containing both critical and uncritical sinks such
an approach can involve a significant waste of routing resources on higher layers.
To overcome these limitations, several methods to address timing during global routing
directly have been introduced.
For netlists consisting of two-terminal nets only, Albrecht et al. [Alb+02] gave a fully
polynomial time approximation scheme (FPTAS) that finds a global routing minimizing
buffer area, routing congestion, and sink delays using a multicommodity flow approach
including net-based delay constraints.
Huang et al. [Hua+93] used net-based delay bounds and rejected Steiner trees violating
these bounds. Hong et al. [Hon+97] generalized this idea and introduced path-based delay
bounds. Now, Steiner trees leading to a delay violation of a path through the net are
discarded. The major disadvantage of the algorithms by [Hua+93] and [Hon+97] is that
routing capacities might be violated in order to meet net- or path based delay bounds that
serve as hard constraints.
Differently from the previous authors, Vygen [Vyg04] introduced path-based delay
bounds. By treating these similar to routing capacity constraints he was able to optimize
timing together with routing congestion. As the number of critical paths is usually
exponential in the size of the netlist, this approach does not yield a polynomial time
algorithm. The algorithm by Vygen [Vyg04] is based on the Min-Max Resource Sharing
Problem which we describe in more detail in Section 3.2.
Other notable approaches on global routing with timing constraints have been developed
by Hu and Sapatnekar [HS02], Yan and Lin [YL04], and Yan, Lee, Chen, and Huang
[Yan+06]. They all start with timing-driven but congestion-unaware Steiner trees for
all nets that are embedded into the global routing graph minimizing congestion. The
approaches differ in the way the trees are embedded. Recently, Samanta et al. [Sam+15]
proposed to pre-compute a set of alternative timing-driven Steiner trees for each net.
3.2 Min-Max Resource Sharing
3.2.1 Resources and Customers: An Abstract View on Global Routing
In this section we consider the global routing problem from a more abstract point of view.
Let N be a net and let (A, κ) be a Steiner tree for N in a global routing graph. The Steiner
tree (A, κ) consumes from several resources:
• Since edges in A model wires on the actual chip, (A, κ) consumes a certain amount
of routing space.
• A is also used to distribute signals from the electrical source s of N to its electrical
sinks. All signal paths can be considered as resources with capacities equal to the
difference between the required arrival time at the path’s end point and the signal’s
arrival time at the starting point. The time a signal needs to traverse the unique
path in A from s to a sink in N\{s} (and also the time needed to traverse the gate
having s as its output pin) can be regarded as consumption of (A, κ) from these
resources.
• Depending on the application we might also want to minimize net length or power
consumption and notice that a Steiner tree consumes from these resources as well.
Global Routing from a Timing Point of View 19
To obtain a mathematical model we consider a net as a customer consuming from a
certain set of resources with given resource capacities. The goal of the Min-Max Resource
Sharing Problem is to achieve that the total combined consumption of all customers from
each resource is upper-bounded by its capacity, or, if this is not achievable, to bound the
maximum violation of a resource capacity.
Min-Max Resource Sharing Problem
Instance: A set C of customers, a set R of resources,
a convex set Bc of solutions for each c ∈ C,
convex functions usgc,r : Bc → R≥0 for all c ∈ C, r ∈ R.
Output: A solution sol(c) ∈ Bc for all c ∈ C such that the maximum resource
consumption
max
r∈R
∑
c∈C
usgc,r(sol(c)) is minimum.
A solution sol(c) ∈ Bc for a customer c ∈ C is also called block. The functions usgc,r
specify which percentage a solution for a customer c consumes from a resource r. Note
that the minimum does not necessarily exist unless all Bc are compact (which will always
be the case in our application). Since we focus on approximation algorithms, the existence
of the minimum will not be important to us.
If the set BIc of solutions for a customer c ∈ C is finite we define Bc to be the set of
formal convex combinations of BIc ,
Bc :=
 ∑
sol∈BIc
µsol · sol : µsol ≥ 0 for all sol ∈ BIc ,
∑
sol∈BIc
µsol = 1
 .
A discrete function usgIc,r : BIc → R≥0 can be extended to Bc by
usgc,r
 ∑
sol∈BIc
µsol · sol
 := ∑
sol∈BIc
µsol · usgIc,r(sol).
For the Standard Global Routing Problem, C is the set of nets and R is the set of edges
in the global routing graph G. Solutions BIN for a net N ∈ C are the Steiner trees for N .
The function usgIN,e tells which fraction of the available routing space at an edge e ∈ E(G)
is consumed by a Steiner tree for N . We obtain Bc and usgN,e as described above. We can
think of usgIN,e(sol(N)) as the sum of wire widths and spacings of edges of Steiner tree
sol(N) for N mapped to e. By this interpretation it is easy to model usage of non-default
wire codes and assignment of an extra spacing larger than the minimum spacing specified
by the wire code. We will henceforth omit to mention spacings of wires explicitly.
Additional constraints and objectives such as minimizing total net length, total power
consumption, or optimizing manufacturing yield can be modeled by adding further resources
(see [MRV11], [Mül06], [Vyg04]).
Modeling timing constraints is more difficult. A naive approach would be to insert
a resource for each signal path. Unfortunately, the number of these paths is usually
exponential in the size of the netlist. An example of a netlist with an exponential number
20 Global Routing from a Timing Point of View
Figure 3.1: The number of timing paths can be exponential in the size of the netlist.
of signal paths is shown in Figure 3.1. Each path from the primary input (the green
vertex) to the primary output colored in blue can contain the upper or the lower input
pin of each of the four and-gates, leading to a total number of 2#and-gates maximal paths.
Timing graphs that occur in practice are much more complex than depicted here and even
enumerating all signal paths would lead to a hopeless running time. In this chapter we
show how to model timing constraints by a polynomial number of additional resources and
customers. For most practical instances, this number is even linear.
Block solvers. Usually, the sets Bc are not given explicitly but by oracle functions
that optimize linear functions over them, called block solvers. For given resource prices
price : R → R≥0 the block solver for a customer c ∈ C returns sol(c) ∈ Bc approximately
minimizing
∑
r∈R
price(r) · usgc,r(sol(c)).
Clearly, if these functions fail to return good solutions, we cannot hope for a good
approximation for the overall problem. However, if we have block solvers with finite
approximation ratios, we can find provably good solutions to the resource sharing problem
as we present in the next section.
Note that the Min-Max Resource Sharing Problem is a generalization of the well-known
Multicommodity Flow Problem: If all nets in our global routing application are two-terminal
nets, finding (fractional) Steiner trees for all nets is identical to finding (fractional) paths.
Hence, the Standard Global Routing Problem is identical to the Multicommodity Flow
Problem in this special case. Garg and Könemann [GK07] solved several variants of it
using a multiplicative price update strategy. The algorithms we present in the next section
use similar strategies.
3.2.2 Algorithms for Min-Max Resource Sharing
In this section we give an overview over existing algorithms for the Min-Max Resource
Sharing Problem. The common idea of these algorithms is the following: We define a price
for each resource (initially all prices are 1) and use the block solvers to iteratively generate
solutions for all customers that are approximately optimum with respect to the current
prices. Whenever a solution consumes from a resource, we increase the resource’s price
and iterate this process with the new prices.
This way, “popular” resources become more and more expensive such that block solvers for
some customers will eventually pick solutions not using these popular resources any more.
In the end, customer c will receive the arithmetic mean of all its solutions. Note that the
arithmetic mean is contained in Bc due to the convexity assumption. For a more formal
description of this concept see Algorithm 1.
In most practical applications we prefer integral solutions over arbitrary elements of
Bc. For instance, we cannot realize a convex combination of Steiner trees on the actual
Global Routing from a Timing Point of View 21
Instance: Resources R, customers C, number of phases p ∈ N,
convex sets Bc for all c ∈ C.
Output: Solutions solc ∈ Bc for all c ∈ C.
1○ set price(r) := 1 for all resources r ∈ R
2○ for i = 1 to p do
3○ for all customers c ∈ C:
4○ compute soli(c) ∈ Bc approximately minimum w. r. t. the current prices
5○ increase price(r) for resources r used by soli(c)
6○ return
(
1
p ·
p∑
i=1
soli(c)
)
c∈C
Algorithm 1: General algorithm for solving Min-Max Resource Sharing Problems. Step 5○ has
to be specified.
chip. To round fractional solutions, we use a combination of randomized rounding [RT87]
and traditional rip-up and re-route (see Section 3.8).
The first polynomial time approximation algorithm for the general version of the
Min-Max Resource Sharing Problem is due to Grigoriadis and Khachiyan [GK94]. For
each β > 0, their algorithm computes a (1 + β) · σ approximation using O˜ (|C| · |R| · β−2)
calls to block solvers with approximation guarantee σ (step 4○). Unfortunately, their
result is restricted to the case that σ can be chosen arbitrarily close to 1. In 2008, Jansen
and Zhang [JZ08] got rid of the restriction to σ while obtaining the same approximation
ratio and running time as [GK94]. The fastest known algorithm is by Müller, Radke and
Vygen [MRV11]. Instead of a quadratic dependence on the instance size, the running time
of their algorithm only grows linearly in the number of resources and customers. More
precisely, they proved the following result:
Theorem 3.1 ([MRV11]) One can solve the Min-Max Resource Sharing Problem with
approximation ratio (1 + β) · σ for any β > 0 in O(θ(|C|+ |R|) log |R|(log log |R|+ β−2))
time. Here, σ ≥ 1 is the worst approximation ratio of a block solver and θ is the time for
an oracle call. If there exists a solution {sol∗(c) : sol∗(c) ∈ B(c) for c ∈ C} such that 12 ≤
maxr∈R
∑
c∈C usgc,r(sol
∗(c)) ≤ 2, the running time reduces to O(θ(|C|+ |R|)β−2 log |R|).
One key idea to achieve an approximation ratio arbitrarily close to the approximation
guarantee of the block solvers is to update prices in a multiplicative way. In the algorithm
of Müller, Radke and Vygen [MRV11], a resource price p is updated to p · eγ·usg for a
constant γ > 0 after a block solver has computed a new solution that consumes a fraction
of usg from that particular resource. A detailed description of their core algorithm can be
found in Algorithm 2.
3.3 Modeling Timing
As demonstrated by Figure 3.1, the number of paths in the timing graph can be exponentially
large. Thus, adding a resource for each signal path would result in a set R of exponential
size and the original algorithm of Müller, Radke and Vygen [MRV11] would no longer
be a polynomial time algorithm. Modeling timing-critical paths only does not avoid this
problem since firstly, this number can still be too large, and secondly, paths that have been
uncritical before can become critical when their timing is ignored.
22 Global Routing from a Timing Point of View
Instance: Resources R, customers C, number of phases p ∈ N,
convex sets Bc for all c ∈ C,
convex functions usgc,r : Bc → R≥0 for all c ∈ C, r ∈ R,
price adjust factor γ > 0.
Output: Solutions solc ∈ Bc for all c ∈ C.
1○ set price(r) := 1 for all resources r ∈ R.
2○ set Xc := 0 and xc,sol := 0 for all c ∈ C and sol ∈ Bc.
3○ for i = 1 to p do
4○ while there is c ∈ C with Xc < i:
5○ compute sol(c) ∈ Bc approximately minimum w. r. t. current prices
6○ set ξ := min{i−Xc, 1/max{usgc,r(sol(c)) : r ∈ R}}
7○ set xc,sol := xc,sol + ξ and Xc := Xc + ξ
8○ set price(r) := price(r) · eγ·ξ·usgc,r(sol(c)) for all r ∈ R
9○ return
(
1
p
∑
sol∈Bc
xc,sol · sol
)
c∈C
Algorithm 2: Algorithm of Müller, Radke, and Vygen [MRV11] for the Min-Max Resource
Sharing Problem.
3.3.1 The Timing Graph
We now construct a directed graph D that helps us modeling timing in global routing more
efficiently. The vertex set V (D) consists of all vertices in the standard timing graph D
introduced in Section 2.5.2 except for output pins of logic gates. More precisely, the vertex
set V (D) is defined as
V (D) = Vin
.∪ Vgate
.∪ Vout,
where Vin is the set of timing starting points (primary input and latch output pins), Vout is
the set of timing end points (primary output and latch input pins), and Vgate contains all
input pins of the logic gates of the chip. Note that all signal paths start in Vin and end in
Vout. The edges E(D) of D correspond to signal propagation. Whenever we have a path
P in D, we will have a corresponding path P in D that arises from P by short cutting all
subpaths of length 2 having a gate output pin as its middle vertex. More formally, E(D)
is the set
{(v, w) : (v, w) ∈ E(D) and v, w ∈ V (D)} .∪
{(v, w) : there is a gate output pin x such that (v, x), (x,w) ∈ E(D)}.
Figure 3.2 depicts D. Note that
|V (D)| ≤ |V (D)| and |E(D)| ≤ |E(D)|+M,
where M =
∑
x output pin of a gate |δ−(x)| · |δ+(x)|. In practice, the number of input pins of
a logic gate is a small constant and hence, M = O(|E(D)|), i. e. the size of D is linear in
the size of D. As D is acyclic, D is acyclic as well.
It is also possible to useD = D directly but not considering the gates’ output pins results
in faster convergence in practice and simplifies modeling the dependence of capacitances of
solutions for net customers on gate delays.
Global Routing from a Timing Point of View 23
(a) Signal path in the standard timing graph D.
Vin Vout
Vgate
Edges in D
(b) Corresponding path in graph D. Vertices of D are primary input and latch output pins (Vin), input
pins of logic gates (Vgate) and primary output, and latch input pins (Vout).
Figure 3.2: Directed graph D arising from the standard timing graph D described in Section 2.5.2.
We use D to model timing within the Min-Max Resource Sharing model.
3.3.2 New Resources and Customers
The timing graph D can be used to model static timing within the resource sharing
framework. Our model is based on the following simple and well-known observation:
Proposition 3.2 (“Folklore”) Let D be a directed graph and let d : E(D) → R+, a :
V (D)→ R+ be two functions.
If a(v) + d(e) ≤ a(w) for all e = (v, w) ∈ E(D),
then a(v) +
∑
e∈E(P )
d(e) ≤ a(w) for all v-w paths P in D.
Proof For a path P with vertices v1, v2, . . . , vk it holds that
a(v1) +
∑
e∈E(P )
d(e) = a(v1) +
k∑
i=2
d((vi−1, vi)) +
k−1∑
i=2
a(vi)−
k−1∑
i=2
a(vi)
=
k∑
i=2
(
a(vi−1) + d((vi−1, vi))
)
−
k−1∑
i=2
a(vi)
≤
k∑
i=2
a(vi)−
k−1∑
i=2
a(vi). 
The result of Proposition 3.2 is not surprising since we have already used a similar
result during the static timing analysis explained in Section 2.5.3: An exponential number
of delay bounds on timing paths can be checked in linear time by defining arrival times. If
we can find arrival times for all vertices of D such that
• the arrival time of the pin in Vin coincides with the signal’s arrival time,
• arrival times of pins in Vout coincide with required arrival times, and
• the delay along an edge in D is not larger than the difference between the arrival
times of its endpoints,
then, all signals arrive in time.
To make use of this knowledge in the resource sharing model we need a lower bound
amin(v) and an upper bound amax(v) on the arrival time at a vertex v ∈ V (D) in any
timing-feasible solution. In Section 3.4 we describe how to choose these bounds.
24 Global Routing from a Timing Point of View
In this section we show how to extend the resource sharing model based on the
assumption that arrival time intervals [amin(v), amax(v)] are given for all vertices. As in the
resource sharing formulation of the Standard Global Routing Problem ([MRV11]) there will
be a customer for each net and resources for routing congestion and other objectives like
net length and power. However, these will no longer be the only customers and resources.
Timing resources and arrival time customers. Each edge e = (v, w) ∈ E(D) is
included as a timing resource with capacity amax(w)− amin(v) and we assume that this
capacity is strictly larger than zero. We include each vertex v ∈ Vgate ∪ Vout as arrival time
customer whose feasible solutions is the set
[amin(v), amax(v)].
A solution a(v) ∈ [amin(v), amax(v)] consumes an amount of
• amax(v)− a(v) from resources e ∈ δ−(v), and
• a(v)− amin(v) from each e ∈ δ+(v).
Net customers also consume from timing resources. Let N be a net with source pin s
and let (A, κ) be a Steiner tree for N .
For t ∈ N\{s} we can compute the delay delay(A,κ)(s, t) along the unique s-t path
in (A, κ). If s is the output pin of a gate g, we can also compute the delay delayg(u, s)
through g between an input pin u of g and s.
In this thesis we assume that delays are positive and depend on (A, κ) only. This is the
case for the linear delay model we use to estimate timing before buffering (Section 6.1) and
for the Elmore delay model (Section 4.2.1) we use to measure timing in a buffered netlist.
If delays are influenced by slews, this resource sharing model can still serve as a heuristic.
Net customer N consumes
• delay(A,κ)(s, t) from timing resource (s, t) for each t ∈ N\{s} if s ∈ Vin and
• delayg(u, s) + delay(A,κ)(s, t) from timing resource (u, t) for each input pin u of g
and each t ∈ N\{s} if s is an output pin of gate g.
Timing relaxation resources. In addition to timing resources, we add a relaxation
resource relax(v) for timing endpoints v ∈ Vout with amax(v) > amin(v). This resource has
capacity amax(v)−amin(v)β , where β > 0 is the parameter from Theorem 3.1. Arrival time
customer v is the only customer that consumes from it. Solution a(v) consumes
amax(v)− amin(v)
β
+ a(v)− amin(v).
By this definition, each arrival time solution consumes at least 100% and selecting an
arrival time later than amin(v) leads to a violation and hence to a large resource price.
This way, selecting an arrival time larger than amin(v) is possible but very expensive.
Global Routing from a Timing Point of View 25
net delaygate delay
net delay
gate delay wu
v
timing resourcetiming resource
amin(u)
a(u)
amax(u) amin(v)
a(v)
amax(v) amin(w)
a(w)
amax(w)
ca
pa
ci
ty
of
or
an
ge
re
so
ur
ce
ca
pa
ci
ty
of
re
d
re
so
ur
ce
at. cust.
of u
gate delay
net delay
at. cust.
of v
at. cust.
of v
gate delay
net delay
at. cust.
of w
(a) Delay violation due to bad choice of arrival times and Steiner tree for the black net.
net delaygate delay
net delay
gate delay
w
u
v
timing resourcetiming resource
amin(u)
a(u)
amax(u) amin(v)
a(v)
amax(v) amin(w)
a(w)
amax(w)
ca
pa
ci
ty
of
or
an
ge
re
so
ur
ce
ca
pa
ci
ty
of
re
d
re
so
ur
ce
at. cust.
of u
gate delay
net delay
at. cust.
of v
at. cust.
of v
gate delay
net delay
at. cust.
of w
(b) After re-choosing arrival times and rerouting the black net we achieve feasibility.
Figure 3.3: Interaction of timing resources and arrival time customers. The blue net customer
consumes from the orange and the black net customer from the red resource. Their usage is
equal to the delay through the Steiner tree and the pin’s source gate. Arrival time customers also
consume from timing resources. The earlier we choose the arrival time of v, the less we consume
from the red timing resource but the more we consume from the orange one.
Example of the model Figure 3.3 shows how timing resources and arrival time cus-
tomers interact. In Figure 3.3(a), the red timing resource is violated. The three customers
consuming from it are the arrival time customers v and w, and the net connected by the
black Steiner tree. The black Steiner tree itself is short – leading to small delay through
the inverter in the middle. The detour on the path to w leads to a huge delay through
the Steiner tree and hence to a large consumption from the red resource. As the arrival
time at v is late and the arrival time at w is early, both arrival time customers consume
a large amount from the red resource. On the other hand, the consumptions from the
orange timing resource are small. Both u, v, and the net connected by the blue Steiner
tree consume from that resource.
Figure 3.3(b) shows how to fix the violation of the red resource. Choosing an earlier
arrival time at v leads to an increased consumption from the orange timing resource but
to a decreased consumption from the red resource. We also choose a different black Steiner
tree. Although the net length of the new tree (and hence the delay through the inverter
in the middle) is significantly larger than before, the overall consumption from the red
resource is smaller. The path to w through the tree is shortest possible now and thus, the
consumption from the red resource is smaller.
26 Global Routing from a Timing Point of View
3.4 Lower and Upper Bounds on Arrival Times
We now show how to obtain the arrival time intervals [amin(v), amax(v)] for all vertices
v ∈ V (D) from which we can choose arrival times a(v) that satisfy the first condition of
Proposition 3.2. For the sake of faster convergence and better stability of the algorithm we
want to achieve that these intervals are as small as possible.
3.4.1 Arrival Time Intervals Based on Lower Delay Bounds
For e ∈ E(D) let dlb(e) be a lower bound on the delay along e. Independent of the delay
model, dlb(e) = 0 is always a valid choice. However, for some delay models better lower
bounds can be computed in polynomial time.
By setting
a→lb (v) =
{
at(v) if v ∈ Vin,
max{a→lb (u) + dlb((u, v)) : (u, v) ∈ δ−(v)} otherwise
we can define lower bounds on the actual arrival time.
If a→lb (v) exceeds rat(v) for v ∈ Vout we will have no chance to meet all timing constraints
and we have to relax required arrival times.
We define
τr := max
{
0, max{a→lb (v)− rat(v) : v ∈ Vout}
}
and propagate (possibly relaxed) required arrival times using the formulas
a←lb (v) =
{
rat(v) + τr if v ∈ Vout,
min {a←lb (w)− dlb((v, w)) : (v, w) ∈ δ+(v)} otherwise.
Setting amin(v) = a→lb (v) and amax(v) = a
←
lb (v) would already be a valid choice due to
Proposition 3.3.
Proposition 3.3 It holds that a→lb (v) ≤ a←lb (v) for all v ∈ V (D).
Proof For v ∈ Vout, the statement is true by definition of τr. Otherwise, let w ∈ V (D)
such that a←lb (v) = a
←
lb (w) − dlb((v, w)). By induction, a→lb (w) ≤ a←lb (w) and hence,
a←lb (v) = a
←
lb (w)− dlb((v, w)) ≥ a→lb (w)− dlb((v, w)) ≥ a→lb (v). 
3.4.2 Shrinking Arrival Time Intervals with Upper Delay Bounds
For nodes on uncritical paths, the resulting intervals can be large as Figure 3.4 shows. In
this simple example, v is completely uncritical and any time between 1 and 14 would be
a feasible choice of arrival time. Unintuitively, arrival time selection for v seems to be a
more difficult task than for the other nodes, where intervals contain one point only.
In the course of the resource sharing algorithm we might change a(v) several times
although any choice that leaves enough time for the incident edges would probably be
equally good.
Global Routing from a Timing Point of View 27
1
5
5
1
5
s v
[0, 0] [1, 14] [15, 15]
[5, 5] [10, 10]
at 0
rat 10
τr 5
Figure 3.4: Example of arrival time (a→lb ) and required arrival time (a
←
lb ) propagation with respect
to lower delay bounds dlb shown at each edge. All arrival time intervals [a→lb , a
←
lb ] are feasible but
at the uncritical vertex v the interval is large.
In practice, such a situation can lead to slow convergence and can be avoided by
tightening the intervals. To do so, we define upper bounds dub(e) on the delay along an
edge e ∈ E(D) as the delay in the worst reasonable solution.
As before, we can propagate arrival times and required arrival times with respect to
delays dub to obtain
a→ub(v) =
{
at(v) if v ∈ Vin
max{a→ub(u) + dub((u, v)) : (u, v) ∈ δ−(v)} otherwise
and
a←ub(v) =
{
rat(v) if v ∈ Vout
min {a←ub(w)− dub((v, w)) : (v, w) ∈ δ+(v)} otherwise.
We can interpret a→ub(v) as the latest and a
←
ub(v) as the earliest reasonable arrival time
at v. It is not necessary to choose a(v) > a→ub(v) as already a(v) = a
→
ub(v) leaves enough
time for all edges on each path in D ending in v. Similarly, choosing a(v) = a←ub(v) leaves
enough time for all edges on a path in D starting at v, and choosing a(v) < a←ub(v) is
pointless.
Using this knowledge we define tightened arrival time intervals.
Definition 3.4 Let v ∈ V (D). If a→ub(v) > a←ub(v), we define
amin(v) := max{a→lb (v), a←ub(v)} and amax(v) := min{a←lb (v), a→ub(v)}.
If a→ub(v) ≤ a←ub(v), we fix the arrival time at v to
amin(v) := amax(v) :=
a→ub(v) + a
←
ub(v)
2
.
We say that v is critical if a→ub(v) > a
←
ub(v) and call v uncritical otherwise. If v is uncritical,
any arrival time in [a→ub(v), a
←
ub(v)] leaves enough time for all paths in D ending in v or
starting with v and we can fix the arrival time to an arbitrary value within that interval.
In this thesis we use the simple solution from Definition 3.4 and select the center of this
interval. In the extended version of [Hel+17] we show how to distribute positive slack
along uncritical paths by fixing the arrival time of an uncritical vertex v to
amin(v) := amax(v) := (1− µ(v)) · a→ub(v) + µ(v) · a←ub(v)
for some choice of µ(v) ∈ [0, 1] proportional to the fraction of the delay consumed at v in a
longest path through v.
The next proposition tells us that we obtain arrival time intervals as desired.
28 Global Routing from a Timing Point of View
u
[2, 2]
v
[6, 8]
→ [6, 6]
w
[8, 9]
→ [8, 8]
at 0 rat 4
rat 10
2/4 2/4
2
/
4
1/2 1/2
(a) Taking into account that in each feasible solution
the signal arrives at u at time 2 we can decrease
amax(v) from 8 to 6. Iterating this argument for
(v, w), we can decrease amax(w) to 8.
5/10
2/4
2/4
1/2
1/2
1/
2
2/4
2/
4
v
w
w′
[6, 8]
[5, 5]
[2, 4]
[2, 4]
→ [3, 4]
at 0 rat 10
rat 6
(b) In any feasible solution the signal arrives at w′
at time 5 and selecting an arrival time earlier than
6 at w is not necessary. Using this observation we
can increase amin(v) to 3.
Figure 3.5: Examples in which we can tighten arrival time intervals further. Each edge e is
labeled dlb(e)/dub(e). Arrival time intervals are shown next to internal vertices.
Proposition 3.5 If 0 ≤ dlb(e) ≤ dub(e) for all e ∈ E(D),
1. [amin(v), amax(v)] 6= ∅ for all v ∈ V (D), and
2. amax(v) + dub((v, w)) ≤ amin(w) for all (v, w) ∈ E(D) for which v or w is uncritical.
Proof We first show 1. If v is uncritical, amin(v) = amax(v) and 1 is fulfilled trivially. Let
v be critical. If amin(v) = a→lb (v) and amax(v) = a
←
lb (v), amin(v) ≤ amax(v) follows from
Proposition 3.3. Inequality dlb(e) ≤ dub(e) guarantees a→lb (v) ≤ a→ub(v) and hence, yields 1
in the case amin(v) = a→lb (v), amax(v) = a
→
ub(v). In the remaining cases, amin(v) = a
←
ub(v) >
a→ub(v) ≥ amax(v) follows from the definition of criticality.
To prove 2 let (v, w) ∈ E(D) with v or w uncritical. If v is uncritical and w is critical,
amax(v) =
a→ub(v)+a
←
ub(v)
2 ≤ a←ub(v) ≤ a←ub(w)− dub((v, w)) ≤ amin(w)− dub((v, w)).
If v is critical and w is uncritical,
amax(v) + dub((v, w)) ≤ a→ub(v) + dub((v, w)) ≤ a→ub(w) ≤ amin(w).
If both v and w are critical,
amax(v) + dub((v, w)) =
1
2
((
a→ub(v) + dub((v, w))
)
+
(
a←ub(v) + dub((v, w))
))
≤ 1
2
(a→ub(w) + a
←
ub(w))
= amin(w).

Assuming dub(e) = 3 · dlb(e) in the example of Figure 3.4 we have a→ub(v) = 3 and
a←ub(v) = 7. The uncritical vertex v gets the trivial arrival time interval [5, 5].
Global Routing from a Timing Point of View 29
3.4.3 Further Reduction of Arrival Time Intervals
In some cases the arrival time intervals can be reduced further. The right interval border
amax(v) of vertex v in Figure 3.5(a) is given by a→ub(v) = 8 and the computation of that
arrival time assumed the maximum delay 4 along the edge entering u. Since u is critical,
that edge has to be realized as fast as possible and the signal cannot arrive at v later than
6 in any timing-feasible solution (even if connection (u, v) is realized in the slowest possible
way). As a consequence, we can reduce amax(v) to 6. By the same argument we observe
that we can also reduce amax(w) and obtain an instance for which all arrival time intervals
contain one point only.
To perform reductions of right interval borders we traverse the nodes in V (D) in
topological order. If for v ∈ V (D),
amin(v) < amax(v) and amax(u) + dub((u, v)) < amax(v) for all (u, v) ∈ δ−(v),
we decrease amax(v) to
amax(v) := max{amin(v),max{amax(u) + dub((u, v)) : (u, v) ∈ δ−(v)}}. (3.1)
Note that this choice guarantees that all timing intervals remain non-empty. Assuming
that the signal does not arrive at a predecessor u of v later than amax(u), we can conclude
that the signal will also not arrive at v later than amax(u) + dub((u, v)) and the update is
feasible.
In the situation shown in Figure 3.5(b) we can increase the left interval border amin(v)
of vertex v. Here, the left interval borders of both successors are large – independent of
the signal’s arrival time at v. Both w and w′ do not benefit from a small arrival time at v
and choosing a(v) = 3 leaves enough time for the signal to arrive early enough at w and
w′ even if both connections leaving v are realized in the slowest possible way.
Similar to the previous algorithm we can increase lower bounds in linear time. This
time we traverse Vgate in reverse topological order. If we encounter a vertex v ∈ Vgate with
amin(v) < amax(v) and amin(v) < amin(w)− dub((v, w)) for all (v, w) ∈ δ+(v),
we set
amin(v) := min{amax(v),min{amin(w)− dub((v, w)) : (v, w) ∈ δ+(v)}}. (3.2)
The next proposition states correctness of the algorithms.
Proposition 3.6 After decreasing right interval borders and increasing left interval borders
by the above algorithms the following statements are true
1. Proposition 3.5 still holds,
2. for all v ∈ V (D)\Vin with amin(v) < amax(v) there is (u, v) ∈ δ−(v) such that
amax(u) + dub((u, v)) ≥ amax(v),
3. for all v ∈ Vgate with amin(v) < amax(v) there is (v, w) ∈ δ+(v) such that amin(v) +
dub((v, w)) ≥ amin(w).
Proof This follows directly from the definition of the new interval borders. 
30 Global Routing from a Timing Point of View
3.5 Properties of Low-Congestion Solutions
The next theorem tells us that all signals arrive in time if there is no violation of a timing
resource. Consequently, we have expressed our timing constraints in terms of resource
sharing correctly.
Theorem 3.7 Assume that τr = 0. Let N be the set of all nets and let (sol(c))c∈C be a
solution, i. e.
• sol(c) is a (possibly fractional) routing of c if c is a net customer (i. e. c ∈ N ), and
• sol(c) ∈ [amin(c), amax(c)] is an arrival time if c ∈ V (D)\Vin is an arrival time
customer.
The following holds:
1. If (sol(c))c∈C does not violate any capacity constraints of a timing or relaxation
resource, the (possibly fractional) routing (sol(N))N∈N meets all timing constraints.
2. If the routing (sol(N))N∈N meets all timing constraints and the delay along each edge
e ∈ V (D) defined by that solution lies between dlb(e) and dub(e), there exist arrival
times sol(v) ∈ [amin(v), amax(v)] for v ∈ Vgate∪Vout such that the solution (sol(c))c∈C
with
sol(c) =
{
sol(N) if c = N is a net customer
sol(v) if c = v is an arrival time customer
does not violate any timing or relaxation resource.
Proof To prove Statement 1 we use Proposition 3.2 and show that for any signal path
P = v1v2 . . . vk in D consisting of nets N1, . . . , Nk−1,
sol(vi) + di ≤ sol(vi+1) holds for i = 1, . . . , k − 1,
where di := usgNi,(vi,vi+1)(sol(Ni)) · (amax(vi+1)− amin(vi)) is the delay from vi to vi+1 and
sol(v1) := at(v1) denotes the signal’s arrival time at its origin v1 ∈ Vin.
Correctness of the inequality follows from the equivalence
sol(vi) + di ≤ sol(vi+1)⇔
(sol(vi)− amin(vi)) + di + (amax(vi+1)− sol(vi+1)) ≤ amax(vi+1)− amin(vi).
(3.3)
For i > 1, the summand sol(vi)−amin(vi) is exactly the usage of the arrival time customer
vi from timing resource (vi, vi+1) (and 0 for i = 1). The summand amax(vi+1)− sol(vi+1)
is equal to the usage of arrival time customer vi+1 from that resource. Hence, the left-hand
side of the inequality is equal to the total amount of usage from resource (vi, vi+1) which
is upper bounded by its resource capacity amax(vi+1)− amin(vi) by assumption.
Since relax(vk) is not violated and τr = 0, sol(vk) = amin(vk) = rat(vk). Hence, at(v1) +∑k
i=1 di ≤ rat(vk).
Now we show Statement 2. For e ∈ V (D) let d(e) be the delay along e in routing
sol(N)N∈N . We use static timing analysis (Section 2.5.3) to compute arrival times at(v)
for v ∈ V (D)\Vin by forward propagation of delays d(e) (e ∈ E(D)) starting with the
initial arrival times at(v) for v ∈ Vin. For the sake of simplicity we consider timing start
points v ∈ Vin as arrival time customers with arrival time interval [at(v), at(v)] in that
proof.
Global Routing from a Timing Point of View 31
For v ∈ V (D) we define sol(v) as the projection of at(v) into [amin(v), amax(v)], i. e.
sol(v) = min{max{amin(v), at(v)}, amax(v)}.
Let e = (v, w) ∈ E(D). Since the routing solution is timing-feasible and by Proposi-
tion 3.2, at(v) + d(e) ≤ at(w) for all e = (v, w) ∈ E(D). We show that the inequalities
sol(v) + d(e) ≤ sol(w) hold as well by case distinction.
If v or w are uncritical, Proposition 3.5 implies sol(v) + d(e) ≤ amax(v) + dub(e) ≤
amin(w) ≤ sol(w). It remains to consider the case that both v and w are critical.
• If sol(v) ≤ at(v) and sol(w) ≥ at(w), sol(w) ≥ at(w) ≥ at(v) + d(e) ≥ sol(v) + d(e).
• Now we consider the case at(v) < sol(v), in particular sol(v) = amin(v).
– If the left border of v has been increased during the arrival time interval reduction
algorithm, sol(w) ≥ amin(w)
(3.2)
≥ amin(v) + d(e) = sol(v) + d(e).
– The case sol(v) = a→lb (v) is not possible as d(e
′) ≥ dlb(e′) for all e′ ∈ E(D) and
hence, sol(v) > at(v) ≥ a→lb (v).
– In the remaining case, sol(v) = a←ub(v) and sol(w) ≥ amin(w) ≥ a←ub(w) ≥
a←ub(v) + d(e) = sol(v) + d(e).
• Finally, assume that at(w) > sol(w) (⇒ sol(w) = amax(w)) and sol(v) > amin(v).
– If the right border of w has been decreased during the arrival time interval
reduction algorithm, sol(v) ≤ amax(v)
(3.1)
≤ amax(w)− d(e) = sol(w)− d(e).
– The case sol(w) = a→ub(w) is impossible due to the inequality d(e
′) ≥ dlb(e′) for
all e′ ∈ E(D).
– If a←lb (w) = sol(w), sol(v) ≤ amax(v) ≤ a←lb (v) ≤ a←lb (w)− d(e) = sol(w)− d(e).
As the routing meets all timing constraints, τr = 0 and at(v) ≤ rat(v) for v ∈ Vout.
We conclude that sol(v) ≤ rat(v) for all timing endpoints and hence, no timing endpoint
relaxation resources are violated. With (3.3) we conclude that no timing resources are
violated. 
It is also possible to give a theoretical lower bound on the worst slack if no resource is
violated by more than a factor of 1 + β.
Theorem 3.8 Let sol(c)c∈C be a solution as in Theorem 3.7. For e ∈ E(D) let d(e) be
the delay along e in that solution.
If no resource is used by more than 1 + β, then the worst slack
min
P path from
v0∈Vin to vk∈Vout
rat(vk)− at(v0)− ∑
e∈E(P )
d(e)

is at least −(τr + βH), where
H := max
P path from
Vin to Vout
 ∑
e∈E(P )
dub(e)
+ max
P path from
Vin to Vout
 ∑
e∈E(P )
(dub(e)− dlb(e))
 |E(P )|.
We omit a detailed proof here but refer to [Hel+17].
32 Global Routing from a Timing Point of View
3.6 Block Solvers
The most important step in the algorithm of Müller, Radke, and Vygen [MRV11] is to find
sol(c) ∈ B(c) for customer c ∈ C such that∑
r∈R
price(r) · usg(sol(c)) is (approximately) minimum.
An algorithm performing this task is called block solver. The call to the block solver is done
in step 4○ of Algorithm 2. For the Standard Global Routing Problem the customers were
exactly the nets in our netlist and the task of their block solvers is theMinimum Cost Steiner
Tree Problem. The Minimum Cost Steiner Tree Problem can be approximated within a
factor of 1.39 in polynomial time ([Byr+13], [Goe+12]) or within a factor of 2 in almost-
linear time by the algorithm of Prim for the Minimum Spanning Tree Algorithm [Pri57].
This situation changes completely after adding timing resources and arrival time customers.
Apart from the fact that we have a different type of customers now, the task of the block
solver for the net customer gets more complicated when taking timing into account. In
addition to resource usage costs for congestion, net length, and power resources (that can
be translated to edge costs in the global routing graph), net customers have to pay for
consumption from timing resources. The exact block solver depends on the particular delay
model and for most common delay models, the resulting task is difficult. Often, constant
factor approximations can only exist for special cases unless P=NP.
Since most of the later chapters of this thesis will be dedicated to the problem of finding
block solvers of net customers for various delay models (Chapters 4 to 7) we restrict to
block solvers for arrival time customers here.
Block solvers for arrival time customers. The task of the block solver for arrival
time customer v ∈ Vgate is to find an element a(v) ∈ [amin(v), amax(v)] minimizing
f(a(v)) :=
∑
(u,v)∈δ−(v)
(
price((u, v)) · amax(v)− a(v)
amax(v)− amin(u)
)
+
∑
(v,w)∈δ+(v)
(
price((v, w)) · a(v)− amin(v)
amax(w)− amin(v)
)
.
Similarly, the block solver for an arrival time customer v ∈ Vout of a timing endpoint
has to find a(v) ∈ [amin(v), amax(v)] minimizing
f(a(v)) :=price(relax(v)) ·
(
1 + β · a(v)− amin(v)
amax(v)− amin(v)
)
+
∑
(u,v)∈δ−(v)
(
price((u, v)) · amax(v)− a(v)
amax(v)− amin(u)
)
which is equivalent to minimizing
f(a(v))− price(relax(v))
= β · price(relax(v)) · a(v)− amin(v)
amax(v)− amin(v) +
∑
(u,v)∈δ−(v)
(
price((u, v)) · amax(v)− a(v)
amax(v)− amin(u)
)
.
Global Routing from a Timing Point of View 33
. . .
. . .
. . .
. . .
. . .[1, 2]
[0, 1]
[0, 1]
[0, 1]
[2, 3]
[2, 3]
v
Figure 3.6: Problem of the simple block solver for timing resources. We assume that price adjust
parameter γ in Algorithm 2 is 1 and the current usages of all timing resources are 0. Choosing
arrival time 2 at v is optimum but after the price update, 1 is optimum.
This can be considered a special case of the task of a block solver for an arrival time
customer in Vgate: Insert a new vertex v′ and a new edge (v, v′) for v ∈ Vout. The new
edge represents relax(v) and gets price price((v, v′)) := β · price(relax(v)). After setting
amax(v
′) := amax(v) and δ+(v) := {(v, v′)} we can write f(a(v)) − price(relax(v)) as
f(a(v)).
In the remainder of Section 3.6 we consider a particular arrival time customer v ∈
Vgate ∪ Vout and show how to compute a(v).
3.6.1 A Simple but Unstable Block Solver for Arrival Time Customers
Since f is linear with slope
∑
(v,w)∈δ+(v)
price((v, w))
amax(w)− amin(v) −
∑
(u,v)∈δ−(v)
price((u, v))
amax(v)− amin(u) , (3.4)
it attains its minimum at the borders and the following choice of a(v) is optimum:
a(v) =
{
amin(v) if (3.4) ≥ 0
amax(v) otherwise.
Although this choice is optimum, it leads to slow convergence in practice. Even if
preceding and succeeding resources are (nearly) balanced, one of the borders of the timing
interval is returned. In case of poor lower and upper bounds (Section 3.4), these intervals
are large and switching from amin(v) to amax(v) or vice versa between two iterations can
change the situation completely.
Figure 3.6 shows how quickly a switch between interval borders can occur. Recall
that the algorithm of Müller, Radke, and Vygen [MRV11] updates resource prices as
price(r) = eγ·ξ·usgc,r(b(c)). In Figure 3.6, γ and ξ are chosen to be 1 and initially, the usages
from timing resources are zero leading to a resource price of 1 for each resource. Thus,
a(v) = 2 is the optimum arrival time at v. The solution a(v) = 2 induces a relative resource
usage of 12 from all timing resources corresponding to edges leaving v and hence, to a
resource price of
√
e > 1.6 for these resources. From the other timing resources, nothing is
consumed. Consequently, after choosing a(v) = 2 (and the resulting price update step),
the other interval border a(v) = 1 has become optimum and the previous solution a(v) = 2
has become the overall worst solution.
34 Global Routing from a Timing Point of View
1○ for i = 1, . . . , n do
2○ compute ai(v) ∈ {amin(v), amax(v)} minimizing f(ai(v))
3○ set price(r) := price(r) · eγ·
1
n · usgv,r(ai(v)) for r ∈ R
4○ return a(v) := 1n
∑n
i=1 ai(v).
Algorithm 3: Iterated block solver for arrival time customers.
3.6.2 Stabilizing Arrival Time Computation by Iteration
This problem can be overcome by running the block solver for n > 1 iterations in each
resource sharing phase, thereby selecting a fraction of 1n from each computed solution
(Algorithm 3).
Following the proof of Müller, Radke, and Vygen [MRV11], choosing a solution returned
by the block solver by a small fraction only, does not invalidate the approximation ratio of
their result. Decreasing ξ in line 6○ of Algorithm 2 has an impact on running time only.
For n→∞ we will end up with an arrival time that is still optimum after the price update:
Theorem 3.9 Let γ ≥ 0 and let
g(a) :=
∑
(v,w)∈δ+(v)
price((v, w))
amax(w)− amin(v) · e
γ· a−amin(v)
amax(w)−amin(v)
−
∑
(u,v)∈δ−(v)
price((u, v))
amax(v)− amin(u) · e
γ· amax(v)−a
amax(v)−amin(u) .
Let a(v) be the output of Algorithm 3, let a∗ be the unique root of g(a) and let a∗ :=
min{max{amin(v), a∗}, amax(v)} be the projection of a∗ into the arrival time interval
[amin(v), amax(v)].
It holds that |a(v) − a∗| ≤ amax(v)+amin(v)n . In particular, the output of Algorithm 3
converges to a∗ for n→∞.
Proof We may assume that amin(v) < amax(v) as otherwise, the statement is trivial.
First note that g is strictly monotonically increasing. We denote by pricei(r) the price of
resource r ∈ R and by fi the function f at the beginning of iteration i. By definition of
the price update step in Algorithm 2 it holds for (v, w) ∈ δ+(v) and (u, v) ∈ δ−(v) that
pricei((v, w)) = price1((v, w)) · e
γ
n
·∑i−1
i′=1
ai′ (v)−amin(v)
amax(w)−amin(v) and
pricei((u, v)) = price1((u, v)) · e
γ
n
·∑i−1
i′=1
amax(v)−ai′ (v)
amax(v)−amin(u) .
If a∗ ≤ amin(v), then 0 ≤ g(amin(v)) and for i ≤ n:
fi(amin(v)) =
∑
(u,v)∈δ−(v)
pricei((u, v)) · amax(v)− amin(v)
amax(v)− amin(u)
<
∑
(u,v)∈δ−(v)
price1((u, v)) · eγ·
amax(v)−amin(v)
amax(v)−amin(u) · amax(v)− amin(v)
amax(v)− amin(u)
≤
∑
(v,w)∈δ+(v)
price1((v, w)) · amax(v)− amin(v)
amax(w)− amin(v)
≤ fi(amax(v)).
Global Routing from a Timing Point of View 35
Hence, we will always choose the left interval border and a(v) = amin(v) = a∗. Analogously,
if a∗ ≥ amax(v), we will always choose the right interval border and end up with a(v) =
amax(v) = a
∗.
In the case amin(v) < a∗ < amax(v) we show that for fixed n, Algorithm 3 selects
solution amin(v) at most qn times, where
qn =
⌈
amax(v)− a∗
amax(v)− amin(v) · n
⌉
.
In the beginning of an iteration i ≤ n in which we have already selected the left interval
border qn times, it holds that
fi(amin(v))
=
∑
(u,v)∈δ−(v)
price1((u, v)) · e
γ
n
·qn· amax(v)−amin(v)amax(v)−amin(u) · amax(v)− amin(v)
amax(v)− amin(u)
≥
∑
(u,v)∈δ−(v)
price1((u, v)) · eγ·
amax(v)−a∗
amax(v)−amin(u) · amax(v)− amin(v)
amax(v)− amin(u)
=
∑
(v,w)∈δ+(v)
price1((v, w)) · eγ
a∗−amin(v)
amax(w)−amin(v) · amax(v)− amin(v)
amax(w)− amin(v)
=
∑
(v,w)∈δ+(v)
price1((v, w)) · e
γ
n
(
n−n· amax(v)−a∗
amax(v)−amin(v)
)
· amax(v)−amin(v)
amax(w)−amin(v) · amax(v)− amin(v)
amax(w)− amin(v)
>
∑
(v,w)∈δ+(v)
price1((v, w)) · e
γ
n
(
(i−1)−n· amax(v)−a∗
amax(v)−amin(v)
)
· amax(v)−amin(v)
amax(w)−amin(v) · amax(v)− amin(v)
amax(w)− amin(v)
≥
∑
(v,w)∈δ+(v)
price1((v, w)) · e
γ
n
((i−1)−qn)· amax(v)−amin(v)amax(w)−amin(v) · amax(v)− amin(v)
amax(w)− amin(v)
= fi(amax(v)).
Thus, we select ai(v) = amax(v) in that iteration.
An analogous calculation shows that we select the right interval border amax(v) at most
qn times with qn :=
⌈
a∗−amin(v)
amax(v)−amin(v) · n
⌉
.
From
(
amax(v)−a∗
amax(v)−amin(v) · n
)
+
(
a∗−amin(v)
amax(v)−amin(v) · n
)
= n we conclude that n ≤ qn + qn ≤
n+ 1 and hence,
• ai(v) = amin(v) must have been selected in at least qn − 1 iterations, and
• ai(v) = amax(v) must have been selected in at least qn − 1.
Hence,
a(v) ≤ 1
n
((
amax(v)− a∗
amax(v)− amin(v) · n+ 1
)
amin(v) +
(
a∗ − amin(v)
amax(v)− amin(v) · n+ 1
)
amax(v)
)
= a∗ +
amax(v) + amin(v)
n
,
and
a(v) ≥ 1
n
((
amax(v)− a∗
amax(v)− amin(v) · n− 1
)
amin(v) +
(
a∗ − amin(v)
amax(v)− amin(v) · n− 1
)
amax(v)
)
= a∗ − amax(v) + amin(v)
n
.

36 Global Routing from a Timing Point of View
Theorem 3.9 implies that Algorithm 3 needs
O
(
(|δ+(v)|+ |δ−(v)|) · amax(v)+amin(v)δ
)
time to approximate a∗ up to additive accuracy δ > 0.
3.6.3 Stabilizing Arrival Time Computation with Newton’s Method
Instead of spending this pseudo-polynomial running time we use Newton’s method to
approximate a∗.
Theorem 3.10 We can approximate a∗ from Theorem 3.9 up to accuracy δ > 0 in time
O
(
(|δ+(v)|+ |δ−(v)|) · log log
(
amax(v)−amin(v)
δ
))
.
Proof If g(amin(v)) ≥ 0 or g(amax(v)) ≤ 0 we return a∗ = amin(v) or a∗ = amax(v),
respectively. So assume that amin(v) < a∗ = a∗ < amax(v).
Recall that function g(a) is of the form
g(a) :=
∑
(v,w)∈δ+(v)
price((v, w))
amax(w)− amin(v) · e
γ· a−amin(v)
amax(w)−amin(v)
−
∑
(u,v)∈δ−(v)
price((u, v))
amax(v)− amin(u) · e
γ· amax(v)−a
amax(v)−amin(u)
=
∑
(v,w)∈δ+(v)
price((v, w))
amax(w)− amin(v) · e
γ· amax(v)−amin(v)
amax(w)−amin(v) ·
(
a−amin(v)
amax(v)−amin(v)
)
+
∑
(u,v)∈δ−(v)
(
− price((u, v))
amax(v)− amin(u) · e
γ· amax(v)−amin(v)
amax(v)−amin(u)
)
· eγ·
(
− amax(v)−amin(v)
amax(v)−amin(u)
)
·
(
a−amin(v)
amax(v)−amin(v)
)
.
Hence, by substituting x(a) := a−amin(v)amax(v)−amin(v) and by defining
p(e) :=

price((v,w))
amax(w)−amin(v) if e = (v, w) ∈ δ+(v)
− price((u,v))amax(v)−amin(u) · e
γ· amax(v)−amin(v)
amax(v)−amin(u) if e = (u, v) ∈ δ−(v),
and
q(e) :=
{
amax(v)−amin(v)
amax(w)−amin(v) if e = (v, w) ∈ δ+(v)
− amax(v)−amin(v)amax(v)−amin(u) if e = (u, v) ∈ δ−(v),
we can write g = g ◦ x, where g is of the form
g(x) =
m∑
j=1
pje
γqjx
for m = |δ+(v)| + |δ−(v)| and constants pi, qi ∈ R with pj · qj > 0 and |qj | ≤ 1 for
j = 1, . . . ,m.
Global Routing from a Timing Point of View 37
Starting with an arbitrary x0 ∈ [0, 1], Newton’s method iteratively sets xi+1 := xi− g(xi)g′(xi) .
By Taylor’s theorem,
0 = g(x(a∗)) = g(xi) + g′(xi) · (x(a∗)− xi) + g′′(ξ)(x(a
∗)− xi)2
2
for some value ξ ∈ [x(a∗), xi] ∪ [xi, x(a∗)], which implies
|x(a∗)− xi+1| = |g
′′(ξ)|
2|g′(xi)| · |x(a
∗)− xi|2.
Now,
|g′′(ξ)|
|g′(xi)| =
∣∣∣∣∣ m∑j=1 pjq2j γ2eγqjξ
∣∣∣∣∣∣∣∣∣∣ m∑j=1 pjqjγeγqjxi
∣∣∣∣∣
≤ mmax
j=1
|pjq2j γ2eγqjξ|
|pjqjγeγqjxi | ≤
m
max
j=1
|qj |γ · eγ·|ξ−xi| ≤ γ · eγ·|x∗−xi|
for all ξ, xi and hence,
|x(a∗)− xi| ≤ γ · e
γ·|x(a∗)−xi−1|
2
· |x(a∗)− xi−1|2.
As starting point x0 we choose a point with |x0 − x(a∗)| ≤ 12γ . Using the fact that g
is monotonically increasing, this can be achieved by a constant number of iterations of
binary search for fixed γ:
Initially, we set ymin := 0 and ymax := 1. While ymax − ymin > 1γ we halve the length
of interval [ymax, ymin] without losing the property that root x(a∗) is contained in it. If
g
(ymax+ymin
2
)
< 0, we know x(a∗) ≥ ymax+ymin2 by monotonicity of g and we can update
ymin to ymax+ymin2 . Otherwise, x(a
∗) ∈ [ymin, ymax+ymin2 ] an we set ymax := ymax+ymin2 .
This binary search procedure needs O
(
log
(
1
γ
))
= O(1) iterations and after termination,
x0 :=
ymax+ymin
2 fulfills |x0 − x(a∗)| ≤ 12γ .
Now we show by induction that for all i ≥ 0,
1. |x(a∗)− xi| ≤ |x(a∗)− x0| and
2. |x(a∗)− xi| ≤
(
γ·eγ|x(a∗)−x0|
2
)2i−1
· |x(a∗)− x0|2i .
Since both statements are trivial for i = 0 we assume that i > 0. It holds that
|x(a∗)− xi| ≤ γ · e
γ·|x(a∗)−xi−1|
2
· |x(a∗)− xi−1|2
≤
(
γ · eγ·|x(a∗)−x0|
2
· |x(a∗)− x0|
)
· |x(a∗)− x0|
≤
√
e
4
· |x(a∗)− x0|
< 0.413 · |x(a∗)− x0|
38 Global Routing from a Timing Point of View
which implies 1. Statement 2 follows from
|x(a∗)− xi| ≤ γ · e
γ·|x(a∗)−x0|
2
·
(γ · eγ|x(a∗)−x0|
2
)2i−1−1
· |x(a∗)− x0|2i−1
2
=
(
γ · eγ|x(a∗)−x0|
2
)2i−1
· |x(a∗)− x0|2i .
We can use 2 to bound the error in iteration i as
|x(a∗)− xi| ≤
(
γ · eγ|x(a∗)−x0|
2
)2i−1
· |x(a∗)− x0|2i ≤
(
γ · √e
2
)2i−1
·
(
1
2γ
)2i
≤ 1
γ
· 1
22i
.
Thus, after i = log log
(
amax(v)−amin(v)
δ·γ
)
Newton steps, each of which takes O(|δ+(v)|+
|δ−(v)|) time, the absolute errors can be bounded as
|x(a∗)− xi| ≤ 1
γ
· 1
22
log log
(
amax(v)−amin(v)
δ·γ
) = δ
amax(v)− amin(v)
and |a∗ − x−1(xi)| = (amax(v)− amin(v)) · |x(a∗)− xi| ≤ δ. 
3.7 Overall Algorithm
After extending our resource sharing model according to Section 3.3 we apply Algorithm 2.
As block solver for arrival time customers we can either use a block solver from Section 3.6.1
or 3.6.2 (in which case we obtain a provably good overall algorithm) or the Newton based
block solver from Section 3.6.3 (in which case we obtain fast convergence and good results).
Note that the proof of Müller, Radke, and Vygen [MRV11] cannot be applied directly if we
use the latter block solver as we cannot prove an approximation ratio with respect to the
initial prices.
To accelerate convergence of arrival times even further, we iterate the computation of
all arrival time customers for a constant number of times in each resource sharing phase.
The overall algorithm is depicted in Algorithm 4.
Theorem 3.11 Let β > 0. Given a routing oracle with approximation ratio σ and
p = O(β−2 log |R|), lines 1○– 11○ of Algorithm 4 compute a (fractional) solution that
minimizes the maximum resource usage up to a factor σ(1 + β), assuming that this
minimum is between 12 and 2 if we use the block solver from Section 3.6.1 or 3.6.2 in 10○.
If σ = 1 and there is a global routing that satisfies all routing and timing constraints,
the algorithm computes a fractional solution such that no edge is overloaded by more than
a factor 1 + β and the worst slack is at least −τr − βH, where H is the constant from
Theorem 3.8.
Proof This follows from optimality of the simple block solver from Section 3.6.1 and from
Theorems 3.1, 3.7 and 3.8. 
Global Routing from a Timing Point of View 39
1○ for i = 1, . . . , p do
2○ for each net N do
3○ set X := 0
4○ while X < 1 do
5○ Call routing oracle for N to obtain a solution (A, κ)
6○ set ξ := min
{
1−X, 1maxr∈R usgN,r((A,κ))
}
7○ set price(r) := price(r) · eγ·ξ·usgN,r((A,κ)) for r ∈ R
8○ for j = 1, . . . , n do
9○ for each arrival time customer v do
10○ a(v) := ComputeAT(v)
11○ set price(r) := price(r) · eγ·n−1·usgv,r(a(v)) for r ∈ R
12○ Iterated randomized rounding
13○ Rip-up and re-route
Algorithm 4: Overall algorithm for timing-constrained global routing with constants, p, n, γ.
Step 5○ and 10○ have to be specified.
3.8 Obtaining Integral Solutions
Having computed a fractional solution, the remaining task is to round the solutions of net
customers to integral solutions such that the maximum resource violation does not increase
too much. Müller, Radke, and Vygen [MRV11] proposed a randomized rounding approach
to perform this task. Independently for all (net) customers N with solution
∑k
i=1 xisoli
where xi > 0 for i = 1, . . . , k,
∑k
i=1 xi = 1, and soli ∈ B(N) is an integral solution for
i = 1, . . . , k, we select solution soli with probability xi. Müller, Radke, and Vygen [MRV11]
proved that the maximum relative resource violation does not increase by more than a
factor of 1 + δ with high probability. For a precise statement see Theorem 22 in [MRV11].
A common approach to decrease resource violations that have been introduced by
randomized rounding is rip-up and re-route, i. e. replacing a solution of a customer by
another solution. The new solution can either be chosen from the set of solutions that have
been computed in the fractional phase but have not been selected, or it can be re-computed
from scratch.
During rip-up and re-route it is necessary to update arrival times as these might have
become sub-optimum at the end of Algorithm 2. Let v ∈ V (D) be an arrival time customer.
For r ∈ R let priceI(r, x) : R× R≥0 → R≥0 be the cost function used during rip-up and
re-route, i. e. for a resource r ∈ R, priceI(r, x) is the resource usage cost of r if a fraction
of x is used from it. We may either still use exponential cost functions as in the resource
sharing phase or we select a completely different function. The only assumptions we make
are that all functions priceI(r, .) are strictly monotonically increasing, continuous and
differentiable.
For r ∈ R let usg(r) be the relative usage from r by all customers except for v. Let
h(x) : =
∑
(v,w)∈δ+(v)
priceI
(
(v, w), usg((v, w)) +
x− amin(v)
amax(w)− amin(v)
)
−
∑
(u,v)∈δ−(v)
priceI
(
(u, v), usg((u, v)) +
amax(v)− x
amax(v)− amin(u)
)
.
40 Global Routing from a Timing Point of View
Instance: A strictly monotonically increasing and differentiable function
h : [amin, amax]→ R with root x∗ such that h(amin) < 0 < h(amin),
a precision δ > 0.
Output: A value x ∈ [amin, amax] such that |x∗ − x| < δ.
1○ set xl := amin, xu := amax.
2○ while xu − xl > δ do
3○ set x1 := xl+xu2
4○ set x2 := x1 − h(x1)h′(x1)
5○ if sgn(x1) 6= sgn(x2) do
6○ set x3 :=
h(x1)x2−h(x2)x1
x1−x2
7○ else
8○ set x3 := x2
9○ set xl := max{x ∈ {xl, x1, x2, x3} : h(x) ≤ 0},
10○ set xu := min{x ∈ {xu, x1, x2, x3} : h(x) ≥ 0}.
11○ return xl.
Algorithm 5: A combination of Newton’s method and binary search.
We update the arrival time a(v) to the projection of the root x∗ of h into the interval
[amin(v), amax(v)] by Newton’s method. As this method is not guaranteed to converge fast
enough for general cost functions, we can combine this with a binary search as shown in Al-
gorithm 5. The binary search guarantees that we need at most O(log(|amax(v)− amin(v)|))
iterations while Newton’s method yields quadratic convergence as soon as we have ap-
proached x∗ close enough. If root x∗ lies between two solution candidates x1 and x2 in
one iteration, we compute the intersection x3 of the straight line through (x1, h(x1)) and
(x2, h(x2)) with the y-axis.
Recently, Hähnle [Häh15] developed an alternative to the arrival time customer approach
described in this chapter. Similar to Vygen [Vyg04] he modeled timing constraints within
the Min-Max Resource Sharing Problem by adding a resource for each maximal path in
the timing graph. By a modified algorithm and a new analysis, Hähnle [Häh15] showed
how to obtain a polynomial running time although the number of resources is exponential.
He reduced the dependency of the running time on the number n of signal paths to log(n)
and achieved a running time of O˜ ((|R′|+m) · log(n)), where m is the number of edges in
the timing graph and R′ is the set of resources different from the signal path resources.
Chapter 4
Buffering-and-Routing Oracles
After having developed a general framework to incorporate timing constraints into a global
routing algorithm, we concentrate on the buffering problem in this chapter. The goal of
this thesis is to optimize the timing of a chip by buffering and by global routing without
violating routing or placement constraints. A delay model that is largely used to estimate
the timing of a buffered netlist is the Elmore Delay model [Elm48]. We prove that even
the problem of finding a single Steiner path (cf. Definition 2.1) minimizing a weighted sum
of Elmore delay and edge costs is NP-hard but can be approximated arbitrarily close to
the optimum.
We generalize the FPTAS to obtain a polynomial time algorithm that finds an almost
optimum buffered Steiner tree in the case that the number of sinks is constant.
Some results of this chapter are joint work with Nicolai Hähnle.
4.1 Minimum Cost Buffered Steiner Trees
With the timing-constrained global routing framework described in Chapter 3 we can
directly address timing within a model that has been designed for the Standard Global
Routing Problem originally. Instead of solving tasks from the fields routing and timing
separately we can now obey constraints from both in one algorithm and output Steiner
trees that are good with respect to congestion and delays.
In this thesis we go even one step further. Instead of just making global routing more
timing-aware, we will solve the buffering problem completely within the resource sharing
algorithm. In contrast to the situation of global routing where the solution of net customers
were Steiner trees, we seek to find buffered Steiner trees (see Definition 2.8). In this sense,
the block solver for net customers becomes a combination of a routing and a buffering
oracle.
4.1.1 Buffer Space Resources
Besides routing congestion and timing, a buffering solution heavily affects the placement
problem. Similar to routing space, placement space is a limited resource. In some areas of
a chip standard logic is densely packed, leaving space for a few repeaters only. Other parts
of the chip are completely reserved for larger macros and repeaters must not be inserted
there at all.
41
42 Buffering-and-Routing Oracles
x
y
layer
Figure 4.1: Global routing graph arising from a partition of a rectangular chip area into rectangular
tiles in the case of two routing layers. The partition defines placement bins such as the gray area.
To model placement space inside the resource sharing algorithm we divide the chip
area into placement bins B1
.∪ B2
.∪ . . . .∪ Bk. For each bin Bi we add a buffer space
resource with capacity equal to the available buffer space size(Bi) inside Bi. A function
Ψ : V (G)→ {B1, B2, . . . , Bk} tells us to which placement bin a node of G belongs.
Let L be a repeater library and let ((A, κ), b) be a buffered Steiner tree for a net
customer N . Let size(l) denote the size of repeater l ∈ L and define size() := 0. Net
customer N consumes an amount of
∑
ν∈V (A):
Ψ(κ(ν))=Bi
size(b(ν)) from buffer space resource Bi.
In the global routing graph described in Section 2.4.3 there is a natural subdivision of
the chip area into placement bins, namely{
[xi, xi+1]× [yj , yj+1] : i ∈ {0, . . . , w − 1}, j ∈ {0, . . . , h− 1}
}
,
where 0 = x0 < x1 < . . . < xw = W , 0 = y0 < . . . < yh = H is a subdivision of
the chip area [0,W ] × [0, H]. The function Ψ maps a node representing a tile center(
xi+xi+1
2 ,
yj+yj+1
2 , z
)
to the bin [xi, xi+1]× [yj , yj+1] containing it. Figure 4.1 illustrates
this.
Note that for each repeater type l ∈ L there is exactly one layer below the wiring
layers on which we can place it. Usually, this the lowest possible layer but the pins of large
repeaters can be on higher layers, too. It is trivial to extend this placement resource model
and deal with repeaters that overlap with more than one tile. For the sake of simplicity
and a simpler notation we omit this extension in this thesis.
4.2 Delay Models
In this section we describe two models that can be used to estimate the delay along a path
inside a buffered Steiner tree ((A, κ), b) for a net N with source s.
For the rest of this chapter we use the notation of the case that G is a directed graph.
All definitions and results can easily be adapted to the undirected case by replacing each
undirected edge by two directed edges. Recall that for ν, ω ∈ V (A) such that ω is reachable
from ν we denote the unique ν-ω path in A by A[ν,ω] (Definition 2.3).
Buffering-and-Routing Oracles 43
4.2.1 The Elmore Delay Model
The delay along a buffered Steiner tree ((A, κ), b) in a global routing graph G depends
on various parameters. Beside its length, electrical properties of gates and wires such as
capacitance and resistance have great impact on the delay along a path.
One popular model is the Elmore delay model [Elm48] that we define now. Let
ν, ω ∈ V (A) such that ν = s or b(ν) ∈ L and P := A[ν,ω] does not contain any internal
repeaters, i. e. b(ν ′) =  for ν ′ ∈ V (P )\{ν, ω}. Furthermore, we assume that we are given
• electrical capacitances cap(ν ′) for all ν ′ ∈ V (P )\{ν}, and
• resistance res(ν) and total output capacitance outcap(ν) at ν.
By replacing each edge in the global routing graph by a copy for each wire code, we can
achieve that each edge e ∈ E(G) has pre-defined wire capacitance cap(e) and resistance
res(e) that only depends on the edge itself. Recall that it is possible that adjacent vertices
ν, ω are placed at the same position and κ((ν, ω)) = ◦. We set res(◦) = cap(◦) = 0.
The Elmore delay of P is defined as
Elmore(P ) := res(ν) · outcap(ν) +
∑
ζ=(ν,ω)∈E(P )
res(κ(ζ)) ·
(
cap(κ(ζ))
2
+ cap(ω)
)
. (4.1)
In early design stages, the value ln(2) ·Elmore(P ) is a reasonable estimate on the delay
along P (see [Vyg16]). In this definition, the delays are independent of slews. To improve
the accuracy of the Elmore delay model, modern timing engines incorporate slew effects.
We call the resulting extension to the Elmore delay model the Elmore Delay Model with
Slew Propagation (see Section 7.1.3). In later design stages, even more accurate delay
models such as rapid interconnection circuit evaluation (RICE) [RP94] and current-based
delay models [CW03] are preferred. Despite the higher accuracy, those delay models are
harder to compute and lack explicit formulas which makes it more difficult to use them in
optimization.
In the context of timing-constrained global routing we assume that resistances and
capacitances of the edges in the global routing graph are given. The same holds for
capacitances of input pins of logic gates, macros, latches, and of primary inputs as well as
resistances of output pins.
This information allows us to define res(s), cap(t) of a sink pin t ∈ N\{s}, and res(ν)
and cap(ν) of a Steiner node ν with b(ν) ∈ L.
For a Steiner point ν with b(ν) =  we define recursively
cap(ν) =
∑
ζ=(ν,ω)∈δ+A(ν)
(cap(κ(ζ)) + cap(ω)).
Similarly, we define outcap(ν) for all ν ∈ V (A) such that ν = s or b(ν) ∈ L by the same
formula
outcap(ν) =
∑
ζ=(ν,ω)∈δ+A(ν)
(cap(κ(ζ)) + cap(ω)).
The reason why we use a different notation to denote the total capacitance outcap(ν)
that a vertex ν has to drive is that Steiner nodes associated with repeaters can appear as
starting and as end points of paths.
44 Buffering-and-Routing Oracles
Formula (4.1) enables us to compute the Elmore-value of any maximal subpath of
((A, κ), b) without internal repeaters. Let t ∈ N\{s} and let νrpt1 , . . . , νrptk be the nodes in
V (A[s,t]) with b(.) ∈ L (in the order in which they appear in A[s,t]). If k ≥ 1, we define
Elmore(A[s,t]) := Elmore
(
A[s,νrpt1 ]
)
+
(
k−1∑
i=1
Elmore
(
A[νrpti ,ν
rpt
i+1]
))
+ Elmore
(
A[νrptk ,t]
)
.
4.2.2 Generalized Non-Linear Delay Model
To demonstrate that the results of this chapter do not depend on the explicit formula for
Elmore delay and to simplify notation we present a slightly more general delay model that
we describe now.
Assume we are given the following functions:
• µ : E(G) ∪ L ∪ (N\{s})→ R≥0,
• F : (E(G) ∪ L ∪ {s})× R≥0 → R≥0 such that F (x, .) : R≥0 → R≥0 is monotonically
increasing for each x ∈ E(G) ∪ L ∪ {s}.
The function F determines the delay along an edge, a repeater, or the source and
depends on a capacitance determined by the µ function. We set µ(◦) := 0 and define
F (◦, .) to be the constant zero function. Similar to the Elmore delay case we compute
cap(ν) for each vertex ν ∈ V (A) and outcap(ν) for each vertex ν ∈ V (A) such that ν = s
or b(ν) ∈ L recursively by the formulas
cap(ν) :=

µ(ν) if ν ∈ N\{s}
µ(b(ν)) if b(ν) ∈ L∑
ζ=(ν,ω)∈δ+A(ν)
(
cap(ω) + µ(κ(ζ))
)
otherwise,
outcap(ν) :=
∑
ζ=(ν,ω)∈δ+A(ν)
(
cap(ω) + µ(κ(ζ))
)
.
For t ∈ N\{s} we define the delay through the unique s− t path in A as
delayµ,F((A,κ),b)(t) := F (s, outcap(s))
+
∑
ζ=(ν,ω)∈E(A[s,t])
F (κ(ζ), cap(ω)) +
∑
ν∈V (A[s,t]) s.t.
b(ν)∈L
F (b(ν), outcap(ν)).
The Elmore delay model is the special case of this model with
µ(e) = cap(e) for e ∈ E(G),
µ(t) = cap(t) for t ∈ N\{s},
µ(l) = cap(l) for l ∈ L,
F (s, x) = res(s) · x,
F (l, x) = res(l) · x for l ∈ L, and
F (e, x) = res(e) ·
(
cap(e)
2
+ x
)
for e ∈ E(G).
Buffering-and-Routing Oracles 45
4.3 The Minimum Cost Buffered Steiner Tree Problem
Now we have collected all ingredients to state the problem of the net customer’s oracle.
Minimum Cost Buffered Steiner Tree Problem
Instance: A repeater library L associated with functions
size : L ∪ {} → R≥0 and power : L ∪ {} → R≥0
such that size() = power() = 0.
A graph G with edge and placement costs
c : (E(G) ∪ {◦}) ∪ V (G)→ R≥0 with c(◦) = 0.
A net N ⊆ V (G) with source s and sink delay costs
λ : N\{s} → R≥0.
A static power cost cpower ∈ R≥0.
Timing functions
µ : E(G) ∪ L ∪ (N\{s})→ R≥0
and
F : (E(G) ∪ L ∪ {s})× R≥0 → R≥0
such that F (x, .) is monotonically increasing for each x ∈ E(G)∪L∪{s}.
Output: A buffered Steiner tree ((A, κ), b) for N in G minimizing∑
ζ∈E(A)
c(κ(ζ)) +
∑
t∈N\{s}
λ(t) · delayµ,F((A,κ),b)(t) +∑
ν∈V (A)
c(κ(ν)) · size(b(ν)) + cpower ·
∑
ν∈V (A)
power(b(ν)).
The cost c(e) for an edge e ∈ E(G) of the global routing graph contains all costs for
using e such as resource prices of congestion resources and a net length resource. Costs
c(v) for v ∈ V (G) are resource prices for buffer space resources while λ(t) is the sum of
prices over all timing resources entering a sink t ∈ N\{s}. We also want to minimize the
static power consumption of the newly inserted repeaters and add a static power resource
into the resource sharing framework. Its price is cpower.
Optimizing dynamic power consumption can be done analogously to optimizing the
delay through a repeater and through the source and requires insertion of an additional
resource for dynamic power (cf. Section 2.5.6).
In this chapter we do not distinguish buffers and inverters and do not consider electrical
constraints like slew- and capacitance constraints. In Section 4.7 we return to the question
how to incorporate polarity and capacitance constraints. Obeying slew constraints and
using delay models in which delays are influenced by slews is harder.
46 Buffering-and-Routing Oracles
4.3.1 Hardness Results
It is easy to see that the Minimum Cost Buffered Steiner Tree Problem NP-hard. In fact,
this is true even for special cases of it.
If cpower = c(v) = λ(t) = 0 for all v ∈ V (G) and t ∈ N\{s}, we obtain the classical
Minimum Cost Steiner Tree Problem that has been proved to be NP-hard by Garey and
Johnson [GJ77], [GJ79] page 209f.
Chuzhoy et al. [Chu+05] proved that for the following problem no o(log log |N |) ap-
proximation algorithm exists unless every problem in NP can be solved in O(nlog log logn)
time where n is the instance size:
Instance: An undirected graph G with functions c, ρ : E(G) ∪ {◦} → R≥0 with
c(◦) = ρ(◦) = 0.
A net N with source s and sink delay costs λ : N\{s} → R≥0.
Output: A Steiner tree (A, κ) for N in G minimizing∑
ζ∈E(A)
c(κ(ζ)) +
∑
t∈N\{s}
λ(t) ·
∑
ζ∈E(A[s,t])
ρ(κ(ζ)).
This is the special case of the Minimum Cost Buffered Steiner Tree Problem for which
L = ∅, F (s, .) is the constant zero function, for e ∈ E(G), F (e, .) is the constant function
with value ρ(e), and where G arises from an undirected graph by directing each edge in
both directions. In Chapter 6 we will study that problem in greater detail. There, we refer
to that problem as the Minimum Cost Steiner Tree Problem with Linear Delays.
Even if delays are measured by the Elmore delay model, the result of Chuzhoy et
al. [Chu+05] shows that the task of a net customer’s block solver is hard to approximate.
Let L = {l} such that l is allowed to be added at each node of G with price 0 (i. e.
c(v) = 0 for all v ∈ V (G)). For e ∈ E(G) let res(e) := ρ(e) and cap(e) := 0. We set
cpower = res(s) = res(l) := 0 and cap(t) = cap(l) = 1 for t ∈ N\{s}.
If (A, κ) is a Steiner tree for N in G such that A is rooted at s, we define b : V (G)→
L ∪ {} by
b(v) =
{
 if v ∈ N
l otherwise.
Then,∑
ζ∈E(A)
c(κ(ζ)) +
∑
t∈N\{s}
λ(t) ·
∑
ζ∈E(A[s,t])
ρ(κ(ζ)) =
∑
ζ∈E(A)
c(κ(ζ)) +
∑
t∈N\{s}
λ(t) · Elmore(s, t).
On the other hand, if ((A, κ), b) is a buffered Steiner tree, we can assume that b(v) = l
for all v ∈ V (A)\N (inserting a repeater at each Steiner point does not increase capacitances
and hence delays, and does not introduce any costs).
The problem of computing a shortest rectilinear Steiner tree in which all source-sink
paths are shortest possible is known as the Rectilinear Steiner Arborescence Problem. Shi
and Su [SS05] proved NP-hardness of that problem.
In Section 4.5.1 we show that the special case of the Minimum Cost Buffered Steiner
Tree Problem with |N | = 2 is NP-hard as well. This is true even if delays are measured by
the Elmore delay model.
Buffering-and-Routing Oracles 47
4.3.2 Existing Algorithms for Special Cases
Due to its complexity, we will not be able to solve the general version of the Minimum
Cost Buffered Steiner Tree Problem. Instead, we concentrate on special cases.
For the case that c(e) = 0 for e ∈ E(G), L = ∅, and delays are measured by the Elmore
delay model with edge resistances and capacitances of the form
res(e) = constres · length(e), cap(e) = constcap · length(e)
for all edges in e ∈ E(G) (where length : E(G)→ R>0 is some function), Scheifele [Sch14]
gave the first constant factor approximation algorithm. For general graph instances he
obtains an approximation ratio of 4.11. If G is a 2-dimensional grid graph and the length
of an edge is the `1-distance between its adjacent vertices, he achieves an approximation
guarantee of 3.39.
For the Rectilinear Steiner Arborescence Problem a 2-approximation algorithm (Rao et
al. [Rao+92], Córdova and Lee [CL94]) and a polynomial time approximation scheme (Lu
and Ruan [LR00]) is known.
For the case that the global routing graph is a 2-dimensional grid graph there have
been several heuristic approaches for simultaneous Steiner tree construction and buffering.
Okamoto and Cong [OC96] combined the computation of a rectilinear Steiner arborescence
with dynamic programming based buffering. For Steiner arborescence computation they
used the A-Tree algorithm by Cong et al. [CLZ93]. The first algorithm for buffering a
given Steiner tree by dynamic programming is due to Van Ginneken [Van90]. An overview
of dynamic programming based algorithms for the buffering problem can be found in
Section 7.2. Similar approaches as in [OC96] have been used by Hrkić and Lillis [HL02],
and Hu et al. [Hu+03].
4.4 Cost-Delay Minimum Steiner Tree Problem with Loops
In this section we slightly change the problem formulation of the Minimum Cost Buffered
Steiner Tree Problem. This allows a simplification of notation, a reduction of variables,
and usage of a slightly more general delay model.
4.4.1 Shortening the Model: Eliminating Pin Properties
First, we may assume that the capacitance µ(t) is zero for each sink pin t ∈ N\{s}. This
can be achieved by adding a new vertex t′ to G connected with t by an edge e = (t, t′).
Edge e will receive a µ-value equal to the original µ-value of t and cost c(e) = 0. The
function F (e, .) can be set to be the constant zero function. Vertex t′ replaces t in N .
By a similar technique we can assume that the source delay function F (s, .) is constantly
zero. As before, we add a new node s′ as well as an edge e = (s′, s) to G and replace s by
s′ in N . We set F (e, .) to the original function F (s, .) and c(e) = µ(e) = 0.
4.4.2 Shortening the Model: Representing Repeaters by Loops
Now we show how to avoid dealing with repeaters directly. Instead, we model the effect
of inserting a repeater by an edge traversal as follows. For each repeater l ∈ L and each
48 Buffering-and-Routing Oracles
v ∈ V (G) such that inserting repeater l at position v is allowed, we add a loop edge
el = (v, v) to G. Traversal of el models adding a repeater of type l at position v. We set
c(el) := c(v) · size(l) + cpower · power(l) and F (el, .) := F (l, .). To model the capacitance
change after repeater insertion we replace the former function µ by a function
∆ : E(G)× R≥0 → R≥0
such that ∆(el, .) is the constant function with value µ(l). For edges e ∈ E(G) that do not
model repeater insertion we define ∆(e, .) to be the function x 7→ x+ µ(e) while ∆(◦, .) is
the identity function R≥0 → R≥0.
The capacitance cap(ν) of a vertex ν ∈ V (A) is then recursively defined as cap(ν) = 0
for ν ∈ N\{s} and
cap(ν) =
∑
ζ=(ν,ω)∈δ+(ν)
∆(κ(ζ), cap(ω))
otherwise.
4.4.3 Problem Formulation
The following problem definition summarizes our new model and the resulting optimization
problem. The Minimum Cost Buffered Steiner Tree Problem is a special case of this new
problem.
Cost-Delay Minimum Steiner Tree Problem with Loops
Instance: A digraph G with edge costs c : E(G)→ R≥0, possibly containing loops.
A net N ⊆ V (G) with source s and sink delay costs λ : N\{s} → R≥0.
A function F : E(G)× R≥0 → R≥0 such that F (e, .) is monotonically
non-decreasing for each e ∈ E(G).
A function ∆ : E(G)× R≥0 → R≥0 such that ∆(e, .)
• is a constant function if e is a loop edge and
• is monotonically non-decreasing with ∆(e, x) ≥ x for all x ∈ R≥0
otherwise.
We define c(◦) = 0, F (◦, .) to be the constant zero function, and ∆(◦, .)
to be the identity function
Output: A Steiner tree (A, κ) for N in G minimizing
∑
ζ∈E(A)
c(κ(ζ)) +
∑
t∈N\{s}
λ(t) ·
 ∑
ζ=(ν,ω)∈E(A[s,t])
F (κ(ζ), cap(ω))
,
where for ω ∈ V (A), cap(ω) is recursively defined as
cap(ω) =
∑
ζ=(ω,ω′)∈δ+A(ω)
∆(κ(ζ), cap(ω′)).
Note that the recursive definition of cap already implies that cap(t) = 0 for t ∈ N\{s}. We
call the special case of the Cost-Delay Minimum Steiner Tree Problem with Loops where
N = {s, t} the Cost-Delay Minimum Steiner Path Problem with Loops.
Buffering-and-Routing Oracles 49
4.4.4 Necessity of Conservative Edge Costs
Recall that for the Standard Minimum Steiner Tree Problem and for the Shortest Path
Problem it is essential to restrict the edge costs such that there are no negative cycles. An
edge cost function for which no cycle with negative total cost exists is called conservative
edge cost. Finding a shortest path in a graph with non-conservative edge costs is NP-hard
as a straightforward reduction from the NP-hard Longest Path Problem shows ([GJ79]
page 213).
In our case, the restriction that ∆(e, x) ≥ x for all non-loop edges e ∈ E(G) and all
x ∈ R≥0 is necessary to ensure that traversal of cycles does not decrease the overall cost of
a Steiner tree. Note that this condition is fulfilled for the Elmore delay model.
It is easy to see that allowing e. g. a function ∆(e, .) = (x 7→ α · x) for a constant α < 1
may lead to undesired situations: If we do not enforce that the κ function of a Steiner
tree (A, κ) is injective, no finite optimum solution might exist. If we require injectivity,
we obtain an optimization problem for which no approximation algorithm exists unless
P = NP. The last fact is even true for the Cost-Delay Minimum Steiner Path Problem
with Loops and can be proved by a straightforward reduction from the Hamiltonian s-t
Path Problem:
Lemma 4.1 Let p be a polynomial. Unless P = NP there is no 2p(|V (G)|)-approximation
algorithm for the modification of the Cost-Delay Minimum Steiner Path Problem with Loops
in which we omit the condition ∆(e, x) ≥ x for non-loop edges e ∈ E(G) and require that
the returned buffered Steiner tree ((A, κ), b) has the property |κ−1(e)| ≤ 1 for all e ∈ E(G).
Proof Let N = {s, t} and let G be a directed graph with n := |V (G)| vertices. We
construct a graph G′ from G by adding vertices s′, t′ and edges (s′, s), (t, t′) with
F ((s′, s), x) = x, F ((t, t′), x) = 0, ∆((t, t′), x) = x + 1 for all x ∈ R≥0. We set
∆(e, x) = x
2p(n+2)+1
and F (e, x) = 0 for all edges e ∈ E(G). Edge costs c(e) are zero
for all edges e ∈ E(G′). Then, the total cost of an s′-t′ path containing all vertices in G
is more than a factor 2p(n+2) smaller than the cost of any shorter s′-t′ path. Hence, no
2p(|V (G′)|)-approximation algorithm can exist unless P = NP. 
While for the Shortest Steiner Tree Problems in Graphs with non-negative (and hence
conservative) edge costs, loops do not play any role, they are essential in our case as they
model repeater insertion. With introduction of loop edges we have relaxed the condition
∆(e, x) ≥ x that we require for non-loop edges e ∈ E(G) and x ∈ R≥0. Consequently,
optimum solutions can contain non-trivial cycles. Such a situation is depicted in Figure 4.2.
Despite these situations we can still bound the length of an optimum solution as the
next theorem states. Having such a bound is necessary to obtain a polynomial time
algorithm for (special cases of) the Cost-Delay Minimum Steiner Tree Problem with Loops.
Lemma 4.2 Let G,N, λ, F, c,∆ be an instance of the Cost-Delay Minimum Steiner Tree
Problem with Loops such that N = {s, t}. Let h be the number of loop edges in G. There
exists an optimum solution ((A, κ), b) with |V (A)| ≤ (1 + h) · |V (G)|.
Proof Without loss of generality we may assume that λ(t) = 1. Let ((A, κ), b) be an
optimum solution with |V (A)| minimum. Let v ∈ κ(A) such that κ−1(v) = {x1, x2, . . . , xk}
and xi is reachable from xj in A if and only if i ≥ j (i. e. the xi appear in that order in
path A).
50 Buffering-and-Routing Oracles
s tv1 v2
v3v4
F (., x) = x
∆(., x) = x
F (., x) = 0
∆(., x) = x + 100
F (., x) = 0
∆(., x) = 1
For black edges:
F (., x) = 0
∆(., x) = x + 1
(a) Graph G with functions ∆ and F . Here,
N = {s, t}, c(e) = 0 for e ∈ E(G) and λ(t) = 1.
κ = s κ = tκ = v1 κ = v2
κ = s κ = t
κ = v1 κ = v2
κ = v3
κ = v3κ = v4
κ = v1 κ = v2
(b) Two Steiner trees. The above tree has cost 101,
the tree below has cost 3 and uses a loop edge and a
non-trivial cycle.
Figure 4.2: Instance of the Cost-Delay Minimum Steiner Tree Problem with Loops for which the
optimum solution uses a non-trivial cycle in G.
If there is i ∈ {1, . . . , k − 1} such that cap(xi) ≥ cap(xi+1), removing A[xi,xi+1] − xi
from A but connecting xi with the successor of xi+1 in A (respectively identifying xi with
t if xi+1 = t) would not increase the cost of A but would decrease |V (A)|. We conclude
that A[xi,xi+1] contains a loop edge for each i ∈ {1, . . . , k − 1}.
Now, note that each loop edge e is used by A at most once as otherwise, removing the
part between the first and the last traversal of e (including the first traversal but excluding
the last one) decreases |V (A)| without increasing the cost.
This implies that k ≤ h+ 1 and the lemma follows. 
The bound of Lemma 4.2 is tight up to a constant factor:
Lemma 4.3 For each n, h ∈ N there is an instance G,N, λ, F, c,∆ of the Cost-Delay
Minimum Steiner Tree Problem with Loops with |N | = 2, |V (G)| = 2 + h+ n, and h loop
edges in which each optimum solution (A, κ) has at least |V (A)| = (h+ 1)(n+ 2) vertices.
Proof Let G be the directed graph with vertices
V (G) = {s, t, v1, . . . , vn, w1, . . . , wh}.
We add edges (s, v1), (vn, t), and (vi, vi+1) for i = 1, . . . , n− 1. For each j ∈ {1, . . . , h} we
insert edges (vn, wj), (wj , v1), and attach a loop edge ej at wj .
We define N = {s, t} with λ(t) = 1, c(e) = 0 for all e ∈ E(G), and for x ∈ R≥0
F ((s, v1), x) = x,
F ((wj , v1), x) =
{
0 if x ≤ j
1 otherwise,
for all j = 1, . . . , h
F (e, x) = 0 for all edges e ∈ E(G)\δ−(v1),
∆((vn, t), x) = x+ h,
∆(ej , x) = j − 1 for all j = 1, . . . , h,
∆(e, x) = x for all non-loop edges e ∈ E(G)\{(vn, t)}.
See Figure 4.3(a) for an illustration of G in the case n = 5, h = 3.
Buffering-and-Routing Oracles 51
f e
s tv1 v2 v3 v4 v5 = vn
w1
w2
w3 = wh
e1
e2
e3
f1
f2
f3
F (e, x) = 0
∆(e, x) = x + 3
F (ej , x) = 0
∆(ej , x) = j − 1
F (f, x) = x
∆(f, x) = x
F (fi, x) =
{
0 if x ≤ i
1 otherwise
∆(fi, x) = x
F ((v5, wj), x) = 0 F ((vi, vi+1), x) = 0
∆((v5, wj), x) = x ∆((vi, vi+1), x) = x
(a) Graph G and timing functions. (b) A Steiner path with cost 0 and mini-
mum number of vertices.
Figure 4.3: Instance for the Cost-Delay Minimum Steiner Tree Problem with Loops constructed
in the proof of Lemma 4.3 with n = 5, h = 3. The number of edges in each optimum s-t Steiner
path is large.
First observe that the Steiner path traversing edges
(s, v1), (v1, v2), . . . , (vn−1, vn), (vn, w1), e1, (w1, v1),
(v1, v2), . . . , (vn−1, vn), (vn, w2), e2, (w2, v1),
. . .
(v1, v2), . . . , (vn−1, vn), (vn, wh), eh, (wh, v1),
(v1, v2), . . . , (vn−1, vn), (vn, t)
in that order has cost 0 (see Figure 4.3(b)).
To complete the proof of the lemma it remains to show that this path has the least
number of vertices among all Steiner paths with cost 0. Let A be such an s-t Steiner path
with cost 0. By definition of F ((s, v1), .) and the ∆-functions, A must traverse e1 which
requires to traverse (w1, v1) afterwards.
Using the same argument we observe that each Steiner path with cost 0 that tra-
verses edge (wj , v1) for j ∈ {1, . . . , h − 1} must also traverse ej+1 and thus (wj+1, v1).
Consequently, A traverses all edges entering v1.
As each non-trivial cycle in G containing v1 and a loop edge has n+2 vertices and since
the part of A between the first and last occurrence (say including the first but excluding the
last occurrence) must contain at least n+ 2 vertices, we conclude |V (A)| ≥ (h+ 1)(n+ 2).

In instances arising in practice it is often the case that either no repeater or all repeaters
can be inserted at a node. In this case, the bound of Lemma 4.2 improves to (|L|+1)·|V (G)|
where L is the repeater library. This statement has essentially the same proof as Lemma 4.2
and tightness can easily be proved by attaching |L| loop edges at each inner node of an s-t
path of length n+ 2. This graph defines G and each of the |L| loop edges at each inner
node represents a repeater of a certain type. By defining cost functions similar to the ones
used in the proof of Lemma 4.3 we can make sure that an optimum Steiner path actually
uses each of these loops.
52 Buffering-and-Routing Oracles
4.5 Cost-Delay Minimum Steiner Path Problems
In this section we study the Cost-Delay Minimum Steiner Path Problem with Loops. Recall
from Section 4.4.3 that this problem is the special case of the Cost-Delay Minimum Steiner
Tree Problem with Loops where N = {s, t}.
If G does not even contain loops, we refer to the problem as the Cost-Delay Minimum
Steiner Path Problem. Without loss of generality we assume λ(t) = 1 in this section.
Generally speaking, we are looking for a shortest Steiner path between two vertices s
and t in this section (recall the definition of a Steiner path: Definition 2.1). The classical
shortest path problem in which the cost of traversing an edge e ∈ E(G) is c(e) ≥ 0 is one
of the best understood problems in combinatorial optimization. Well-known algorithms
like Dijkstra’s algorithm [Dij59] or the algorithms by Moore [Moo59], Bellman [Bel58], and
Ford [For56] give optimum solutions in fast running time.
As the cost of traversing an edge is dependent on a capacitance in our case, the problem
becomes much harder and the classical algorithms cannot be applied anymore.
In Section 4.5.1 we show that the Cost-Delay Minimum Steiner Path Problem with
Loops is NP-hard even in very restricted special cases. Despite its hardness we can still
give a fully polynomial time approximation scheme in Section 4.5.2. In Section 4.5.3 we
show how to substantially improve the running time of that FPTAS in the situation of the
Cost-Delay Minimum Steiner Path Problem (without loops).
4.5.1 NP-Hardness
Now we prove that the Cost-Delay Minimum Steiner Path Problem (and hence the Cost-
Delay Minimum Steiner Path Problem with Loops) is NP-hard even if delays are measured
by the Elmore delay model.
The results of this section are joint work with Nicolai Hähnle.
Our reductions use NP-completeness of the famous Partition Problem (see the book of
Garey and Johnson, page 47 [GJ79]).
Lemma 4.4 ([GJ79]) The following problem is NP-complete:
Instance: Rationals x1, . . . , xn ∈ Q>0 with
n∑
i=1
xi = 2.
Question: Is there a subset I ⊆ {1, . . . , n} such that ∑
i∈I
xi = 1?
Theorem 4.5 The Cost-Delay Minimum Steiner Path Problem is NP-hard if for each
edge e ∈ E(G)
• ∆(e, .) is of the form x 7→ x+ µ(e) with µ(e) ≥ 0 and
• F (e, x) is equal to F (e, x) := f(∆(e, x))−f(x) for a monotonically increasing, strictly
convex, and twice differentiable function f : R≥0 → R.
This is true even if
• G arises from an s-t path by replacing each edge by two parallel edges,
• c(e) · µ(e) = 0 for all e ∈ E(G), and
• c(e) · c(e) = µ(e) · µ(e) = 0 for all pairs e,e of parallel edges.
Buffering-and-Routing Oracles 53
. . .s = v1 v2 v3 vn+1 = t
0, x1
y1
0, x2
y2
0, x3
y3
0, xn
yn
f ′(1) · x1, 0
e1
f ′(1) · x2, 0
e2
f ′(1) · x3, 0
e3
f ′(1) · xn, 0
en
Figure 4.4: Graph G in the proof of Theorem 4.5. Each edge e is labeled c(e), µ(e).
Proof Let f and F be as described in the theorem. First, observe that for any Steiner
path (P, κ) in G,
∑
ζ=(ν,ω)∈E(P )
F (κ(ζ), cap(ω)) = f
 ∑
ζ∈E(P )
µ(κ(ζ))
− f(0).
Let x1, . . . , xn be an instance of Partition. We construct an instance of the Cost-Delay
Minimum Steiner Path Problem by
G =
(
{v1, . . . , vn+1}, {ei = (vi, vi+1), yi = (vi, vi+1) : i = 1, . . . , n}
)
,
c(ei) = f
′(1) · xi, c(yi) = 0, µ(ei) = 0, µ(yi) = xi for i = 1, . . . , n, v1 = s, vn+1 = t.
Figure 4.4 depicts G. For any Steiner path (P, κ) let
I(P, κ) :=
{
i ∈ {1, . . . , n} : P contains an edge ζ with κ(ζ) = yi
}
.
Furthermore, define g(x) := f ′(1) · (2− x) + f(x)− f(0). Note that (P, κ) has cost
∑
ζ∈E(P )
c(κ(ζ)) +
∑
ζ=(ν,ω)
∈E(P )
F (κ(ζ), cap(ω)) = f ′(1) ·
∑
i/∈I(P,κ)
xi + f
 ∑
i∈I(P,κ)
xi
− f(0)
= f ′(1) ·
2− ∑
i∈I(P,κ)
xi
+ f
 ∑
i∈I(P,κ)
xi
− f(0)
= g
 ∑
i∈I(P,κ)
xi
 .
We claim that x1, . . . , xn is a yes-instance for Partition if and only if there is a Steiner
path with cost at most g(1).
⇒: Let I ⊆ {1, . . . , n} such that ∑
i∈I
xi =
∑
j∈{1,...,n}\I
xj = 1.
Let P =
(
{ω1, . . . , ωn+1}, {ζ1, . . . , ζn}
)
with
κ(ωi) = vi for i = 1, . . . , n+ 1 and κ(ζi) =
{
ei if i ∈ I
yi otherwise
for i = 1, . . . , n,
i. e. we take the “upper” edge (according to Figure 4.4) for indices in I and the lower
edge otherwise. By construction, I(P, κ) = {1, . . . , n}\I and thus, (P, κ) has cost
g(1).
54 Buffering-and-Routing Oracles
. . .s t
0, y1 0, y2 0, y3 0, yn
x1, 0 x2, 0 x3, 0 x4, 0
(a) Instance G, c, µ of the problem of Theo-
rem 4.5. Each edge e is labeled c(e), µ(e)
. . .s t
0, y1, 2y1 0, y2, 2y2 0, y3, 2y3 0, y4, 2y4
x1, 0, 0 x2, 0, 0 x3, 0, 0 x4, 0, 0
(b) Arising instance G, c, w, d of the Cost-Delay
Minimum Path Problem. Each edge e is labeled
c(e), w(e), µ(e)
Figure 4.5: Graphs used in the proof of Theorem 4.6.
⇐: Let (P, κ) be a Steiner path with cost at most g(1). Since f is strictly convex, g is
strictly convex as well. As g′(1) = −f ′(1) + f ′(1) = 0 and g′′ ≥ 0 by convexity, 1 is a
local, and hence global minimum of g.
By strict convexity, this minimum is unique which implies that
∑
i∈I(P,κ) xi = 1.

The strict convexity assumption is indeed necessary. If f is a linear function and ∆
and F are as described in Theorem 4.5, the resulting problem is equal to the Standard
Shortest Path Problem which is of course not NP-hard. Analogously to Theorem 4.5 one
can show that the corresponding maximization problem is NP-hard if f is strictly concave.
We can now show that finding a Steiner path minimizing the sum of linear edge costs
and Elmore delay is NP-hard.
Theorem 4.6 The Cost-Delay Minimum Steiner Path Problem is NP-hard if for each
edge e ∈ E(G)
• ∆(e, .) is of the form x 7→ x+ µ(e) with µ(e) ≥ 0 and
• F (e, x) is equal to F (e, x) = r(e) ·
(
µ(e)
2 + x
)
with r(e) ≥ 0.
This is true even if
• G arises from an s-t path by replacing each edge by two parallel edges,
• c(e) · µ(e) = 0 for all e ∈ E(G), and
• c(e) · c(e) = µ(e) · µ(e) = r(e) · r(e) = 0 for all pairs e, e of parallel edges.
Proof Let G, c, µ be an instance of the special case of the Cost-Delay Minimum Steiner
Path Problem proved to be NP-hard in Theorem 4.5 where f(x) = x2. We obtain an instance
of the variant introduced in Theorem 4.6 by defining c(e) = c(e), r(e) = µ(e), µ(e) = 2µ(e)
(see Figure 4.5).
Let (P, κ) be an s-t- Steiner path in G, E′ = {ζ ∈ E(P ) : c(κ(ζ)) = 0}. Let ζ1, . . . , ζk
be the ordering of the edges in E′ as they appear in P .
Buffering-and-Routing Oracles 55
The cost of (P, κ) w. r. t. the cost function of the variant of Theorem 4.6 is then
∑
ζ∈E(P )\E′
c(κ(ζ)) +
k∑
i=1
r(κ(ζi)) ·
µ(κ(ζi))
2
+
k∑
j=i+1
µ(κ(ζj))

=
∑
ζ∈E(P )\E′
c(κ(ζ)) +
k∑
i=1
(
µ(κ(ζi))
)2
+
k∑
i=1
k∑
j=i+1
2µ(κ(ζi))µ(κ(ζj))
=
∑
ζ∈E(P )\E′
c(κ(ζi)) +
∑
ζ∈E′
µ(κ(ζi))
2 .
This is exactly the cost of (P, κ) w. r. t. the objective function of the variant of Theorem 4.5.

4.5.2 A Fully Polynomial Time Approximation Scheme
As we do not have a chance to solve the Cost-Delay Minimum Steiner Path Problem
with Loops optimally (unless P = NP) we have to aim for approximation algorithms. In
this section we give an approximation algorithm to compute a Steiner path with total
cost at most 1 +  times the cost of an optimum solution. The running time will be
polynomial in the size of the instance and 1 . Such an algorithm is called fully polynomial
time approximation scheme (FPTAS) in literature.
The algorithm itself is similar to the algorithms of Moore [Moo59], Bellman [Bel58],
and Ford [For56] for the minimum cost t-s path problem in the graph arising from G by
reversing all edges. The major difference is that we round costs to powers of 1 +  and
store labels with minimum capacitance for each cost-class at each node.
For the proof we need the following well-known lemma:
Lemma 4.7 For all x > 1 it holds that e1−
1
x ≤ x ≤ ex−1.
Proof The second inequality is a simple consequence of the Mean-Value Theorem which
yields y ∈ [1, x] such that ex−1−e1−1x−1 = ey−1 ≥ 1.
To prove the first inequality we note that the function f(x) = 1− 1x − ln(x) has its only
maximum at x = 1. The upper bound f(x) ≤ f(1) = 0 implies ln(x) ≥ 1− 1x ⇒ x ≥ e1−
1
x .

Theorem 4.8 Let (G,N = {s, t}, c, F,∆) be an instance of the Cost-Delay Minimum
Steiner Path Problem with Loops and k ∈ N. We require that an s-t Steiner path in G with
at most k edges exists. Denote by n := |V (G)| the number of vertices and by m := |E(G)|
the number of edges in G. Let
• cost↓ > 0 be a lower bound on the cost of all Steiner paths with positive cost that end
in t,
• cost↑ > 0 be an upper bound on the cost of all Steiner paths with at most k edges that
end in t.
56 Buffering-and-Routing Oracles
For each  > 0 there is an algorithm that finds an s-t Steiner path (P, κ) with |E(P )| ≤ k
and cost at most (1 + ) times the cost of an optimum s-t Steiner path with at most k edges.
Let θ be the maximum running time for an evaluation of one of the functions F , ∆,
and the restriction of
⌈
log1+/2k(.)
⌉
to the interval [cost↓, cost↑]. The running time of the
algorithm is
O
(
k2 ·m · log(cost
↑/cost↓)

· θ
)
.
Proof We define δ := 2k . Note that by this choice of δ and Lemma 4.7,
(1 + δ)k ≤ eδk = e/2 =
∞∑
i=0
(/2)i
i!
= 1 +  ·
∞∑
i=1
i−1
2ii!
≤ 1 +  ·
∞∑
i=1
1
2i
= 1 + .
We proceed analogously to the algorithms of Moore [Moo59], Bellman [Bel58], and
Ford [For56] and propagate labels (v, i) ∈ (V (G),Z∪{−∞}) associated with a capacitance
cap(v, i) ∈ R≥0. A label (v, i) with cap(v, i) <∞ corresponds to a v-t Steiner path (P, κ)
in G with
• total cost at most (1 + δ)i (where (1 + δ)−∞ := 0) and
• capacitance of v in (P, κ) at most cap(v, i).
Initially, we only have a label (t,−∞) with cap(t,−∞) = 0 corresponding to the trivial
t-t Steiner path. We iterate along all edges in G for k times. When we propagate along
edge e = (v, w) we do the following: For all labels (w, i) at w we create a label (v, i′)
associated with value cap(v, i′) = ∆(e, cap(w, i)).
If label (w, i) corresponds to the w-t Steiner path (P, κ), (v, i′) corresponds to the
following v-t Steiner path (P ′, κ′). P ′ arises from P by adding a new node ν and an edge
ζ between ν and the starting point of P . We set κ′(ν) = v, κ′(ζ) = e, and κ′(ω) = κ(ω)
for ω ∈ V (P ) ∪ E(P ).
The index i′ ∈ Z is chosen minimum such that (1 + δ)i plus the cost increase induced
by the new edge ζ is upper bounded by (1 + δ)i′ . We keep the new label unless we have an
existing label of the form (v, i′) with smaller capacitance.
A formal description of the algorithm can be found in Algorithm 6. We use the notation
(P (v, i), κ(v, i)) to denote the Steiner path associated with a label (v, i).
We use the notation start(P ) to denote the starting point of a path P .
Approximation guarantee: Denote the total cost of a Steiner path (P, κ) by cost(P, κ) and
the capacitance of a node ν ∈ V (P ) in (P, κ) by cap(P,κ)(ν). The properties
cost(P (v, i), κ(v, i)) ≤ (1 + δ)i and capP (v,i),κ(v,i)(start(P (v, i))) ≤ cap(v, i)
for all labels (v, i) are fulfilled by construction.
Let (P, κ) be any s-t Steiner path with vertices V (P ) = {s = νη, νη−1, . . . , ν1, ν0 = t}
and edges E(P ) = {ζη, ζη−1, . . . , ζ1} (in that order) such that η ≤ k. We may assume
that κ(ζ) 6= ◦ for all ζ ∈ E(P ). For j ∈ {0, . . . , η} we define (Pj , κj) to be the νj-t
sub-Steiner path of (P, κ), i. e. Pj consists of vertices νj , . . . , ν0 and edges ζj , . . . , ζ1.
The function κj is the restriction of κ to V (Pj) ∪ E(Pj).
Buffering-and-Routing Oracles 57
Instance: An instance (G,N = {s, t}, c, F,∆) of the Cost-Delay Minimum Steiner
Path Problem with Loops, a natural number k.
Output: An s-t Steiner path (P, κ) in G with |E(P )| ≤ k or fail if such a Steiner
path does not exist.
1○ set Q(v) := {(v, i) : i ∈ Z ∪ {−∞}}, cap(v, i) =∞ for all v ∈ V (G), i ∈ Z ∪ {−∞}.
2○ set cap(t,−∞) = 0, P (t,−∞) = ({t}, ∅), κ(t,−∞) = (t 7→ t)
3○ for i = 1 to k do
4○ for e = (v, w) ∈ E(G) do
5○ for labels (w, i) ∈ Q(w) with cap(w, i) <∞ do
6○ propagate((w, i), e)
7○ if cap(s, i) =∞ for all (s, i) ∈ Q(s) do
8○ return fail
9○ else
10○ select (s, i) ∈ Q(s) with cap(s, i) 6=∞ and i minimum among these labels
11○ return (P (s, i), κ(s, i))
functionpropagate(label (w, i), edge e = (v, w)):
1○ set x := (1 + δ)i + c(e) + F (e, cap(w, i))
2○ set i′ :=
⌈
log1+δ(x)
⌉
if x > 0 and i′ := −∞ if x = 0.
3○ set P := P (w, i), κ := κ(w, i)
4○ augment P by a new vertex ν and an edge ζ = {(ν, start(P ))}
5○ set κ(ν) = v, κ(ζ) = e
6○ if cap(v, i′) > ∆(e, cap(w, i)) do
7○ set cap(v, i′) = ∆(e, cap(w, i))
8○ set P (v, i′) = P , κ(v, i′) = κ
Algorithm 6: FPTAS for the Cost-Delay Minimum Steiner Path Problem with Loops. We use
the notation start(P ) to denote the starting point of a path P .
Claim: Let j ∈ {0, . . . , η}. After the j-th iteration of the loop in line 3○, Q(κ(νj))
contains a label (κ(νj), i) such that
(1 + δ)i ≤ cost(Pj , κj) · (1 + δ)j and cap(κ(νj), i) ≤ cap(P,κ)(νj).
Proof (of the claim) For j = 0 this is trivial. Let j > 0 and assume that after
iteration j − 1 of the loop in line 3○, Q(κ(νj−1)) contains a label (κ(νj−1), i) as
desired. When we call propagate for label (κ(νj−1), i) and edge κ(ζj) in 6○, we set
x := (1 + δ)i + c(κ(ζj)) + F (κ(ζj), cap(κ(νj−1), i)) and i′ :=
⌈
log1+δ(x)
⌉
.
Since
(1 + δ)i
′
= (1 + δ)dlog1+δ((1+δ)i+c(κ(ζj))+F (κ(ζj),cap(κ(νj−1),i)))e
≤ (1 + δ)dlog1+δ(cost(Pj−1,κj−1)·(1+δ)j−1+c(κ(ζj))+F(κ(ζj),cap(P,κ)(νj−1)))e
≤ cost (Pj) · (1 + δ)j
and
cap(κ(νj), i′) ≤ ∆(κ(ζj), cap(κ(ζj−1), i)) ≤ ∆(κ(ζj), cap(P,κ)(ζj−1)) = cap(P,κ)(νj),
58 Buffering-and-Routing Oracles
label (νj , i′) fulfills the conditions of the claim after iteration j of the loop in line 3○.
(claim)
For the optimum s-t Steiner path (P ∗, κ∗) with |E(P ∗)| ≤ k, the claim for j = η
yields a label (s, i) such that
(1 + δ)i ≤ (1 + δ)k · cost(P ∗, κ∗) ≤ (1 + ) · cost(P ∗, κ∗).
Running time: We start by bounding the number of possible values for i such that a label
(v, i) can have finite capacitance.
By definition of cost↓ and cost↑, the number I of different possible exponents of
(1 + δ) is at most
k +
⌈
log1+δ(cost
↑)
⌉
−
⌈
log1+δ(cost
↓)
⌉
+ 2 = O
(
k +
log(cost↑/cost↓)
log(1 + δ)
)
= O
(
k +
log(cost↑/cost↓)
δ
)
.
The last equality uses Lemma 4.7. With x := 1 + δ, the inequality e1−
1
x ≤ x of
Lemma 4.7 implies 1 + 1δ =
1
1− 1
δ+1
≥ 1ln(1+δ) and hence, 1log(1+δ) = O
(
1
δ
)
.
To perform propagate in time O(θ) we store all labels (v, i) with v ∈ V (G)
and i ∈ {⌈log1+δ(cost↓)⌉− 1, . . . , ⌈log1+δ(cost↑)⌉+ k} ∪ {−∞} with corresponding
capacitances in |V (G)| arrays of size I (we do not store the other labels). Instead of
storing Steiner paths directly we store the input to propagate of the call in which
a label has been updated as predecessor information.
We obtain a total running time of
O (kmIθ) = O
(
k2mθ + kmθ · log(cost
↑/cost↓)
δ
)
= O
(
k2mθ · log(cost
↑/cost↓)

)
.

Corollary 4.9 Denote cost↓, cost↑, θ,m, n as in Theorem 4.8 and assume that log
(
cost↑
cost↓
)
is polynomally bounded in the input size. There is an FPTAS for the Minimum Cost
Buffered Steiner Path Problem with running time
O
(
m3 · n2 · log(cost
↑/cost↓)

· θ
)
.

The condition that log
(
cost↑/cost↓
)
is polynomially bounded is naturally fulfilled if the
functions ∆(e, .) are either constant or of the form x 7→ x+ µ(e), and if the result F (e, x)
is of polynomial size (i. e. can be expressed by a polynomial number of bits) for all input
values x of polynomial size. This is the case when we model the problem of computing a
cheapest path minimizing linear costs and costs for Elmore delay.
Computing
⌈
log1+δ(x)
⌉
for values x of polynomial size can be done by first computing an
approximation y of log1+δ(x) with a bounded absolute error β > 0 and returning
min{i : (1 + δ)i ≥ x, i ∈ N ∩ {bx− βc , . . . , dx+ βe}}.
Buffering-and-Routing Oracles 59
4.5.3 Unbuffered Non-Linear Steiner Paths
Although the main motivation of this thesis is its application in buffering, the Cost-Delay
Minimum Steiner Path Problem is interesting for other applications, too.
• A runner might get tired at some point and will get slower the more distance she or
he has travelled.
• Millions of commuters get up early every day to arrive at their jobs before rush hour.
The price for using a street depends on the traffic volume on it that is increasing in
time.
• In later design steps of VLSI design we might want to find a path with minimum
Elmore delay without making further design changes. In particular, we do not want
to insert or remove repeaters.
In all of these applications the cost of traversing an edge is dependent on the path
travelled before or afterwards. This can easily be modeled as a Cost-Delay Minimum Steiner
Path Problem. We show how to use the property ∆(e, x) ≥ x for all e ∈ E(G), x ∈ R≥0 to
reduce the running time of the FPTAS presented in Theorem 4.8 significantly. Note that
this property is indeed provided in all applications mentioned above.
The following theorem is joint work with Nicolai Hähnle.
Theorem 4.10 Let (G,N = {s, t}, c, F,∆) be an instance of the Cost-Delay Minimum
Steiner Path Problem (i. e. G does not contain loops, in particular ∆(e, x) ≥ x for all
e ∈ E(G), x ∈ R≥0).
As in Theorem 4.10 denote by n := |V (G)| the number of vertices and by m := |E(G)|
the number of edges in G. Let cost↓ > 0 be a lower bound on the cost of all Steiner paths
with positive cost ending in t and let cost↑ > 0 be an upper bound on the cost of all Steiner
paths ending in t. Also, let θ be defined as in Theorem 4.10.
If log
(
cost↑/cost↓
)
is polynomially bounded in the input size, there is an FPTAS for
the Cost-Delay Minimum Steiner Path Problem with running time
O
(
n · log(cost
↑/cost↓)

·
(
mθ + n · log
(
n · log(cost
↑/cost↓)

)))
.
Proof The proof is similar to the proof of Theorem 4.10. Instead of processing labels like
in the algorithms of Moore [Moo59], Bellman [Bel58], and Ford [For56], we proceed in a
Dijkstra [Dij59] manner, i. e. we store non-permanent labels in a heap and propagate from
a “minimum” label in each iteration. To compare two labels (v, i) and (v′, i′) we say that
(v, i) < (v′, i′) if i < i′, or i = i′ and cap(v, i) < cap(v′, i′).
A full description of the algorithm can be found in Algorithm 7.
Approximation guarantee: Note that by the non-negativity of F and c, the monotonicity
of the F (e, .) and ∆(e, .), and by the property ∆(e, x) ≥ x for all e ∈ E(G) and
x ∈ R≥0, the labels chosen in 5○ are always non-decreasing with respect to the
relation ≤.
Let (P, κ) be any s-t Steiner path with V (P ) = {s = νη, νη−1, . . . , ν0 = t}, E(P ) =
{ζη, ζη−1, . . . , ζ1} (in that order). As in the proof of Theorem 4.10 let (Pj , κj) be the
νj-t sub-Steiner path of P for j = 0, . . . , η.
We also reuse the notation cost(P, κ) to denote the cost of a path (P, κ) and cap(P,κ)(ω)
to denote the capacitance of a node ω ∈ V (P ) in path (P, κ).
60 Buffering-and-Routing Oracles
Instance: An instance (G,N = {s, t}, c, F,∆) of the Cost-Delay Minimum Steiner
Path Problem.
Output: An s-t Steiner path (P, κ) in G.
1○ set Q(v) := {(v, i) : i ∈ Z ∪ {−∞}}, cap(v, i) =∞ for all v ∈ V (G), i ∈ Z ∪ {−∞}.
2○ set cap(t,−∞) = 0, P (t,−∞) = ({t}, ∅), κ(t,−∞) = (t 7→ t)
3○ set S := ∅
4○ while
⋃
v∈V (G)Q(v) 6= ∅ do
5○ select (w, i) ∈
(⋃
v∈V (G)Q(v)
)
\S such that (i, cap(w, i)) is lex. minimum
6○ if w = s do
7○ return (P (w, i), κ(w, i))
8○ set S = S ∪ {(w, i)}
9○ for e = (v, w) ∈ δ−(w) do
10○ propagate((w, i), e) # function from Algorithm 6
Algorithm 7: FPTAS for the Cost-Delay Minimum Steiner Path Problem.
Claim: One of the following conditions holds for all 0 ≤ j ≤ η:
1. At some point of the algorithm there exists a label (κ(vj), i) ∈ Q(κ(vj)) s. t.
(1 + δ)i ≤ cost(Pj , κ) · (1 + δ)j and cap(κ(vj), i) ≤ cap(P,κ)(vj).
2. j > 0 and the algorithm exits with a label (s, i) with
(1 + δ)i ≤ cost(Pj−1, κ) · (1 + δ)j−1.
Proof (of the claim) The proof of the claim is similar to the proof of the claim
used in the proof of Theorem 4.10.
For j = 0 this is trivial. Let j > 0 and assume that there exists a label (κ(νj−1), i)
fulfilling the first condition (if the second condition holds for j − 1, it automatically
holds for all larger indices and we are done as F is non-negative).
First, assume that (κ(νj−1), i) is selected in 5○. After we call propagate with input
label (κ(νj−1), i) and input edge κ(ζj), there must be a label (κ(νj), i′) with
(1 + δ)i
′ ≤ cost(Pj) · (1 + δ)j and cap(κ(νj), i′) ≤ cap(P,κ)(νj).
The proof of this property is equal to the proof of the claim used in Theorem 4.10.
If (κ(vj+1), i) is never selected, the algorithm must have stopped before. By mono-
tonicity, the exit label (s, i) must satisfy i ≤ i and we are in Case 2. (claim)
As in Theorem 4.10 the claim implies the approximation ratio.
Running time: As seen in the proof of the running time of Theorem 4.10 the number I of
labels in Q(v) with finite capacitance is
O
(
n+
log(cost↑/cost↓)
δ
)
= O
(
log(cost↑/cost↓)
δ
)
.
Since we never select from the set S of permanent labels in 7○, we conclude that
propagate is called at most I times for each edge.
Buffering-and-Routing Oracles 61
To find minimum labels in 7○, we use a Fibonacci heap [FT87]. During the whole
algorithm, we haveO(nI) insert and deletemin operations as well asmI decrease
key operations and thus, the total running time for all heap operations is O(I · (m+
n · log(n · I))). Putting everything together we obtain a running time of
O(I · (m+ n · log(n · I)) + Imθ)
=O
(
n
log(cost↑/cost↓)

·
(
mθ + n · log
(
n · log(cost
↑/cost↓)

)))
.

4.6 Cost-Delay Minimum Steiner Trees with Fixed Topolo-
gies
In this section we deal with the question how to solve the Cost-Delay Minimum Steiner
Tree Problem with Loops for nets N that consist of more than two pins.
By the result of Chuzhoy et al. [Chu+05] cited in Section 4.3.1 there is no o(log log |N |)
approximation algorithm unless every problem in NP can be solved in O(nlog log logn) time.
This holds even if F (e, .) is a constant function for each edge e ∈ E(G) and all criticalities
λ(t) are equal. The best known approximation algorithm for this special case is due to
Meyerson, Munagala and Plotkin [MMP00] and has approximation ratio O(log(|N |)).
We will now show how to approximate the Cost-Delay Minimum Steiner Tree Problem
with Loops if we already know the topology of the output.
By using the results of Sections 4.5.2 and 4.5.3 we will combine embeddings of so-called
sub-topologies to compute an overall embedding.
Definition 4.11 Let T be a topology for N and ”v ∈ V (T ). The sub-topology T ( ”v) of T
with source ”v is the graph with vertex set
V (T (”v)) := {”w ∈ V (T ) : ”v ∈ V (T[s, ”w])}
and edge set
E(T (”v)) := {(˚uffl, ”w) ∈ E(T ) : ˚uffl, ”w ∈ V (T (”v))}.
An illustration of a sub-topology can be found in Figure 4.6. Note that for a node ˚uffl,
˚uffl + (˚uffl, ”v) + T (”v) is again a topology.
Theorem 4.12 Let G,N,F, c,∆, λ be an instance of the Cost-Delay Minimum Steiner
Tree Problem with Loops and let T be a topology for N . Let
• cost↓ > 0 be a lower bound on the cost of a Steiner path with positive costs in G
• cost↑ be an upper bound on the cost of any Steiner tree for N in G.
Let θ be the maximum running time for one evaluation of F , ∆, and the restriction of⌈
log1+/(2|N |(k+2))(.)
⌉
to the interval [cost↓, cost↑].
For each  > 0 there is an algorithm that computes an embedding (A, κ) of T with cost
at most 1 +  times the cost of an optimum embedding of T . The running time is
O
(
|N |2 ·
((
m+
n|N |

)
m2 · n2 · log(cost
↑/cost↓)

· θ
))
.
62 Buffering-and-Routing Oracles
By the second fact of Lemma 2.7, this theorem provides an FPTAS for instances of
the Cost-Delay Minimum Steiner Tree Problem with Loops with |N | constant and for
which log(cost↑/cost↓) is polynomially bounded in the input size: enumerate and embed
all possible topologies.
In Sections 4.6.1 and 4.6.2 we prove Theorem 4.12. The high-level idea is to obtain
an embedding of T by composition of embeddings of the edges in T . As in the FPTAS of
Theorem 4.8 we represent solution candidates by labels.
We identify nodes of T with nodes in the output that we produce. The reason for that
is that we want to shorten notation by not mentioning the function φ of Definition 2.6
explicitly.
4.6.1 Preparation for the Proof of Theorem 4.12
Representing embedding of sub-topologies by labels
For an edge (”v, ”w) ∈ E(T ) we will create labels (u, i)(”v,”w) for all u ∈ V (G) and i ∈ Z∪{−∞}.
Labels are associated with capacitances cap(”v,”w)(u, i) and if this value is finite, also with
an embedding (A, κ) of the arborescence ”v + ( ”v, ”w) + T ( ”w) with
• cost at most (1 + δ)i,
• κ(”v) = u, and
• cap(A,κ)( ”v) ≤ cap(”v, ”w)(u, i).
(4.2)
Merging labels at the same graph node
Let ”w of T be a Steiner node with δ+T (”w) = {(”w, ”w1), (”w, ”w2)}. At ”w we have to merge
labels representing embeddings for T ( ”w1) and T (”w1). Let
M ′(”w) :=
{〈
(u, i1)(”w, ”w1), (u, i2)( ”w,”w2)
〉
: u ∈ V (G), labels (u, i1)( ”w,”w1) and (u, i2)(”w, ”w2)
s.t. cap((u, i1)(”w, ”w1)), cap((u, i1)(”w,”w2)) <∞
}
be the set of all label pairs with finite capacitance. We say that a pair〈
(u, i1)(”w, ”w1), (u, i2)( ”w, ”w2)
〉
is δ-dominated by a pair
〈
(u, k1)( ”w,”w1), (u, k2)( ”w,”w2)
〉
if⌈
log1+δ((1 + δ)
i1 + (1 + δ)i2)
⌉ ≥ ⌈log1+δ((1 + δ)k1 + (1 + δ)k2)⌉
and
cap(( ”w,”w1))(u, i1) + cap( ”w, ”w2)(u, i2) ≥ cap(”w, ”w1)(u, k1) + cap(”w,”w2)(u, k2).
Let Mδ( ”w) be a maximal subset of M ′( ”w) with the property that no label pair Mδ(”w)
is δ-dominated by another pair in Mδ(”w).
As seen in the running time proof of Theorem 4.8 we have up to O
(
log(cost↑/cost↓)
δ
)
labels at a vertex. This leads to a set M ′(”w) of size O
(
n ·
(
log(cost↑/cost↓)
δ
)2)
. The next
lemma tells us that we can compute Mδ(”w) in a faster running time.
Lemma 4.13 Using the notation of Theorem 4.12 the set Mδ( ”w) has size
O
(
n · log(cost↑/cost↓)δ
)
and can be computed in time O
(
n · log(cost↑/cost↓)
δ2
· θ
)
.
Buffering-and-Routing Oracles 63
Proof Fix u ∈ V (G) and write li := (u, i)(”w,”w1) and l′i := (u, i)( ”w,”w2) for labels (u, i)( ”w,”w1),
(u, i)( ”w, ”w2). Let z < Z ∈ Z such that all labels li and l′i with i /∈ [z, . . . , Z] ∪ {−∞} have
infinite capacitance.
Using bucket sort we can sort the li and l′i by i in O(Z − z) time. We can assume
that for i < j the cap-value of li (respectively of l′i) is not smaller than the cap-value of lj
(respectively of l′j). If this is not the case, we erase lj as it can never be contained in a
non-δ-dominated label pair.
For simplicity we assume that we have labels
l−∞, lz, lz+1, . . . , lZ and l′−∞, l
′
z, l
′
z+1, . . . , l
′
Z .
If label l−∞ does not exist, we create a dummy label l−∞ with cap-value∞. If a label li
is missing but li−1 exists, we may set li := li−1 (respectively li := l−∞ for i = z). Similarly
for the l′i.
For i, j ∈ Z ∪ {−∞} define g(i, j) := ⌈log1+δ((1 + δ)i + (1 + δ)j)⌉ . To create the set
Mδ(”w) we find labels li, l′j for each β ∈ {−∞, z, z + 1, . . . , Z} such that g(i, j) = β and i
and j are maximal in the sense that g(i+ 1, j), g(i, j + 1) > β. By exchanging roles of the
li and l′j it suffices to show how to find i and j with i ≥ j.
If β = −∞, i = j = −∞ is the solution. Let β ≥ z. If i ≤
⌊
β − 1− 1log(1+δ)
⌋
− 1,
g(i, j) ≤

log
(
2 · (1 + δ)β−2− 1log(1+δ)
)
log(1 + δ)
 =

1 +
(
β − 2− 1log(1+δ)
)
log(1 + δ)
log(1 + δ)

≤ 1
log(1 + δ)
+
(
β − 2− 1
log(1 + δ)
)
+ 1 = β − 1.
Hence, i must lie between
⌊
β − 1− 1log(1+δ)
⌋
and β, and there are at most O
(
1
log(1+δ)
)
=
O (1δ ) possible choices for i.
With r := log1+δ
(
(1 + δ)β − (1 + δ)i), log1+δ ((1 + δ)i + (1 + δ)r) = β. This implies
that li and l′brc is a label pair with g(i, brc) = β and i, brc maximal.
As in the running time proof of Theorem 4.8 it follows that Z − z = O
(
log(cost↑/cost↓)
δ
)
which implies the lemma. 
The topology embedding algorithm
Let k be an upper bound on the number of edges in any optimum Steiner path between
two distinct nodes in G. Let δ := 2·(|N |·(k+2)) .
We embed edges of T into G in reverse topological ordering, i. e. we start with the
edges entering the pins in N\{s} and end with the edge leaving the source s of N . During
the embeddings we create labels of the form (u, i)(”v, ”w) with u ∈ V (G), i ∈ Z ∪ {−∞},
(”v, ”w) ∈ E(T ) as described before .
To produce these labels we run Algorithm 6 on Page 57 until line 6○ (including 6○).
We refer to this algorithm as the modified algorithm.
If we embed an edge (”v, t) ∈ E(T ) entering a sink t ∈ N\{s}, we run the modified
algorithm on instance
(
G, {”v, t}, c, λ(t) ·F,∆) with precision |N | (”v is considered the source
64 Buffering-and-Routing Oracles
”w placed at u′ ∈ V (G)
”w1
”v placed at u ∈ V (G)
”w2T ( ”w1) T (”w2)
T (”w)
Figure 4.6: Sub-topologies of T in the proof of Theorem 4.12. T ( ”w) consists of ”w+(”w, ”w1)+T ( ”w1)
and ”w + (”w, ”w2) + T (”w2). To find an embedding of T = T ( ”v) we combine embeddings of the
topologies ”w + ( ”w, ”wi) + T ( ”wi) with an u-u′ Steiner path.
of the dummy net { ”v, t} and is ignored by the modified algorithm). As parameter bounding
the length of the longest path we choose k + 2. With this choice, the δ-value in the proof
of Theorem 4.8 coincides with the δ-value in this algorithm. For u ∈ V (G)\{t} the output
labels (u, i) = (u, i)(”v,t) are already as desired. For label (t,−∞) the corresponding path
(P, κ) might consist of zero edges. We extend it to a path with one edge ζ with κ(ζ) = ◦.
Labels (t, i) with i > −∞ are not needed.
If we embed an edge ( ”v, ”w) ∈ E(T ) such that ”w is a Steiner point (in T ) with outgoing
edges (”w, ”w1), (”w, ”w2), we do the following. First, we create the set Mδ(”w). Let G′(”w)
arise from G by adding a new vertex w and by inserting an edge (u,w) for each label pair〈
(u, i1)(”w, ”w1), (u, i2)( ”w, ”w2)
〉 ∈Mδ(”w). Edge (u,w) gets properties
c((u,w)) = (1 + δ)dlog1+δ((1+δ)i1+(1+δ)i2 )e
F ((u,w), x) = x
∆((u,w), x) = x+ cap(”w, ”w1)(u, i1) + cap(”w,”w2)(u, i2).
Let λ :=
∑
t∈N∩V (T (”v)) λ(t). We run the modified algorithm on instance (G′( ”w), { ”v, w}, c, λ·
F,∆) with precision |N | (”v is considered the source of net { ”v, w} and is again ignored by
the modified algorithm). As parameter bounding the length of the output path we choose
k + 2 which is an upper bound on the number of edges in any optimum path in G′(”w).
Let u ∈ V (G). The algorithm provides labels of the form (u, i) associated with
a u-w Steiner path (P, κ) with cost at most (1 + δ)i. We create label (u, i)(”v,”w) with
cap(”v, ”w)(u, i) = cap(u, i). With ζ denoting the edge entering w in P , this label corresponds
to the embedding of ”v + (”v, ”w) + T ( ”w) that consists of the combination of (P − ζ, κ) and
the embeddings for ”w + (”w, ”w1) + T ( ”w1) and ”w + (”w, ”w2) + T (”w2). See Figure 4.6 for an
illustration.
If we embed the edge (s, ”w) leaving the source, the topology s+ (s, ”w) + T (”w) is equal
to T . We return the Steiner tree associated with the label (s, i)(s, ”w) for which i is minimum.
4.6.2 Proof of Theorem 4.12
We run the algorithm described in the previous section. As parameter k bounding the
number of edges in every possibly optimum Steiner path we choose k = m · n, which is a
feasible choice by Lemma 4.2.
Buffering-and-Routing Oracles 65
First, observe that the topology embeddings associated with labels created during
the algorithm have Properties (4.2). For labels created during the embedding of an edge
entering a sink this is true by the proof of Theorem 4.8. For the other labels this is true by
construction of the modified graphs G′( ”w) with ”w ∈ V (T )\N (and again by Theorem 4.8).
Proof of approximation guarantee
To prove the approximation guarantee denote the number of edges in the longest path in
T leaving a vertex ”v ∈ V (T ) by pi(”v).
Claim: For all edges ( ”v, ”w) ∈ V (T ), all nodes u ∈ V (G), and any embedding of ”v+(”v, ”w)+
T ( ”w)
• for which the number of edges in each Steiner path is at most k,
• for which ”v is mapped to u by the corresponding κ function,
• with cost c, and
• for which the capacitance of ”v is χ,
we create a label (u, i)( ”v, ”w) such that (1 + δ)i ≤ c · (1 + δ)pi(”v)·(k+2) and cap(”v,”w)(u, i) ≤ χ.
Proof (of the claim) For edges (”v, ”w) with ”w ∈ N\{s} we have already shown this in
the proof of Theorem 4.8. Assume that ”w is a Steiner point with successors ”w1 and ”w2.
Let u ∈ V (G) and let (B∗, κ∗) be any embedding of ”v + ( ”v, ”w) + T (”w) with κ∗( ”v) = u
and for which the total number of edges in each path is at most k. Let c be the cost of
(B∗, κ∗) and let χ be the capacitance of ”v in (B∗, κ∗).
Embedding (B∗, κ∗) consists of
• an embedding (Br, κr) of topologies ”w + ( ”w, ”wr) + T (”wr) with κ∗( ”w) = κr( ”w)
(r = 1, 2),
• an embedding of ({ ”v, ”w}, {( ”v, ”w)}) that corresponds to an u-w Steiner path (P ∗, κP ∗)
in the modified graph G′(”w) constructed during the algorithm before embedding
( ”v, ”w). This Steiner path has at most k + 1 edges. Node w is the artificial node in
G′( ”w)\V (G).
Denote the total cost of (Br, κr) by cr (r = 1, 2). By induction we may assume that for
r = 1, 2 we have created a label (u′, jr)(”w,”wr) that satisfies the claim for ( ”w, ”wr) ∈ E(T ),
node u′ := κ∗( ”w), and embedding (Br, κr). In particular,
(1 + δ)j1 ≤ (1 + δ)pi(”w)·(k+2) · c1, and (4.3)
(1 + δ)j2 ≤ (1 + δ)pi(”w)·(k+2) · c2. (4.4)
Without loss of generality we may assume that
〈
(u′, j1)(”w, ”w1), (u′, j2)(”w, ”w2)
〉
belongs to
the setMδ( ”w) we computed before embedding ( ”v, ”w) and the last edge ζ of P ∗ corresponds
to that label pair.
By the proof of Theorem 4.8 we produce a label (u, i) with cap(u, i) ≤ χ and such that
(1 + δ)i is upper bounded by (1 + δ)|E(P ∗)| ≤ (1 + δ)k+1 times the cost of P ∗. Since P ∗
consists of ζ plus a u-u′ Steiner path with cost c− c1 − c2, it holds that
(1 + δ)i ≤ (1 + δ)k+1 ·
(
(1 + δ)dlog1+δ((1+δ)j1+(1+δ)j2 )e + (c− c1 − c2)
)
.
By a simple calculation we verify that
(1 + δ)dlog1+δ((1+δ)j1+(1+δ)j2 )e ≤ (1 + δ) · ((1 + δ)j1 + (1 + δ)j2) .
66 Buffering-and-Routing Oracles
Consequently,
(1 + δ)i ≤ (1 + δ)k+1 · ((1 + δ) · ((1 + δ)j1 + (1 + δ)j2)+ (c− c1 − c2))
≤ (1 + δ)k+2 ·
(
(1 + δ)pi(w)·(k+2) · (c1 + c2) + (c− c1 − c2)
)
≤ (1 + δ)(pi(w)+1)·(k+2) · c.
The second inequality uses Equations (4.3) and (4.4). Together with the observation
pi(v) ≥ pi(w) + 1 this proves the claim. (claim)
Let (s, ”w) be the edge leaving s in T and let OPT be the cost of an optimum embedding
of T . The claim yields a label (s, i)(s, ”w) such that
(1 + δ)i ≤ (1 + δ)pi(s)·(k+2) ·OPT < (1 + δ)|N |·(k+2) ·OPT ≤ (1 + ) ·OPT.
Hence, the described algorithm is indeed an FPTAS.
Proof of running time
To achieve the claimed running time we have to speed-up the modification of Algorithm 6
when we use it on instances with the modified graphs G′(”w). As the artificial node
w ∈ V (G′( ”w))\V (G) has no outgoing edges, the edges entering w can only occur at the
end of any path with endpoint w. Hence, we can omit to consider them in Step 4○ of
Algorithm 6 if the iteration counter i in Step 3○ is larger than 1. When we process them
in iteration i = 1, the set Q(w) contains only one element, namely (w,−∞). Together
with the observation that cost↑ is an upper bound on the cost of any path in G′(”w) and
Theorem 4.8 we conclude that each invocation of the modified algorithm has running time
O
(
|N | ·m3 · n2 · log(cost
↑/cost↓)

· θ
)
although it is sometimes called on a graph with O
(
|N |mn
(
log(cost↑/cost↓)

))
edges. The
time to create a set Mδ( ”w) is O
(
|N |2m2n3 · log(cost↑/cost↓)
2
· θ
)
by Lemma 4.13.
By Lemma 2.7, the number of edges in T is O(|N |). We obtain a total running time of
O
(
|N |2 ·m3 · n2 · log(cost
↑/cost↓)

· θ
)
+O
(
|N |3m2n3 log(cost
↑/cost↓)
2
· θ
)
= O
(
|N |2m2n2 log(cost
↑/cost↓)

( |N |n

+m
)
θ
)
.

4.6.3 Topology Embeddings in Graphs without Loops
In the case that G does not contain loops one can obtain an improved running time.
Theorem 4.14 Let  > 0. In the situation of Theorem 4.12 we can embed topology T with
approximation guarantee 1 +  and running time
O
(
|N |2 · n · log(cost
↑/cost↓)

·
(( |N |n2

· θlog
)
+mθ + n · log
(
n · log(cost
↑/cost↓)

)))
if G does not contain any loops.
Buffering-and-Routing Oracles 67
Sketch of the proof. The proof works exactly as the proof of Theorem 4.12. The only
difference is that we use Algorithm 7 from Page 60 without lines 6○ and 7○ to produce
labels. If G does not contain loops, k = n− 1 is an upper bound on the number of edges
in any optimum Steiner path.
As in the running time proof of Theorem 4.12 we observe that if the end point w of a
path search has no outgoing edge, all of its incoming edges are considered at most once.
We conclude that the total running time for all O(|N |) invocations of the algorithm is
O
(
|N |2 · n · log(cost
↑/cost↓)

·
(
m(θF + θ∆ + θlog) + n · log
(
n · log(cost
↑/cost↓)

)))
.
Together with Lemma 4.13 we obtain the claimed running time. 
4.7 Electrical and Polarity Constraints
With Theorems 4.12 and 4.14 we have obtained fully polynomial time approximation
schemes for the Minimum Cost Buffered Steiner Tree Problem if the topology of a solution
is known or if the number of sinks is constant. Recall that the Minimum Cost Buffered
Steiner Tree Problem does not distinguish between buffers and inverters, and ignores
capacitance and slew limits.
It is trivial to modify the algorithm such that it obeys capacitance limits. For all loop
edges e modeling insertion of a repeater l ∈ L with capacitance limit caplim(l) we can
simply replace function F (e, .) by function F (e, .) with
F (e, x) =
{
F (e, x) if x ≤ caplim(l),
∞ otherwise.
With the same trick we can enforce that F (e, x) =∞ for all edges leaving source s and a
value x larger than the capacitance limit of s.
Another class of constraints that can easily be incorporated into the algorithms are
polarity constraints (Section 2.5.5). For each label (u, i)( ”v,”w) we store a required polarity
pol(u, i). Initial labels (i. e. labels corresponding to the sinks) receive the sink’s required
polarity. Propagating along a loop edge modeling an inverter switches the polarity
requirement. We allow to create two different labels of the form (u, i)( ”v, ”w) if they have
distinct required polarities and only execute the updates in Steps 7○ and 8○ of the
modification of Algorithm 6 if there is another label with smaller capacitance
• at the same node,
• in the same cost class, and
• with the same required polarity.
When computing the sets Mδ(”w) before embedding edges entering a Steiner node ”w in
the given topology, we only allow to merge labels which have the same required polarity.
Finally, we output the Steiner tree corresponding to the best source label with required
polarity ident. It is easy (but a bit technical) to verify that with these modifications,
Theorem 4.12 remains to be an FPTAS and that the asymptotic running time does not
increase.
It is an open problem to incorporate slew effects into the results from this chapter
without loosing the provable quality.

Chapter 5
Topology Generation
The results of Chapter 4 are more of theoretical than of practical interest. Even for
smaller nets the number of possible topologies grows extremely fast as the following table
demonstrates (cf. Lemma 2.7). Enumerating all topologies to apply Theorem 4.12 or
Theorem 4.14 is unacceptable unless the number of sinks is very small.
|N | # topologies
2 1
3 1
4 3
5 15
|N | # topologies
6 105
7 945
8 10 395
9 135 135
|N | # topologies
10 2 027 025
11 34 459 425
12 654 729 075
13 13 749 310 575
In this chapter we show how to compute a topology with nice properties in terms of delay
and net length. Such a topology can then be used as the initial topology for Theorem 4.12
or Theorem 4.14.
The focus in this chapter will be both theoretical and practical. Besides provable
bounds for delays and lengths, the topology generation routines presented in this chapter
are fast and provide good solutions on practical VLSI repeater tree instances. The latter
property is reinforced by post-optimization.
Almost all results of this chapter are joint work with Stephan Held.
5.1 Placed Topologies
The main object of this chapter is a topology T for a net N together with a placement
function p : V (T ) → M that specifies a position for each node of T in a metric space
(M, dist). We call a pair (T, p) placed topology.
In the case that the global routing graph is a 3-dimensional grid graph as described in
Section 2.4.3, the 2-dimensional plane R2 together with the `1 norm
||(x1, y1)− (x2, y2)||1 := |x1 − x2|+ |y1 − y2|
is a canonical candidate for (M, dist) = (R2, `1).
If we deal with a global routing graph G that does not allow a geometric interpretation
but has a cost function c : E(G) → R≥0, the metric closure of (G, c) yields the metric
space.
69
70 Topology Generation
5.1.1 Properties of Placed Topologies
Length. The distance function dist allows us to define the length of a placed topology
(T, p) as
length(T, p) :=
∑
(”v,”w)∈E(T )
dist(p(”v), p( ”w)).
In most parts of this chapter the positions p are clear from the context and we will
just write length(T ) instead of length(T, p). We also use the above definition to define the
length of other graph structures such as paths and branchings that have positions in M
assigned to their nodes. By computing topologies with small length we hope to end up
with a Steiner tree with small total costs for congestion and net length after applying the
algorithm of Theorem 4.12.
Delay. To achieve small timing costs we need to define the delay of a placed topology.
The following delay model was introduced by Bartoschek et al. [Bar+10] and estimates
both the delay impact of path lengths and of the increased capacitance induced by sibling
branches.
Let s be the source of net N , let t ∈ N\{s}, and let b ≥ 0 be a constant. Bartoschek
et al. [Bar+10] estimate the delay from s to t in (T, p) as
delay(T,p)(s, t) =
 ∑
(”v, ”w)∈E(T[s,t])
dist(p(”v), p( ”w))
+ b · (|E(T[s,t])| − 1).
Due to the special graph structure of a topology defined in Definition 2.4, (|E(T[s,t])|−1)
is exactly the number of bifurcations on the unique s-t path in T . Each of these bifurcations
is assumed to increase the delay by the constant b. The delay along a path without
bifurcations is assumed to be proportional to the length along it, which is a reasonable
assumption in an optimally buffered path. Up to scaling we may assume that the delay
along a path without bifurcations is equal to its length. If positions p are clear from the
context, we write delayT (s, t) instead of delay(T,p)(s, t).
5.1.2 Contradicting Objectives
For many instances, short topologies and topologies with short delays look different. An
example of such an instance is shown in Figure 5.1.
Computing short topologies is equivalent to the Shortest Steiner Tree Problem in
(M, dist) and can thus be approximated within a constant factor from optimum (e. g.
see [Byr+13; Goe+12]). An example of such a short topology can be found in Figure 5.1(b).
A major drawback of such a solution is that delays on paths between the source and timing
critical sinks can be large. In this example, the path between s and t4 is roughly three
times longer than a shortest path and has three bifurcations that slow down the signal
delay even further. If t4 is timing critical, the topology in Figure 5.1(b) is not a good
solution. In contrast to the short topology, Figure 5.1(a) shows an example of a topology
in which all path delays are small. Here, we could achieve that path lengths between s
and the sinks t ∈ N\{s} are (almost) shortest possible by placing all Steiner points close
to the source’s position. The number of bifurcations on each source-sink path is exactly
Topology Generation 71
s
t1 t2
t3
t4
(a) A topology with small delays
but with huge length.
s
t1 t2
t3
t4
(b) A topology with small length
but with delay(T,p)(s, t4) large.
s
t1 t2
t3
t4
(c) A topology trading-off length
and delays.
Figure 5.1: Trade-off between length and delays of a topology for N = {s, t1, t2, t3, t4} embedded
into (R2, `1).
two. In this example it is not possible to obtain a source-sink path with one bifurcation
without creating paths with three bifurcations.
Whether the topology depicted in Figure 5.1(a) is an optimum solution from a timing point
of view depends of course on the precise definition of delay optimality. In this section we
consider the following two variants
Variant a) Given required arrival times rat(t) ∈ R≥0 for t ∈ N\{s} we want to compute
a placed topology (T, p) such that
delay(T,p)(s, t) ≤ rat(t) for all t ∈ N\{s}
or decide that such a topology does not exist.
Variant b) Given criticalities λ(t) ∈ R≥0 for t ∈ N\{s} we want to compute a placed
topology (T, p) minimizing ∑
t∈N\{s}
λ(t) · delay(T,p)(s, t).
In Section 5.1.3 we list existing algorithms that can compute optimum solutions to
both variants. Since delay-optimum topologies can have large lengths, one is usually
looking for a trade-off between delay and length. A placed topology trading-off these two
contradicting objectives is called shallow-light topology in literature. An example can be
found in Figure 5.1(c).
Similar to the two variants of delay-optimum topologies we consider the following two
problems.
Shallow-Light Topology Problem with Required Arrival Times
Instance: A metric space (M, dist),
a net N with source s ∈ N and positions p : N →M ,
a constant bifurcation delay penalty b ≥ 0,
required arrival times rat : N\{s} → R≥0.
Output: A topology T for N and positions p : V (T )\N → M such that
• delay(T,p)(s, t) ≤ rat(t) for all t ∈ T and
• length(T, p) minimum
or find out that such a topology does not exist.
72 Topology Generation
Shallow-Light Topology Problem with Criticalities
Instance: A metric space (M, dist),
a net N with source s ∈ N and positions p : N →M ,
a constant bifurcation delay penalty b ≥ 0,
criticalities λ(t) ∈ R≥0 for t ∈ N\{s}.
Output: A topology T for N and positions p : V (T )\N → M minimizing
length(T, p) +
∑
t∈N\{s}
λ(t) · delay(T,p)(s, t).
5.1.3 Delay-Minimum Placed Topologies
To compute shallow-light topologies, we have to compute delay-optimum topologies for
Variant a) and b) as defined in the previous section.
Since it is always possible to achieve that all paths are shortest paths by placing all
Steiner points to the source’s position, it suffices to balance the number of the depths of
the sinks in the computed topology.
For Variant a) we know upper bounds for depths of sinks in feasible solutions:
Definition 5.1 For a net N with source s, positions p : N →M , required arrival times
rat : N\{s} → R≥0, and a bifurcation delay penalty b ≥ 0 we define for each sink
t ∈ N\{s}:
bif(t) :=
{|N | − 2 if b = 0⌊
rat(t)−dist(p(s),p(t))
b
⌋
if b > 0.
Hence, to solve Variant a) it suffices to find a topology T for N such that
|E(T[s,t])| − 1 ≤ bif(t) for all t ∈ N\{s}.
Bartoschek et al. [Bar+10] gave a polynomial time algorithm that computes such a
topology. Even stronger, they proved that their algorithm maximizes the worst slack
min
{
rat(t)− delay(T,p)(s, t) : t ∈ N\{s}
}
.
Their algorithm has running time of O(|N |2) and is sketched in Section 5.3.1.
An even faster method to solve Variant a) or b) is the Huffman Coding Algorithm [Huf52]
(Algorithm 8).
Although that algorithm is well known, formal proofs can rarely be found in literature.
That’s why we repeat the proofs of the most basic properties in the following.
Lemma 5.2 (“well known”) The Huffman Coding algorithm (Algorithm 8) has running
time O(|N | · log(|N |)).
If sinks are sorted by their bif-value or by their λ-values, respectively, the running time of
the algorithm reduces to O(|N |).
Topology Generation 73
Instance: A net N with source s and
Variant a) values bif : N\{s} → R.
Variant b) criticalities λ(t) ∈ R for t ∈ N\{s}.
Output: A topology for N .
1○ set X = N\{s}, Y = ∅, T = (N, ∅)
2○ while |X ∪ Y | > 1 do
3○ let ”v, ”w ∈ X ∪ Y be the two elements with
Variant a) largest bif-values
Variant b) smallest λ-values
4○ let ˚uffl be a new node with p(˚uffl) = s and
Variant a) bif(˚uffl) = min{bif(”v), bif( ”w)} − 1
Variant b) λ(˚uffl) = λ( ”v) + λ( ”w)
5○ insert ˚uffl and edges (˚uffl, ”v), (˚uffl, ”w) to T
6○ set X = X\{”v, ”w}, Y = (Y \{”v, ”w}) ∪ {˚uffl}
7○ add an edge between s and the unique remaining element of X ∪ Y to T
Algorithm 8: Huffman Coding algorithm for Variant a) and b) as defined in Section 5.1.2.
Proof We assume that the sinks are sorted. As after each iteration of the while-loop
2○, the cardinality |X ∪ Y | is reduced by 1, it suffices to show that lines 3○– 6○ can be
implemented to run in constant time. To achieve that we store X and Y in two arrays
sorted by their bif-value in non-increasing or by their λ-value in non-decreasing order,
respectively.
Since the values bif(˚uffl) = min{bif(”v), bif(”w)} − 1 decrease or the values λ(˚uffl) =
λ(”v) + λ(”w) increase in each iteration, appending ˚uffl to the back of Y in 6○ maintains the
sorting. Together with the observation that the elements ”v and ”w in line 3○ can be found
among the first two elements of each array and can implicitly be removed by storing the
indices of the first elements that are supposed to be contained in each array, we can indeed
perform 3○– 6○ in constant time.
If sinks are not sorted in the input, the initial array storing X must be sorted. That
takes O(|N | · log(|N |)) time. We obtain the claimed running times. 
To prove optimality we need the well known inequality by Kraft [Kra49].
Lemma 5.3 ([Kra49]) Let (M, dist) be a metric space, let N be a net with source s, and
let bif : N\{s} → N be a function.
There exists a topology T for N in which the number of bifurcations on the s-t path is
upper-bounded by bif(t) for each t ∈ N\{s} if and only if∑
t∈N\{s}
2−bif(t) ≤ 1.
By setting the bif-function as in Definition 5.1 we can use Kraft’s inequality to check if a
feasible solution to Variant a) exists.
Lemma 5.4 (“well known”) If the initial set of sinks together with bif-values as defined
in Definition 5.1 fulfill Kraft’s inequality, the Huffman Coding algorithm returns a topology
that is feasible for Variant a).
74 Topology Generation
Proof We show that the set X ∪ Y fulfills Kraft’s inequality at each step of the algorithm.
Consider any iteration of the while-loop in line 2○ and assume that Kraft’s inequality
holds before that iteration. Let ”v and ”w be as defined in line 3○, without loss of generality
bif(”v) ≥ bif( ”w). Let ˚uffl be the new element created in line 4○ with bif-value bif(”w) − 1
(i. e. 2−bif(˚uffl) = 2 · 2−bif( ”w)). Kraft sum ∑”x∈(X∪Y ∪{˚uffl})\{”v, ”w} 2−bif( ”x) equals the Kraft sum
of instance X ∪ Y after decreasing bif(”v) to bif(”w).
In the remainder of this proof we show that even after decreasing the bif-value of ”v to
bif(”w), Kraft’s inequality is fulfilled. Assume for contradiction that this is not the case (in
particular, bif(”v) > bif(”x) for ”x ∈ (X ∪ Y )\{ ”v}). Let Z be a set of
|Z| =
(
1−
∑
”x∈X∪Y
2−bif(”x)
)
· 2min{bif(”x): ”x∈X∪Y }
new elements with an assigned bif-value bif(˚uffl) = min{bif( ”x) : ”x ∈ X ∪ Y } for ˚uffl ∈ Z.
Since
∑
”x∈X∪Y ∪Z 2−bif( ”x) = 1 and bif( ”v) > 0,
2bif(”v) = 2bif(”v) ·
∑
˚uffl∈X∪Y ∪Z
2−bif(˚uffl) =
∑
˚uffl∈X∪Y ∪Z
2bif(”v)−bif(˚uffl)
is an even number. On the other hand, the sum
∑
˚uffl∈X∪Y ∪Z 2bif(”v)−bif(˚uffl) has exactly one
odd summand (for ˚uffl = ”v) which is a contradiction. 
Lemma 5.5 (“well known”) The Huffman Coding algorithm for Variant b) returns a
topology T with
∑
t∈N\{s} λ(t) · delayT (s, t) minimum.
Proof We prove optimality by induction on |N |. For |N | ≤ 2 this is obvious. Let |N | ≥ 3
and let ”v, ”w ∈ X(= N\{s}) be the two sinks selected in line 2○ in the first iteration. Let
T be the topology returned by the Huffman Coding algorithm and let T ∗ be a topology
with
∑
t∈N\{s} λ(t) · delayT ∗(s, t) minimum. Let ˚uffl ∈ V (T ∗)\N be a Steiner point with
|E(T ∗[s,˚uffl])| maximum.
As λ(”v) and λ( ”w) are minimum among all sinks we may assume that the successors of
˚uffl are exactly ”v and ”w (exchanging ”v or ”w with a successor of ˚uffl cannot make T ∗ worse).
Hence, T ∗ consists of a topology T ∗ for net N ′ := (N\{”v, ”w}) ∪ {˚uffl} plus edges (˚uffl, ”v),
(˚uffl, ”w). By induction hypothesis the Huffman Coding algorithm finds a solution T for
instance N ′ with λ(˚uffl) = λ(”v) + λ(”w) such that∑
t∈N ′\{s}
λ(t) · b · (|E(T [s,t])| − 1) ≤
∑
t∈N ′\{s}
λ(t) · b · (|E(T ∗[s,t])| − 1).
Note that the part of the algorithm for net N after the first iteration is equivalent to
the algorithm for net N ′, i. e. T consists of T plus edges (˚uffl, ”v), (˚uffl, ”w). The inequality
∑
t∈N\{s}
λ(t) · b · (|E(T[s,t])| − 1) =
 ∑
t∈N ′\{s}
λ(t) · b · (|E(T [s,t])| − 1)
+ b · (λ(”v) + λ(”w))
≤
 ∑
t∈N ′\{s}
λ(t) · b · (|E(T ∗[s,t])| − 1)
+ b · (λ(”v) + λ(”w))
=
∑
t∈N\{s}
λ(t) · b · (|E(T ∗[s,t])| − 1)
concludes the proof. 
Topology Generation 75
5.2 Nonapproximability
In [HR13] we stated that the existence of a constant factor approximation for the Shallow-
Light Topology Problem with Required Arrival Times would imply P = NP. A detailed
proof of this statement can be found in [Rot12], we omitted it in the conference paper
[HR13]. The stronger version that we prove now is joint work with Nicolai Hähnle who
noticed that the analysis of [Rot12] can be strengthened.
From now on and for the rest of this chapter, topologies always mean placed topologies
although we do not always mention the placement function explicitly.
Theorem 5.6 There is no |N |β-approximation algorithm for the Shallow-Light Topology
Problem with Required Arrival Times for any constant β < 1 unless P=NP.
Proof Assume, there is an approximation algorithm with approximation ratio |N |β for
β < 1. We use this algorithm to decide an NP-complete variant of Satisfiability. Let C be
a set of clauses over variables X = {x1, . . . , xn} where n = 2k for some k ∈ N\{1} and
each literal appears in at most two clauses. It is NP-hard to decide if a set of clauses of
this special form is satisfiable (the proof immediately follows from [Tov84]). Furthermore,
we may assume that |C| ≤ 2 · n.
Define X := {x1, . . . , xn} as the set of negated literals and interpret the variables in X
as the non-negated literals. Let C′ be a set of 2n− |C| elements. We define (M, dist) as the
metric closure of an undirected graph G which is defined as follows (see also Figure 5.2(a)).
Let α = 2+3β1−β , m = dnαe,  = 1m , and
V (G) = C ∪ C′ ∪ {s} ∪X ∪X ∪ {tij : i ∈ {1, . . . , n}, j ∈ {1, . . . ,m}}.
Include edges
• {s, χ} for all χ ∈ X ∪X with length 1,
• {χ,C} for all χ ∈ X ∪X, C ∈ C such that χ ∈ C with length 1,
• {χ,C ′} for all χ ∈ X ∪X, C ′ ∈ C′ with length 1,
• {χ, tij} for all χ ∈ X ∪X s.t. χ = xi or χ = xi, j ∈ {1, . . . ,m} with length .
Let N := {s} ∪ C ∪ C′ ∪ {tij : i ∈ {1, . . . , n}, j ∈ {1, . . . ,m}}, b = 1, p(t) = t for all t ∈ N ,
rat(C) = k+m+3 for C ∈ C∪C′, rat(tij) = 1++j+k for all i ∈ {1, . . . , n}, j ∈ {1, . . . ,m}.
Using Definition 5.1 it holds that∑
t∈N\{s}
2−bif(t) = 2n ·
(
2−(m+k+1)
)
+ n ·
m∑
j=1
2−(j+k)
= n · 2−(m+k) + n ·
(
2−k · (1− 2−m)
)
= n · 2−k = 1
and by Lemma 5.3, a feasible topology for the constructed instance exists. As Kraft’s
inequality is satisfied with equality, the number of Steiner vertices on an s-t path is uniquely
determined by bif(t) for every t ∈ N\{s}. The following claim proves the theorem.
Claim: If C is satisfiable, a feasible topology for N with length at most 3n+nm · exists (see
Figure 5.2(b)). Otherwise, each feasible topology has cost at least 2n+ k + (1 + n) ·m >
|N |β · (3n+ nm · ).
Proof (of the claim) We prove that C is satisfiable if and only if there is a feasible topology
T for N in which no Steiner point ”v with |E(T[s, ”v])| = k +m is placed at position s.
76 Topology Generation
s
C1 C2 C3 C4 C5 C
′
1 C
′
2 C
′
3
t11
t13
x1 x1 x4 x4
(a) The graph G which defines (M,dist) in the proof
of Theorem 5.6 (m = 3).
s
s
s s
C1 C2 C3C4 C5C
′
1 C
′
2 C
′
3
t11
t13
x1 x2 x3 x4
x1 x2 x3 x4
..
.
..
.
..
.
..
.
(b) Feasible topology with small length. Labels
indicate positions. x1 = x4 = true, x2 = x3 =
false satisfies all clauses.
Figure 5.2: Graph G and a light topology without delay violations in the proof of Fig-
ure 5.2(a). The corresponding instance of Satisfiability is X = {x1, x2, x3, x4}, C1 = {x1, x2},
C2 = {x1, x2, x3}, C3 = {x1, x2, x4}, C4 = {x2, x3}, C5 = {x3, x4}. For simplicity, we choose
m = 3.
First, we assume that C is satisfiable. Fix a satisfying truth assignment. We can define
a function ψ′ : C → X ∪X,ψ′(C) = {χ ∈ C : χ is a true literal}.
Note that |ψ′−1(χ)| ≤ 2 for each true literal χ. By choice of |C′| we can extend ψ′ to
a function ψ : C ∪ C′ → X ∪X such that |ψ−1(χ)| = 2 for each true literal χ. For each
such χ place m+ 1 Steiner points ”vχj at position χ (j = 1, . . . ,m+ 1) and include edges
(”vχj , ”vχj+1) and (”vχj , tij) for j = 1, . . . ,m where χ ∈ {xi, xi}. Also include edges ( ”vχm+1, C)
for each C ∈ ψ−1(χ). The resulting branching can be extended to a topology T for N by
a balanced binary tree for
{ ”vχ1 : χ is a true literal} ∪ {s}
in which all inner vertices are positioned at s. Figure 5.2(b) depicts this solution.
It is easy to see that this topology satisfies delayT (s, t) = rat(t) for each t ∈ N\{s} and
hence, is feasible. The length of T is 3n+ nm · .
The set of Steiner points with distance k + m from the source is exactly the set
{”vχm : χ is a true literal}. None of these vertices is placed at position s.
Conversely, let T be a feasible topology for N in which no Steiner point ”v with
|E(T[s, ”v])| = k+m is placed at position s. If there is a sink t′ for which T[s,t′] is not a shortest
path, |E(T[s,t′])|−1 must be at most
⌊
rat(t′)−∑( ”v,”w)∈E(T[s,t′]) distG(p(”v), p( ”w))⌋ < bif(t).
By Lemma 5.3 we get a contradiction. Hence, all Steiner points must be placed in
{s} ∪X ∪X. Since
1 ≥
∑
t∈T
2−min{bif(t),|E(T[s,t′])|} ≥
∑
t∈T
2−bif(t) = 1,
bif(t′) = |E(T[s,t′])| for all t ∈ N\{s} which implies that the predecessor of terminal tim
(i ∈ {1, . . . , n}) must be a Steiner point ”v with |E(T[s, ”v])| = k +m and which reaches a
sink in C ∪ C′. Thus, it must be placed in {xi, xi}. Its successor must be positioned at the
same point. Let
V ′ := {”v ∈ V (T )\N : |E(T[s, ”v])| = k +m+ 1}.
Topology Generation 77
Since the successors of elements in V ′ are exactly the sinks in C ∪ C′, |V ′| = |C ∪ C′|/2 = n.
We conclude that for each i = 1, . . . , n there is exactly one χ ∈ {xi, xi} such that there is
a vertex of V ′ with position χ.
We use this property to define a truth assignment of {x1, . . . , xn} as follows:
for each i = 1, . . . , n set xi to
{
true if there is ”v ∈ V ′ such that p( ”v) = xi
false otherwise.
Let C ∈ C and let ”v ∈ V ′ be the predecessor of C in T . Since T[s,C] is a shortest path,
p(”v) corresponds to a true literal containing C. Thus, C is satisfiable.
Assume that C is not satisfiable. Let T be any feasible topology for N . As seen before,
all paths contained in T are shortest paths. Let γ be the number of Steiner points placed
at s. All of the |N | − 2 − γ = 2n + mn − 1 − γ other Steiner points are placed with
distance 1 to the source. Since C is not satisfiable, at least one Steiner point ”v such that
|E(T[s, ”v])| = k + m and hence all Steiner points from which ”v is reachable in T must
be placed at position s. Thus, γ ≥ k + m. For ”v ∈ V (T )\{s} let pred(”v) denote the
predecessor of ”v in T . We have
length(T ) =
∑
”v∈V (T )\{s}
distG(p(”v), p(pred(”v)))
≥
∑
”v∈V (T )\{s}
(distG(s, p( ”v))− distG(s, p(pred(”v))))
=
∑
t∈N\{s}
distG(s, t)−
∑
”v∈V (T )\N
distG(s, p(”v))
≥ 4n+ (1 + ) · nm− 2n−mn+ 1 + γ
> 2n+ k + (1 + n) ·m.
By choice of α, 4 ≤ n = nα(1−β)−3β−1 and hence,
n(α+3)β · 4n ≤ nα ≤ m < 2n+ k + (1 + n)m. (5.1)
By choice of m and ,
|N |β · (3n+ nm · ) = (n ·m+ 2n)β · (3n+ nm · ) < n(α+3)·β4n. (5.2)
(5.1) and (5.2) yield the result. 
Note that the Huffman Coding algorithm [Huf52] (Algorithm 8 for Variant a)) is an
|N | approximation algorithm as it produces a feasible solution with cost∑
t∈N\{s}
dist(p(s)− p(t)) < |N | ·OPT,
where OPT denotes the length of an optimum feasible placed topology. In this sense, β = 1
is the smallest value such that the Shallow-Light Topology Problem with Required Arrival
Times admits an O (|N |β) approximation (unless P= NP).
78 Topology Generation
5.3 Bicriteria Approximation
After we have found out that the Shallow-Light Topology Problem with Required Arrival
Times is hopeless to be approximated within a non-trivial approximation factor, we will
now allow to relax the delay constraints
delayT (s, t) ≤ rat(t).
Our goal is to obtain short topologies in which delay violations delayT (s, t)− rat(t) are not
too large. In Section 5.4 we return to the Shallow-Light Topology Problem with Criticalities.
5.3.1 Previous Work
For the case that the bifurcation delay penalty b is zero there are several examples of
algorithms that compute topologies bounding both path lengths and topology length. For
the case M = {p(t) : t ∈ N}, Alpert et al. [Alp+95] gave an algorithm that combines the
algorithms of Prim [Pri57] and Dijkstra [Dij59]. Both algorithms extend a current topology
T by a new node ”w with a position p( ”w) /∈ {p( ”v) : ”v ∈ V (T )} and an edge ( ”v, ”w) with ”v ∈
V (T ). While Prim’s algorithm selects ”v and p( ”w) such that dist(p( ”v), p(”w)) is minimum,
Dijkstra’s algorithm selects ”v and p(”w) such that delay(T,p)(s, ”v) + dist(p(”v), p(”w)) is
minimum. By choosing ”v and p(”w) minimizing
ξ · dist(p(”v), p(”w)) + (1− ξ) · (delay(T,p)(s, ”v) + dist(p( ”v), p(”w))) for some ξ ∈ [0, 1]
we obtain the Prim-Dijkstra Algorithm by Alpert et al. [Alp+95]. Although that algorithm
turns out to produce good results in practice, theoretical bounds are not known for
ξ ∈ (0, 1).
Cong et al. [Con+92] developed an algorithm that trades-off topology length and the
length of the longest path. For a given value  > 0 their algorithm computes a topology
with length at most 1 + 2 times the cost of a minimum spanning tree for N . The length of
each source-sink path is at most a factor 1 +  longer than maxt∈N\{s}{dist(p(s), p(t))}.
Roughly speaking, the algorithm of Cong et al. [Con+92] operates in two phases. In the
first phase, a short topology arising from a minimum spanning tree by applying local
transformations to achieve the degree constraints is computed. Bounds on path lengths are
ignored in that phase. This topology is then traversed by a depth-first search. When an
edge (”v, ”w) is traversed, we check if delayT (s, ”w) ≤ (1 + ) · dist(p(s), p(”w)) holds (where
T is the current topology). If this is not the case, edge (”v, ”w) is replaced by a direct
connection between s and ”v.
This algorithm could be improved by Khuller et al. [KRY95] who achieved path length
bounds of delayT (s, t) ≤ (1 + ) · {dist(p(s), p(t)} for each t ∈ N\{s} while obtaining the
same bound on topology length. Their algorithm proceeds similar to the algorithm of
Cong et al. [Con+92]. The only difference is that edge (”v, ”w) is considered for a second
time after all vertices reachable from ”w in T have been visited. During that second
traversal, it is checked if the delay to ”v can be reduced by connecting it directly to ”w, i. e.
delayT (s, ”v) > delayT (s, ”w) + dist(p(”v), p(”w)). This equation can be fulfilled if a direct
connection between s and ”w has been added before. If the equality is fulfilled, the edge
entering ”v in T is replaced by (”w, ”v).
In this section we generalize the result of Khuller et al. [KRY95] to the case that b > 0.
We also get rid of the requirement that the short topology computed in the first phase
arose from a minimum spanning tree.
Topology Generation 79
p1
p2
p3
p
p1 = p
p2
p3
Figure 5.3: Median p of three points p1, p2, p3.
Bartoschek et al. [Bar+10] proposed an algorithm for the Shallow-Light Topology
Problem with Required Arrival Times. They first sort the sinks N\{s} by their bif-value
(see Definition 5.1) in non-decreasing order. Let t1, . . . , t|N |−1 be that ordering. Starting
with the topology ({s, t1}, {(s, t1)}) they insert the remaining sinks t2, . . . , t|N |−1 into the
current topology (in that order). A sink ti is inserted by subdividing an edge (”v, ”w) by a
new vertex ˚uffl and adding an edge between ˚uffl and ti. Steiner node ˚uffl is positioned such that
p(˚uffl) lies both on a shortest p( ”v)−p( ”w) and a shortest p( ”v)−p(ti) path and dist(p( ”v), p(˚uffl))
is maximum. In the most relevant case that (M, dist) = (R2, `1), this point is exactly the
median of p(ti), p(”v), and p(”w).
Definition 5.7 (Median) The median of three points p1, p2, p3 ∈ R2 is the point p ∈ R2
for which px is the median of (p1)x, (p2)x, (p3)x and py is the median of (p1)y, (p2)y, (p3)y.
Figure 5.3 illustrates this definition. Bartoschek et al. [Bar+10] choose edge ( ”v, ”w) such
that it minimizes a convex combination
(1− ξ) · “length of the resulting topology” − ξ · “worst slack of the resulting topology”,
where ξ is a parameter that can take values between 0 and 1.
Although the algorithm of Bartoschek et al. [Bar+10] turns out to be successful on
practical repeater tree instances, provable delay bounds or topology length bounds are only
known for the case ξ = 0 where the computed topology is at most as large as a minimum
spanning tree for N , and for the case ξ = 1 where it maximizes the worst slack.
5.3.2 A Bicriteria-Approximation Algorithm
We now formulate a so-called bicriteria-approximation algorithm. This result is joint work
with Stephan Held [HR13] and is based on the algorithm of Khuller et al. [KRY95].
Theorem 5.8 Let (M, dist) be a metric space and let N be a net with source s ∈ N and
positions p : N →M . Let b ≥ 0 be a bifurcation delay penalty and let rat : N\{s} → R≥0
be required arrival times such that a feasible solution for the Shallow-Light Topology Problem
with Required Arrival Times exists (see Lemma 5.3). Let (Tinit, pinit) be any placed topology.
For each  > 0 we can compute a placed topology (T, p) such that
delay(T,p)(s, t) ≤ 2 · b+  · rat(t) for all t ∈ N\{s} and (5.3)
length(T, p) <
(
1 +
2

)
· length(Tinit, pinit) + 4b · (|N | − 1)

. (5.4)
The running time of the algorithm is O(|N | log |N |+ ψ), where ψ is the time needed to
query dist(p( ”v), p(”w)) for all (”v, ”w) ∈ E(Tinit) and dist(s, t) for all t ∈ N\{s}.
80 Topology Generation
Note that ψ = O(|N |) in many applications, e. g. in the case that (M, dist) = (R2, `1). In
the remainder of this section we prove Theorem 5.8.
Description of the algorithm. Let ¯s′ be the successor of s in Tinit and let ←−→Tinit be the
directed graph with vertex set V (Tinit) and edge set {(”v, ”w), (”w, ”v) : (”v, ”w) ∈ E(Tinit)}.
Note that
←−→
Tinit is Eulerian. We perform an Eulerian walk in
←−→
Tinit− s starting at ¯s′. During
the walk we keep track of a branching B and an estimate d( ”v) on the delay of the s-”v path in
the final topology for each vertex ”v. Initially, set B := Tinit−s and d(¯s′) := dist(p(s), p(¯s′))
(see Figure 5.4(a) for an example). Throughout the whole algorithm, for vertices ”v ∈ V (B)
that are not roots (i. e. for vertices ”v ∈ V (B) such that |δ−B(”v)| = 1) we recursively set
d( ”v) := d(˚uffl) + b+ dist(p(˚uffl), p( ”v)), where (˚uffl, ”v) ∈ E(B) is the unique edge entering ”v in
B. By construction, each forward edge ( ”v, ”w) ∈ E(Tinit) is visited prior to its backward
counterpart ( ”w, ”v) and when (”w, ”v) is visited, the tour finished visiting vertices in the
sub-tree of Tinit rooted at ”w.
When we visit a forward edge (”v, ”w) ∈ E(Tinit), we do nothing if ”w /∈ N\{s}. Otherwise,
”w ∈ N\{s} is a leaf and we check if
d(”w) > (1 + ) · rat( ”w). (5.5)
If this is the case, we delete the edge ( ”v, ”w). The sink ”w becomes a new root of B and we
set d( ”w) = dist(p(s), p( ”w)) + b · bif(”w) (see Figures 5.4(a) and 5.4(b)).
When we visit a backward edge ( ”w, ”v) ∈ E(←−→Tinit)\E(Tinit), we check whether it is
better to merge the current sub-tree of B rooted at ”v with the connected component of B
containing ”w. More precisely, we check if
d(”v) > d( ”w) + dist(”w, ”v) + b. (5.6)
Note that by the definition of d, this can only be the case if the edge (”v, ”w) is not in B
anymore. If Condition (5.6) is true, we
• delete the edge currently entering ”v (unless ”v is a root of B),
• subdivide the edge currently entering ”w by a Steiner vertex placed at p(”w) and
connect it to ”w if ”w is not a root of B,
• create a new Steiner point ˚uffl placed at p(”w), connect it to ”v and ”w, and set
d(˚uffl) = d( ”w) if ”w is a root.
See Figure 5.4(c) for an illustration. The vertex ˚uffl is the new root of the connected
component of B containing ”v and ”w.
When we have finished the Eulerian walk, we make sure that |δ+(”v)| = 2 for all
”v ∈ V (B)\N . If |δ+(”v)|+ |δ−(”v)| ≤ 1 for a Steiner point ”v, we delete it. If ”v ∈ V (B)\N
has both out-degree and in-degree equal to one, delete it and connect its predecessor with
its successor.
Let X be the set of roots of connected components of B (e. g. boxed vertices in
Figure 5.4(c)). Note that ¯s′ ∈ X unless there are no sinks left in the connected component
of B containing ¯s′ after the Eulerian walk. Set rat′(t′) := d(t′) + b for t′ ∈ X. Let
bif′ : X → N be defined analogously to bif in Definition 5.1.
We have
∑
t′∈X 2
−bif′(t′) ≤ 12 + 12 ·
∑
t∈N\{s} 2
−bif(t) ≤ 1 and hence, a topology Ttoplvl for
net X ∪ {s} in which delayTtoplvl(s, t′) ≤ rat′(t′) for all t′ ∈ X exists and can be computed
Topology Generation 81
s
t1 t2
t3
t4
¯s′
”v
(a) Initial branching B when
(Tinit, pinit) is given by Fig-
ure 5.1(b). When visit-
ing a forward arc ( ”v, t4) ∈
E(Tinit), we check if d(t4) >
(1 + ) · rat(t4).
s
t1 t2
t3
t4
”v
¯s′
(b) If so, a new connected
component of B with root t4
is created.
s
t1 t2
t3
t4
”v
¯s′
(c) When visiting a backward
arc (t4, ”v), we reconnect ”v if
(5.6) holds.
Figure 5.4: The branching B at different stages of the Eulerian walk. The wide orange circles
mark the head vertex of the currently visited edge in
←−→
Tinit. The blue boxes mark roots in B.
in O (|X| · log(|X|)) ⊆ O (|N | · log(|N |)) time using Huffman Coding [Huf52]. All Steiner
vertices in Ttoplvl are placed at position p(s) and hence,
length(Ttoplvl) =
∑
t′∈X
dist(p(s), p(t′)).
Finally, the algorithm returns topology T = Ttoplvl +B (Figure 5.1(c) in our example).
Positions of all nodes are chosen according to their position in Ttoplvl or in B respectively.
Since the running time claim of Theorem 5.8 is clear, it suffices to prove worst slack
and length bound.
Proof of the worst slack bound of Theorem 5.8. Let t ∈ N\{s} be a sink. After
the first visit of t, d(t) ≤ (1 + ) · rat(t) by (5.5). Note that d(t) increases only if an edge
on the path from the root of its containing connected component and t is subdivided by
a Steiner point during its second visit, i. e. after checking (5.6). Due to the subdivision,
this can happen at most once. With rat′(t′) = d(t′) + b for all t′ ∈ X, we conclude that
delayT (r, t) ≤ (1 + ) · rat(t) + 2b.
Proof of the length bound of Theorem 5.8. First note that
length(B) ≤ length(Tinit)− dist(s, ¯s′)
holds at the end of the Eulerian walk. Since length(T ) = length(B) + length(Ttoplvl), it
suffices to estimate length(Ttoplvl).
Let X1 := {t1, . . . , tk} be the set of sinks for which Condition (5.5) was true when we
traversed the edge entering it. We assume that the elements of X1 are sorted by the time
they are traversed by the Eulerian walk (i. e. we visited ti before ti+1 for all 1 ≤ i ≤ k− 1).
Let X be the set of roots of B at the end of the Eulerian walk as defined in the algorithm.
By construction, for each ”x ∈ X\{¯s′} it holds that p(”x) = p(ti) (i ∈ {1, . . . , k}), where ti
is the unique sink from X1 in the connected component of B rooted at ”x. Hence,
length(Ttoplvl) =
∑
”x∈X
dist(s, ”x) ≤ dist(s, ¯s′) +
k∑
i=1
dist(s, ti).
82 Topology Generation
”x
ti−1
ti
Q R
”v
”w
”v′”w′
Figure 5.5: ti−1-ti sub-tour in the length bound proof of Theorem 5.8 for the case ti−1 6= s.
In the remainder of the proof we show that
k∑
i=1
dist(s, ti) <
2

· length(Tinit) + 4b · (|N | − 1)

.
Define t0 := s and d(s) := 0 at any time of the algorithm. Consider the time when we
visit a sink ti ∈ X1 (i ∈ {1, . . . , k}) in the Eulerian walk.
Let Pi be the ti−1-ti path in
←−→
Tinit and let ”x ∈ V (Pi) such that Pi is the union of a ti−1-”x
path Q consisting of backward-edges and an ”x-ti path R consisting of forward-edges only
(see Figure 5.5). If i = 1, ”x := s and Q is the trivial path with E(Q) = ∅. Let
• d1 denote the function d at the time right before traversing the first edge of Q
• d2 denote the function d at the time right before traversing the first edge of R
• d3 denote the function d at the time when we check Condition (5.5) for ti.
Due to Condition (5.6) it holds that
d2(”w) ≤ d1(”v) + b+ dist( ”v, ”w) for all (”v, ”w) ∈ E(Q).
As we do not delete edges of R while traversing the ti−1-ti sub-part of the Eulerian walk,
d3( ”w′) = d2( ”v′) + b+ dist(”v′, ”w′) = d1( ”v′) + b+ dist( ”v′, ”w′) for all (”v′, ”w′) ∈ E(R).
Consequently,
d3(ti) ≤ d1(ti−1) + |E(Pi)| · b+ length(Pi) = rat(ti−1) + |E(Pi)| · b+ length(Pi).
The last equality follows from the equality d1(ti−1) = rat(ti−1) which is trivial for i = 1
and is true by construction and definition of d1 for i > 1. By choice of ti as a sink for
which Condition (5.5) fails to hold, (1 + ) · rat(ti) < d3(ti) and hence,
(1 + ) · rat(ti) < rat(ti−1) + |E(Pi)| · b+ length(Pi).
Summing up over all i = 1, . . . , k yields
(1 + ) ·
k∑
i=1
rat(ti) <
k−1∑
i=0
rat(ti) +
k∑
i=1
(length(Pi) + b · |E(Pi)|).
Since the Pi are pairwise disjoint parts of the Eulerian walk through
←−→
Tinit,
∑k
i=1 length(Pi) ≤
2 · length(Tinit) and
∑k
i=1 |E(Pi)| ≤ 2 · |E(Tinit)| = 4(|N | − 1)− 2. Combination of these
inequalities (and the observation dist(s, ti) ≤ rat(ti) for all {i ∈ 1, . . . , k}) concludes the
proof:
k∑
i=1
dist(s, ti) <
2 · length(Tinit)

+
4b · (|N | − 1)

.

Topology Generation 83
5.4 Shallow-Light Topologies with Criticalities
The bicriteria-approximation algorithm can be used to approximate the Shallow-Light
Topology Problem with Criticalities as we show in this section.
An easy observation is that the Huffman Coding algorithm (Algorithm 8, Variant b))
is a 2-approximation in the case that λ(t) ≥ 1 for all t ∈ N\{s}. In the general case we
can use the bicriteria algorithm to approximate the Shallow-Light Topology Problem with
Criticalities as the next theorem shows.
Theorem 5.9 Let N be a net with source s, positions p : N →M inside a metric space
(M, dist), and sink criticalities λ(t) ∈ R≥0 for t ∈ N\{s}. Let b ≥ 0 be a bifurcation delay
penalty and let β ≥ 1 such that an approximation algorithm A for the Shortest Steiner
Tree Problem with approximation guarantee β exists.
There is an algorithm for the Shallow-Light Topology Problem with Criticalities that
computes a placed topology (T, p) such that
length(T, p) +
∑
t∈N\{s}
λ(t) · delay(T,p)(s, t)
≤ (1 + ′)
length(T ∗, p) + ∑
t∈N\{s}
λ(t)delay(T ∗,p)(s, t)
+ 2b(∑
t∈T
λ(t) + 2′ (|N | − 1)
)
,
where T ∗ is an optimum solution and ′ is the unique solution to
(
1 + 2′
) · β = (1 + ′).
If ψ is the time needed to run A on instance N, (M, dist) and to query dist(s, t) for all
t ∈ N\{s}, the running time of the algorithm is O (|N | log(|N |) + ψ) .
Proof We use algorithm A (plus local transformations to ensure degree constraints) to
compute a topology Tinit such that length(Tinit) ≤ β · length(T ∗). We use the Huffman
Coding Algorithm 8 to compute an optimum solution Ttoplvl to Variant b) as defined in
Section 5.1.2. In particular,∑
t∈N\{s}
λ(t) · delayTtoplvl(s, t) ≤
∑
t∈N\{s}
λ(t) · delayT ∗(s, t).
Together with required arrival times rat(t) := delayTtoplvl(s, t) for t ∈ N\{s} we obtain a
feasible instance for the Shallow-Light Topology Problem with Required Arrival Times. Let
T be the output of the bicriteria algorithm of Theorem 5.8 with initial topology Tinit and
 := ′. It holds that
length(T, p) +
∑
t∈N\{s}
λ(t)delay(T,p)(s, t)
≤
(
1 +
2
′
)
length(Tinit) +
4b · (|N | − 1)
′
+
∑
t∈N\{s}
λ(t)
(
(1 + ′) · delayTtoplvl(s, t) + 2b
)
= (1 + ′)
 length(Tinit)
β
+
∑
t∈N\{s}
λ(t)delayTtoplvl(s, t)
+ 2b
 2
′
(|N | − 1) +
∑
t∈N\{s}
λ(t)

≤ (1 + ′)
length(T ∗) + ∑
t∈N\{s}
λ(t)delayT ∗(s, t)
+ 2b ·(∑
t∈T
λ(t) +
2
′
· (|N | − 1)
)
.
The claim about the running time is clear. 
84 Topology Generation
Corollary 5.10 There are algorithms for the Shallow-Light Topology Problem with Criti-
calities with absolute error
2b ·
(∑
t∈T
λ(t) + 2′ · (|N | − 1)
)
and running time, relative error, and restriction according to the following table:
running time relative restriction
error
1. O (|E(G)|+ |V (G)| · log(|V (G)|)) 3.57 (M, dist) = metric cl. of graph G
2. O (|N | · log(|N |)) 2.23 (M, dist) = (R2, `1)
3. “polynomial” 1.88 (M, dist) = metric cl. of graph G
4. “polynomial” 1.42 (M, dist) is a Minkowski metric
Proof For 1.–4. we select the following algorithms for A and apply Theorem 5.9.
• For 1. we use Mehlhorn’s Algorithm [Meh88] which is a 2-approximation for the short-
est Steiner tree problem in graphs and has running timeO (|E(G)|+ |V (G)| log(|V (G)|)).
• For 2. we compute a minimum spanning tree in the Delaunay triangulation (see [HS75])
which is a 32 -approximation by the result of Hwang [Hwa76]. Since the Delaunay
triangulation can be computed in O (|N | · log(|N |)) time and has O(|N |) edges, we
obtain a total running time of O (|N | · log(|N |)).
• For 3. we use the algorithm by Byrka et al. [Byr+13] which is a 1.39 approximation.
• For 4. we use the approximation scheme by Arora [Aro98] with  = 0.004.

5.5 Topology Optimization
The theoretical length bound is not very strong. A non-optimized implementation of
the bicriteria algorithm of Theorem 5.8 would yield results far worse than the topologies
computed by the algorithm of Bartoschek et al. [Bar+10].
In this section we show how to improve the results of the bicriteria algorithm without
loosing the theoretical bounds. Although some optimizations can equally be applied to
general metric spaces, we restrict to the case (M, dist) = (R2, `1). The results of this
section are joint work with Stephan Held.
5.5.1 Placement of Steiner Points in Delay-Optimum Solutions
Finding new positions for all Steiner nodes of a given topology such that delays along
the source-sink paths are not increased and the total topology length is minimized is
possible in polynomial time since the problem can be formulated as linear program (see e. g.
Rockel [Roc16]). Maßberg [Maß15] gave a polynomial time combinatorial algorithm that
combines dynamic programming with binary search to find such positions. Rockel [Roc16]
observed that the linear program is the dual of a Min-Cost-Flow problem and obtained a
second polynomial time combinatorial algorithm.
The most obvious reason why the output of the bicriteria algorithm is too long is the
placement of Steiner points in the top-level topology. According to the following calculation
Topology Generation 85
that holds for all topologies in which all source-sink paths are shortest paths, net length
becomes smaller the larger the distances between the source and the Steiner points are:
length(Ttoplvl) =
∑
(”v, ”w)∈E(Ttoplvl)
(dist(p(s), p(”w))− dist(p(s), p( ”v)))
=
∑
”v∈V (Ttoplvl)
(|δ−Ttoplvl(”v)| − |δ+Ttoplvl(”v)|) · dist(p(s), p(”v))
=
∑
t∈N\{s}
dist(p(s), p(t))−
∑
”v∈V (Ttoplvl)\N
dist(p(s), p(”v)).
In this sense, choosing position p(s) for each Steiner point is the worst decision we can
make with respect to length.
Instead, when nodes ”v and ”w are selected in 3○ in Algorithm 8, we need to find
the point that is both on a shortest p(s)-p( ”v) and on a shortest p(s)-p( ”w) path and has
maximum distance to p(s). For (M, dist) = (R2, `1), this point is exactly the median of
p(s), p(”v), and p(”w) (see Definition 5.7).
If the set of nodes from which we can choose in Step 3○ is large, we can furthermore
select ”v, ”w close to each other, or such that the median of p(s), p( ”v), and p(”w) is far away
from the source. The next lemma tells us how this can be accomplished if all sinks have a
sufficiently large bif-value.
Lemma 5.11 Let N be a net with source s and positions p : N → R2, and let bif :
N\{s} → N such that bif(t) = H := dlog(|N | − 1)e for all t ∈ N\{s}. Let T ∗ be a (short)
topology for N with respect to `1-distances. Then, we can compute a placed topology (T, p)
with length at most H · length(T ∗) and delay(T,p)(s, t) ≤ rat(t) for all t ∈ N\{s} in time
O(|N | log(|N |)).
The proof of this lemma is inspired by the results of Rao et al. [Rao+92].
Proof Let C be a Hamiltonian cycle through N\{s} with
length(C) :=
∑
{v,w}∈E(C)
||p(v)− p(w)||1 ≤ 2 · length(T ∗).
Cycle C can be obtained by the double-tree algorithm starting with T ∗ in O(|C|) time.
We use C to compute a matching M covering 2 ·
⌊ |N |−1
2
⌋
sinks with
length(M) :=
∑
{v,w}∈E(M)
||p(v)− p(w)||1 ≤ 1
2
· length(C) ≤ length(T ∗).
Matching M is obtained in O(|N |) time by taking every second edge in C (for details
see Rao et al. [Rao+92]). Choosing nodes ”v, ”w for which { ”v, ”w} ∈ M in the first |M |
iterations of the Huffman Coding algorithm (Algorithm 8 Variant a)) is a valid choice. We
place all Steiner vertices at the median of p(s) and the positions of the selected vertices.
Let X ′ and Y ′ be the sets X and Y of the Huffman Coding algorithm after the first
|M | iterations respectively. X ′ ∪ Y ′ consists of the newly created Steiner vertices and at
most one initial sink.
86 Topology Generation
By choice of the Steiner points’ positions, C can be considered a Hamiltonian cycle
through X ′ ∪Y ′ and by short-cutting we transform C into a cycle C ′ through X ′ ∪Y ′ with
length(C ′) ≤ length(C) ≤ 2 · length(T ∗).
If X ′ ∪ Y ′ contains an initial sink (which is the case if and only if this set of initial sinks
has odd cardinality), we decrease its bif-value from H to H − 1. As seen in the proof of
Lemma 5.4, this does not violate feasibility of the resulting instance. We decrease H by 1
and iterate the same procedure until there is only one sink left that we directly connect
with s.
It is clear that the delay along the s-t path in the computed topology is at most rat(t)
for all t ∈ N\{s}.
In each of the first H−1 iterations we produce edges with total cost at most length(T ∗).
After the (H − 1)-st iteration we have at most 2 elements left in X ∪ Y that are both
placed within the bounding box of N . Thus, the length of the edges added in the last
iteration plus the length of the edge connecting the final sink to s is upper-bounded by the
length of the bounding box of N which is at most length(T ∗). We obtain the claim about
the cost bound.
Let Ci be the cycle C in iteration i. After initially sorting all sinks by their bif-value,
the running time of the i-th iteration is proportional to |Ci|. Equation
∑H+1
i=1 |Ci| = O(|N |)
(by Lemma 2.7) yields the claimed running time. 
The algorithm of the proof of Lemma 5.11 can be generalized to general bif-values
with
∑
t∈N\{s} 2
−bif(t) ≤ 1. For H = max{bif(t) : t ∈ N\{s}} to 1 we compute a short
maximum matching covering the sinks with bif-value ≥ H and compute a topology based
on these matchings as in Lemma 5.11. The length of the computed topology will be smaller
the smaller the number of iterations max{bif(t) : t ∈ N\{s}} is. Counterintuively, we
obtain larger lengths for uncritical instances. In fact, many instances contain uncritical
sinks for which the original bif-value according to Formula 5.1 is larger than necessary.
Lemma 5.12 shows how to efficiently decrease the bif-value of uncritical sinks such that we
still have an instance for which a topology without delay violations exists.
Lemma 5.12 Let N be a net with source s and let bif : N\{s} → N such that∑
t∈N\{s} 2
−bif(t) ≤ 1. We can find H ∈ N minimum such that∑t∈N\{s} 2−min{bif(t),H} ≤ 1
in time O(|N | log(|N |)).
Proof Initially, we sort the sinks by their bif-value. For all H ∈ N this yields a sorting
with respect to values min{bif(t), H} as well.
We find the minimum value for H by binary search in the interval [blog(|N |)c , |N | − 2].
For a candidate value of H we run Variant a) of the Huffman Coding Algorithm 8
with bif-values min{bif(t), H}. By Lemma 5.2, one call of the algorithm needs time
O(|N |) (since the input is already sorted). By Lemmas 5.3 and 5.4, the algorithm finds
a topology T for which |E(T[s,t])| − 1 ≤ min{bif(t), H} for each t ∈ N\{s} if and only if∑
t∈N\{s} 2
−min{bif(t),H} ≤ 1. 
In the case that H = dlog(|X|)e holds for the top-level instance X∪{s} in the bicriteria
algorithm of Theorem 5.8 after application of Lemma 5.12, computing top-level topology
Ttoplvl by the algorithm of Lemma 5.11 can significantly improve the theoretical and
practical length bound.
Topology Generation 87
The following result is based on Lemma 3.2 of Elkin and Solomon [ES15] and appeared
in [HR13] (Theorem 3).
Theorem 5.13 Let N be a net with source s and let bif : N\{s} → N such that bif(t) =
H := dlog(|N | − 1)e for t ∈ N\{s}. Let T ∗ be any (short) topology for N with respect to
`1-distances and let  > 0 such that∑
t∈N\{s}
||p(t)− p(s)||1 ≤ 2 · length(T ∗).
Then, the placed topology (T, p) computed with Lemma 5.11 has length at most(
1 +
⌈
log
(
2

)⌉)
· length(T∗).
Proof We may assume that |N |−1 is a power of 2 as otherwise, we can insert 2H−(|N |−1)
additional sinks with bif-value H at source position. For i = 1, . . . ,H + 1 we denote the
set of edges produced in the i-th iteration by Ei (EH+1 consists of the edge outgoing of s
in T only). Note that the number of sinks reachable from the endpoint of an edge in Ei is
2i−1. As the statement follows directly from Lemma 5.11 if dlog (2 )e+ 1 ≥ H, we may
assume that dlog (2 )e ≤ H − 2. It holds that
∑
t∈N\{s}
||p(s)− p(t)||1 =
H+1∑
i=1
 ∑
(”v,”w)∈Ei
|{t ∈ N\{s} : ”w ∈ V (T[s,t])}| · ||p(”v)− p(”w)||1

=
H+1∑
i=1
2i−1 · ∑
(”v,”w)∈Ei
||p( ”v)− p( ”w)||1

≥ 2

·
H+1∑
i=
⌈
log
(
2

)⌉
+1
 ∑
(”v,”w)∈Ei
||p( ”v)− p( ”w)||1
.
Thus,
length(T ∗) ≥ 
2
∑
t∈N\{s}
||p(t)− p(s)||1 ≥
H+1∑
i=
⌈
log
(
2

)⌉
+1
 ∑
(”v, ”w)∈Ei
||p(”v)− p(”w)||1

and
length(T ) =
H+1∑
i=1
 ∑
(”v, ”w)∈Ei
||p( ”v)− p( ”w)||1
 ≤ (⌈log (2 )⌉+ 1) · length(T ∗).

Note that the condition H = dlog(|N | − 1)e is always fulfilled in the case b = 0 and
after applying Lemma 5.11. Using Theorem 5.13 for building top-level topology Ttoplvl in
Theorem 5.8 improves the length bound to(
2 +
⌈
log
(
2

)⌉)
· length(Tinit) if 0 <  ≤ 2.
88 Topology Generation
t1t2t3
(a) Initial topology. Sink t1 is supposed to
be critical while all other sinks are uncriti-
cal.
t1t2t3 ”x1
(b) Branching (black and orange) and top-
level topology (gray). Branching root ”x1
is far away from s.
Figure 5.6: Instance for which placement of Steiner points results in large topology length. Here,
we assume that a forward violation has been detected at t1 and backward violations at the orange
edges.
5.5.2 Changing Component Layout and Steiner Point Positions
In this section we concentrate on the placement of Steiner points inserted during the back-
connect step performed after finding out that Equation (5.5) in the bicriteria algorithm of
Theorem 5.8 is not fulfilled. Recall that during that step we place Steiner points at the
position of the new edge’s tail. This can result in poor results as Figure 5.6 shows.
These optimizations avoid the problems:
Optimization 5.14 (Rearrange Steiner points) Traverse a branching in reverse topo-
logical order. When we visit a Steiner point ”v that is not a root, we set p(”v) to the median
of its two children and its parent. If ”v is a root, we set p( ”v) to the median of its children
and s.
Optimization 5.14 does neither increase the length nor decrease the worst slack of a
branching B and can be performed in running time linear to the size of B.
Optimization 5.15 (Change component layout) Let ”x be a root of a non-trivial con-
nected component H of the branching B built by the algorithm. We replace all directed
edges by undirected ones, delete ”x but add an edge between its two children.
We sort all edges {”v, ”w} by the distance between s and the median of s, p(”v), and p(”w) in
non-decreasing order.
For an edge `e = {”v, ”w}, we subdivide `e by a Steiner point ”v′ placed on the median of p( ”v),
p(”w), p(s) and direct all edges away from ”v′. We compute a required arrival time of ”v′ by
propagating rat-values of all sinks in B to ”v′ in reverse topological order.
By Huffman Coding we can check if a top-level topology without delay violations for the
resulting instance exists. If this is the case, we keep the changes to B; otherwise we revert
the transformations and continue with the next edge.
Figure 5.7 illustrates the two optimizations.
Theorem 5.16 Applying Optimization 5.15 before building the top-level topology Ttoplvl
in the bicriteria algorithm of Theorem 5.8 does not violate Properties (5.3) and (5.4) of
Theorem 5.8. The running time of applying Optimization 5.15 to one component H is
O(k log k + |E(H)| · (|E(H)|+ k)),
where k is the number of connected components.
Topology Generation 89
t1t2t3 ”x1
(a) Result after applying Optimiza-
tion 5.14.
t1t2t3
(b) Possible result after applying Optimiza-
tion 5.15: We check if a feasible top-level
topology exists after creating connection
points nearer to the source by subdividing
e. g. the left horizontal edge.
Figure 5.7: Effect of Optimization 5.14 and 5.15 on the instance of Figure 5.6.
Proof First, note that in the iteration in which we select the edge connecting the former
children of ”x, the achievable worst slack is at least as large as stated by Properties (5.3).
We conclude that
∑
”v root of B
dist(p(s), p(”v)) is not increased by Optimization 5.15 which
implies Property (5.4). Property (5.3) is clear since we apply Huffman Coding in each
iteration.
To achieve the claimed running time, we first sort the set of connected components by their
bif-value in non-decreasing order. Transformation of the component and rat-propagation
in each iteration takes O(|E(H)|) time while restoring the ordering of the set of branching
roots as well as the application of Huffman-Coding takes O(k) time. 
5.5.3 Optimization with Greedy
Right before the computation of the top-level topology Ttoplvl it is possible that two
connected components with very different bif′-value are located close together. It is unlikely
that these two components are joined during Huffman Coding although this would result
in small length. In Figure 5.8, the black Steiner point is supposed to have a small bif′-value
since it has a critical successor t4. Sink t2 is supposed to be uncritical and hence, has a
larger bif′-value. Joining t2 and the black node would result in a short topology although
the Huffman Coding algorithm would probably not do that.
One idea is to check if we can include the non-critical component into the critical one
by subdividing a non-critical edge of the critical component.
We will describe a procedure similar to the greedy algorithm by Bartoschek et al. [Bar+10]
to merge components.
t1 t2
t3
t4
(a) Branching B. Sink t4 is supposed to be
critical while all other sinks are uncritical.
t1 t2
t3
t4
(b) Branching (black and orange) with and
top-level topology (gray).
Figure 5.8: Branching B for which Huffman Coding finds a long top-level topology.
90 Topology Generation
Optimization 5.17 (Optimization with greedy) Let C1, . . . , Ck be the set of con-
nected components (different from {s}) of a branching B and let ”x1, . . . , ”xk be the corre-
sponding roots. Initially, B is the branching in the bicriteria algorithm of Theorem 5.8
before building Ttoplvl.
We select a connected component Ci that we have not selected in any previous iteration
and try to merge it into another component. We consider two different ways of merging
components:
(i) We insert Ci into a different component Cj by subdividing an edge `e = ( ”v, ”w) ∈ E(Cj)
by a Steiner point ˚uffl placed at the median of p(”v), p( ”w), p(”xi), and adding edge (˚uffl, ”xi).
See Figure 5.9(a) for a visualization.
(ii) We merge Ci and another component Cj by connecting ”xi and ”xj with a new Steiner
point ˚uffl placed at the median of p(¯s), p( ”xi), p(”xj). New edges (˚uffl, ”xi) and (˚uffl, ”xj)
connect the new root ˚uffl with the former roots ”xi and ”xj. An example of this operation
is shown in Figure 5.9(b).
For all possible branchings B˜ obtained by Operations (i) and (ii) we compute the
maximal achievable worst slack wsl(B˜) of a top-level topology connecting B˜ by
• computing required arrival times and bif-values of the roots of components of B˜ by
backward propagation of required arrival times at sinks, and by
• computing arrival times of the roots of components of B˜ by applying Huffman-Coding
to the set of these roots with the newly computed bif-values.
Among all candidate branchings B˜ with
wsl(B˜) ≥ 0 and (5.7)
length(B˜) +
∑
”x root of B˜
||p(s)− p(”x)||1 ≤ length(B) +
∑
”x root of B
||p(s)− p( ”x)||1 (5.8)
we select the one minimizing (1 − ξ) · length(B˜) − ξ · wsl(B˜) for a trade-off parameter
ξ ∈ [0, 1].
If such a branching B˜ exists, we set B := B˜ and iterate.
Figure 5.10 shows the effect of Optimization 5.17 on the instance shown in Figure 5.8. Note
that the algorithm of Bartoschek et al. [Bar+10] is equivalent to Optimization 5.17 if we
start with the initial branching (N, ∅) consisting of singletons only, select components Ci
according to the bif-values of their roots in non-decreasing order, and if we omit Conditions
(5.7) and (5.8).
Theorem 5.18 Applying Optimization 5.17 before building the top-level topology Ttoplvl
in the bicriteria algorithm of Theorem 5.8 does not violate Properties (5.3) and (5.4) of
Theorem 5.8. The running time is O(k2 · (|E(B)|+ k)) where k is the number of connected
components of the initial branching.
Proof Since Conditions (5.7) and (5.8) guarantee that Properties (5.3) and (5.4) are
preserved respectively, we only have to show how to obtain the claimed running time.
At the beginning of the algorithm we compute worst slack and bif-value of each root in
the initial branching. Furthermore, we sort the roots by their bif-value in non-decreasing
order. During the algorithm we keep wsl and bif values up-to-date and preserve the sorting.
Topology Generation 91
”xiCi ”v
”w
˚uffl
”xj
7`e
Cj
(a) New branching resulting from subdivision
of an edge `e by a Steiner node ˚uffl that is con-
nected to the root ”xi of another component.
s
”xiCi
”xj
˚uffl
Cj
(b) New branching resulting from insertion of
a new Steiner node ˚uffl that is connected with
two roots ”xi and ”xj .
Figure 5.9: Examples of new branching candidates in Figure 5.8. For all candidates, components
Ci and Cj are merged.
Let i be a fixed iteration index. Let Bi be the branching B at the beginning of that
iteration and let ki be the number of components of Bi. We try to merge component Ci′
of Bi with another component in iteration i. Let `e = (”v, ”w) be an edge of a component
Cj 6= Ci′ . If we subdivide `e as described in (i), the new worst slack of ”xj will be the
minimum of the old worst slack of ”xj and wsl( ”w) − b. If we merge Ci′ with another
component Cj by inserting a new Steiner node ˚uffl connected with ”xi and ”xj as described in
(ii), the worst slack of the new root ˚uffl is
min{wsl( ”xi)− ||p(˚uffl)− p(”xi)||1,wsl(”xj)− ||p(˚uffl)− p(”xj)||1} − b.
Clearly, determining length(B˜) and
∑
”x root of B˜ ||p(s)− p( ”x)||1 for a candidate branching
B˜ can be done in constant time given these values for Bi.
Hence, checking if a candidate branching B˜ satisfies (5.7) and (5.8), and determining
the value (1− ξ) · length(B˜)− ξ · wsl(B˜) can be done in time O(ki) by Huffman Coding.
In addition, the worst slack information of the finally selected branching Bi+1 allow us
to update bif-values in constant time and the sorting of roots in O(ki) time. We also
update worst slack information at every node of the selected candidate branching as
they are needed to evaluate candidate branchings in the next iteration. This takes time
O(|E(Bi)|+ |V (Bi)|) = O(|E(Bi)|+ ki).
Using the fact that the expression |E(Bi)|+ ki is increased by at most 2 in each of the at
t1 t2
t3
t4
(a) Branching B after applying Optimiza-
tion 5.17. The delay on the critical path to
t4 did not increase.
t1 t2
t3
t4
(b) Since we achieved that t2 is no longer a
sink, we can build a shorter top-level topol-
ogy.
Figure 5.10: Effect of Optimization 5.17 on the instance of Figure 5.8(a).
92 Topology Generation
most k1 iterations we obtain a total running time of
O(ki · (|E(Bi)|+ ki)) = O(k1 · (|E(B1)|+ k1))
for iteration i and the claimed running time follows. 
5.6 Experimental Results
In this section we show results of the bicriteria algorithm of Theorem 5.8 and the optimiza-
tions described in Section 5.5 on practical repeater tree instances. As initial topologies we
computed approximately shortest topologies. For up to 9 terminals we use the FLUTE
algorithm by Chu and Wong [CW08] and for larger nets we ran an optimized variant of
Prim’s algorithm [Pri57] on the Delaunay triangulation [HS75] that has an approximation
guarantee equal to Steiner ratio 32 (see [Hwa76]).
Our testbed consists of 584 592 repeater tree instances from 32 real-world designs in
14 nm and 22 nm technology. The designs are provided by our cooperation partner IBM.
Each instance has at least 4 pins, i. e. we excluded trivial instances with up to 3 pins. The
largest instance has 12 063 pins. We deactivated pre-clustering of sinks for large instances.
The bifurcation delay penalties b vary from 3.4 ps to 5.3 ps. Depending on the design
this corresponds to the delay along wire of length between 5.5µm and 11.5µm. The
chosen delay parameters correspond to the lowest available layer although long, critical
connections would rather be routed on higher layers. By this choice of delay parameters,
wire delays are rather large and detours can result in bad worst slacks easily.
All tests were conducted on a machine with an Intel Xeon E5-2699 processor running at
2.20GHZ.
5.6.1 Layout of Tables
Let I be the set of all 584 592 instances. For an algorithm A and I ∈ I let
• length(A(I)) be the length of the topology computed by algorithm A for instance I,
• delayA(I)(t) be the delay from the source of I to sink t of I inside the topology
computed by A for instance I,
• wsl(A(I)) := min{rat(t)− delayA(I)(t) : t sink in I} be the worst slack of the output
of A on instance I, and
• sns(A(I)) := ∑
t sink in I
min
{
0, rat(t)− delayA(I)(t)
}
be the sum of negative slacks of
the output of A on instance I.
When comparing two algorithms Amain and Aref we are interested in the length
ratios length(Amain(I))length(Aref(I)) and the differences min{0,wsl(Amain(I))}−min{0,wsl(Aref(I))} and
sns(Amain(I)) − sns(Aref(I)) for I ∈ I (worst slack difference and sum of negative
slacks difference). For these values we display maximum (max), minimum (min) and
average (av) values over instance groups I ′ ⊆ I containing instances of certain sizes. In
addition, we report total length ratios
∑
I∈I′ length(Amain(I))∑
I∈I′ length(Aref(I)) for these groups.
Topology Generation 93
|N |: 19
Length: 168µm
WSL: −63 ps
(a) Bicriteria for  = 0.1, without optimization.
|N |: 19
Length: 92µm
WSL: −63 ps
(b) Bicriteria for  = 0.1, with optimization.
Figure 5.11: Instance on which optimization could reduce topology length without degrading the
worst slack. For better visibility, very short edges are not drawn and some sink nodes are plotted
in a different shade of blue.
As a rule of thumb we can say that Amain did a better job with respect to length if and
only if numbers in the length ratio column of the tables in this section are smaller than 1.
It did a better job with respect to worst slack and sum of negative slack if numbers in the
respective slack difference column are larger than 0.
5.6.2 The Impact of Optimization
We ran the bicriteria algorithm with parameters  = 0, 0.1, 0.5, and 1.
In the main run (Amain) we enabled the optimizations described in Section 5.5. We
used the improved placement of Steiner points resulting from a better top-level topology
(Section 5.5.1) and the optimization of component layouts (Section 5.5.2) for all instances.
Since the optimization with greedy is more time consuming, we select different configurations
based on the number k of connected components of the branching after the Eulerian walk.
If k ≤ 100, we proceed as described in Section 5.5.3. We use parameter ξ := 1 − .
If k is large, we cannot try all possible candidate branchings. If 101 ≤ k ≤ 1000, we
restrict to candidate branchings in which the selected component Ci is merged with the
component containing the successor of the source in the initial topology. This component
is usually the largest one as it contains the parts of Tinit that have not been ripped-up
during the Eulerian walk. To decrease the running time further, we avoid the application of
Huffman Coding in each iteration if 101 ≤ k ≤ 1000. Instead, when evaluating a candidate
branching in which two components with roots ”xi and ”xj are replaced by a component with
root ˚uffl, we replace Condition (5.7) by the stronger but computationally easier condition
2−bif(˚uffl) ≤ 2−bif(”xi) + 2−bif(”xj). If k > 1000, we do not optimize with greedy at all.
In the reference algorithm Aref we disabled the optimization of component layouts and
the optimization with greedy (Sections 5.5.2 and 5.5.3). The improved top-level topology
generation from Section 5.5.1 remained active also for the reference run. With this setting,
the reference algorithm coincides with the version of the bicriteria algorithm used in [HR13].
The results are shown in Table 5.1. Running times for the main algorithm range from
51 seconds for  = 1 to almost 2 minutes for  = 0. Roughly half of that time is spent
in topology optimization. Computation of the initial topologies takes about 30 seconds.
Taking into account that topology generation is a preparation step for the much slower
buffer insertion, all these running times are negligible.
Average and total values in the length ratio columns of the table point out an overall
positive effect of optimization on topology lengths. For the small values for  (0 and 0.1),
these improvements are substantial although delay bounds are very tight. With larger
94 Topology Generation
|N | wsl-diff sns-diff length wsl-diff sns-diff length
# instances [ps] [ps] ratio [ps] [ps] ratio
 = 0  = 0.1
≤ 10 max 10 546 1.71 31 110 1.62
min -10 -371 0.35 -87 -472 0.36
av 0 0 0.99 1 1 0.99
# 468 838 total 0.97 0.98
11− 100 max 10 1 600 1.89 39 3 290 1.73
min -10 -4 238 0.29 -103 -4 289 0.19
av 0 -12 0.93 0 -10 0.88
# 112 216 total 0.90 0.86
> 100 max 10 1 974 1.95 28 24 930 1.62
min -5 -56 659 0.42 -65 -450 543 0.35
av 0 -633 0.88 -2 -1 058 0.81
# 3 538 total 0.89 0.81
all max 10 1 974 1.95 39 24 930 1.73
min 10 -56 659 0.29 -103 -450 543 0.19
av 0 -6 0.98 1 -7 0.96
# 584 592 total 0.94 0.91
|N | wsl-diff sns-diff length wsl-diff sns-diff length
# instances [ps] [ps] ratio [ps] [ps] ratio
 = 0.5  = 1
≤ 10 max 10 47 1.32 10 33 1.28
min -231 -524 0.36 -40 -154 0.53
av 0 0 1.00 0 0 1.00
# 468 838 total 1.00 1.00
11− 100 max 132 1 600 1.31 73 1 560 1.32
min -175 -6481 0.34 -220 -7 254 0.42
av -3 -56 0.93 -3 -65 0.96
# 112 216 total 0.92 0.96
> 100 max 86 21 696 1.18 78 8 966 1.46
min -285 -1 419 415 0.35 -286 -2 350 104 0.44
av -18 -2 688 0.82 -30 -3999 0.86
# 3 538 total 0.81 0.86
all max 132 21 696 1.32 78 8 966 1.46
min -285 -1 419 415 0.34 -286 -2 350 104 0.42
av -1 -27 0.98 -1 -37 0.99
# 584 592 total 0.95 0.97
Table 5.1: Comparison between the bicriteria algorithm including all optimizations of Section 5.5
(Amain) and the version of [HR13] that does not contain component layout optimization and
optimization with greedy (Aref). We compare the results for two small values  = 0, 0.1 (on top)
and for two larger values  = 0.5, 1 (at the bottom).
The columns entitled length ratio show the ratios length(Amain)length(Aref) . Numbers below 1 in these columns
show that optimization decreases topology lengths. The wsl-diff columns show the differences
min{0,wsl(Amain)} −min{0,wsl(Aref)} while the sns-diff columns display sns(Amain)− sns(Aref).
Negative values in the columns wsl-diff and sns-diff show that optimization degrades timing. A
more detailed description of how the table is arranged can be found in Section 5.6.1.
Topology Generation 95
|N |: 70
Length: 1 994µm
WSL: −74 ps
(a) Bicriteria algorithm for  = 0.1, without opti-
mization.
|N |: 70
Length: 1 013µm
WSL: −74 ps
(b) Bicriteria algorithm for  = 0.1, with opti-
mization.
Figure 5.12: Instance on which optimization could reduce topology lengths without degrading
the worst slack.
values for  (0.5 and 1), the initial topology is feasible for many instances and the effect of
optimization is smaller.
On some instances optimization increased the length of the returned solution. The
reason for this effect is that we run the faster non-optimized version of Huffman coding
to evaluate worst slacks of branching candidates and the improved version of Huffman
coding from Section 5.5.1 could find a longer solution for the new branching than for the
old one. One possibility to overcome this would be to always use the optimized version
and reject solutions that increase topology lengths. However, this situation occurs rarely
and we prefer to take the runtime benefit from using the fast Huffman coding.
The overall improvement with respect to length goes on cost of a degraded timing.
With average degradations between 0 and 1 ps the degradations of the average worst slack
are tiny. For sinks t with large required arrival times the bound (1 + ) · rat(t) can become
weak if  > 0. As optimization tries to shorten lengths as long as these delay bounds are
met, timing can degrade a lot on such instances. Examples of instances for which topology
optimization degrades the worst slack a lot are shown in Figures 5.13 and 5.14. In both
instances, critical sinks are connected to the source by direct connections in the top-level
topology of the non-optimized bicriteria algorithm. Optimization with greedy decreases
the number of these direct connections but paths with detours arise. With  = 0.5, 50%
detour is allowed and since delay parameters correspond to lower layers here, these detours
result in huge worst slack degradations. To prevent that the optimization with greedy uses
the freedom given by weak delay bounds, one can always choose ξ = 1 in the optimization
with greedy algorithm.
Minimizing the sum of negative slacks is no direct optimization goal and the average
degradations of the sum of negative slacks by the optimizations of Section 5.5 are small. The
larger degradations can be avoided by excluding branching candidates in the optimization
with greedy that result in a too large degradation of the sum of negative slacks. Examples
of instances for which optimization could improve the results substantially can be found in
Figures 5.11 and 5.12. On both instances we could decrease topology lengths by avoiding
long edges of the top-level topology. On Figure 5.12(b) the length improvement is almost
a factor 2. In both cases, optimization did not degrade the worst slack.
96 Topology Generation
|N | wsl-diff length wsl-diff length
# instances [ps] ratio [ps] ratio
 = 0  = 0.1
≤ 10 max 3 2.89 3 2.33
min -10 0.97 -99 0.97
av -2 1.03 -1 1.01
# 468 838 total 1.04 1.01
11− 100 max 2 3.87 1 3.68
min -10 0.96 -106 0.95
av -4 1.29 -5 1.15
# 112 216 total 1.33 1.15
> 100 max 0 5.22 2 4.58
min -10 1.00 -104 1.00
av -6 1.87 -14 1.57
# 3 538 total 1.87 1.55
all max 3 5.22 3 4.58
min -10 0.96 -106 0.95
av -2 1.08 -2 1.04
# 584 592 total 1.20 1.10
|N | wsl-diff length wsl-diff length
# instances [ps] ratio [ps] ratio
 = 0.5  = 1
≤ 10 max 3 2.01 3 2.13
min -293 0.96 -547 0.98
av -3 1.01 -3 1.00
# 468 838 total 1.00 1.00
11− 100 max 0 2.79 0 2.23
min -342 0.95 -562 0.95
av -16 1.09 -24 1.05
# 112 216 total 1.08 1.04
> 100 max 0 2.91 0 2.81
min -357 1.00 -704 1.00
av -57 1.33 -106 1.23
# 3 538 total 1.28 1.18
all max 3 2.91 3 2.81
min -357 0.95 -704 0.95
av -6 1.03 -8 1.01
# 584 592 total 1.05 1.03
Table 5.2: Comparison between the bicriteria algorithm including all optimizations of Section 5.5
(Amain) and bounds on length and worst slack. We compare the results for the two small values
 = 0, 0.1 (on top) and for two larger values  = 0.5, 1 (at the bottom).
The columns entitled length ratio show the ratios length(Amain)length(Aref1 ) , where Aref1 is the algorithm com-
puting the initial short topologies. The wsl-diff columns show the differences min{0,wsl(Amain)}−
min{0,wsl(Aref2)}, where Aref2 is the Huffman Coding Algorithm. A more detailed description of
how the table is arranged can be found in Section 5.6.1.
Topology Generation 97
|N |: 7
Length: 2 329µm
WSL: −376 ps
(a) Bicriteria algorithm with
 = 0.5, without optimization.
|N |: 7
Length: 2 007µm
WSL: −618 ps
(b) Bicriteria algorithm with
 = 0.5, with optimization.
|N |: 7
Length: 2 011µm
WSL: −378 ps
(c) Greedy algorithm with ξ =
0.5.
Figure 5.13: Example of an instance for which optimization of the bicriteria algorithm with
 = 0.5 degrades timing a lot. Here, the greedy algorithm with ξ = 0.5 gives a better solution.
5.6.3 Comparison between Bicriteria and Bounds for Length and Slack
We now compare the optimized bicriteria algorithm Amain from Section 5.6.2 with bounds
on length and worst slack. We compare the lengths of the computed topologies with the
lengths of the initial topologies. For instances with up to 9 sinks, the initial topologies are
shortest possible as they have been computed by the FLUTE algorithm [CW08]. Initial
topologies for larger instances are not necessarily optimum and can hence be larger than
the output of the bicriteria algorithm. We compare worst slacks with the worst slacks
of the output of the Huffman Coding Algorithm [Huf52]. Due to numerical issues it is
possible that values of the form rat(t)−dist(p(s)−p(t))b are stored with a small error. These
errors can result in an error of 1 when applying the b.c-operation during computation of
bif(t) and the worst slack of the Huffman topologies can be smaller that optimum by at
most b.
According to Table 5.2 for  = 0 we are close to the almost optimum timing of the Huffman
coding topology. On average, the worst slacks lies only 2 ps below the almost-optimum. The
maximum worst slack degradation is 10 ps which coincides with the worst slack guarantee
of 2b. In average, the topology lengths are 8% larger than the approximately shortest
topologies. For the largest instances the average and total length ratios are much larger,
resulting in a total length increase of 20% over all instances. Increasing  from 0 to 0.1
does not degrade timing of most instances. The worst slack of timing outliers degrades to
roughly 100 ps but we gain substantial length improvements.
Topology lengths can be decreased further by increasing . For  = 1, the total topology
lengths are within only 3% from optimum. The improved lengths for  > 0 goes along
with timing degradations. The average deviation from the almost optimum worst slack
increases from 2 ps for  = 0.0, 0.1 to 8 ps for  = 1. For the largest values for , 0.5 and 1,
there are some instances with a worst slack deviation of several hundred pico seconds. As
these timing degradations are inacceptable in almost all scenarios, parameters  = 0.5 and
1 are usually not chosen in practice.
98 Topology Generation
|N | wsl-diff sns-diff length wsl-diff sns-diff length
# instances [ps] [ps] ratio [ps] [ps] ratio
 = 0  = 0.1
≤ 10 max 15 3 064 2.74 20 672 2.45
min -10 -323 0.22 -91 -565 0.29
av -2 3 0.94 -1 2 0.93
# 468 838 total 0.95 0.95
11− 100 max 48 30 937 3.11 43 15 216 2.34
min -10 -7 079 0.11 -93 -7 323 0.26
av -2 106 0.95 -3 71 0.94
# 112 216 total 0.94 0.94
> 100 max 61 624 469 2.61 112 216 446 2.09
min -10 -50 412 0.15 -79 -177 589 0.30
av 0 2 675 0.97 -4 1 812 1.01
# 3 538 total 0.90 1.04
all max 61 624 469 3.11 112 216 446 2.34
min -10 -50 412 0.11 -93 -177 589 0.23
av -2 39 0.94 -2 28 0.92
# 584 592 total 0.94 0.94
|N | wsl-diff sns-diff length wsl-diff sns-diff length
# instances [ps] [ps] ratio [ps] [ps] ratio
 = 0.5  = 1
≤ 10 max 218 1 076 1.79 413 1 491 2.06
min -287 -776 0.43 -287 -1 148 0.75
av -2 -1 0.96 0 -1 1.00
# 468 838 total 0.98 1.00
11− 100 max 252 13 200 2.04 475 8 835 2.17
min -325 -9 291 0.50 -533 -15 628 0.64
av -10 -55 0.97 -2 4 1.01
# 112 216 total 0.98 0.99
> 100 max 489 633 770 2.07 1724 2 551 448 2.63
min -249 -286 890 0.73 -401 -305 011 0.74
av -22 698 1.08 44 10 778 1.10
# 3 538 total 1.06 1.05
all max 489 633 770 2.67 1724 2 551 448 2.63
min -325 -286 890 0.43 -1124 -305 011 0.64
av -4 -8 0.96 0 65 1.00
# 584 592 total 0.99 1.00
Table 5.3: Comparison between the bicriteria algorithm including all optimizations of Sec-
tion 5.5(Amain) and the greedy topology algorithm by Bartoschek et al. [Bar+10] with trade-off
parameter ξ() = 1− . We compare the results for two small values  = 0, 0.1 (on top) and for
two larger values  = 0.5, 1 (at the bottom).
The columns entitled length ratio show the ratios length(Amain)length(Aref) . The wsl-diff columns show the
differences min{0,wsl(Amain)}−min{0,wsl(Aref)} while the sns-diff columns display sns(Amain)−
sns(Aref). A more detailed description of how the table is arranged can be found in Section 5.6.1.
Topology Generation 99
|N |: 249
Length: 6 616µm
WSL: −173 ps
(a) Bicriteria algorithm with
 = 0.5, without optimization.
|N |: 249
Length: 4 944µm
WSL: −311 ps
(b) Bicriteria algorithm with
 = 0.5, with optimization.
|N |: 249
Length: 5 058µm
WSL: −110 ps
(c) Greedy algorithm with ξ =
0.5.
Figure 5.14: Example of an instance for which optimization of the bicriteria algorithm with
 = 0.5 degrades timing a lot. Here, the greedy algorithm with ξ = 0.5 gives better solutions.
5.6.4 Comparison between Bicriteria and Greedy
In this section we use an optimized version of the greedy topology algorithm by Bartoschek
et al. [Bar+10] with trade-off parameter ξ() = 1−  as reference.
Table 5.3 shows the results. The greedy algorithm needed around 9 minutes to process all
instances. Although this is slower than the bicriteria algorithm by a factor of 5, this is still
a small running time. In practice, large instances are pre-clustered by a fast clustering
algorithm of Maßberg and Vygen [MV08] and with this pre-clustering, the greedy algorithm
is only slightly slower than the bicriteria algorithm.
Overall, the greedy algorithm produces topologies with slightly better worst slacks
but with slightly worse sum of negative slacks. The only exception is the configuration
 = 0.5 where the greedy algorithm seems to push more on timing. For parameters  = 0.1
(⇒ ξ = 0.9) and  = 0 (⇒ ξ = 1) that are most relevant in practice, the average worst
slack degradation of the bicriteria algorithm lies around 2 ps which corresponds to roughly
b/2. In contrast to the greedy algorithm, the bicriteria algorithm has a configuration (i. e.
 = 0) in which it is guaranteed not to produce timing outliers. When using larger values
for  such as 0.5 and 1, both algorithms produce solutions that have a worst slack far away
from optimum.
For the smaller choices for , the bicriteria algorithm produces shorter topologies. The
total net length reduction lies around 6%. The average net length improvement can be
even higher. While the greedy algorithm is the most commonly used topology generation
algorithm during design phases that target at maximizing the worst slack, the bicriteria
algorithm appears to be well-suited in our application of timing-constrained global routing
as the significantly shorter topology lengths improve routability a lot.
100 Topology Generation
Instances for which the bicriteria algorithm yields worse results than the greedy
algorithm can be found in Figures 5.13 and 5.14. Here, the initial short topologies
contain detours and are bad w. r. t. timing. The bicriteria algorithm can repair most timing
violations by long connections of the top-level topology (as this is done in the non-optimized
version) or can recover most connections of the initial topology on cost of a degraded
timing (as this is done in the optimized version). The greedy algorithm does not make use
of a bad initial topology and builds relatively short topologies with good timing.
Instances for which the bicriteria achieves better results than the greedy algorithm are
shown in Figure 5.15.
|N |: 17
Length: 258µm
WSL: −47 ps
(a) Greedy algorithm with ξ = 0.9.
|N |: 17
Length: 174µm
WSL: −32 ps
(b) Bicriteria algorithm with  = 0.1.
|N |: 45
Length: 597µm
WSL: −71 ps
(c) Greedy algorithm with ξ = 0.9.
|N |: 45
Length: 529µm
WSL: −61 ps
(d) Bicriteria algorithm with  = 0.1.
|N |: 40
Length: 644µm
WSL: −40 ps
(e) Greedy algorithm with ξ = 0.9.
|N |: 40
Length: 473µm
WSL: −28 ps
(f) Bicriteria algorithm with  = 0.1.
Figure 5.15: Instances for which the bicriteria algorithm with  = 0.1 behaves better than the
greedy algorithm with ξ = 0.9.
Chapter 6
On the Way to a Practical
Algorithm: Virtual Buffering
Although we can find an almost optimum solution to the Minimum Cost Buffered Steiner
Tree Problem with a given topology in polynomial time, the practical running time is
still too slow. To speed-up the repeater tree construction, we consider the Steiner tree
construction problem and the problem of buffering a given Steiner tree separately from
each other.
In this chapter we restrict ourselves to the first of these sub-problems and show how
to compute a Steiner tree which trades-off congestion for linear timing. The linear delay
model is a simple delay model that estimates the delay of a buffered Steiner tree without
actually inserting repeaters. In this sense, such a Steiner tree can be considered a virtual
buffering.
As in Chapter 4, we assume that the global routing graph is a directed graph although
all results can equally be applied to the undirected case.
6.1 A Linear Delay Model for Steiner Trees
While computing a buffered Steiner tree as in Chapter 4 was computationally expensive, a
topology with good properties w. r. t. both linear delays and lengths can be computed fast
(Chapter 5). A natural question arising at this point is how fast we can compute a Steiner
tree trading-off linear delays and congestion.
Definition 6.1 Let G be a directed graph with delays ρ : E(G)∪{◦} → R≥0 with ρ(◦) = 0.
Let (A, κ) be a Steiner tree in G for a net N with source s.
For t ∈ N\{s} we define the linear delay between s and t in (A, κ) as
linear_delay(A,κ)(s, t) :=
∑
ζ∈E(A[s,t])
ρ(κ(ζ)).
In the linear delay model, gate delays are not influenced by Steiner trees, i. e. the delay
along a gate is constant for each gate. As for the buffered case, our goal is to compute a
Steiner tree minimizing a weighted sum of edge costs c : E(G)→ R≥0 and delays. More
precisely, we want to solve the following problem:
101
102 On the Way to a Practical Algorithm: Virtual Buffering
Minimum Cost Steiner Tree Problem with Linear Delays
Instance: A graph G with edge costs c : E(G)→ R≥0 and
linear edge delays ρ : E(G) ∪ {◦} → R≥0 with ρ(◦) = 0.
A net N ⊆ V (G) with source s and sink delay costs λ : N\{s} → R≥0.
Output: A Steiner tree (A, κ) for N in G minimizing∑
ζ∈E(A)
c(κ(ζ)) +
∑
t∈N\{s}
λ(t) · linear_delay(A,κ)(s, t).
This is of course the special case of the Minimum Cost Buffered Steiner Tree Problem
for which L = ∅ and the functions F (e, .) are constant for each edge e ∈ E(G). Recall
that Chuzhoy et al. [Chu+05] proved for this problem that no o(log log |N |) approximation
algorithm exists unless every problem in NP can be solved in O(nlog log logn) time (where n
is the instance size).
6.2 Shortest Paths and Optimum Topology Embeddings
While we have seen that the general Minimum Cost Buffered Steiner Tree Problem is
NP-hard for two-terminal nets, this is obviously not the case here (if P 6= NP).
By defining the traversal cost of an edge e as c(e) + λ(t) · ρ(e) (where N = {s, t}), we
see that the two-terminal version of the Minimum Cost Steiner Tree Problem with Linear
Delays is just a Standard Shortest Path Problem and hence can be solved optimally in time
O(|E(G)|+ |V (G)| log(|V (G)|)).
The next theorem shows how to embed a given topology into G in an optimum way.
This theorem is just an improved version of the theorems of Section 4.6 and is equal to
Theorem 11 in [Hel+17].
Theorem 6.2 Let G,N, c, ρ, λ be an instance of the Minimum Cost Steiner Tree Problem
with Linear Delays. Let T be a topology for N .
We can compute in O(|N | · (|E(G)|+ |V (G)| log(|V (G)|))) time a Steiner tree (A, κ)
for N with underlying topology T such that its cost∑
ζ∈E(A)
c(κ(ζ)) +
∑
t∈N\{s}
λ(t) · linear_delay(A,κ)(s, t)
is minimum among all embeddings of T .
Proof The algorithm is similar to the algorithms of the proof of Theorems 4.12 and 4.14.
As in their proofs we process the edges of T in reverse topological ordering. While processing
an edge ( ”v, ”w) ∈ E(T ) we compute labels (u, α(”v,”w)(u))( ”v, ”w) for all u ∈ V (G). These labels
correspond to embeddings of the topology ”v + ( ”v, ”w) + T (”w) with total cost at most
α(”v, ”w)(u) and in which ”v has position u. We also specify λ-values for all Steiner points.
If we embed an edge (”v, t) ∈ E(T ) with t ∈ N\{s}, Dijkstra’s algorithm [Dij59] on
the graph
(
V (G), {(u′, u) : (u, u′) ∈ E(G)}) with cost function cost(e) = c(e) + λ(t) · ρ(e)
On the Way to a Practical Algorithm: Virtual Buffering 103
starting at t ∈ V (G) produces these labels and we obtain corresponding embeddings by
transforming paths (in G) into Steiner paths. As in Theorems 4.12 and 4.14 we have to
extend the otherwise trivial embedding corresponding to (t,−∞) by an edge ζ mapped to
κ(ζ) = ◦ without increasing costs.
If we embed edge (”v, ”w) ∈ E(T ) entering a Steiner node ”w /∈ N\{s}, let δ+T ( ”w) =
{ ”w1, ”w2} and assume that for all u ∈ V (G) we have already computed labels (u, α(”w, ”wi)(u))(”w, ”wi)
as desired (i = 1, 2). We set λ( ”w) := λ(”w1) + λ( ”w2) and run Dijkstra’s algorithm on the
graph arising from G by reversing all edges and adding a new node w with incident edges
(w, u) with ρ((w, u)) = 0 and c((w, u)) = α(”w, ”w1)(u) + α( ”w,”w2)(u) for all u ∈ V (G). Vertex
w serves as start node and we use the cost function cost(e) = c(e) + λ(”w) · ρ(e).
Due to the choice of costs of the edges leaving w, w − u paths found by this algorithm
naturally correspond to embeddings of ”v + ( ”v, ”w) + T (”w) and the algorithm produces
labels as desired.
Let (s, ”w) ∈ E(T ) be the unique edge leaving s in T . We output the embedding
corresponding to (s, α(s, ”w)(s)).
Since the running time of the overall algorithm is dominated by the O(|N |) applications
of Dijkstra’s algorithm, each taking time O(|E(G)|+ |V (G)| log(|V (G)|)), the running time
is clear.
To prove correctness we show that for each (”v, ”w) ∈ E(T ) and u ∈ V (G) there is no
embedding of ”v + ( ”v, ”w) + T (”w) where ”v is positioned at u that has cost smaller than
α(”v,w)(u).
For ”w ∈ N\{s} this is clear by the correctness of Dijkstra’s algorithm. For ”w ∈
V (T )\N let (A∗, κ∗) be an optimum embedding ”v + (”v, ”w) + T ( ”w) with κ∗(”v) = u. If
δ+( ”w) = {”w1, ”w2}, (A∗, κ∗) consists of embeddings (Ai, κi) of ”w + ( ”w, ”wi) + T (”wi) for
i = 1, 2 plus an κ∗( ”w) − u path. By induction and construction, the cost of the edges
between w and u in the modified graph created for the application of Dijkstra’s algorithm
during processing of (”v, ”w) ∈ E(T ) does not exceed the sum of costs of (A1, κ1) and
(A2, κ2). By correctness of Dijkstra’s algorithm, α(”v,”w)(u) does not exceed the cost of
(A∗, κ∗). 
Similar to Chapter 4 we have now developed an optimum algorithm for the special
case of the Minimum Cost Steiner Tree Problem with Linear Delays where |N | is constant.
For larger sets of sinks a solution for the Shallow-Light Topology Problem with Criticalities
(see Theorem 5.9) serves as a good starting topology.
6.3 Speed-up Techniques for Practical Instances
Although the running time of the algorithm contained in the proof of Theorem 6.2 is
already smaller than its buffered extension (Theorem 4.12), it is still not fast enough for
practical application.
Recall that in many applications, G is a 3-dimensional grid graph (see Section 2.4.3)
and that the placement step, that is usually preceding the global routing step in any
physical design flow, tries to place the circuits such that connected pins are not far away
from each other. As a result, the pins of most nets are local in the sense that optimum
Steiner trees for them are contained inside small sub-grids of G. Traversing the whole
graph during each of the several million path searches would be way too time consuming.
104 On the Way to a Practical Algorithm: Virtual Buffering
6.3.1 Reducing Running Time by Limiting Search Areas
To achieve a fast running time of the overall algorithm it is necessary to reduce the number
of potential Steiner points, i. e. the number of vertices u ∈ V (G) for which we create a
label (u, α(”v, ”w)(u))(”v, ”w) during embedding of a topology edge (”v, ”w).
For positions that are far away from the bounding box of its terminals it is usually easy
to find out that placing a Steiner point there cannot lead to an optimum solution. The
cost of a topology embedding that is easy to compute (e. g. an embedding where all Steiner
points are placed at the source’s position) or an embedding that we have computed in an
earlier phase of the resource sharing algorithm (Chapter 3) might already be smaller than
the length of a shortest path to that position. When running Dijkstra’s algorithm we do
not have to wait until labels at these distant positions are created.
Simultaneous embedding of sibling paths. Instead of excluding Steiner point can-
didates explicitly, we use a dynamic approach to determine when we can stop a Dijkstra
path search. Assume that we are given a topology T for a net N . Let ”v ∈ V (T ) and
assume that we have already computed labels corresponding to embeddings of the sub-
topologies rooted at the successors of ”v (see the proof of Theorem 6.2). If ”v is the source
of N , we can certainly stop the label generation as soon as we have labeled the source’s
position permanently. Assume that ”v is a Steiner point with outgoing edges (”v, ”w1) and
(”v, ”w2) ∈ E(T ). Instead of embedding these edges one after the other, we run both path
searches used for the embeddings simultaneously. This approach is similar to the largely
used bi-directional Dijkstra (see [Nic66]).
Each point u that is visited by both path searches is a possible location for ”v. The
combined cost of the embeddings of ”v + (”v, ”w1) + T (”w1) and ”v + (”v, ”w2) + T (”w2) with
κ( ”v) = u is equal to the sum of keys of the two labels at u. Since the keys of the labels
selected in each iteration of Dijkstra’s algorithm are monotonically increasing, the cost
of these embeddings will be larger for Steiner point candidates found in later iterations.
After the first node in the global routing graph has been labeled permanently by both
path searches, we compute a bound on the cost of the embeddings of ”v + (”v, ”w1) + T (”w1)
and ”v + ( ”v, ”w2) + T (2) and omit to process labels that have a key exceeding this bound.
In the case that we embed a placed topology, the costs of labels at the first node u that is
permanently labeled by both path searches plus an estimate on the cost (with respect to
the cost function c(.)+λ(”v) ·ρ(.)) of the path between u and the position of the predecessor
of ”v in T can be used as such a bound. In the next section we show how to compute an
estimate on path costs by a landmark based A∗ approach.
Bounded embedding tolerances. In practice, Steiner point positions computed by the
optimized version of the bicriteria algorithm of Chapter 5 are very good unless the design
is highly congested. Geometric information obtained by the positions of the endpoints of
edges in the placed topology can serve as a guideline during the path searches.
In the case that the global routing graph is a 3-dimensional grid graph as described in
Section 2.4.3, we can use placement information of the initial topology to impose a limited
embedding tolerance on topology edges. Let i ∈ {1, 2} and let BB be the bounding box of
• all nodes in the global routing graph where we have an initial label for the path
search used to embed (”v, ”wi), and
• the position of ”v in the initial placed topology.
On the Way to a Practical Algorithm: Virtual Buffering 105
x
y
layer
Figure 6.1: Visualization of embedding tolerance. In this example we wish to find a path from
one of the blue vertices to the green vertex. The bounding box of these pins (blue box) is extended
in each direction (yellow box). During the path search we forbid all vertices within the red boxes.
Let tol > 0 be a parameter. The value of tol will serve as a trade-off between quality and
running time and can be chosen dependent on the the timing criticality of (”v, ”wi) in T
and on the aspect ratio of BB. We extend BB by tol in each direction and restrict the
path search to nodes within this extended bounding box. An example of such a restricted
routing area can be found in Figure 6.1. In the picture we choose tol = 1 and restrict the
path search to the nodes within the blue and yellow box.
6.3.2 Future Costs
In the case that the initial topology is a placed topology, which is always the case if we
are embedding two-terminal nets, path searches have a target. By exchanging the original
cost function with its reduced costs we can decrease the number of labels produced during
target directed path searches (see [GH05]).
Recall that the cost functions we use during Dijkstra’s algorithm are of the form
cost((v, w)) = c((v, w)) + λ · ρ((v, w)).
Definition 6.3 For functions pic, piρ : V (G)→ R≥0 such that for all (v, w) ∈ E(G),
• c((v, w)) + pic(w)− pic(v) ≥ 0 and
• ρ((v, w)) + piρ(w)− piρ(v) ≥ 0,
we call the function V (G)→ R≥0, v 7→ pic(v) + λ · piρ(v) a feasible potential and the edge
cost function
costpic+λpiρ((v, w)) = cost((v, w)) + pic(w)− pic(v) + λ · (piρ(w)− piρ(v))
reduced costs. We also call pic respectively piρ(w) a feasible potential or potential
function if it satisfies the above inequalities.
Note that for s, t ∈ V (G) and any s-t- path P in G,
cost(P ) = costpic+λpiρ(P ) + (pic(s) + λpiρ(s))− (pic(t) + λpiρ(t))
106 On the Way to a Practical Algorithm: Virtual Buffering
and hence, a shortest path with respect to the original cost function is also a shortest path
with respect to reduced costs.
The definition of pic and piρ, as well as the running time that is necessary to compute
these values, heavily depend on the structure of the global routing graph. We now define
pic and piρ in the case that the vertices of the global routing graph G can be written as
M × {1, . . . , Z} for a finite metric space (M, dist)
and a finite number Z ∈ N of routing layers (see Section 2.4.3). In this case we can define
the geometric distance ||v, w|| between two vertices v, w ∈ V (G) as the distance with
respect to dist of the projections of v and w onto M .
Different wire codes that influence delays and space consumption of wires can be
modeled easily by inserting further routing layers. Henceforth, we will omit mentioning
wire codes explicitly.
For the functions piρ and pic that we define now, pic(v) + λpiρ(v) is a lower bound on the
cost on the shortest path between a node v ∈ V (G) and the target t of the path search.
Definition of piρ by geometric lower bounds. For an edge e ∈ E(G) connecting two
nodes on the same routing layer, the estimated delay ρ(e) usually depends on the geometric
length of e and the layer only (resp. the combination of layer and wire code). If e ∈ E(G)
connects different layers, ρ(e) depends on the layers only.
For z ∈ {1, . . . , Z} we are usually given a delay per length value ρwire(z) ∈ R≥0 for z
while for a pair z1, z2 ∈ {1, . . . , Z} of adjacent wiring planes (i. e. |z1 − z2| = 1) we are
given a via delay ρvia(z1, z2) ∈ R≥0 such that ρvia(z1, z2) = ρvia(z2, z1). For v ∈ V (G) we
denote the layer on which v is located by layer(v). Using this notation we write
ρ((v, w)) =
{
||v, w|| · ρwire(layer(v)) if layer(v) = layer(w),
ρvia(layer(v), layer(w)) otherwise.
Let t ∈ V (G) be the target of the path search. For v ∈ V (G) we define
piρ(v) := min
z∈{1,...,Z}
{
||v, t|| · ρwire(z) +
max{layer(t),z}−1∑
z′=min{layer(t),z}
ρvia(z
′, z′ + 1)
+
max{layer(v),z}−1∑
z′=min{layer(v),z}
ρvia(z
′, z′ + 1)
}
.
This can indeed be used to obtain a feasible potential as the next lemma shows:
Lemma 6.4 For any edge (v, w) ∈ E(G) it holds that ρ((v, w)) + piρ(w)− piρ(v) ≥ 0.
Proof Let H be the graph with vertex set V (H) = M × {1, . . . , Z} and edge set
E(H) = {(v, w) : layer(v) = layer(w) or (|layer(v)− layer(w)| = 1 and ||v, w|| = 0)}.
For v ∈ V (H) the length of a shortest v-t path in H with respect to ρ-costs is equal to
piρ(v) and hence, piρ(w) + ρ(e) ≥ piρ(v) for all e = (v, w) ∈ E(H).
Now, the lemma follows from the fact that G is a subgraph of H. 
On the Way to a Practical Algorithm: Virtual Buffering 107
s
v w t
vˇ
(a) Large parts of the shortest s-t and s-vˇ paths
coincide. The reduced costs of edge (v, w) are
small.
s
v w t
vˇ
(b) The shortest s-t and s-vˇ path are com-
pletely different. The reduced costs are
equal to the original costs.
Figure 6.2: Idea of the landmark-based approach by Goldberg and Harrelson [GH05] who choose
the potential function max{0, dist(G,c)(v, vˇ)− dist(G,c)(t, vˇ)}. Figure 6.2(a) is taken from [GH05]
(Figure 2).
Henke [Hen16] defined several more advanced future cost functions on weighted grid
graphs. On instances that contain many blockages that extend to all available routing
layers it would be more accurate to use one of those.
Definition of pic by landmark-based future costs. As the c-cost of an edge is usually
not dependent on the length of the edge only, defining good future costs for cost function c
is harder. We use the landmark-based A∗ approach by Goldberg and Harrelson [GH05] to
define pic. This strategy has already be used by Müller [Mül09].
We start by explaining the high-level idea of their approach. Assume that we want to
find a shortest s-t path and we already know shortest paths from all vertices in G to a
particular node vˇ ∈ V (G). The vertex vˇ is called landmark. If we are lucky, large parts of
the shortest s-vˇ and the shortest s-t path coincide and we can use our knowledge on the
s-vˇ path to achieve that reduced edge costs on the s-t path are small.
Figure 6.2, that is inspired by Figure 2 from [GH05], depicts this idea. For v ∈ V (G) let
dist(G,c)(v, vˇ) be the c-cost of a shortest v-vˇ path in G. Since the edge (v, w) in Figure 6.2(a)
lies on a shortest v-vˇ path, dist(G,c)(v, vˇ)− dist(G,c)(w, vˇ) = c((v, w)). In other words, the
reduced cost of (v, w) with respect to the feasible potential v 7→ dist(G,c)(v, vˇ) is zero and
we are encouraged to use that edge. Note that using the potential v 7→ dist(G,c)(v, vˇ) is
identical to using the potential v 7→ dist(G,c)(v, vˇ)− dist(G,c)(t, vˇ) because dist(G,c)(t, vˇ) is
constant.
In the case that shortest paths between s and vˇ and between t and vˇ are different
(Figure 6.2(b)), it is not a good idea to use the potential defined before. The reduced edge
cost of (v, w) would be even larger than the original costs and we would be discouraged to
use (v, w). Instead, we would prefer to use the trivial potential v 7→ 0 in that case.
Of course we are not able to find out if an edge (v, w) lies in the intersection of
a shortest v-vˇ path with a shortest t-vˇ path sufficiently fast. By checking whether
dist(G,c)(v, vˇ)− dist(G,c)(t, vˇ) is positive or not, and by using the trivial potential function
in case of negativity, we can at least distinguish between the two extreme cases depicted in
Figure 6.2. By Lemma 2.2 of [GH05], v 7→ max{0, dist(G,c)(v, vˇ)− dist(G,c)(t, vˇ)} is indeed
a potential function and if Vˇ ⊆ V (G) is a set of landmarks,
pic(v) := max
vˇ∈Vˇ
{
max{0, dist(G,c)(v, vˇ)− dist(G,c)(t, vˇ)}
}
is again a potential function.
108 On the Way to a Practical Algorithm: Virtual Buffering
s
t
(a) Path search without future cost.
s
t
(b) Path search with future cost.
Figure 6.3: Visualization of the amount of vertices labeled during the path search from t to s.
Pictures are shown in 2 dimensions although the path searches were performed in a 3 dimensional
grid graph. Each rectangle represents 12 nodes. The darker the color of a rectangle the more of
the represented vertices were visited. If none of these vertices were visited, the rectangle is white.
The large gray boxes are blockages.
To speed-up Dijkstra’s algorithm it is important to choose sets Vˇ such that for many
path searches there exists a landmark for which the target lies “between” the source and the
landmark as in Figure 6.2(a). Goldberg and Harrelson [GH05] show how to find suitable
landmark sets in general graphs and for geometric graphs such as road graphs. If G is a
3-dimensional grid graph as in Section 2.4.3, it is often a good idea to select the corners of
the chip area on both the lowest and the highest routing layer.
The major drawback of the landmark approach of Goldberg and Harrelson [GH05] is of
course the fact that we need to pre-compute the distances to all landmarks. Since we make
millions – or even billions – of path searches in the same graph during a whole timing-
constrained global routing, this effort pays-off. According to the price update strategy
of the resource sharing algorithm by Müller, Radke, and Vygen [MRV11] (Algorithm 1),
prices never decrease and hence, a feasible potential remains feasible during the whole
algorithm.
At some points of the resource sharing algorithm the previously computed distances
to the landmarks will only yield poor lower bounds for the new prices and we need to
re-compute the distances. To detect these situations during the algorithm we compare the
actual cost of all computed shortest paths with the lower bound provided by the feasible
potential pic(v) + λpiρ(v). Whenever the ratio between the actual cost and the estimated
cost has become too large for a sufficiently large number of path searches, we update
landmark distances. To avoid that the number of landmark re-computations becomes
too large, we strictly forbid re-computations if the number of path searches since the last
re-computation is too small.
Good future costs have an enormous impact on the number of permanently labeled
vertices as Figure 6.3 shows.
On the Way to a Practical Algorithm: Virtual Buffering 109
s t1
t2
x
(a) Steiner tree with small linear delay that does
not admit a good buffering. The capacitance at
x is too large in each buffering solution.
s t1
t2
(b) Avoiding to route over the blockage can im-
prove the results after buffering although linear
delays increase.
Figure 6.4: While computing Steiner trees with minimum linear delays as input to buffering we
have to take blockages on the placement layer into account. In the picture we assume that the
gray area is an area without placement space. We are not allowed to insert repeaters there but we
are allowed to put wires on top of it.
6.4 Reach-Aware
Recall that the main goal of this thesis is the computation of buffered Steiner trees
minimizing the sum of costs for placement congestion, routing congestion, and timing.
With respect to routing congestion, the output of the algorithm of Theorem 6.2 is
already well-suited as a starting point for the actual buffer insertion (see Chapter 7). To
achieve a large correlation between linear delays (Definition 6.1) and the delays after
buffering we must take placement congestion into account already during Steiner tree
computation. While Steiner trees with small linear delays usually have small delays after
they are buffered in a timing-optimum way, we will probably get poor results if limited
placement space prevents us from inserting repeaters at optimum positions. Especially on
chips that have large areas without placement space (blockages) this is a severe problem.
Figure 6.4(a) shows an example of a Steiner tree with small linear delays. Due to the
large blockage in the middle of the picture we cannot prevent a large electrical capacitance
at point x and hence, we cannot achieve a good timing after buffering. To avoid such a
situation we have to make sure that connected components over placement blockages are
not too large. Instead of completely forbidding routing space over these blockages (which
is too restrictive in nearly all cases), our goal is to compute a so-called reach-aware Steiner
tree that we define now for rectilinear Steiner trees.
In the following we denote by pi : R3 → R2 the canonical projection (x, y, z) 7→ (x, y)
and by layer : R3 → R the function (x, y, z)→ z.
Definition 6.5 (Reach-aware rectilinear Steiner tree) Let B be a finite set of axis-
parallel rectangles (blockages) and let (A, κ) be a 3-dimensional rectilinear Steiner tree
(Definition 2.2). Let Γz ∈ R≥0 for all z ∈ layer(κ(V (A))) (reach lengths).
We may assume that for all (ν, ω) ∈ E(A), all inner points of the (possibly degenerated)
line between pi(κ(ν)) and pi(κ(ω)) either lie completely inside or completely outside the set⋃
B∈B B. In the first case we call an edge blocked.
We say that (A, κ) is reach-aware if for all connected subgraphs A′ of A consisting of
110 On the Way to a Practical Algorithm: Virtual Buffering
blocked edges only, ∑
(ν,ω)∈E(A′)
||pi(κ(ν))− pi(κ(ω))||1
Γlayer(κ(ν))
≤ 1. (6.1)
Note that we can check in linear time if a given Steiner tree is reach-aware.
6.4.1 Reach-Aware 2-Dimensional Steiner Trees
In the case that the layers on all edges are identical, the previous definition says that a
rectilinear Steiner tree is reach-aware if and only if the length of each connected component
over a blockage does not exceed a certain threshold.
The problem of computing such a Steiner tree with minimum total geometric length is
called the Length-Restricted Rectilinear Steiner Tree Problem.
Müller-Hannemann and Peyer [MP03] introduced that problem and gave a 2-approximation
algorithm with running time O((|N |+ |B|)2 log(|N |+ |B)). Under the mild assumption
that each component of the blocked area has a constant number of corners, Held and
Spirkl [HS14] achieved the same approximation guarantee in almost linear running time
O((|N |+ |B|) log(|N |+ |B)2).
If the reach length is 0, we obtain the well-studied Obstacle Avoiding Rectilinear Steiner
Tree Problem for which numerous fast 2-approximation algorithms exist, see e. g. [Fen+06],
[Lin+08], [LZM08], and [Liu+09].
Bihler [Bih15] extended the algorithm by Held and Spirkl [HS14]. In the case that
each component of the blocked area has a constant number of corners, he provides a
2-approximation algorithm with running time O((|N | + |B|) log(|N | + |B)2) that can
simultaneously handle
• blockages over which routing is allowed if reach length constraints are obeyed,
• blockages that completely block the routing space (i. e. reach length 0),
• blockages that block the routing space in horizontal direction but allow length-
restricted routing in vertical direction, and
• blockages that block the routing space in vertical direction but allow length-restricted
routing in horizontal direction.
The Length-Restricted Rectilinear Steiner Tree Problem is certainly far away from the
problem of computing a 3-dimensional rectilinear reach-aware Steiner tree minimizing
the objective function of Minimum Cost Steiner Tree Problem with Linear Delays. First,
it does not take into account different reach lengths on different layers and second, it
completely ignores delay constraints.
However, we can still use the algorithm of the previously mentioned authors to compute
an initial short topology Tinit for the bicriteria algorithm (see Section 5.3). The resulting
topology will be a better starting point for the algorithm of Theorem 6.2.
6.4.2 Reach-Awareness by Restricting the Routing Area
In addition to a better choice of starting topology we enforce reach-awareness by restricting
the global routing graph. This simple method has already been used for a long time in
BonnRouteGlobal and BonnRoute [Ges+13], [Ahr+15], [Hel+15]. We could improve
On the Way to a Practical Algorithm: Virtual Buffering 111
s
t1
t2 t3
t4
(a) Initial Steiner tree that is not
reach-aware. The red subgraph is
too large.
s
t1
t2 t3
t4s′
v w
5
5
5
5
(b) Remainder of the initial tree
after removing the component over
the blockage. The red edges are
forbidden in G′. Horizontal edges
are forbidden in the indicated di-
rection only.
s
t1
t2 t3
t4s′
(c) Final reach-aware Steiner tree
consisting of the remainder of
the initial tree and a tree for
{s′, t3, t4}.
Figure 6.5: Visualization of the algorithm that computes reach-aware Steiner trees by forbidding
edges described in Section 6.4.2. We assume that traversing 4 or more edges over blockages results
in a reach violation.
running time and flexibility by a new implementation. We explain the approach in the case
that the global routing graph is the 3-dimensional grid graph introduced in Section 2.4.3.
In a first round, we compute a Steiner tree (A, κ) without considering reach-awareness
at all (see Figure 6.5(a)). For practical VLSI instances the fraction of reach-aware Steiner
trees that we build without explicitly trying to is typically large. For the trees which
indeed violate reach constraints we rip-out connected components over blockages that
violate Inequality (6.1) in Definition 6.5. In Figure 6.5(a) this is exactly the component
consisting of the red edges. To gain more flexibility in the subsequent embedding step we
successively remove antennas, i. e. Steiner nodes with degree 1 and edges incident to them.
These paths are colored orange in Figure 6.5(a). The set of erased edges can be written as
a collection (A1, κ1), . . . , (Ak, κk) of Steiner trees. For i = 1, . . . , k the set
Ni := {ν : ν is the root of Ai} ∪ {κ(ν) : ν is a leaf of Ai} ⊂ V (G)
together with criticalities
λ(κ(ν)) =
∑
t∈N\{s} reachable
from ν in A
λ(t)
for leaves ν of Ai form an instance of the Minimum Cost Steiner Tree Problem with Linear
Delays. In Figure 6.5(b), there is one new instance that needs to be reconnected. This
instance has source s′ and sinks t3 and t4.
We solve the resulting instances in a subgraph G′ of G with the property that all
paths in G′ satisfy Definition 6.5. Together with the restriction that we do not allow
to create Steiner points over blockages we end up with a reach-aware Steiner tree after
combining the remainder of (A, κ) with the computed reach-aware Steiner trees for the
sub-instances Ni. Note that Steiner points over blockages can easily be forbidden by
removing labels (u, α(”v, ”w)(u))( ”v, ”w) for which the projection of u to R2 lies within such
a blocked area after the path search we perform to embed an edge ( ”v, ”w) of the initial
topology (see Theorem 6.2).
To build G′ we gradually remove certain edges from G. First, we remove all via edges
over blockages. Whenever we use an edge e over a blockage we have to follow the whole
112 On the Way to a Practical Algorithm: Virtual Buffering
straight path over the blockage until we reach a node that is not located over a blockage.
The reason is that by construction of G we can only switch between horizontal and vertical
direction by traversing a via edge. If this straight path violates Inequality (6.1), we delete
blocked edge e. The resulting graph G′ can be pre-computed globally. In Figure 6.5(b) the
edges in E(G)\E(G′) are printed in red. The four red horizontal wires are forbidden in
one direction only. In the picture we assume that traversing four consecutive edges over
blockages results in a violation of reach length. When we enter the horizontal edge at point
v, we have to follow the direct horizontal path to point w. As this path consists of 4 edges
we have to forbid the edge going from v to the right.
We finally output the tree consisting of the remainder of the initial tree and the
computed Steiner trees for the sub-instances Ni. An example of the final tree is depicted
in Figure 6.5(c). The Steiner tree for the sub-instance {s′, t3, t4} is printed in brown.
In the rare case that a pin (i. e. an element of a net) is inaccessible in G′, no reach-aware
Steiner tree without vias over blockages exists. In this case, we do not perform the path
search for embedding the unique edge incident to that pin in G′, but in our original graph
G. To avoid unnecessarily long paths over blockages also in this case, we increase the cost
of edges in E(G)\E(G′) by a large value (e. g. by |V (G)| ·max{c(e) : e ∈ E(G)}).
In this case we indeed cannot guarantee to find a reach-aware path, even if such a
path exists. To avoid such an undesired behavior it is possible to run the path search
introduced in Theorem 4.10. As cost function c we choose the cost function of the Dijkstra
path search with linear delays (which is of the form c(e) + λ · ρ(e)). As the function ∆ we
use ∆((v, w), x) = x+ ||pi(v)−pi(w)||1Γlayer(v) and F can be defined as
F (e, x) =
{
0 if x ≤ 1
∞ otherwise.
By Definition 6.5 and by Theorem 4.10, we will obtain a reach-aware path if such a path
exists. The path itself will be nearly optimum. This improved approach can certainly
be used as a replacement for all path searches that we would perform in G′ otherwise.
However, from a running time point of view the standard path search in G′ is preferable.
6.5 Experimental Results
We implemented the timing-constrained global routing approach described in Chapter 3 as
extension to the resource sharing based 3D global router BonnRouteGlobal [Ges+13]
that is part of the BonnTools suite developed by the Research Institute for Discrete
Mathematics and is used inside the IBM design environment. As a delay model we used
the linear delay model introduced in Section 6.1. The topology embedding algorithm
of Theorem 6.2 together with all speed-up techniques and improvements described in
this chapter serves as block solver for the net customers. In this section we refer to this
algorithm as TCGRLin (Timing Constrained Global Routing with Linear delays)
We ran this algorithm on 11 microprocessor units in 14 nm and 22 nm technology that
were provided by IBM. All netlists are unbuffered and do not contain layer and wire code
assignments.
For all runs we used 16 threads on a machine with a 2.20GHZ Intel Xeon E5-2699
processor. The experiments we present here coincide with the experiments for the linear
delay model presented in [Hel+17] although the testbed is different.
On the Way to a Practical Algorithm: Virtual Buffering 113
Linear delay parameters. As global routing graph for all these designs we used the
standard global routing graph of Section 2.4.3. In this setting, linear delays ρ((v, w)) along
edges (v, w) with wire code wc can be written as
ρ((v, w)) =
{
||v, w||1 · ρwire(layer(v),wc) if layer(v) = layer(w)
ρvia(layer(v), layer(w),wc) otherwise
for parameters ρwire and ρvia for each layer / wire code pair (cf. Section 6.3.2).
To compute these constants we use a library preprocessing by Bartoschek et al. [Bar+09].
The wire delay ρwire of a wire on a given plane and with a given wire code is equal to the
delay per length in a long repeater chain. For more details we refer to the PhD thesis of
Bartoschek [Bar14] or to Section 7.3.5. As via delay ρvia we compute the delay through a
via, assuming a default output capacitance and input slew.
The underlying input topologies for Theorem 6.2 are computed with the optimized
variant of the bicriteria algorithm of Chapter 5 with bifurcation penalty b = 0. The reason
for choosing b = 0 is not that the embedding algorithm cannot handle the case b > 0
but rather that there is no reference algorithm that is able to deal with bifurcation delay
penalties.
Clustering. Note that the bicriteria algorithm requires uniform delays per length al-
though the parameters ρwire differ much between the routing layers and wire codes. For
example, on the largest unit U11 the wire delay on the lowest layer is 0.5 ps per micrometer
and 0.09 ps per micrometer on the highest layer. For that reason we start by clustering
nearby sinks using the clustering algorithm of Maßberg and Vygen [MV08]. Each set of
clustered sinks is connected by the optimized bicriteria algorithm (Chapter 5) using the
wire delay on the lowest layer for the default wire code. As source for each cluster we use
a dummy node placed on the projection of the net’s source into the bounding box of the
sinks that belong to that cluster. These sources are then connected to the net’s source by
a bicriteria topology that uses the delay per length parameters on the highest layer and
for the wire code that makes the fastest signal propagation possible. The combination
of the topologies connecting the sinks of each cluster and the topology connecting the
dummy sources yields the initial topology that is embedded by the embedding algorithm
of Theorem 6.2.
Lower and upper delay bounds. The timing-constrained global routing algorithm
requires lower and upper delay bounds (Section 3.4). Since gate delays are not influenced
by Steiner trees in the linear delay model, it suffices to define these bounds for wire delays.
Let t be a sink pin and let s be the source of the net containing t. Let layer(s) and
layer(t) be the layer on which s and t are located respectively. The delay
min
z layer
()max{layer(s),z}−1∑
z′=min{layer(s),z}
min
wc
wire code
ρvia(z
′, z′ + 1,wc)
+ dist(s, t) · min
wc
wire code
ρwire(z,wc)
+
max{layer(t),z}−1∑
z′=min{layer(t),z}
min
wc
wire code
ρvia(z
′, z′ + 1,wc)
()
is a lower bound on the wire delay between s and t (cf. Section 6.3.2 and Henke [Hen16]).
This bound assumes that the distance dist(s, t) is realized by wires with the same wire
114 On the Way to a Practical Algorithm: Virtual Buffering
delay. Since in our instances, each layer has an adjacent layer for which the wire delay ρwire
with respect to all wire codes is similar, this is a reasonable assumption (cf. Section 2.4.3).
To improve this lower bound in situations in which all layers have different characteristics,
it is possible to compute an s-t path with minimum delay directly.
Defining an upper bound on delay is more difficult. As before, we start by computing
the delay along a straight path. Instead of selecting optimum layers and wire codes, we
select the lowest available layer and use the default wire code. We multiply this straight
path delay with a detour factor that accounts for both
1. detours caused by the choice of the embedded topology, and
2. detours caused by the embedding of edges of that topology.
We also add a small constant to the resulting bound.
We start by computing a short topology for each instance. The ratio between the
`1-length of the source-sink path in that short topology, and the geometric `1-distance
between source and sink serves as an upper bound on the detour inside the topology
(although it is no strict upper bound). The detours caused by embedding are quite small
on most instances. We multiply the detour factor with an additional factor of 1.5 and
relax this bound further by adding the delay needed to send the signal from one tile to its
neighboring tile and back. This delay bound is no strict upper bound but turns out to
be suitable in practice. For the largest unit U11 this bound is only violated for 0.003%
of all timing resources. None of them lie on timing paths that do not meet their timing
specification at the end. On the other units the situation is similar.
Reference algorithm. We compare TCGRLin with the layer assignment algorithm
CATALYST [Wei+13] that is part of the IBM design flow. CATALYST computes a
timing and congestion-aware layer assignment by making calls to a 2-dimensional global
router. This assignment is then passed as constraint to the timing-unaware version of
BonnRouteGlobal. In addition to these layer assignments, the reference run pre-assigns
net length bounds to critical nets and bounds the source-to-sink distances inside topologies
for high fan-out nets. These bounds are respected by BonnRouteGlobal. In addition,
we post-optimize the output of CATALYST with BonnLayerOpt, the layer assignment
tool of the BonnTools suite. This step is necessary for a fair comparison since CATALYST
estimates delays based on 2-dimensional Steiner trees with a star-like topology instead of
propagating delays along global wires as we do in the final BonnLayerOpt step.
Results. The results are shown in Table 6.1. The first lines (“Bounds”) show worst
slacks (wsl) and sum of negative slacks (sns) in the case that all delays are equal to their
lower bound. We also ran the timing-unaware version of BonnRouteGlobal without
any assignments and embedded approximately shortest Steiner trees without considering
congestion or timing. The best routing overload among all runs is reported as “bound”
on overload (ol) while shortest obtained wiring lengths (wl) and the smallest number of
required vias yield “bounds” for these two metrics.
The remaining lines show worst slack (wsl), sum of negative slacks (sns), routing
overload (ol), wiring length (wl), number of required vias (vias), and total running
time (wall time) for the reference run and for TCGRLin. For our algorithm we report
both results of the fractional solution after the resource sharing phase (Lines 1○– 11○ in
Algorithm 4) and of the final integral solution.
On the Way to a Practical Algorithm: Virtual Buffering 115
Unit Experiment wsl sns ol wl vias wall time
(#nets, [ps] [ns] [m] [K] [h:mm:ss]
cycle time)
U1 “Bounds” 11 0 0 0.80 25 -
(10 188, CATALYST [Wei+13] -114 -4 0 0.80 28 0:03:40
250 ps) TCGRLin (fractional) -32 -1 1 0.80 31 0:00:15
TCGRLin (final) -26 -1 0 0.80 30 0:02:31
U2 “Bounds” -62 -44 0 3.37 195 -
(21 385, CATALYST [Wei+13] -228 -220 8 032 4.14 225 0:04:58
240 ps) TCGRLin (fractional) -118 -107 8 3.47 214 0:01:43
TCGRLin (final) -90 -96 36 3.43 210 0:04:01
U3 “Bounds” -57 -5 0 1.77 834 -
(140 380, CATALYST [Wei+13] -83 -59 0 2.07 1 259 0:07:53
240 ps) TCGRLin (fractional) -58 -13 0 1.88 1 228 0:01:36
TCGRLin (final) -58 -12 0 1.86 1 229 0:04:43
U4 “Bounds” -44 -6 4 5.04 938 -
(156 800, CATALYST [Wei+13] -72 -36 4 5.10 1 700 0:09:52
264 ps) TCGRLin (fractional) -52 -10 53 5.18 1 738 0:05:22
TCGRLin (final) -52 -9 53 5.20 1 816 0:10:06
U5 “Bounds” -306 -457 0 9.78 952 -
(165 379, CATALYST [Wei+13] -782 -3 163 12 10.78 2 191 0:12:00
274 ps) TCGRLin (fractional) -307 -755 1 002 9.90 1 499 0:02:16
TCGRLin (final) -306 -682 0 9.91 1 614 0:06:43
U6 “Bounds” -6 -0 6 8.73 1 172 -
(304 718, CATALYST [Wei+13] -30 -0.3 6 9.29 1 877 0:07:35
790 ps) TCGRLin (fractional) -40 -0.2 9 8.87 1 792 0:02:17
TCGRLin (final) -30 -0.1 7 8.88 1 949 0:08:35
U7 “Bounds” -99 -12 709 16 13.51 1 461 -
(355 234, CATALYST [Wei+13] -100 -12 766 42 000 15.18 2 918 0:23:07
208 ps) TCGRLin (fractional) -102 -12 881 177 13.85 2 579 0:08:21
TCGRLin (final) -100 -12 810 19 13.86 2 833 0:22:22
U8 “Bounds” -237 -32 0 8.54 2 147 -
(361 684, CATALYST [Wei+13] -329 -88 124 8.59 2 982 0:13:32
184 ps) TCGRLin (fractional) -244 -46 0 8.83 2 985 0:04:05
TCGRLin (final) -248 -45 0 8.85 3 132 0:13:15
U9 “Bounds” -36 -3 0 12.37 1 712 -
(413 199, CATALYST [Wei+13] -58 -26 0 13.26 3 140 0:11:12
208 ps) TCGRLin (fractional) -40 -13 0 12.60 3 016 0:03:20
TCGRLin (final) -38 -10 0 12.62 3 406 0:11:54
U10 “Bounds” -116 -3 933 10.32 2 021 -
(477 869, CATALYST [Wei+13] -193 -227 1 860 11.11 3 265 0:12:40
208 ps) TCGRLin (fractional) -134 -42 1 734 10.93 3 302 0:03:55
TCGRLin (final) -125 -48 1 539 10.92 3 485 0:15:04
U11 “Bounds” -80 -1 380 0 36.37 7 539 -
(1 257 242, CATALYST [Wei+13] -181 -1 868 9 36.79 10 993 0:55:45
184 ps) TCGRLin (fractional) -91 -1 479 0 37.02 10 983 0:16:23
TCGRLin (final) -85 -1 463 0 37.14 12 073 0:49:22
Table 6.1: Comparison between CATALYST [Wei+13] and our timing-constrained global routing
algorithm with linear delay model and the topology embedding algorithm of Theorem 6.2 as block
solver (TCGRLin). All netlists were unbuffered and timing is evaluated with the linear delay model
(Definition 6.1). For TCGRLin we show results for fractional solutions after the resource sharing
phase and for and results of the final solutions.
116 On the Way to a Practical Algorithm: Virtual Buffering
CATALYST
TCGRLin
(a) TCGRLin tries to use higher layers until the
congestion target is met.
CATALYST TCGRLin
(b) Congestion of U7. Layer assignments in the CATALYST
run create congestion.
Figure 6.6: Congestion plots for instances U4 (left) and U7 (right). White, green, and yellow
indicate that an edge is used by a small fraction only. Orange and red edges are used to almost
their full amount and purple edges indicate a violation of their resource capacity. The images show
the maxima over all routing layers.
On all units TCGRLin is as least as good as the CATALYST algorithm with respect
to timing. On U6 and U7 both algorithms compute solutions with a similar worst slack,
on all other units, timing improvements were significant with 20 ps and more. On U3, U5,
U7, U9, and 11, the achieved worst slacks are close to optimum.
Except for the routing-critical unit U10 (where a routing without overload does not
seem to exist), TCGRLin could find a feasible or almost feasible solution with respect
to routability. Small overloads on U2, U4, U6, and U7 exhibit inaccuracies of the global
routing model at macro borders and when connecting pins that belong to macros. Typically,
they do not impact detailed routability.
A congestion plot of U4 can be found in Figure 6.6(a). Here, every edge is colored
according to the fraction to which its corresponding congestion resource is used. Light
colors such as white, green, and yellow indicate that an edge is used by a small fraction
only. Orange and red edges are used to almost their full amount and purple edges indicate
a violation of their resource capacity. The images show the maxima over all routing layers.
In the congestion plot of TCGRLin, most edges are printed in orange. In these areas, the
highest layers are used by a fraction of up to 95% which is the congestion target for the
designs. Many of these edges are yellow or green in the congestion plot for the CATALYST
run which indicates that they are used by 80% or less.
The two units with the fewest nets (U1 and U2) are integration instances. Although the
number of nets and the number of pins in each net is small, these instances are challenging
global routing tasks as they contain many blockages and the distances between the pins are
large. Similar to these integration designs, units U5, U7, and U10 have many blockages and
contain nets with sinks that are far apart from each other. Here, good timing solutions can
only be achieved if most parts of the long connections are realized on the highest available
layers. Simultaneously, the many blockages make it hard to find a routable solution. Except
for U7, the layer assignment algorithm failed to compute good solutions with respect to
timing for these instances. On U2 and U7 the solution was not routable. In contrast to
the layer assignment algorithm, the timing-constrained global routing algorithm could
On the Way to a Practical Algorithm: Virtual Buffering 117
Bounds CATALYST TCGRLin
timing-unaware routing-unaware [Wei+13]
Figure 6.7: Congestion plots and slack histograms on instance U11.
avoid hot-spots by locally using lower layers in routing-critical areas. As an example of
such a behavior congestion plots of U7 can be found in Figure 6.6(b). The CATALYST
algorithm created large routing hot-spots while other parts show a small utilization of
routing resources. The results suggest that especially on these integration instances the
new approach is superior to the classical layer assignment algorithm.
With respect to wire length and vias none of the algorithms is dominating. On some
units, selecting timing-driven topologies in TCGRLin increases the total wire length and
the more flexible usage of different routing layers consumes more vias than the layer
assignment algorithm. However, in many cases, layer assignments worsen routability a lot
and many nets need to take detours to avoid congestion. On these units, the CATALYST
runs produce more wiring length and need more vias.
The comparison between fractional and final solutions in TCGRLin shows that we can
recover the quality of the fractional solution after rounding and rip-up and re-route. Since
the fractional solution also contains global routes from early phases, integral solutions are
sometimes even better in all metrics.
The results suggest that on unbuffered instances layer assignment approaches have a
limited power to fulfill timing constraints. First, assignments of large nets to high layers are
often inhibited as they create congestion by also forcing connections to uncritical sinks to
the limited routing space on the assigned layers. Second, the timing-unaware global routing
may choose unfavorable topologies. In contrast, the delay prices and the implicit delay
bounds computed throughout TCGRLin ensure that the eventually critical connections are
short and use layers that make fast signal propagation possible. The large timing benefit
does not go along with larger routing overloads or higher running times.
A visual comparison between TCGRLin and CATALYST can be found in Figure 6.7.
Below congestion plots in the upper image there are slack histograms that show the slack
distribution of all gates, where each gate is represented by its worst slack. Here, the yellow
bar represents all gates with slack zero, lower bars show uncritical gates that have positive
slacks, and the upper bars represent gates with negative slacks.
The pictures show that the slack distribution of the result of TDGRLin is similar to
the slack distribution of a solution that does not take routing into account at all. Layer
utilizations are large with TDGRLin (larger than with e. g. a timing-unaware algorithm) but
we do not create any violations of congestion resources. In contrast to that, the reference
118 On the Way to a Practical Algorithm: Virtual Buffering
After phase 1 After phase 3 After phase 5 After phase 10 After phase 25
wsl:−112 ps
sns: −1 511ns
ol: 7 224
wsl:−88 ps
sns: −1 512ns
ol: 6 386
wsl:−89 ps
sns: −1 525ns
ol: 34
wsl:−89 ps
sns: −1 519ns
ol: 0
wsl:−90 ps
sns: −1 475ns
ol: 0
Figure 6.8: Development of timing and congestion on instance U11 during the resource sharing
phases.
algorithm creates some local hot-spots at macro borders and ends up with sub-optimum
timing. The pictures in Figure 6.8 show the progress of our algorithm during the resource
sharing phases. Already after the first five phases we have eliminated most timing problems
with an almost feasible fractional global routing. In the remaining phases, we fix the
remaining timing violations, improve the sum of negative slacks, and remove the overflow
on the few violated edges.
6.6 Port and Assertion Generation
Apart from the fact that Steiner trees computed by timing-constrained global routing with
the linear delay model are a good basis for buffering, the linear delay model has another
nice application in the floor planning step – one of the earliest parts of physical design.
6.6.1 Hierarchichal Design Flows
Due to the increasing complexity of computer chips, hierarchical approaches are very
popular. Instead of designing a whole chip at once, it is subdivided into smaller units that
are optimized individually. In a later step, solutions for the single units are combined to
obtain a solution for the whole chip. In this step the units are considered as large macros
with pre-defined ports that need to be connected to ports of other units and to primary
ports of the top-level. Of course, these units can then be designed hierarchically again.
This way, the tremendous amount of design and computation work can be distributed
to different designers and to different machines more easily. By reduction of instance sizes
it becomes possible for the single designer and machine to put more effort into optimization.
Hierarchical approaches also have advantages with respect to stability and turn-around
time since local changes do no longer require to load data or to even re-compute solutions
for the whole chip. Although the flexibility of optimization tools is limited by imposing
hierarchy on a chip, results of hierarchical design flows are often even better than the
results of a flat design. Explanations for this paradox can be found in the complexity of the
chip design task. Since even most tiny sub-tasks that arise in the theory of VLSI design
are NP-hard, optimization tools are non optimum. Reasonable subdivision into sub-units
serve as a guide for heuristic approaches.
On the Way to a Practical Algorithm: Virtual Buffering 119
The success of hierarchy depends on the amount of interconnection between the sub-
units and between the units and the top-level. If signal paths traverse several units, it
is difficult to optimize the timing of a single unit without knowing the delay of the path
through the remaining units. To enable timing optimization on a unit-level, I/O-assertions
for the unit’s ports need to be generated. These assertions include
• arrival times, capacitance limits, and output slews for primary input ports
• required arrival times, capacitances, and slew limits for primary output ports.
In Section 6.6.3 we show how to use timing-constrained global routing with the linear delay
model to create these assertions.
As wiring inside the unit and wiring of the top-level interconnections share the same
routing space, one has to be careful with routing space consumption. When we compute a
routing of a sub-unit, we must not use all the routing space on high performance layers
although that seems to be a good solution from the unit’s perspective.
To enable the application of hierarchy, classical design flows have strictly divided the
routing layers between unit-level and top-level. For a unit ceiling z that is strictly smaller
than the number Z of available wiring planes, the unit internal routing is allowed to
use layers 1, . . . , z while layers z + 1, . . . , Z above the unit are reserved for the top-level
instances. A major drawback of this approach is that uncritical connections of the top-level
nets are forced to use routing resources on high performance layers that must not be used
for the critical nets on unit-level. Since the amount of wires that fit on the highest layers
is usually small, congestion problems may arise. To overcome this issue, the macros that
represent units on the top-level instance are often not packed as densely as possible. This
way, an unnecessarily large amount of unused space on the placement layer is created.
Figure 6.9 shows a (simplified) example of a top-level instance. The spacing between
the sub-units is quite large resulting in a large amount of unused placement space. No
routing of the top-level nets uses routing space within the units while nets of the units are
completely connected within the units. These connections are not shown in the picture.
For the hierarchical design flow to be successful it is essential that all ports of the units
(here depicted as black and gray boxes) have good assertions. The ports’ positions also
need to be determined beforehand and have a large impact on timing and routability.
6.6.2 Abutted Hierarchy and Port Assignment
To reduce the amount of unused placement space and to avoid the necessity of a priory
separation of routing space we follow an abutted hierarchy approach. Development of this
approach is joint work with Harald Folberth, Stephan Held, Pietro Saccardi, and Friedrich
Schröder. Our main goal is to achieve that units can be placed without enforcing space in
between.
We assume that we are given both positions and I/O-assertions for initial ports of
the units. A netlist tells us how these ports are connected. As shown by Bartoschek et
al. [Bar+10] we can determine an estimate ρ(e) on the delay for traversing an edge e in the
global routing graph in an almost optimally buffered netlist. Recall that we have already
used these estimates in Sections 5.1.1 and 6.5.
We run a timing-constrained global routing with the linear delay model that is allowed
to use the whole available routing space to connect the initial nets (Figure 6.10(a)). Unit
internal connections can either be completely ignored in that step, or the unit-level nets
120 On the Way to a Practical Algorithm: Virtual Buffering
x
y
layer
Figure 6.9: A top-level instance connecting ports of sub-units and ports of the top-level. Input
ports are plotted black and output ports are gray. In a classical design flow wires connecting these
ports may use layers above the units’ ceilings and routing space in the gaps between units only.
Routing space within the units is reserved for unit internal connections.
can be included into the netlist. If we decide for the latter solution, we enforce that unit
internal connections do not cross their unit’s boundary by restricting their routing area.
The solution of the initial timing-constrained global routing is used to replace the
initial ports with new ports at all positions at which the Steiner trees (for top-level nets)
cross unit borders. Simultaneously, the Steiner trees determine how these new ports
are connected and hence define new nets that can be classified into the following three
categories (Figure 6.10(b)):
1. Nets that connect an initial output port of a unit to a new output port at the border
of the same unit. The new port replaces the initial port in a net of the unit’s netlist.
In Figure 6.10(b) and Figure 6.10(c), an example of such a net is marked with 1.
The left gray port of the green unit is replaced by the new white port on the bottom
left of the green unit.
2. Nets that connect a new input port of a unit to new output ports of the same unit
and to initial input ports of that unit. By replacing each initial input port to all
pins to which this port is connected in the unit’s netlist, we receive a new net that
we add to the unit’s netlist. In Figure 6.10(b) and Figure 6.10(c) such a situation is
marked with 2.
3. Nets that connect new output ports of units (respectively primary input ports of
the top-level instance) to new input ports of different units (respectively to primary
output ports of the top-level). These nets form the new top-level instance. If the
spacing between the units is indeed small, these nets are trivial. These trivial
connections are drawn in blue in Figure 6.10(c). The edge marked by 3. is an
example.
We obtain netlists for unit-level and top-level instances. Due to strict design rules it is
usually forbidden to access ports by Steiner trees containing jogs, vias, or wire segments
parallel to unit boundaries close to unit boundaries. To forbid these structures we restrict
the routing space close to unit boundaries.
On the Way to a Practical Algorithm: Virtual Buffering 121
x
y
layer
(a) Abutted units. Each of the units gets the full layer range.
1.
2.
3.
(b) Initial ports (black and gray) with Steiner
trees (black) that define the units’ new ports
(white). For better visibility we draw in 2 di-
mensions and draw connections to the interior
of the unit for two ports only (brown).
1.
2.
3.
(c) Nets after creation of new ports. Only nets
within units (dashed) and trivial top-level nets
(blue) remain.
Figure 6.10: Timing-constrained global routing can be used to define a new set of ports. By
abutting units we reduce the amount of unused space on the placement layer and we do no longer
require to divide routing space between unit-level and top-level.
6.6.3 Assertion Generation
The arrival time customer framework can be used to generate assertions for the newly
created ports. Let (A, κ) be a Steiner tree for a top-level net N that crosses at least one
unit border. Let s be its source. Arrival time customers yield an arrival time at s: If s is a
primary port, the arrival time at(s) is given with the instance. Otherwise, the source port
is a unit’s output port and we can use the solution of the arrival time customers at the
preceding input port plus the time it takes to propagate through the unit. Similarly, we
obtain a required arrival time rat(t) at each sink t ∈ N\{s}.
Let s′1, . . . , s′k be the points where (A, κ) crosses the source’s unit border if s is an
output port of a unit, and let s′1 = s, k = 1 if s is a top-level port. For each sink t ∈ N\{s}
that is neither a top-level port nor an input port of the source’s unit let t′ be the point
nearest to t where A[s,t] crosses the border of the unit to which t belongs. For sinks that will
not be moved during port assignment we set t′ := t. We assign arrival times at(s′i) := at(s)
for i = 1, . . . , k and at(t′) := rat(t) to the sinks. By propagating the linear delay values
ρ(e) from the s′i along (A, κ) in topological ordering, we can define arrival times at(v) at
the other points v where (A, κ) crosses unit borders. These values serve as assertions for
arrival times and required arrival times for the corresponding ports.
Together with the delay-per-length values that define the values ρ(e), Bartoschek et
al. [Bar+10] compute default capacitances and slews for given layers and wire codes. At
input ports these values can be used as assertions for capacitance limits and output slew.
122 On the Way to a Practical Algorithm: Virtual Buffering
At output ports they can be used as assertions for slew limit and capacitance.
Let t ∈ N\{s} and let s′ ∈ {s′1, . . . , s′k} such that t′ is reachable from s′ in A. One
can think of several improvements to refine assertions depending on the slack wsl(t′) :=
at(t′)− at(s′)− linear_delay(A,κ)(s′, t′).
Distributing positive slacks. In case of positive slack on a path we distribute the
slack to all units. We compute a delay scaling factor α > 1 such that maxt∈N\{s}{wsl(t′)}
becomes zero if we increase all delays by a factor of α. It is also possible to scale delays
non-uniformly.
Distributing negative slacks. In case of negative slacks, we distribute the slack be-
tween source and sink units. In all other units, and for top-level connections, negative
slack is usually unwanted. To accomplish this, we increase at(t′) by −wsl(t′)2 for all
t ∈ N\{s} with wsl(t′) < 0. For i = 1, . . . , k we decrease at(s′i) by 12 · max{−wsl(t′) :
t reachable from s′i in A}. This method can be refined further by distributing the negative
slack non-uniformly between the source and the sink continent.
Improved assertions by buffered Steiner trees. Instead of computing assertions
based on the linear delay model, we can directly compute buffered Steiner trees. Instead
of using default values for capacitance and slew, we can propagate the actual slews, slew
limits, capacitances and capacitance limits along the fully buffered Steiner tree and obtain
more accurate assertions. When computing these buffered Steiner trees it is important
that these are electrically correct and have a good timing with respect to a delay model
that measures slew effects. Buffered Steiner trees computed with the algorithms presented
in Chapter 4 do not fulfill these requirements. In Chapter 7 we show how to compute a
low-cost buffering of a given Steiner tree. In addition to routing and placement congestion,
these costs model penalties for capacitance and slew violations, and for slew-dependent
delays.
The methods described in this section are not only of theoretical interest but are
actually used in practice at IBM. An example output of a real-world top-level design for
which ports and assertions have been created as described are shown in Figure 6.11.
(a) Top-level view on the design.
Each unit is plotted in a different
color. The top-level routing is not
shown.
(b) Zoom into the top-level routing
shows distribution of the wires that
define ports.
(c) Zoom into Figure 6.11(b)
shows created ports (white
boxes).
Figure 6.11: Real-world output of port assignment. Pictures are shown in 2-dimensions although
the used global routing graph is a 3-dimensional graph.
Chapter 7
Buffering a Given Steiner Tree
Buffering algorithms in today’s physical design flows typically work in two steps.
1. In a first step a Steiner tree or a topology is computed.
2. This Steiner tree or topology is then buffered.
Since we already discussed algorithms for the computation of topologies (Chapter 5)
and Steiner trees (Chapter 6) in previous chapters, we concentrate on the second step in
this chapter. As before we try to minimize costs for delays, routing congestion, placement
congestion, and other costs including costs for net length and power. We also try to obey
capacitance limits and slew limits.
After having recalled existing buffering algorithms (Section 7.2) we will introduce a new
approach to realize Step 2. This algorithm unites classical dynamic programming algorithms
(e. g. Van Ginneken [Van90]) with the Fast Buffering Algorithm by Bartoschek [Bar14] and
Bartoschek et al. [Bar+09]. We discuss two versions of it. In a basic version we propagate
many candidates and obtain good solutions. We show how to make use of information
based on repeater library and input characteristics to prune most candidates without
degrading the results too much. We obtain a fast mode of our algorithm that can be used
even for large designs.
After the bicriteria algorithm of Section 5.3 to compute a routing topology and the
topology-embedding algorithm contained in the proof of Theorem 6.2, the algorithm
presented in this chapter is the last building block of a (heuristic) buffering-and-routing
oracle that can be used in the resource-sharing-based timing-constrained global routing
framework introduced in Chapter 3. This oracle will be fast enough for practical application.
7.1 The Minimum Cost Steiner Tree Buffering Problem
In this section we give a detailed formulation for the Minimum Cost Steiner Tree Buffering
Problem. Since this chapter is targeted at improving results on practical VLSI instances
we have to make several deviations from the setting in the previous chapters. In particular,
we no longer make use of the simple version of Elmore Delay model ([Elm48]) described in
Section 4.2.1 that does not consider slew effects.
123
124 Buffering a Given Steiner Tree
7.1.1 Connecting the Detailed Pin Shapes
The first deviation concerns the global routing graph. Recall that the standard global
routing graph Gcoarse introduced in Section 2.4.3 has a vertex for each tile of a partition of
the routing area and joins two vertices if they represent adjacent tiles. All pins of a net
are mapped to a graph node representing a tile that has non-empty intersection with that
pin. We call these nodes global pins. From a timing point of view this strategy has many
drawbacks.
• If the distance between the nodes representing source and sink of the same net
is larger than the actual distance between the pins, we might insert unnecessary
repeaters.
• If the distance between global pins is shorter than between the corresponding detailed
pins, we underestimate capacitances and slews. This can lead to timing degradations
and electrical violations.
This situation is even worse considering the fact that nodes to which we can assign repeaters
need a position in the coarse graph Gcoarse. Refining the partition would certainly decrease
these over- or underestimates but can also increase running times.
To avoid this problem we do not restrict ourselves to Gcoarse during buffering. Instead,
we consider the infinite graph Gfine with vertex set V (Gfine) = [0,W ]× [0, H]× {1, . . . , Z},
where W ∈ R>0 is the width, H ∈ R>0 is the height of the chip image, and Z ∈ N is the
number of wiring planes. We have an edge between two nodes (x, y, z) and (x′, y′, z′) in
Gfine if they have exactly two coordinates in common. For simplicity, we also add edges
between any pair of points in the same tile on the same routing layer – even if we create
jogs that way. As before, different wire code can be modeled as parallel edges in Gfine.
It is easy to transform a Steiner tree (Aglobal, κglobal) in Gcoarse connecting the global
pins into a Steiner tree (A, κ) in Gfine that connects the original shapes such that the
number of crossings of tile borders are identical for both trees. Kiefner [Kie16] has shown
how to minimize tree length as first and delays as secondary objective.
As before, we measure congestion at tile borders (that are represented by edges in
Gcoarse). When we query congestion cost or consumption of an edge e = {(x, y, z), (x′, y′, z′)}
∈ E(Gfine) there are two cases. Either e connects points in the same tile, and hence does
not consume from congestion resources, or e is a straight path segment and consumes
from all congestion resources of tile borders that are crossed by e. Using this definition of
congestion for edges in Gfine, (Aglobal, κglobal) and (A, κ) consume exactly the same amount
from all congestion resources.
When we discuss algorithms to buffer a given Steiner tree we rather think of Steiner
trees in Gfine instead of Gcoarse.
7.1.2 Steiner Tree Transformations
Input to the Minimum Cost Steiner Tree Buffering Problem is a Steiner tree (A, κ), say
in the graph Gfine defined in the previous section. In principle we would like to obtain
a buffered Steiner tree ((A, κ), b) as defined in Section 4.1. However, just computing a
function b : V (A)→ L for a discrete repeater library L would be insufficient due to formal
reasons.
Buffering a Given Steiner Tree 125
layer
x
source sink
(a) Initial Steiner tree consisting of a long path on
high layers. We cannot insert repeaters into the
Steiner tree directly. We assume that the sink has
ident polarity.
layer
x
source sink
(b) We subdivide the upper path three times and
place the middle Steiner node on the placement
layer. We assign an inverter to that node. To meet
the polarity constraint we have to insert a second
inverter. We subdivide the edge leaving the source
by a Steiner node at root’s position and assign the
second inverter to it.
Figure 7.1: During buffering we allow subdivisions of the initial Steiner tree.
• It is possible that function κ does not map any Steiner node to a graph node in which
buffer insertion is allowed. For example, (A, κ) could be a 3-dimensional rectilinear
Steiner tree for which all wires in x- or in y- direction use a high layer. Without
allowing insertion of paths containing Steiner points mapped to graph nodes on
which buffer insertion is allowed (e. g. nodes on the lowest layer), we cannot insert
any repeater and cannot even output a logically correct solution. Such a situation
can be found in Figure 7.1.
• Edges in (A, κ) could be long. Again, we must allow to subdivide edges to achieve
good results with respect to timing and to meet all polarity constraints. As before,
Figure 7.1 depicts such a situation.
• Sometimes, polarity constraints force us to insert many inverters. To overcome that
problem we allow local topology changes as shown in Figure 7.2.
t1
t2
t3
t4
t5
invert
ident
invert
ident
invert
(a) Initial Steiner tree for a net with 5 sinks. Sinks
t2 and t4 have ident polarity; t1, t3, and t5 have
invert polarity. Without topology changes we
have to insert at least three inverters.
t1
t2
t3
t4
t5
invert
ident
invert
ident
invert
(b) After local topology changes there is a legal
buffering with one inverter.
Figure 7.2: In order to save inverters we allow local topology changes during buffering. In this
picture we assume that we may assign repeaters to the white Steiner nodes.
126 Buffering a Given Steiner Tree
The following definition summarizes the allowed transformations to the initial Steiner tree.
Definition 7.1 (Feasible modification) We say that a Steiner tree (A′, κ′) is a feasible
modification of a Steiner tree (A, κ) if we can reach (A′, κ′) from (A, κ) by successively
applying the following operations:
1. Multiple subdivision of an edge in A (without changing the placement of the adjacent
vertices)
2. Let ω ∈ V (A) such that δ+A(ω) = {(ω, ω1), (ω, ω2)}, and δ−A(ω) = {(ν, ω)}. Let ω1
and ω2 be new vertices placed at the position of ω. Replace ω and its incident edges
by ω1, ω2, (ν, ω1), (ν, ω2), (ω1, ω1), (ω2, ω2).
ν νω
ω1
ω2
ω1
ω2
ω1
ω2
3. Let ω1, ω2 ∈ V (A) such that ω1 and ω2 have the same position u and the same
predecessor ν. Insert a new vertex ω with position u and replace (ν, ω1), (ν, ω2) by
(ν, ω), (ω, ω1), (ω, ω2).
ν
ω1
ω2
ν ω
ω1
ω2
with pos. uwith pos. u
We say that Operation 2 sets ω into parallel mode while Operation 3 resolves the
parallel mode at ω1 and ω2. Observe that the Steiner trees in Figures 7.1(b) and 7.2(b)
are feasible modifications of the Steiner trees in Figures 7.1(a) and 7.2(a) respectively.
7.1.3 Elmore Delay Model with Slew Propagation
Due to its simplicity, the Elmore delay model (Section 4.2.1) is a frequently used delay
model in timing optimization. For modern technologies the timing impact of slew effects
on delays is significant and ignoring them can lead to poor results. For that reason we
extend the simple version of the Elmore delay model of Section 4.2.1 by slew propagation.
We call the resulting delay model the Elmore Delay Model with Slew Propagation.
We now describe how to compute timing information of a buffered Steiner tree ((A, κ), b)
for a net N with source s in a graph G. We can think of G being the graph Gfine introduced
in Section 7.1.1. Recall the function κ : V (A) ∪ E(A)→ V (G) ∪ (E(G) ∪ {◦}) for a graph
G as defined in Section 2.1. As in Section 4.1, a function b : V (A) → L ∪ {} assigns
(Steiner) nodes of A to repeaters of a finite repeater library L or to  if we do not insert a
repeater at a node.
Instead of using a particular delay model directly, we assume that we are given delay
functions
dwire : E(G)× R≥0 × R≥0 → R≥0,
drepeater : L× R≥0 × R≥0 → R≥0, and
dsource : R≥0 → R≥0.
Buffering a Given Steiner Tree 127
The above functions are treated as black-box functions but we assume that dsource is
monotonically increasing, and dwire and drepeater are monotonically increasing in the second
and the third argument (when the other arguments are fixed). We will explain later how
these functions yield a delay model.
Capacitances. The last input of all delay functions is a capacitance that can be computed
by propagation in reverse topological order (see Sections 2.5.4 and 4.2.1). We assume given
capacitances cap(t) for all sinks t ∈ N\{s} as well as the input capacitance cap(l) for all
repeaters l ∈ L. A function cap : E(G)→ R≥0 determines the capacitance increase along
an edge in G. For wiring edges the latter function is usually just the product of the length
of the edge and a value that determines the capacitance increase per length. This value
depends on the edge’s wire code and layer. For via edges the capacitance increase is a
constant that also depends on the wire code and the connected layers. For the sake of a
simpler notation we define cap(◦) := 0.
We recursively define the capacitance of a node ν ∈ V (A)\(N\{s}) by
cap(ν) :=
{
cap(b(ν)) if b(ν) ∈ L∑
(ν,ω)∈δ+A(ν) cap(ω) + cap(κ((ν, ω))) otherwise
and the downstream capacitance of a node ν that is associated with a repeater as
outcap(ν) :=
∑
(ν,ω)∈δ+A(ν)
cap(ω) + cap(κ((ν, ω))).
Capacitance violations. Source pins of all repeaters and of the net N itself have
capacitance limits. These are given by a function caplim : {s} ∪ L→ R≥0. We denote the
total amount of capacitance violations by
capvio(((A, κ), b)) := max{0, cap(s)− caplim(s)}
+
∑
ν∈V (A):b(ν)∈L
max{0, outcap(ν)− caplim(b(ν))}.
Slews. Wire and repeater delays (dwire, drepeater) can depend on slews. To propagate
slews we assume functions
outslew : (E(G) ∪ L)× R≥0 × R≥0 → R≥0 and outslewsource : R≥0 → R≥0.
As before, we assume that outslewsource is monotonically increasing and outslew is
monotonically increasing in the second and the third argument. After having computed
cap(ν) for all ν ∈ V (A) and outcap(ν) for ν ∈ V (A) with b(ν) ∈ L we recursively set
slew(ν) :=

outslewsource(cap(s)) if ν = s
outslew(κ((ω, ν)), slew(ω), cap(ν)) if δ−A(ν) = {(ω, ν)} and b(ν) = 
outslew(b(ν), inslew(ν), outcap(ν)) otherwise,
where inslew(ν) is defined as outslew(κ((ω, ν)), slew(ω), cap(ν)) for Steiner nodes ν associ-
ated with repeater b(ν) ∈ L and having entering arc (ω, ν) (cf. Section 2.5.4).
128 Buffering a Given Steiner Tree
Slew limits and slew violations. Similar to capacitance limits we have to obey slew
limits at all sink pins. Note that the set of sink pins include input pins of the newly
inserted repeaters and obeying these limits means bounding the value of inslew(ν) for a
Steiner node ν ∈ V (A) assigned to a repeater b(ν) ∈ L. Let
slewlim : (N\{s}) ∪ L→ R≥0
be the function that yields the slew limits of sinks of N and of the repeater’s sinks. We
denote the total amount of slew violations of ((A, κ), b) by
slewvio(((A, κ), b)) :=
∑
t∈N\{s}
max{0, slew(t)− slewlim(t)}
+
∑
ν∈V (A),b(ν)∈L
max{0, inslew(ν)− slewlim(b(ν))}.
During the algorithms considered in this chapter we often propagate slew limits through
wires and repeater, i. e.
• For (ν, ω) ∈ E(A) such that ω has capacitance cap(ω) and slew limit slewlim(ω),
what is the largest slew at ν that does not cause a violation?
• For ν ∈ V (A) with b(ν) ∈ L such that the output pin of the repeater assigned to ν
drives capacitance outcap(ν) and has slew limit slewlim(ν), what is the largest slew
at the input slew of that repeater that does not create a slew violation?
As all slew functions are monotonically increasing with increasing input slew, we can
answer both questions by binary search. Sometimes, it is even possible to let the timing
engine propagate slew limits directly. In this chapter we just assume that we can efficiently
propagate slews both backwards and forwards along (sub-trees of) buffered Steiner trees.
Wire delay. Let ζ = (ν, ω) ∈ E(A). If κ(ζ) = , its delay is 0. Otherwise, κ(ζ) ∈ E(G)
and we define the delay of traversing e as dwire(κ(ζ), slew(ν), cap(ω)).
Repeater delay. Let ν ∈ V (A) such that b(ν) ∈ L. The delay needed to traverse the
repeater assigned to ν is drepeater(b(ν), inslew(ν), outcap(ν)).
Source delay. The delay dsource(cap(s)) measures the delay through the source gate (if
s is an output pin of a logic gate). That delay depends on the slews of the input pins
of the gate. Since these slews do not depend on N , we treat them as being encoded in
function dsource. If s is a primary input or latch output, the functions dsource are usually
the constant zero function.
Delay along a source-sink path. With the above definitions we can define the delay
along the unique path A[s,t] from s to a sink t ∈ N\{s} as
delay((A,κ),b)(t) := dsource(cap(s)) +
∑
ζ=(ν,ω)∈E(A),
κ((ν,ω)) 6=
dwire(κ(ζ), slew(ν), cap(ω))
+
∑
ν∈V (A),
b(ν)∈L
drepeater(b(ν), inslew(ν), outcap(ν)).
We omit to explicitly mention the dependence on the functions dwire, drepeater, dsource,
cap, outslew, and outslewsource in that notation.
Buffering a Given Steiner Tree 129
7.1.4 Problem Formulation
With the previous sections we are now able to give a formal definition to the overall problem
that we wish to solve in this chapter.
Minimum Cost Steiner Tree Buffering Problem
Instance: A repeater library L consisting of buffers and inverters.
A graph G with edge and placement costs c : E(G) ∪ V (G)→ R≥0.
A net N ⊆ V (G) with source s.
Sink polarities pol : N\{s} → {ident, invert}.
Delay costs λ : N\{s} → R≥0.
Power consumptions power : L→ R≥0 and a power price cpower ∈ R≥0.
Capacitance and slew violation penalties pencap and penslew.
A Steiner tree (A, κ) for N .
Timing functions
cap : (N\{s}) ∪ L → R≥0
caplim : {s} ∪ L → R≥0
slewlim : (N\{s}) ∪ L → R≥0
dwire : E(G)× R≥0 × R≥0 → R≥0
drepeater : L× R≥0 × R≥0 → R≥0
dsource : R≥0 → R≥0
outslew : (E(G) ∪ L)× R≥0 × R≥0 → R≥0
outslewsource : R≥0 → R≥0.
Output: A buffered Steiner tree ((A′, κ′), b) for N in G such that (A′, κ′) is a
feasible modification of (A, κ) and for all t ∈ N\{s},∣∣∣{ν ∈ V (A′[s,t]) : b(ν) ∈ L and b(ν) inverter}∣∣∣ even ⇔ pol(t) = ident.
Our goal is to minimize∑
ζ∈E(A′),
κ(ζ)6=◦
c(κ(ζ)) +
∑
ν∈V (A′)
c(κ(ν)) · size(b(ν))
+
∑
t∈N\{s}
λ(t) · delay((A′,κ′),b)(t) + cpower ·
∑
ν∈V (A′),
b(ν)∈L
power(b(ν))
+ penslew · slewvio(((A′, κ′), b)) + pencap · capvio(((A′, κ′), b)).
In this definition we optimize static power only. As described in Section 2.5.6, optimizing
dynamic power is also easily possible (see Section 2.5.6).
Dealing with capacitance and slew violations does not really fit into the resource sharing
model (Section 3.2). In principle one would like to define the feasible solutions for a net
customer as the set of (convex combinations of) buffered Steiner trees for the nets without
any violations.
130 Buffering a Given Steiner Tree
However, this is not always possible:
• An input slew of the source gate could be large and introduce slew violations at the
subsequent sink pins in any buffered Steiner tree.
• An input capacitance of a sink pin could exceed the capacitance limit of the source
of N , and of any repeater.
In the definition of the Minimum Cost Steiner Tree Buffering Problem we allow to produce
capacitance and slew violations but pay a price if we do so. We can think of the penalties
pencap and penslew as large numbers with pencap > penslew. In particular, they should be
larger than the resource prices of any resource such that we would rather violate resources
than introducing violations.
7.2 Previous Work
The particular problem formulation of Section 7.1.4 arises from the need of a cost-based
buffering algorithm that we can use as a block solver for the resource sharing algorithm
(Chapter 3) together with the Steiner tree algorithms from Chapter 6.
Most buffering algorithms known in the literature rather target at maximizing the
worst slack
min
t∈N\{s}
{
rat(t)− at(s)− delay((A,κ),b)(t)
}
,
where rat(t) is the required arrival time at t and at(s) is the arrival time at s.
7.2.1 Buffering by Dynamic Programming
The most natural and commonly used algorithmic paradigm to achieve that goal is dynamic
programming. The first version of a dynamic programming algorithm for a buffering
problem has been developed by Van Ginneken [Van90]. His algorithm works in the case
that delays are independent of slews (which is the case for the Elmore delay model of
Section 4.2.1) and that L consists of one buffer only (and no inverter).
While traversing an initial tree (A, κ) in reverse topological order he computes sets of
non-dominated candidates that define solutions for the sub-tree of A rooted at any node in
V (A). Tree (A, κ) is not transformed during the algorithm but assumed to be subdivided
beforehand. Van Ginneken [Van90] shows that the total running time for computing all
these sets and hence of his algorithm is O(|V (A)|2).
In the following years there have been several extensions of Van Ginneken’s dynamic
programming approach. In 1996, Lillis et al. [LCL96] showed how to deal with a larger
repeater library L containing both buffers and inverters. They obtained a total running
time of O(|L|2 · |V (A)|2). In the same paper they also proposed a way to minimize power
consumption during buffer insertion. Without further effort, these ideas can be used to
minimize costs for delay, wiring, and placement congestion.
The O(|L|2 · |V (A)|2) running time of the algorithm by Lillis et al. has been reduced to
O(|L|2 · |V (A)| · log(|V (A)|)) by Shi and Li [SL05] by introducing a new pruning technique.
Li and Shi [LS06] could get rid of the quadratic dependency on the library size. They
achieved a running time of O(|L| · |V (A)|2). Li et al. [LZS12] presented a variant that has
running time O(|L|2|V (A)|+ |L||V (A)||N |). If repeater library and net have constant size,
this is a linear time algorithm.
Buffering a Given Steiner Tree 131
Other extensions to Van Ginneken’s core algorithm are related to the delay model.
Alpert et al. [ADQ99] published a version that can be used for the Elmore delay model
with slew propagation and even for higher-order delay models. Chen and Menezes [CM99]
showed how to model noise effects.
One drawback of Van Ginneken style algorithms is that they do not make any transfor-
mations to the initial tree. A natural approach is to cut each wire of the input tree into
small and uniformly sized pieces. Choosing small pieces improves the quality of dynamic
programming algorithms on cost of a higher running time. Alpert and Devgan [AD97]
presented a strategy for wire segmentation that yields a compromise between running time
and quality. Alpert et al. [AHQ04] showed how to choose potential repeater positions
based on the library and on properties of the pins of the net.
Dynamic programming can also be applied to compute a buffering of a given Steiner
tree that (approximately) minimizes static power consumption or other costs associated
with each repeater type. The algorithm of Hu et. [Hu+07] minimizes a cost function∑
ν∈V (A), b(ν)∈L
power(b(ν)) (7.1)
under the constraint that the input slew at each sink and repeater does not exceed a
certain constant α. To compute slews, they use the simplified slew model by Kashyap et
al. [Kas+04].
The more natural task of finding a cheapest solution (w. r. t. cost function (7.1)) that
meets required arrival time constraints at every sink has been proved to be NP-hard by
Shi et al. [SLA04]. This is true even if the cost function only attains integral values and if
delays are measured with the version of the Elmore delay model in which slew effects are
ignored.
Hu et al. [HLA09] gave a fully polynomial time approximation scheme for the task
identified as NP-hard by Shi et al. [SLA04]. For  > 0 they compute a solution with the
following properties.
• The total cost (7.1) is at most 1 +  times larger than a cheapest timing-feasible
solution.
• The Elmore delay along each source-sink path does not exceed 1 +  times the largest
delay that meets the required arrival time constraint for that sink.
The running time of their algorithm is
O
( |N |2 · |V (A)|2 · |L|2
3
+
|N |3|L|2

)
.
Romen [Rom15] extended this algorithm to general costs in his master’s thesis at the
Research Institute for Discrete Mathematics, University of Bonn. This master’s thesis was
co-supervised by me. For  > 0 the algorithm by Romen [Rom15] computes a buffering of
a given Steiner tree (A, κ) minimizing the weighted sum∑
ν∈V (A), b(ν)∈L
(
c(κ(ν)) · size(b(ν)) + cpower · power(b(ν))
)
+
∑
t∈N\{s}
λ(t) · Elmore(A,κ)(t)
up to a factor 1 +  in time
O
(
log(|V (A)|) ·
( |V (A)|3|N |
2
+
|V (A)|3|L|

))
.
132 Buffering a Given Steiner Tree
7.2.2 The Fast Buffering Algorithm
A fast and successful approach for buffering is due to Bartoschek et al. [Bar+09], [Bar14].
Unlike the algorithms cited in Section 7.2.1, their Fast Buffering Algorithm allows modifi-
cations of the initial topology. Since the fast variant of the buffering algorithm presented
in Section 7.3 is an extension to that algorithm, we have a closer look on their algorithm
now. For a more detailed description we refer to [Bar14].
Roughly speaking, the algorithm of Bartoschek et al. [Bar+09], [Bar14] is a dynamic
program that processes one single candidate only. The key ingredient to determine the
most promising candidate is a careful library pre-processing.
For a given routing layer z, a wire code wc, and a value ξpower ∈ [0, 1], we pre-compute
a buffering of a long path on layer z using wire code wc. All repeaters in that solution
have the same type and are placed equidistantly. The buffering minimizes
ξpower · “delay of the buffered path” + (1− ξpower) · “power consumption of that path”.
Bartoschek et al. [Bar+09][Bar14] show how to compute such a solution.
We call a triple (z,wc, ξpower) buffering mode and the repeater type inserted in the
corresponding repeater chain default repeater of a buffering mode. The stationary slews of
the chains serve as slew targets during the main algorithm.
Let (A, κ) be an initial Steiner tree for a net N with source s in a 2-dimensional grid
graph. During the algorithm a set of clusters is propagated bottom-up along A. Formally,
a cluster C is a triple (S(C), P (C),M(C)). S(C) is the set of sink pins that correspond to
C. That set can contain sink of N and input pins of previously inserted repeaters, and is
partitioned into (possibly empty) subsets S+(C) and S−(C) that differ by polarity. In the
case that both S+(C) and S−(C) are non-empty sets, we might want to merge both parts.
Due to polarity restrictions this can only be done by inserting an inverter driving either
S−(C) or S+(C). Position P (C) is a possible position for such an inverter. If S+(C) = ∅
or S−(C) = ∅, position P (C) is not needed and can be chosen arbitrarily. We say that a
cluster C with S+(C) 6= ∅, S−(C) 6= ∅ is in parallel mode. The object M(C) is a buffering
mode assigned to C. Additionally, we keep track of required arrival times, slew limits, and
slew targets at the sinks of all clusters.
Initially, we create clusters (S, κ(ν),M) for all ν ∈ V (A) with S =
{
{ν} if ν ∈ N\{s}
∅ otherwise.
Buffering modesM are selected according to the criticality of ν. Bartoschek et al. [Bar+09]
[Bar14] proposed a Min-Cost-Flow -based approach to assign a buffering mode to the initial
clusters. The default slew of the chosen buffering mode serves as a slew target for the
initial cluster.
We process (A, κ) in reverse topological order until we reach source s. While we do so,
we move and merge clusters until there is only one remaining cluster at s.
We also modify (A, κ). Apart from the feasible transformations described in Sec-
tion 7.1.2, we replace parts of (A, κ) connecting a set S of sink pins and a Steiner node
ν ∈ V (A) by an approximately shortest rectilinear Steiner tree for S ∪ {ν} whenever we
assign a repeater to a node ν. For details on these transformations we refer to [Bar14].
Note that Fast Buffering operates in 2-dimensions and, formally, repeater insertion includes
insertion of via stacks to reach the placement layer. The different layer and wire code
characteristics are encoded in the buffering modes.
Buffering a Given Steiner Tree 133
Move. Let ζ = (ν, ω) ∈ E(A) and let C be a cluster at ω. While processing ζ we move C
to position κ(ν). First, we extract a short Steiner tree for S+(C)∪{ω} (unless S+(C) = ∅),
and S−(C) ∪ {ω} (unless S−(C) = ∅) with root ω.
For a point p on the straight line segment between κ(ν) and κ(ω) we extend either
of these trees by two new vertices ν1, ν2, and edges (ν1, ν2), (ν2, ω). For both trees, the
position of both new vertices will be p and we assign the default repeater for buffering
mode M(C) to ν2. By backward propagation along these trees assuming the slew target at
the sources ν1 and ν2, we check if we meet slew limit and slew target (for M(C)) at these
sources. Using binary search we can approximate the point p∗ closest to κ(ν) for which
that is the case (respectively p∗ = κ(ω) if such a point does not exist).
If p∗ = κ(ν), we move C to position p∗, update timing values at C, and merge C with
the cluster at ν. The merge step will be described in the next paragraph.
If p∗ 6= κ(ν), we insert a repeater. If both S+(C) and S−(C) are non-empty, we insert
an inverter at position P (C), either driving all sinks in S+(C) or in S−(C). We obtain a
new cluster with sinks S−(C) ∪ {χ} or S+(C) ∪ {χ}, where χ is the sink pin of the newly
inserted inverter. All sinks of the new cluster have the same polarity and we continue to
move it to position κ(ν). Among all 2|L| possibilities to insert a repeater in one side of the
parallel cluster C we select the one lexicographically minimizing
• the total capacitance violations,
• the total slew violations,
• the following convex combination for a parameter ξbuf ∈ [0, 1]:
(1− ξbuf) · power(“repeater”)− ξbuf · “resulting worst slack”.
(7.2)
In the case that C is not in parallel mode, we insert a new repeater at an unblocked
position closest to p∗ and obtain a new cluster containing the repeater’s sink pin only. As
before, we select the repeater type minimizing (7.2).
Merge. After we have moved a cluster C to its parent cluster C ′, we perform a merge.
We say that C ′ is the parent cluster of C if C is the cluster at a node ω ∈ V (A) and C ′
has node ν ∈ V (A) with (ν, ω) ∈ E(A).
If |S+(C)| · |S−(C ′)| = |S+(C ′)| · |S−(C)| = 0, we can just add the sinks in S(C) to
S(C ′). Otherwise, we select the solution minimizing (7.2) from a list of possible merge
configurations. Examples of merge configurations are
• merge clusters without inserting a further repeater,
• resolve parallel mode of C by inserting a repeater at position P (C),
• resolve parallel mode of C and C ′, and add a further repeater driving all sinks of the
resolved cluster C.
Bartoschek ([Bar14] page 76, Figure 6.1 and Figure 6.9) listed and described 15 reasonable
merge configurations and the situations in which they can be applied. Since we do not
need the explicit configurations in this thesis, we omit details here.
If C ′ becomes a parallel cluster during the merge step, we update P (C ′) to the current
position.
Connect root. After we have reached the source, we make sure that all polarity con-
straints are met and connect the sinks of the root pin. To do this, we enumerate all
possibilities to successively insert 0,1, or 2 repeaters in the last remaining cluster. Inserting
134 Buffering a Given Steiner Tree
a repeater into a cluster C is done by extracting short rectilinear Steiner trees connecting
the sinks of the sets S+(C) or S−(C) with a Steiner node at source position associated
with the repeater. The repeater’s input pin replaces the sinks driven by its output pin in
S(C).
Among the solutions for which all sinks that remain in the final cluster have ident
polarity, we select the one minimizing (7.2). When evaluating a solution we also take
timing and electrical violations into account that we obtain after propagating through the
source gate.
7.3 An Algorithm for Cost-Based Buffering
In this section we describe an algorithm for cost-based buffering that uses ideas from
the Fast Buffering Algorithm [Bar+09][Bar14] described in Section 7.2.2 and the classical
dynamic programming buffering algorithms [Van90][LCL96] listed in Section 7.2.1.
The algorithm maintains and propagates a set of candidates, buffering and modifying
an initial Steiner tree (A, κ) as it does so. There will be two flavors of the algorithm. In
a basic version we create a possibly exponential number of candidates, leading to good
solutions but high running times. By using elements of the Fast Buffering Algorithm we
are able to prune or avoid computation of almost all candidates and achieve that their
number is constant at each position. We show how to select the right candidates to gain
large speed-ups without too large loss in quality. We refer to the second variant as fast
version.
All results of this section concerning the basic version of the algorithm are joint work
with Rodion Permin who in particular implemented a variant of it [Per16]. Permin worked
on the cost-based buffering problem during his master’s thesis at the Research Institute
for Discrete Mathematics, University of Bonn, under my co-supervision.
Before we describe the algorithm in detail, we define the basic data structures and
operations such as propagation, repeater insertion, and pruning.
For this section we fix an instance for the Minimum Cost Steiner Tree Buffering Problem
and use the notation of the problem formulation given in Section 7.1.4. Furthermore, we
assume that the global routing graph G is equal to the graph Gfine described in Section 7.1.1
as this is the situation we have to solve in practice.
7.3.1 Candidates and Candidate Pairs
The algorithm maintains a set of candidates representing sub-trees in the final solution.
A candidate C is associated with the set sinks(C) ⊆ N\{s} of sinks in that sub-tree, and
a polarity pol(C) ∈ {ident, invert, undefined} that specifies if the path between s and the
root of the represented sub-tree must contain an even or an odd number of inverters to
achieve that all sinks in sinks(C) meet their polarity constraint. The first case is indicated
as pol(C) = ident and we have pol(C) = invert in the second case. We define C to be the
invalid candidate that represents the empty sub-tree with no sinks. We set sinks( C) = ∅
and pol( C) = undefined.
Candidates are created by bottom-up propagation along the initial Steiner tree (A, κ)
and are linked with A by candidate pairs.
Buffering a Given Steiner Tree 135
t1
t2
t3
ν ω
invert
ident
invert
(a) Initial Steiner tree for a net with 3 sinks.
︷ ︸︸ ︷
position
κ(ν)︷ ︸︸ ︷
position
x
t1
t2
t3
invert
ident
invert
(b) Final buffered Steiner tree. For simplicity, we
consider repeaters as points.
CC1CC2CC3CC4CC5
t1
t2
t3
invert
ident
invert
node( CC1) = ω, p( CC1) = κ(ω), type: parallel
node( CC2) = ν, p( CC2) = κ(ν), type: parallel
node( CC3) = ν, p( CC3) = κ(ν), type: invert
node( CC4) = ν, p( CC4) = x, type: invert
node( CC5) = ν, p( CC5) = x, type: ident
ident( CC1) =
(
{t2}, ident
)
, invert( CC1) =
(
{t1, t3}, invert
)
ident( CC4) =C, invert( CC4) =
(
{t1, t2, t3}, invert
)
ident( CC5) =
(
{t1, t2, t3}, ident
)
, invert( CC4) =C
(c) Some candidate pairs and candidates. Candidates C 6=C are labeled (sinks(C), pol(C)).
Figure 7.3: Candidate pairs representing a final solution. Together with backtrace information
indicated as gray arrows (e. g. “ CC2 can be obtained from CC1 by inserting an inverter of type l ∈ L
at the ident part”) we can obtain the solution from the candidates.
A candidate pair CC is a 4-tuple
CC = (node( CC), p( CC), ident( CC), invert( CC)),
where node( CC) ∈ V (A), p( CC) ∈ V (G) is κ(node( CC)) or a point on the edge entering
node( CC), and ident( CC), invert( CC) are candidates.
As in the Fast Buffering Algorithm we allow modifications of the initial Steiner tree, in
particular to set a node into parallel mode (see Definition 7.1, Item 2). As a consequence,
the sub-tree of A rooted at ν can fall into two parts that differ by polarity in a possible
final solution. These two sub-trees are represented by candidates ident( CC) and invert( CC).
It is possible that one of these candidate is the invalid candidate C but we require that
pol(ident( CC)) = ident if ident( CC) 6= C and pol(invert( CC)) = invert if invert( CC) 6= C. By
this property we can define the type of a candidate pair:
Definition 7.2 (Type of a candidate pair) A candidate pair CC has type
• ident if ident( CC) 6= C and invert( CC) = C,
• invert if ident( CC) = C and invert( CC) 6= C,
• parallel if ident( CC) 6= C and invert( CC) 6= C,
Given candidate pairs and backtrace information we can re-construct the represented
solutions. An example how candidates and candidate pairs represent possible final solutions
can be found in Figure 7.3. The solution shown in Figure 7.3(b) arises from the instance
shown in Figure 7.3(a) by buffering. Five candidate pairs are shown in yellow boxes in
Figure 7.3(c).
136 Buffering a Given Steiner Tree
Quality of candidates. To be able to evaluate the quality of solutions we associate
each candidate C 6= C with
cap(C): the capacitance at the root of the sub-tree represented by C,
slewlim(C): the maximum slew such that no slew violation is induced in the represented
sub-tree as long as the slew at its root is not larger than this value, and
cost(C): the cost of the sub-tree according to Definition 7.1.4.
While cap(C) can be computed easily, exact computation of slew limits is impossible
unless we know the slew of the sub-tree’s root r. Similarly, computation of delays between
r and the sinks in sinks(C), and of slew violations requires knowledge about the slews (see
Section 7.1.3). These delays and slew violations are needed to compute costs.
Instead of working with the exact values, we make use of the library pre-processing and
Min-Cost-Flow-based buffering mode assignment by Bartoschek [Bar14] (see Section 7.2.2).
This yields a target slew slewtarget at any node of the initial Steiner tree that we can use
as an estimate on the root’s slew.
If a candidate has a slew violation (in particular if slewlim(C) < 0), it will no longer
be possible to obtain candidates with a non-negative slew limit by propagating from it.
Therefore, we relax violated slew limits by setting them to the slew target after we have
paid the price for the violation by increasing cost(C).
We define cap( C) = cost( C) = 0 and slewlim( C) =∞ for the invalid candidate C.
Creation and propagation. Initially, we create a candidate C(t) with sinks(C(t)) = {t}
and pol(C(t)) = pol(t) contained in an initial candidate pair CC(t) with node( CC(t)) =
p( CC(t)) = t for each sink t ∈ N\{s}.
All other candidates and candidate pairs can be obtained by
• propagation along a wire or via,
• repeater insertion, or
• merge with another candidate or candidate pair.
Note that for all these operations the resulting candidates and candidate pairs as well
as all quality estimates can be obtained from their predecessors by a constant number of
invocations of the timing functions presented in Section 7.1.3.
7.3.2 Dominance
To obtain an acceptable running time we have to keep the number of candidate pairs small.
By the NP-hardness result of Section 4.5.1 that already holds in the case of 2-terminal
nets and for the basic version of the Elmore delay model, we certainly won’t be able to
bound the number of created candidate pairs by a polynomial unless we accept to prune
candidate pairs that could possibly lead to optimum solutions (or unless P=NP). However,
in some cases, candidate pairs cannot lead to an optimum solution as they are dominated.
Definition 7.3 (Dominance of candidates) Let C and C ′ be candidates with sinks(C) =
sinks(C ′) and pol(C) = pol(C ′). C is dominated by C ′ if
slewlim(C) ≤ slewlim(C ′), cap(C) ≥ cap(C ′), and cost(C) ≥ cost(C ′).
Buffering a Given Steiner Tree 137
Using Definition 7.3 we can define dominance of candidate pairs.
Definition 7.4 (Dominance of candidate pairs) A candidate pair CC is dominated by
a pair CC ′ if all of the following conditions hold:
• p( CC) = p( CC ′), • ident( CC) is dominated by ident( CC ′),
• node( CC) = node( CC ′), • invert( CC) is dominated by invert( CC ′).
Formally speaking, if CC is dominated by CC ′, implementing the solution encoded by
CC ′ instead of CC would never lead to a worse solution. This is not completely true since
slew limits and delay costs are computed using an estimate slewtarget on the input slew,
but we would still like to gain the running time benefit.
Even more, we will prune a candidate pair if it is almost-dominated by another pair.
Definition 7.5 (Almost-domination) Let γcost > 1, γcap > 0, and γslewlim > 0 be
fixed constants. We say that a candidate C is almost-dominated by a candidate C ′ if
sinks(C) = sinks(C ′), pol(C) = pol(C ′) and⌊
slewlim(C)
γslewlim
⌋
≤
⌊
slewlim(C′)
γslewlim
⌋
,⌈
cap(C)
γcap
⌉
≥
⌈
cap(C′)
γcap
⌉
,⌈
logγcost(cost(C))
⌉ ≥ dlog γcost(cost(C ′))e .
A candidate pair CC is almost-dominated by a pair CC ′ if
• p( CC) = p( CC ′), • ident( CC) is almost-dominated by ident( CC ′),
• node( CC) = node( CC ′), • invert( CC) is almost-dominated by invert( CC ′).
The rounding used in Definition 7.5 is similar to the rounding of costs into buckets used
in the FPTAS of Section 4.5.2. To obtain better running times we also round capacitances
and slew limits here.
During the algorithm we have to check if a new candidate pair is almost-dominated by
any previously computed candidate pair and to determine the previous candidate pairs
that are almost-dominated by the new pair. To do this fast, we sort the set of candidate
pairs for the same node and at the same position by cap(ident(.))) + cap(invert(.))). To
check if a new candidate pair CC is almost-dominated, we scan through all previous pairs
with total capacitance at most cap(ident( CC))) + cap(invert( CC))). The candidate pairs
almost-dominated by CC must have a total capacitance of at least this value. During
insertion of CC we make sure to maintain that sorting.
7.3.3 Infeasible Repeater Positions
Not all repeater types can be placed at all positions. In our setting G = Gfine (Section 7.1.1)
we assume that we are allowed to insert repeaters of any type in all positions on the lowest
layer and that we are not allowed to insert any repeater on higher layers. Blockages on the
placement layer can be modeled by infinite placement costs.
A natural operation for a buffering algorithm in this standardized setting is to create
stacked vias leading from a position u on a higher layer to the point u↓ on the lowest
138 Buffering a Given Steiner Tree
Instance: An instance of the Minimum Cost Steiner Tree Buffering Problem.
Output: A buffered Steiner tree.
1○ Subdivide edges of (A, κ) as described in Section 7.3.5.
2○ Pre-process the library and assign buffering modes as in Section 7.2.2.
3○ Create initial candidate pairs representing sub-trees containing the sinks only.
4○ Sort V ′ = {ν ∈ V (A) : |δ+A(ν)| > 1} ∪ {s} in reverse topological ordering.
5○ for ν ∈ V ′ do
6○ for each ζ ∈ δ+A(ν) do
7○ Pζ := maximal path in A starting with ζ without internal vertices in V ′.
8○ Apply the move step of Section 7.3.6 to ζ.
9○ CCζ := set of final candidate pairs at ν.
10○ Apply the merge step of Section 7.3.8 to merge the sets CCζ (ζ ∈ δ+A(ν)).
11○ Select and apply the best candidate pair as in Section 7.3.9.
Algorithm 9: Overview of the basic version of the algorithm for cost-based buffering described
in Section 7.3.
layer that shares x- and y- coordinate with u, and creating stacked vias from u↓ to u after
having placed a repeater at u↓. This way we can realize repeater insertion at all positions
u ∈ V (G) as a sequence of via propagations and a repeater insertion although insertion
is forbidden on u. From a Steiner tree point of view, these operations result in a feasible
modification of Type 1 as described in Definition 7.1. For an illustration of a repeater
insertion on a higher layer see Figure 7.4(c). We are not allowed to insert the depicted
inverter at position κ(ν) directly but can reach κ(ν)↓ with stacked vias.
In a more general setting in which G 6= Gfine, inserting repeaters at forbidden positions
can be modeled in a similar way by defining close points u↓ in which repeater insertion is
allowed.
7.3.4 Overview of the Algorithm
We can now start with the description of the algorithm. A high-level description is given
in this section and the details can be found in the subsequent sections.
After a pre-processing step (Section 7.3.5) and after having created initial candidate
pairs representing sub-trees consisting of the sinks only (Section 7.3.1) we process the
initial tree (A, κ) in reverse topological order. We use a Dijkstra-based [Dij59] move step to
propagate candidate pairs along maximal paths with the property that all internal nodes
have out-degree one (Section 7.3.6). At the Steiner nodes with out-degree at least two we
merge candidate pairs (Section 7.3.8). After we have propagated all candidate pairs to
the source of N we select a best candidate and obtain our final solution by backtracing
(Section 7.3.9). Algorithm 9 shows a schematic overview.
7.3.5 Pre-Processing
The algorithm starts by dividing all wiring edges of (A, κ) crossing a tile border by a
Steiner node placed in the tile with cheaper placement cost (see Figure 7.4). If both tiles
have the same placement costs, we arbitrarily choose one of the adjacent tiles. Moreover,
we subdivide the edges entering the sinks by a Steiner node at sink position. This enables
Buffering a Given Steiner Tree 139
A B C
(a) Edge in the initial Steiner tree
crossing two tile borders. The
edge is a wiring edge on a higher
layer. Placing a repeater in tile A
or B is cheap while the placement
resource price for tile C is large.
ν
A B C
(b) We create Steiner nodes near
each crossing point but make sure
that the Steiner node ν near the
crossing point of tile B and C
belongs to the tile with cheaper
placement cost (B).
κ(ν)
A B Cκ(ν)↓
(c) Placing a repeater at position
κ(ν) can be simulated by a se-
quence of via insertions and a re-
peater insertion at point κ(ν)↓ on
the placement layer directly be-
low κ(ν). We pay the placement
resource price for tile B.
Figure 7.4: In the pre-processing step of the algorithm of Section 7.3 we subdivide all wiring
edges to ensure that there are points near all tile border crossings. Placement resource prices are
taken into account when crossing points are positioned.
us to place a repeater directly before a sink which is sometimes useful if sinks have a large
capacitance. By subdividing the edge leaving the source s of net N two times and placing
the new Steiner nodes at the sink’s position, we will be able to obtain good results even if
s is the output pin of a small gate. Recall that this situation is accounted for by the root
connect step in the Fast Buffering Algorithm.
Additionally, we run the library pre-processing step and the buffering mode assignment of
the Fast Buffering Algorithm (Bartoschek et al. [Bar+09], Bartoschek [Bar14], Section 7.2.2).
As a result we obtain a slew target for each node of (A, κ) that we require for candidate
propagation. We also store the (stationary) output capacitance of the repeaters in the long
repeater chain computed during the library pre-processing for each buffering mode. This
yields a capacitance target of which we make use during the move and merge step of the
faster variant of the algorithm.
7.3.6 The Move Step
Let P be a maximal path in A with the property that none of its internal vertices has
out-degree larger than 1. Let ν, ω ∈ V (A) be its endpoints (i. e. P = A[ν,ω]).
We use a Dijkstra-like [Dij59] algorithm to propagate initial candidate pairs with node
ω and position κ(ω) along P to ν. If ω ∈ N\{s}, there is only one initial candidate
pair: The pair representing the sub-tree consisting of ω only. If ω is a Steiner node
with out-degree larger than one, the initial candidate pairs are a result of the merge
step (see Section 7.3.8). All these initial candidate pairs are stored in a heap with key
cost(ident( CC)) + cost(invert( CC)).
While the heap is non-empty we do the following. First, we erase a candidate pair
CC with minimum key from the heap. If node( CC) = ν, we mark it as a final candidate
pair. Otherwise, we create new candidate pairs by propagation along the edge e =
κ((ν, node( CC))), where (ν, node( CC)) is the edge entering node( CC) in A. We create two
types of new candidate pairs: candidate pairs that arise from CC by propagation along a
wire or a via without inserting new repeaters and candidate pairs that arise by repeater
insertion.
140 Buffering a Given Steiner Tree
. . .
. . .
ν
ων node( CC)
CC
p( CC)u
ident
(a) Non-parallel mode. Creating a new candi-
date pair left of u would introduce an electrical
violation.
. . .
. . .
ν
ων node( CC)
CC
p( CC)u
invert
ident
(b) Parallel mode. Creating a new candidate pair left
of u would introduce an electrical violation. The new
candidate ends the parallel mode.
Figure 7.5: Propagation of candidate pairs with repeater insertion. We apply the move step of
Section 7.3.6 on the path from ν to ω. By binary search we look for position u “between” p( CC) and
κ(ν) such that we do not create an electrical violation. We create a new candidate with position u
that represents repeater insertion.
Propagation without repeater insertion. We propagate CC to position κ(ν). The
resulting candidate pair is added to the heap unless it is almost-dominated by a candidate
pair that is currently contained in the heap or that has been removed before. The new
candidate pair CC ′ has node( CC ′) = ν and p( CC ′) = κ(ν). Note that CC and CC ′ have the
same type.
Propagation with repeater insertion. When we create new candidate pairs by inser-
tion of further repeaters we have to distinguish if CC is of parallel type or not.
If CC is not of parallel type, we do the following. For each repeater l ∈ L we look for
a point u on the straight path segment between p( CC) and κ(ν) (if e is a wiring edge)
respectively in {p( CC), κ(ν)} (if e is a via). We want to create a new candidate pair CC(u, l)
arising by propagation to u and insertion of a repeater of type l in the valid part as
described in Section 7.3.3. Position u is chosen such that the downstream capacitance of
the inserted repeater does not exceed the capacitance limit of l and such that the slew limit
of the new candidate is not smaller than the slew target. By binary search we can find u
as far away from p( CC) as possible. If no such point exists, we choose u = p( CC). This is
the case if the capacitance of the valid part of CC is already larger than the capacitance
limit of l or if a repeater of type l cannot drive CC at all without violating the slew limit.
We add CC(u, l) to the heap unless it is almost-dominated by a candidate pair contained in
the heap or already removed from it, and set node( CC(u, l)) = ν if u = κ(ν). An example
of non-parallel repeater insertion during the move-step can be found in Figure 7.5(a).
If CC is of parallel type, we create candidate pairs resolving the parallel mode. For each
inverter l ∈ L and each polarity in pol ∈ {ident, invert} we use binary search to find the
point u as far away from p( CC) as possible such that after propagation to position u and
insertion of an inverter of type l in the pol-part of CC we obtain a candidate pair without
capacitance violations and for which no slew limit is smaller than the slew target. As before,
we choose u on the straight path segment between p( CC) and κ(ν) if e is a wiring edge and
in {p( CC), κ(ν)} if e is a via. If a point such as u does not exist, we choose u := κ(ν). We
add the resulting candidate pair to the heap unless it is almost-dominated. An example of
parallel repeater insertion during the move-step can be found in Figure 7.5(b).
When we compute the positions u we only take electrical violations into account.
Positions for which delays are optimum can be between p( CC) and u. In our practical
application, tile sizes are small compared to optimum distances between repeaters and
Buffering a Given Steiner Tree 141
the error we make by computing u based on electrical violations only is also small. In
the presence of large tile sizes we can make use of the stationary slew slewtarget in the
pre-computed optimum repeater chain (Bartoschek et al. [Bar+09]) and compute u such
that CC(u, l) does not result in a slew larger than slewtarget at CC. This can be achieved by
reducing the slew limit of every valid part of CC to slewtarget.
Every time we add a candidate pair CC to the heap, we erase all candidate pairs CC ′( 6= CC)
for which CC almost-dominates CC ′ from the heap. Note that due to non-negativity of
the cost function, a candidate pair CC1 can never be almost-dominated by a pair CC2 that
is created after CC1 is extracted from the heap without the property that CC1 almost-
dominates CC2 as well. When pruning almost-dominated candidates we have to make sure
that whenever CC1 and CC2 almost-dominate each other, we never erase the earlier candidate
pair.
Although almost-domination is already sufficient to prune most candidate pairs, their
number can still be exponential in general (although it is possible to give polynomial
runtime bounds when making restrictions to the input, see Permin [Per16]). In order to
obtain acceptable running times in practice, Permin [Per16] developed and implemented
useful speed-up techniques such as future costs and caching of transitions that lead to
almost-dominated solutions. By increasing the values γcost, γcap, γslewlim we can obtain
even further speed-ups – even though at the cost of worse solutions.
In the next section we show how we can prune and avoid most candidate pairs and
thus, obtain a polynomial running time. These techniques are used in the fast version of
the algorithm.
7.3.7 Speed-up Techniques for the Move Step in the Fast Version
Candidate pruning. In the fast version we prune candidate pairs until the number of
pairs of each type and at each position of the form κ(ν) for ν ∈ V (P ) is bounded by a
constant kinternal.
During pruning we make sure that we neither prune too many candidate pairs of which
the total capacitance is below the target capacitance captarget (see Section 7.3.5) nor below
caplimit := max{caplim(l) : l ∈ L}.
Let CC be a set of candidate pairs with the same type and with the same position in
{κ(ν) : ν ∈ V (P )} and let
CC1 := { CC ∈ CC : cap(ident( CC)) + cap(invert( CC)) ≤ captarget}
CC2 := { CC ∈ CC : cap(ident( CC)) + cap(invert( CC)) ≤ caplimit}.
We define the cost of a candidate pair CC in the canonical way as cost(ident( CC)) +
cost(invert( CC)). First, we add the min
{
| CC1|,
⌈
kinternal−1
2
⌉}
cheapest candidate pairs from
CC1 to a set CCkept of kept candidates. Then, we add the min
{
| CC2\ CCkept|,
⌊
kinternal−1
2
⌋}
cheapest candidate pairs from CC2\ CCkept to CCkept and fill up with the kinternal − | CCkept|
cheapest candidate pairs in CC\ CCkept. Finally, we erase all candidate pairs in CC\ CCkept.
The best speed-up due to pruning is obtained if we already prune during Dijkstra’s
algorithm. To avoid problems after pruning candidate pairs that are predecessors of pairs
142 Buffering a Given Steiner Tree
that we do not prune, we can re-define the keys of the elements stored in a heap such that
we select candidate pairs CC lexicographically minimizing(
node( CC), cost(ident( CC)) + cost(invert( CC))
)
,
where the sorting of the nodes is the reverse topological ordering of A.
Since we will actually prune most candidate pairs created during the move operation
we should also avoid creating them in the first place.
Avoiding candidate creation. Let CC be a candidate pair and let (ν, node( CC)) ∈ E(P ).
During the move step we count the number of pairs with node ν and at position κ(ν)
that we have extracted from the heap. If this number has reached 2 · kinternal, we will no
longer create further candidate pairs arising from CC. As soon as the number has reached
2 · kinternal we perform the pruning mentioned above.
In addition to this strict method, we can use characteristics of the input library to
avoid creation of too many candidates. To use this method we have to assume that we can
sort repeaters in L = {`1, . . . , l|L|} by their strength such that for a given candidate C we
can determine a smallest index i ∈ {1, . . . , |L|} ∪ {∞} such that
caplim(lj) ≥ cap(C) and outslew(lj , slewtarget, cap(C)) ≤ slewlim(C)
if and only if j ≥ i. In practice, such a sorting is usually possible. Let CC be a candidate
pair extracted from the heap containing a valid candidate C ∈ {ident( CC), invert( CC)}. If
the index i of the smallest feasible repeater for C is ∞, we avoid creating candidate pairs
arising from CC by propagation along an edge or by inserting a repeater driving C that is
not the largest possible repeater. If i <∞, we avoid creating candidate pairs by inserting
repeaters with index < i that drives C.
Limiting the number of final candidates. Unless path P is outgoing of s, the set of
final candidate pairs will be the input to a merge step (Section 7.3.8). Keeping the number
of them small usually has an even greater impact on the running time than bounding the
number of internal candidate pairs.
To decrease the set of final candidate pairs, we only keep the kfinal pairs with smallest
score for a constant kfinal:
Definition 7.6 (Score of a candidate pair) The score of a valid candidate pair CC is
cost( CC) + λ( CC) ·min{drepeater(l, slewtarget, cap( CC)) : l ∈ L},
where λ( CC) is the sum of prices over all timing resources entering the sinks in
sinks(ident( CC)) ∪ sinks(invert( CC)), and
cap( CC) := cap(ident( CC))+cap(invert( CC)), cost( CC) := cost(ident( CC))+cost(invert( CC)).
Intuitively, the score is created to penalize candidate pairs with high capacitances. It
allows us to compare the timing benefit of a repeater insertion on the overall solution with
the additional cost accompanied with the insertion.
At this point we note that it is not difficult to construct examples in which we prune
or avoid candidate pairs representing the optimum solution. This is not surprising as the
Minimum Cost Steiner Tree Buffering Problem is NP-hard (cf. Section 4.5.1).
Buffering a Given Steiner Tree 143
7.3.8 The Merge Step
Let ν ∈ V (A) with δ+A(ν) = {(ν, ωi) : i = 1, . . . , |δ+A(ν)|}. As we proceed in reverse
topological ordering, we have already computed sets CCi of final candidate pairs obtained
from the move steps for the paths starting with the edges (ν, ωi) for i = 1, . . . , |δ+A(ν)|.
Without loss of generality we assume that |δ+A(ν)| = 2 as otherwise, we can iteratively
merge the candidate pairs in CCi for i > 3 with the candidate pairs resulting from the
previous merge steps.
In the basic version, the merge step consists just of merging all candidates in CC1 with
all candidates with CC2 and pruning almost-dominated candidates. Furthermore, in the
implementation of Permin [Per16], only the k candidate pairs with smallest cost (with
respect to cost(ident(.)) + cost(invert(.))) of each type are kept for an input parameter k.
All other pairs are erased. The remaining candidate pairs are used as initial candidates for
the move step along the path in A ending in ν.
In the fast version we use a more sophisticated merge step. First, we extend the sets
CC1 and CC2 by computing new candidate pairs obtained from the previous ones CC by
inserting a further repeater (where we use Section 7.3.3 if repeater insertion at position
κ(ν) is not allowed). More precisely, for each i = 1, 2 we extend CC ′i as follows:
• For all candidate pairs CC of type pol ∈ {ident, invert} we add all candidate pairs
obtained from CC by inserting a repeater of type l ∈ L to CC ′i.
• For all candidate pairs CC of parallel type we compute all 2 · |{l ∈ L : l inverter}|
possible candidate pairs obtained from CC by resolving parallel mode.
After merging all candidate pairs CC1 with candidate pairs CC2 we obtain a set CC of
merged pairs. As above, we extend CC by adding candidates modeling insertion of further
repeaters. Note that all merge configurations described by Bartoschek ([Bar14] page 76,
Figure 6.1 and Figure 6.9) can be obtained that way.
Among this set we select the kfinal candidate pairs with smallest score (see Definition 7.6)
for the constant kfinal that we already used at the end of the move step in the fast mode.
The set of initial candidates for the move step along the path in A ending in v consists of
these pairs only.
As before, the score can penalize candidate pairs with high capacitances. Inserting a
repeater next to a merge point can be a good idea as it shields capacitances on a side branch
and thus, decreases delay costs on the incoming path. On the contrary, candidate pairs
representing such a solution usually have higher costs as the price for repeater insertion is
already contained in the cost stored in their candidates. With the definition of the score
we can compare the timing benefit of a repeater insertion with the additional cost.
7.3.9 Choosing a Final Solution
After having arrived at the source we select the cheapest among all candidate pairs CC
with node( CC) = p( CC) = s and invert( CC) = C. Unlike for the intermediate candidates for
which we had to make use of a slew estimate, we can evaluate the objective function for
these final candidate pairs exactly.

Chapter 8
BonnRouteBuffer: A Tool for Global
Buffering
In this chapter we put together the results of the previous chapters and obtain a tool for
timing-constrained global routing with buffered Steiner trees. Our tool BonnRouteBuffer
is part of the BonnTools suite [KRV07][Hel+11] developed at the Research Institute for
Discrete Mathematics, University of Bonn, in cooperation with IBM.
BonnRouteBuffer can be used in different parts of a physical design flow. In early
planning phases (floor planning), top-level nets have to be connected and buffered. Many
blockages make both global routing and buffering difficult. The blockage structure of an
example instance (U6 in Table 8.1) is shown in Figure 8.1.
After initial placement optimization with respect to linear delays, BonnRouteBuffer
can be used to manage the transition from an unbuffered netlist with optimized virtual
timing to a buffered one. This step requires buffering of all nets.
In Section 8.1 we summarize how BonnRouteBuffer is composed of the building
blocks developed in preceding chapters. In this section we also describe how to deal with
slew effects that do not fit into the resource sharing model directly.
In Section 8.3 we compare BonnRouteBuffer with an industrial buffering algorithm
that achieves congestion-awareness by buffering global wires. Our experiments show that
BonnRouteBuffer is superior to the industrial algorithm with respect to timing while
other metrics such as routability and power consumption are comparable.
8.1 Overview
8.1.1 BonnRouteBuffer as Part of BonnRouteGlobal
Starting point for BonnRouteBuffer is the timing-constrained global routing framework
presented in Chapter 3 that we implemented as extension to BonnRouteGlobal [Mül09]
[Ahr+15]. Per default we use 25 resource sharing iterations (p=25 in Algorithm 4 on
Page 39) and a congestion target of 95%. If we manage to meet all constraints modeled
within BonnRouteGlobal, we will still have routing resources left. These remaining
routing resources are important to compensate for inaccuracies of the global routing model
and to enable us to re-buffer the most critical nets or nets with electrical violations by a
congestion-unaware buffering algorithm in later parts of physical design.
145
146 BonnRouteBuffer: A Tool for Global Buffering
Figure 8.1: Blockage structure of design U6.
The input netlist consists of all nets present after removing all repeaters (even the parity
inverters). As global routing graph we construct the coarse grid graph Gcoarse (Section 2.4.3)
with a default tile size of 70 detailed routing tracks. This graph is used during Steiner tree
construction. For repeater insertion we use the fine graph Gfine (Section 7.1.1). As delay
model we use the Elmore Delay Model with Slew Propagation described in Section 7.1.3.
As lower bound for source delays we use the delay dsource(min{cap(L) : l ∈ L}) along
a source gate driving the smallest repeater. This is indeed a lower bound if the input
capacitance of each sink pin does not fall below the input capacitance of the smallest
repeater (which is usually a correct assumption). As lower bound for the delay along
a wire we use linear delays on a straight path on an optimum layer and with optimum
wire code (see Section 6.5). The delay-per-length parameters have been computed by an
almost optimum buffering of a long repeater chain using a power-time tradeoff parameter
of ξpower = 0.8 per default (see Section 7.2.2). These wire delays are no strict lower bounds
on wire delays in a buffered netlist in general but turn out to be quite optimistic.
As upper bounds for gate delays we use the delay dsource(caplim(s)) through the driving
gate in a solution that just obeys the capacitance limit at the source. This is an upper
bound in an electrically correct solution. As upper bound for wire delays we use the same
strategy as for virtual buffering described in Section 6.5. In addition, we estimate the delay
impact of capacitances on side branches inside a Steiner tree for a net N as b · (|N | − 2).
Here, b is the bifurcation delay penalty introduced in Chapter 5. Bartoschek [Bar14]
(Section 4.1.6.) describes how to obtain the value for b as the maximum delay impact of a
default capacitance on a long repeater chain. Note that the number of bifurcations on a
source-sink path inside a topology for a net N is upper-bounded by |N | − 2.
8.1.2 Block Solver for Arrival Time Customers
The block solver for arrival time customers uses 3 iterations of Newton’s method described
in Section 3.6.3. Re-computation of all arrival times is repeated n = 15 times (Line 8○ in
Algorithm 4 on Page 39).
BonnRouteBuffer: A Tool for Global Buffering 147
8.1.3 Block Solver for Net Customers
The most complex part of BonnRouteBuffer is the block solver for net customers.
Initial topology computation. Let N be a net. We start by clustering N as in
Section 6.5 and connecting all clusters by a topology computed with the bicriteria approx-
imation algorithm of Theorem 5.8 together with optimizations described in Section 5.5.
The only difference to Section 6.5 is that we cluster sinks with the same polarity only, and
choose a positive bifurcation delay penalty b.
In the default setting we choose  = 0.1 within the bicriteria algorithm. To compute
initial topologies for the algorithm of Theorem 5.8 we have to distinguish if we are in
the reach aware mode or not. If reach aware routing is active, we use the algorithm by
Bihler [Bih15]. As reach length we use the distance between two repeaters in a long repeater
chain computed with power time trade-off parameter ξpower = 0. If reach aware routing is
inactive, we compute an approximately shortest topology as in Section 5.6.
Topology embedding. Topologies are embedded by the embedding algorithm of Theo-
rem 6.2 including all speed-up techniques described in Section 6.3. Per default we use an
embedding tolerance of tol = 5 tiles. If reach aware routing is enabled, we optimize Steiner
trees by the strategy of Section 6.4.
Buffering. The embedded Steiner trees serve as initial Steiner trees for the algorithm of
Section 7.3. The default setting uses the fast mode with parameters γcost = 1.1, γcap = 1,
γslewlim = 1, kinternal = 5, and kfinal = 1. Alternatively, the algorithm can be configured
to use the basic variant or a hybrid mode that uses the basic version for all critical nets
and the fast mode or the remaining nets. To obtain a further speed-up we avoid buffering
of instances for which the initial Steiner tree without repeaters meets all polarity and
capacitance constraints, and for which all slews stay below the slew target.
8.1.4 Slew Updates
By definition of the Min-Max Resource Sharing Problem (Section 3.2), resource consump-
tions of every block have to be independent of solutions for other customers. In our case,
consumptions from timing resources by buffered Steiner trees depend on the slew at input
pins of the source gate and hence on solutions for net customers of the preceding nets.
Instead of working with fixed input slews and taking inaccuracies of the timing model
into account, we update input slews for root circuits of succeeding instances after each
Steiner tree computation as follows. In the beginning, we set all input slews to the slew
target for the buffering mode corresponding to the lowest available routing layer, the
default wire code, and the currently chosen value for ξpower (0.8 in the default setting).
Let g be a gate with output pin s and having pin t as one of its input pins. After having
computed a buffered Steiner tree (At, κt) for the net Nt containing t, we evaluate the slew
by propagation along (At, κt) using a current input slew at the source gate at the root of
Nt. This slew serves as the new input slew during computation of a solution for the net
customer of the net with root s.
Although this approach is heuristic, it leads to a better accuracy of the timing model
and to better results.
148 BonnRouteBuffer: A Tool for Global Buffering
8.2 Layer and Wire Code Assignments
Except for later steps, the IBM optimization flow does not maintain global wires. Instead,
for each net, a short rectilinear Steiner tree is computed and evaluated using the properties
of the layer and wire code to which the net is assigned. The assignment of nets to layers
(and wire codes) is called layer assignment. Industrial layer assignment tools such as
CATALYST [Wei+13] and BonnLayerOpt iteratively assign the most critical nets to
higher layers and to wire codes mapping these layers to wider widths and spacings as long
as their timing improves and wiring congestion does not arise. Details on CATALYST
can be found in the paper of Wei et al. [Wei+13]. BonnLayerOpt is based on a Time-
Cost-Tradeoff formulation, see [Hel08], Chapter 6. Recall that we have already used layer
assignment algorithms in the reference run for the experiments in Section 6.5.
To perform the layer assignment step afterBonnRouteBuffer we useBonnLayerOpt
that is part of the BonnTools suite. BonnLayerOpt makes use of a (usually timing-
unaware) global router (e. g. BonnRouteGlobal) that re-routes all nets for which the
assignment has changed. During such a re-route, the global router avoids wiring planes
below the assigned layer and chooses wire widths and spacings according to the assigned
wire code. This way, the global router serves as congestion estimate.
By running such a layer assignment algorithm afterwards, we achieve that BonnRoute-
Buffer outputs layer and wire code assignments instead of concrete global wires. Instead
of making the assignment based on short rectilinear Steiner trees, we can use the RC-aware
mode of BonnRouteGlobal [Hel+17] and compute Steiner trees trading-off congestion
for Elmore delays.
There are alternative approaches to output layer and wire code assignments at the end
of BonnRouteBuffer. The easiest possibility is to assign each net of the netlist after
buffering to the highest layer z for which at least a fraction of τ of the global wires for
this net use layers on plane z and above. A similar strategy can be used to obtain a wire
code assignment. The most serious drawback of this approach is that it is not clear how to
choose the parameter τ . If we choose small values for τ , we might create congestion while
too large values lead to timing degradations.
A better method would be to compute layer and wire code assignments already during
resource sharing. This would require to replace net customers by customers consisting
of pairs (net, assignment). Solutions for the new type of customers are Steiner trees
that do not use wires in x- or y-direction on a layer below the assigned one. The wire
code assignment specifies widths and spacings of these wires. To be consistent with
timing engines that compute delays based on 2-dimensional Steiner trees by assuming
timing functions corresponding to the lowest assigned layer and the assigned wire code (cf.
Section 7.1.3), Steiner trees using edges on layers higher than the assigned one should not
see the timing benefit for doing so.
This approach would perfectly fit into the timing-constrained global routing framework.
Since there are usually only constantly many possible assignments, a naive block solver
would be to compute almost-cheapest Steiner trees for each assignment and select the pair
(Steiner tree, assignment) with minimum cost.
Implementation of such an approach would be a useful future project. Finding a block solver
for the new type of customers that is faster in practice an still admits an approximation
ratio is an interesting open problem.
BonnRouteBuffer: A Tool for Global Buffering 149
8.3 Experimental Results
We ran the default version of BonnRouteBuffer on 12 unbuffered 14 nm designs.
The instances were provided by IBM. They do not contain initial plane- and wire code-
assignments and do not contain repeaters except for the parity inverters. We compare
our tool with BuffOpt, the default buffering algorithm in the IBM physical design
flow. BuffOpt achieves congestion-awareness by following the output of a global router.
In a first step, the IBM global router ROUGHROUTE computes a (timing-unaware)
global routing and outputs unbuffered Steiner trees for each net. Each of these Steiner
trees is then buffered sequentially by dynamic programming using a highly optimized
implementation of the algorithm of Lillis et al. [LCL96] including optimizations and speed-
up techniques described in Section 7.2.1. The global routing is updated incrementally
during buffering. Placement-awareness during the BuffOpt flow is achieved by running
placement legalization at several intermediate steps. In addition, BuffOpt performs local
gate sizing and Vt-optimization.
We ran two different post-optimization flows. The first flow performs placement
legalization and BonnLayerOpt only. In a second flow we run global gate-sizing and
Vt-optimization, placement legalization, and BonnLayerOpt.
The results are shown in Table 8.1. Rows entitled w/o gate sizing show results of
BuffOpt and BonnRouteBuffer with the first post-optimization flow. Rows entitled
with gate sizing show results of buffering and the second post-optimization flow.
We compare results by worst slack (wsl), sum of negative slacks (sns), number of
inserted repeaters (# rpt), power consumption (static power and dynamic power),
routing overflow (ol), wire length (wl), number of vias (vias), and the amount of electrical
violations (slew vio and load vio). We compare running times for the pure buffering
calls (wall time buffering) and for the whole flow (wall time total). All reported
timing values and electrical violations were computed with IBM EinsTimer based on
the netlist at the very end. For each net, EinsTimer computes a 2-dimensional timing-
unaware Steiner tree and propagates delays, slews, and capacitances along it assuming delay
parameters of the assigned layer and the assigned wire code. EinsTimer uses the Elmore
Delay Model with Slew Propagation. Static and dynamic power is computed with IBM
EinsPower. Routing overload is determined by a final call to the timing-unaware version of
BonnRouteGlobal, respecting all computed assignments. The reported running times
exclude running times for design loading and for computation of quality metrics.
Table 8.1 shows that BonnRouteBuffer did a better job from a timing point of view.
On U1, U2, U3, U4, U5, U6, U9, U11, and U12 the worst slack improvements are significant
with 50 ps and more for both post-optimization flows. Among all instances, U4 and U6
have the largest chip image and the most critical timing. Here, the solution computed by
BuffOpt has a timing far away from optimum. On instance U10 the BuffOpt-based flow
including gate sizing and Vt-assignment achieved a better timing than the corresponding
BonnRouteBuffer-based flow but on the cost of a higher power consumption, worse
routability, and more electrical violations.
With respect to power consumptions none of the buffering tools dominates the other
one. On units U1, U4, and U6, BonnRouteBuffer consumed more power. This effect
can be explained by the exponential cost functions used in the resource sharing algorithm:
Prices for the critical timing resources dominate prices for the less critical power resource
and hence large power consumptions are tolerated to achieve small delays. On these units,
15
0
B
on
nR
ou
te
B
uff
er
:
A
T
oo
lf
or
G
lo
ba
lB
uff
er
in
g
Unit Experiment Postopt wsl sns # area static dynamic ol wl vias slew load wall time wall time
(#nets, [ps] [ns] rpt power power [m] [k] vio vio buffering total
cycle time) [mW] [mW] [ps] [fF] [h:m:s] [h:m:s]
U1 BuffOpt w/o gate sizing -366 -831 29 105 642 374 27.5 27.3 137 3.64 491 43 281 224 0:25:24 1:00:53
(21 385, BonnRouteBuffer w/o gate sizing -102 -239 15 302 923 478 37.4 31.0 7 3.54 374 81 113 244 0:27:35 1:14:13
240 ps) BuffOpt with gate sizing -217 -425 29 105 654 351 34.9 27.7 102 3.64 480 25 443 465 0:25:24 1:03:47
Bonnroutebuffer with gate sizing -112 -225 15 302 790 770 41.7 28.3 6 3.54 370 71 641 94 0:27:35 1:15:45
U2 BuffOpt w/o gate sizing -187 -222 10 412 341 483 22.8 10.6 0 0.40 256 10 487 0 0:15:19 0:19:08
(25 044, BonnRouteBuffer w/o gate sizing -91 -149 5 260 351 833 20.9 11.0 0 0.40 227 4 038 0 0:03:54 0:07:30
184 ps) BuffOpt with gate sizing -126 -141 10 412 338 468 24.0 10.5 0 0.40 256 7 482 0 0:15:19 0:19:45
Bonnroutebuffer with gate sizing -62 -84 5 260 325 602 19.9 10.5 0 0.40 230 1 244 0 0:03:54 0:08:40
U3 BuffOpt w/o gate sizing -190 -48 14 603 379 066 12.9 8.9 0 0.64 305 184 4 0:09:01 0:12:20
(32 442, BonnRouteBuffer w/o gate sizing -113 -24 6 808 423 062 13.9 9.8 0 0.60 270 258 0 0:04:23 0:07:26
184 ps) BuffOpt with gate sizing -161 -25 14 603 372 914 10.1 8.8 0 0.64 305 356 0 0:09:01 0:13:11
Bonnroutebuffer with gate sizing -70 -23 6 808 341 446 5.4 8.2 0 0.61 282 2 628 0 0:04:23 0:08:58
U4 BuffOpt w/o gate sizing -984 -32 498 32 845 4 404 323 42.2 103.6 693 17.92 900 19 157 031 21 955 0:44:12 2:25:49
(37 370, BonnRouteBuffer w/o gate sizing -229 -6 118 65 376 6 290 657 119.3 139.0 362 17.62 1 419 1 566 067 3 002 1:10:01 1:59:43
250 ps) BuffOpt with gate sizing -984 -32 404 32 845 4 403 702 42.8 103.5 468 17.96 888 19 038 134 21 033 0:44:12 2:55:02
Bonnroutebuffer with gate sizing -229 -6 174 65 376 6 292 440 119.0 138.6 327 17.63 1 423 1 570 497 3 063 1:10:01 2:09:17
U5 BuffOpt w/o gate sizing -297 -57 13 448 330 210 12.5 7.3 0 0.48 361 940 0 0:11:15 0:14:36
(37 917, BonnRouteBuffer w/o gate sizing -175 -41 5 525 351 521 10.5 8.2 0 0.49 327 1 771 13 0:04:47 0:08:24
184 ps) BuffOpt with gate sizing -199 -41 13 448 318 526 8.4 6.7 0 0.49 362 679 0 0:11:15 0:16:18
Bonnroutebuffer with gate sizing -129 -38 5 525 296 971 6.2 6.8 0 0.49 337 5 432 0 0:04:47 0:10:28
U6 BuffOpt w/o gate sizing -821 -4 474 157 957 13 793 175 59.8 228.6 105 37.11 2 293 1 835 626 5 375 2:35:35 2:50:01
(38 066, BonnRouteBuffer w/o gate sizing -182 -1 457 75 014 15 800 355 152.1 230.1 5 35.29 1 414 9 037 388 22 409 1:58:50 2:37:12
250 ps) BuffOpt with gate sizing -821 -4 025 157 957 13 871 591 70.1 229.7 97 37.10 2 279 1 696 970 5 369 2:35:35 3:06:09
Bonnroutebuffer with gate sizing -182 -1 458 75 014 15 800 456 152.0 229.9 7 35.29 1 414 9 037 100 22 381 1:58:50 2:44:07
U7 BuffOpt w/o gate sizing -135 -138 21 521 572 226 33.0 13.1 0 0.96 501 367 7 0:13:18 0:18:54
(44 423, BonnRouteBuffer w/o gate sizing -97 -87 12 030 647 049 34.5 14.7 0 0.99 453 857 0 0:07:10 0:12:54
184 ps) BuffOpt with gate sizing -77 -57 21 521 574 285 30.8 13.0 0 0.96 499 390 0 0:13:18 0:20:31
Bonnroutebuffer with gate sizing -66 -59 12 030 561 504 25.0 13.2 0 1.00 462 1 170 0 0:07:10 0:14:16
U8 BuffOpt w/o gate sizing -149 -78 27 298 634 202 17.7 14.7 0 1.15 532 563 25 0:17:50 0:23:13
(49 192, BonnRouteBuffer w/o gate sizing -174 -134 17 682 736 019 16.0 16.4 0 1.11 474 266 0 0:07:11 0:11:56
264 ps) BuffOpt with gate sizing -113 -37 27 298 617 292 15.0 14.3 0 1.15 534 707 0 0:17:50 0:24:51
Bonnroutebuffer with gate sizing -97 -67 17 682 599 321 12.1 14.4 0 1.12 488 1 632 3 0:07:11 0:14:19
U9 BuffOpt w/o gate sizing -174 -151 40 734 998 787 26.0 24.8 0 1.49 978 281 0 0:26:43 0:34:56
(93 783, BonnRouteBuffer w/o gate sizing -125 -225 22 416 1 097 076 20.1 26.8 0 1.45 862 220 0 0:12:01 0:19:11
264 ps) BuffOpt with gate sizing -115 -43 40 734 983 069 18.3 23.9 0 1.49 978 353 0 0:26:43 0:38:17
Bonnroutebuffer with gate sizing -62 -55 22 416 940 349 13.9 23.4 0 1.44 880 2 557 0 0:12:01 0:23:32
U10 BuffOpt w/o gate sizing -323 -1 039 38 427 1 224 151 40.4 13.4 589 2.47 1 450 2 245 0 0:29:21 0:38:04
(141 175, BonnRouteBuffer w/o gate sizing -308 -1 107 18 389 1 273 621 25.6 14.3 265 2.37 1 345 914 6 0:14:27 0:22:30
240 ps) BuffOpt with gate sizing -199 -636 38 427 1 261 029 51.7 13.9 546 2.48 1 445 2 405 0 0:29:21 0:41:54
Bonnroutebuffer with gate sizing -215 -609 18 389 1 185 880 44.4 13.3 234 2.36 1 344 526 0 0:14:27 0:24:10
U11 BuffOpt w/o gate sizing -379 -1 995 121 017 2 635 955 100.7 51.6 5 1.82 584 3 815 6 1:15:55 1:34:28
(156 464, BonnRouteBuffer w/o gate sizing -209 -1 495 63 233 2 767 163 70.8 54.5 95 5.38 2 065 30 072 0 0:27:24 0:59:22
264 ps) BuffOpt with gate sizing -192 -1 019 121 017 2 630 576 129.2 51.4 2 1.24 405 3 399 0 1:15:55 1:40:49
Bonnroutebuffer with gate sizing -124 -800 63 233 2 455 688 113.3 49.9 0 1.74 524 22 872 0 0:27:24 0:56:04
U12 BuffOpt w/o gate sizing -509 -727 131 544 4 632 645 116.8 143.8 219 9.93 4 075 62 892 19 1:56:56 2:51:45
(361 665, BonnRouteBuffer w/o gate sizing -355 -545 82 410 5 208 123 92.5 153.9 109 9.54 3 662 284 718 205 0:56:14 3:09:21
184 ps) BuffOpt with gate sizing -343 -337 131 544 4 524 806 92.8 139.9 208 9.94 4 092 66 906 7 1:56:56 3:13:20
Bonnroutebuffer with gate sizing -296 -325 82 410 4 393 723 66.6 132.6 128 9.52 3 814 243 635 66 0:56:14 2:29:02
Table 8.1: Comparison between BonnRouteBuffer and the industrial buffering algorithm BuffOpt.
BonnRouteBuffer: A Tool for Global Buffering 151
gate sizing and Vt-optimization could not help to achieve a significant power reduction.
Further post-processing with optimization tools that directly optimize objectives modeled
by dominated resources such as the power resource in later parts of physical design is
important.
On units U9, U10, U11, and U12, the solution computed by BonnRouteBuffer
consumes less power than the solution of BuffOpt and on U2, U3, U5, U7, and U8,
power consumptions do not differ much between the tools. After the application of gate
sizing and Vt-optimization, power consumptions on all these 9 units are smaller with the
BonnRouteBuffer-based flow.
On all instances except for U4, BuffOpt inserts more repeaters than BonnRoute-
Buffer and seems to prefer the smallest repeaters. This approach has both advantages
and disadvantages for later optimization steps. By inserting many repeaters, smaller nets
arise and the probability that the design’s timing engine chooses unfavorable Steiner trees
for computation of delays and electrical violations decreases. The drawback of insertion of
many repeaters is that subsequent optimization tools have less flexibility. Power reduc-
tions achieved by gate sizing and Vt-optimization are smaller when run on the output of
BuffOpt while timing improvement are comparable. During global routing connecting
the many repeaters inserted by BuffOpt require to insert many vias.
On U1, U5, U6, U7, U11, and U12, BonnRouteBuffer produces more slew violations
than BuffOpt. On U6, the number of load violations is much larger, too. There are
several reasons for electrical violations. Sometimes, BonnLayerOpt cannot assign nets
to higher layers due to congestion issues and larger slew degradations on lower layers
create violations. On some nets, unfavorable routing topologies of the timing-unaware
2-dimensional Steiner trees computed by IBM EinsTimer induce slew violations that are
not present in the timing-driven Steiner trees computed by BonnRouteBuffer. The
largest amount of electrical violations is present in timing-uncritical nets. For these nets,
the Steiner trees given as input to the buffering algorithm from Chapter 7 use the lowest
layers. Some of these Steiner trees are not reach-aware and the buffering algorithm can not
avoid electrical violations. Activating the reach-aware mode as described in Section 6.4 can
certainly reduce these problems but cannot completely avoid them. The main reason for
this is that the simple reach-aware model from Section 6.4 does not take pin capacitances
into account which have a significant impact on large instances such as U6. There is an
ongoing project by a master student who works at the Research Institute for Discrete
Mathematics, University of Bonn, under my co-supervision. By using a multi-label approach
he re-embeds topologies for such instances, taking also placement-, slew-, and capacitance
constraints into account. With this project we can hope for better solutions with less
electrical violations in future versions of BonnRouteBuffer.
Since BonnRouteBuffer is run with 16 threads, running times are smaller than
for BuffOpt which performs repeater insertion sequentially. The only units where
BonnRouteBuffer is still slower are U1 and U4.
Figures 8.2 and 8.3 show congestion, timing, and placement density of U4 and U8
respectively. Recall from Section 6.5 that in a congestion plot each edge of the global
routing graph is colored according to the fraction to which the corresponding congestion
resource is used and the timing histograms show the slack distribution of all gates, where
each gate is represented by its worst slack. Congestion plots and slack histograms shown
in Figures 8.2 and 8.3 visualize the state at the end of the flows including gate sizing
and Vt-assignment. The pictures at the bottom are placement density plots. Here, each
152 BonnRouteBuffer: A Tool for Global Buffering
cong. after postopt timing after postopt
wsl:−984 ps
sns: −32 404ns
density before postopt density after postopt
(a) Results of BuffOpt on U4.
cong. after postopt timing after postopt
wsl:−229 ps
sns: −6 174ns
density before postopt density after postopt
(b) Results of BonnRouteBuffer on U4.
Figure 8.2: Congestion, timing, and placement density of U4. BonnRouteBuffer has a larger
though feasible area consumption but achieves a much better timing. The congestion plot for
BuffOpt shows local hot-spots.
placement bin is shown in a color corresponding to its density. White and blue represent
bins with small density (below 10%), yellow show a moderate usage of around 50%, and
the colors orange and red indicate that a placement bin is used by almost its full amount.
The purple bins are either used by 100% or even more. Example of bins that are used
by exactly 100% are all bins in Figure 8.2 completely covered by a placement blockage.
Overfull bins can only occur before placement legalization. In Figures 8.2 and 8.3 we show
density plots directly after buffering (left) and at the end of the flow (right).
Figure 8.2 shows that BonnRouteBuffer consumes more area than BuffOpt. Gate
sizing could not reduce that amount. By paying the price of an increased area and power
consumption, we could obtain a much better timing. Figure 8.3 shows an example of a unit
for which gate sizing could reduce the amount of occupied area in the BonnRouteBuffer-
based flow. In contrast to the situation after buffering, density plots on the right hand
side of Figures 8.3(a) and 8.3(b) show that the BonnRouteBuffer-based flow consumes
less area than the BuffOpt-based flow where gate sizing could not lead to a significant
reduction of area consumption. From a timing point of view, none of the solution is really
better. BonnRouteBuffer achieves a better worst slack while the sum of negative slacks
is better with BuffOpt.
BonnRouteBuffer: A Tool for Global Buffering 153
cong. after postopt timing after postopt
wsl:−113 ps
sns: −37ns
density before postopt density after postopt
(a) Results of BuffOpt on U8.
cong. after postopt timing after postopt
wsl:−97 ps
sns: −67ns
density before postopt density after postopt
(b) Results of BonnRouteBuffer on U8.
Figure 8.3: Congestion, timing, and placement density of U8. The solution of BonnRoute-
Buffer is overpowered and consomes more area than necessary. Gate sizing can drastically
decreases the area consumption.
8.3.1 Comparison with the IBM Physical Design Flow
The results of Section 8.3 suggest that BonnRouteBuffer is well-suited as global
buffering tool in early phases of a chip design flow: On most instances the combination
of BonnRouteBuffer and the post-optimization flow including gate sizing and Vt-
optimization achieves better timing than the industrial buffering algorithm BuffOpt and
even uses less power and area.
In this section we compare BonnRouteBuffer with the IBM physical design flow.
This design flow uses many tools for timing optimization and power reduction and we
cannot hope that our simple BonnRouteBuffer-based optimization flows from the
previous section achieve equally good results on general instances.
We run a design similar to unit U6 directly within the IBM environment. This design
consists of repeater trees only and most optimization tools contained in the design flow
that are missing in our simple flows are not needed here. We demonstrate that on such an
instance a BonnRouteBuffer-based flow achieves even better results than the complex
and well-tuned design flow.
The design flow achieves the following result:
154 BonnRouteBuffer: A Tool for Global Buffering
Worst slack: -45 ps Routing Overload: 320
Sum of negative slacks: -0.3 ns Wire length: 37.75m
Number of repeaters: 158 398 Number of vias: 2 482 k
Area: 12 093 604 Slew violations: 9 193 612 ps
Static power: 59.4mW Load violations: 1 425 160 fF
Dynamic power: 587.0mW
The BonnRouteBuffer-based flow from Section 8.3 that includes gate sizing and Vt-
optimization could compute a solution with smaller static and dynamic power consumption
and less routing overflow:
Worst slack: -68 ps Routing Overload: 88
Sum of negative slacks: -0.7 ns Wire length: 29.05m
Number of repeaters: 105 283 Number of vias: 1 650 k
Area: 13 157 757 Slew violations: 3 881 047 ps
Static power: 26.8mW Load violations: 619 483 fF
Dynamic power: 550.4mW
The solution computed by the design flow is almost timing-clean and BonnRoute-
Buffer achieved a worst slack that is larger by roughly 20 ps smaller. The output of the
design flow is more difficult to route and BonnRouteGlobal needed to make detours
to avoid congestion which results in larger net length and more used vias. Both solutions
produce a huge amount of electrical violations in timing-uncritical nets.
We optimized the output of the BonnRouteBuffer-based flow further by re-buffering
the 0.5% most timing-critical nets with the fast buffering algorithm by Bartoschek et
al. [Bar+09][Bar14] (see Section 7.2.2). In this algorithm we used the configuration
ξpower = ξbuf = 1 that optimizes timing only and ignores the power consumption. After re-
buffering we legalized the placement, re-computed layer assignments with BonnLayerOpt,
and re-ran gate sizing and Vt-assignment. We obtain the following result:
Worst slack: -47 ps Routing Overload: 81
Sum of negative slacks: -0.2 ns Wire length: 29.06m
Number of repeaters: 105 289 Number of vias: 1 657 k
Area: 13 119 012 Slew violations: 3 879 779 ps
Static power: 23.2mW Load violations: 619 380 fF
Dynamic power: 548.9mW
With this additional post-optimization we could achieve an equally good timing as the
IBM design flow. The impact on the other quality metrics is negligible and hence, the final
result of the BonnRouteBuffer-based flow is better than the result of the IBM design
flow with respect to power consumption, routability, and electrical violations. Congestion
and density plots of the two results are shown in Figure 8.4.
BonnRouteBuffer: A Tool for Global Buffering 155
(a) IBM physical design flow. (b) BonnRouteBuffer-based flow that includes
gate sizing, Vt-optimization, and re-buffering of the
0.5% most timing-critical nets.
Figure 8.4: Congestion plots and density plots of the instance from Section 8.3.1.
8.4 Conclusions and Future Work
With BonnRouteBuffer we have developed a theoretically-founded tool that combines
global routing with buffering. These central problems of any physical design flow have been
solved separately in the past. The experimental results presented in this chapter already
look quite promising: On most units, one single application of BonnRouteBuffer is
sufficient to build fast repeater trees for most repeater tree instances without creating
routing hot-spots and without creating placement instances that cannot legalized after-
wards. Later optimization steps can now concentrate on detailed optimization such as
eliminating electrical violations, optimizing the most critical paths, and decreasing the
power consumption in uncritical parts of the chip.
Existing buffering tools such as BuffOpt sometimes fail to accomplish the task of
global buffering. Often, many iterations of timing optimization tools are necessary to
compute globally good solutions. Most repeater trees built in early iterations are ripped-up
and replaced during later steps. This leads to unnecessarily large running times. Simulating
global optimization with iteration of tools for detailed optimization is not always a good
idea.
Although BonnRouteBuffer already produces good results, there are still many
features missing that need to be implemented and are subject to ongoing or future research.
• Computation of the Steiner tree that is passed as input to buffering should be aware
of placement congestion. The buffering algorithm should always have a chance to
avoid placing repeaters inside placement hot-spots.
• Buffered Steiner trees for the most critical instances should be computed by a block
solver that does not separate Steiner tree computation and buffer insertion. Recall
that we presented such an algorithm for the basic version of the Elmore delay model
in Chapter 4.
• Lower and upper bounds for arrival times are currently estimates based on linear
156 BonnRouteBuffer: A Tool for Global Buffering
delays. Bartoschek [Bar14] (page 46) showed that there is a good correlation between
linear delays and delays after buffering, but there are many outliers. Using better
bounds would improve results and would lead to faster convergence of arrival times
during the resource sharing part of BonnRouteBuffer.
• Resource prices within the first resource sharing iterations are often not good enough.
It can happen that long connections of eventually critical nets are routed on the
lowest layers because prices of the corresponding timing resources have not increased
enough yet. For such Steiner trees it is not necessary to spend the running time for
computing an actual buffering solution.
Another interesting open problem is the interaction between buffering and gate sizing.
On the one hand, tools for gate sizing need to get completely buffered nets as input to
perform their task successfully. On the other hand, sizes of the logic gates have a big impact
on buffering. Often, insertion of additional repeaters close to the source pin is necessary if
the source gate can only drive small capacitances. In many of these cases, re-sizing the
source gate leads to better solutions than inserting the additional repeaters. The ambitious
task of solving gate sizing, buffering, and global routing simultaneously should get high
priority in future research. Except for the work of Alpert et al. [Alp+04] who developed
an algorithm for simultaneous driver sizing and buffering in which delay effects of re-sizing
the source gate are modeled as a pre-computed delay penalty, not much is known about
this problem. A promising approach would be to combine the algorithm by Schorr [Sch15]
for resource-sharing-based gate sizing and Vt-optimization with the resource-sharing-based
timing-constrained global routing and buffering within BonnRouteBuffer.
Bibliography
[Ahr+15] Markus Ahrens, Michael Gester, Niko Klewinghaus, Dirk Müller, Sven Peyer,
Christian Schulte, and Gustavo Téllez (2015). Detailed Routing Algorithms for
Advanced Technology Nodes. IEEE Transactions on Computer-Aided Design
of Integrated Circuits and Systems 34.4, pp. 563–576.
[Alb+02] Christoph Albrecht, Andrew B Kahng, Ion Mandoiu, and Alexander Zelikovsky
(2002). Floorplan Evaluation with Timing-Driven Global Wireplanning, Pin
Assignment and Buffer/Wire Sizing. Proceedings of the 2002 Asia and South
Pacific Design Automation Conference. IEEE, p. 580.
[AD97] Charles J Alpert and Anirudh Devgan (1997). Wire Segmenting for Improved
Buffer Insertion. Proceedings of the 34th annual Design Automation Conference.
ACM, pp. 588–593.
[ADQ99] Charles J Alpert, Anirudh Devgan, and Stephen T Quay (1999). Buffer Inser-
tion With Accurate Gate and Interconnect Delay Computation. Proceedings of
the 36th annual ACM/IEEE Design Automation Conference. ACM, pp. 479–
484.
[AHQ04] Charles J Alpert, Miloš Hrkić, and Stephen T Quay (2004). A Fast Algorithm
for iIdentifying Good Buffer Insertion Candidate Locations. Proceedings of the
2004 International Symposium on Physical Design. ACM, pp. 47–52.
[Alp+95] Charles J Alpert, Te C Hu, J H Huang, Andrew B Kahng, and David R Karger
(1995). Prim-Dijkstra Tradeoffs for Improved Performance-Driven Routing Tree
Design. IEEE Transactions on Computer-Aided Design of Integrated Circuits
and Systems 14.7, pp. 890–896.
[Alp+04] Charles Alpert, Chris Chu, Gopal Gandham, Miloš Hrkić, Jiang Hu, Chan-
dramouli Kashyap, and Stephen Quay (2004). Simultaneous Driver Sizing and
Buffer Insertion Using a Delay Penalty Estimation Technique. IEEE Trans-
actions on Computer-Aided Design of Integrated Circuits and Systems 23.1,
pp. 136–141.
[Aro98] Sanjeev Arora (1998). Polynomial Time Approximation Schemes for Euclidean
Traveling Salesman and Other Geometric Problems. Journal of the ACM
(JACM) 45.5, pp. 753–782.
[Bar14] Christoph Bartoschek (2014). Fast Repeater Tree Construction. PhD thesis,
Research Institute for Discrete Mathematics, University of Bonn, Germany.
[Bar+10] Christoph Bartoschek, Stephan Held, Jens Maßberg, Dieter Rautenbach, and
Jens Vygen (2010). The Repeater Tree Construction Problem. Information
Processing Letters 110.24, pp. 1079–1083.
[Bar+09] Christoph Bartoschek, Stephan Held, Dieter Rautenbach, and Jens Vygen
(2009). Fast Buffering for Optimizing Worst Slack and Resource Consumption
in Repeater Trees. Proceedings of the 2009 International Symposium on Physical
Design. ACM, pp. 43–50.
157
158 Bibliography
[Bel58] Richard Bellman (1958). On a Routing Problem. Quarterly of Applied Mathe-
matics 16, pp. 87–90.
[Bih15] Tilmann Bihler (2015). Rektilineare Steinerbäume mit längen- und richtungs-
beschränkenden Blockaden. Bachelor’s thesis, Research Institute for Discrete
Mathematics, University of Bonn, Germany.
[BZ15] Marcus Brazil and Martin Zachariasen (2015). Optimal Interconnection Trees in
the Plane: Theory, Algorithms and Applications. Springer Publishing Company,
Incorporated.
[Byr+13] Jarosław Byrka, Fabrizio Grandoni, Thomas Rothvoß, and Laura Sanita (2013).
Steiner Tree Approximation via Iterative Randomized Rounding. Journal of
the ACM (JACM) 60.1, p. 6.
[CM99] Chung-Ping Chen and Noel Menezes (1999). Noise-aware Repeater Insertion
and Wire Sizing for On-chip Interconnect using Hierarchical Moment-Matching.
Proceedings of the 36th annual ACM/IEEE Design Automation Conference.
ACM, pp. 502–506.
[CW08] Chris Chu and Yiu-Chung Wong (2008). FLUTE: Fast Lookup Table Based
Rectilinear Steiner Minimal Tree Algorithm for VLSI Design. IEEE Trans-
actions on Computer-Aided Design of Integrated Circuits and Systems 27.1,
pp. 70–83.
[Chu+05] Julia Chuzhoy, Anupam Gupta, Joseph Naor, and Amitabh Sinha (2005).
On the Approximability of Network Design Problems. ACM Transactions on
Algorithms (TALG).
[CLZ93] Jason Cong, Kwok-Shing Leung, and Dian Zhou (1993). Performance-Driven
Interconnect Design Based on Distributed RC Delay Model. Proceedings of the
30th Conference on Design Automation. IEEE, pp. 606–611.
[Con+92] Jingsheng Cong, Andrew B Kahng, Gabriel Robins, Majid Sarrafzadeh, and
Chak-Kuen Wong (1992). Provably Good Performance-Driven Global Routing.
IEEE Transactions on Computer-Aided Design of Integrated Circuits and
Systems 11.6, pp. 739–752.
[CL94] Javier Córdova and Yann-hang Lee (1994). A Heuristic Algorithm for the
Rectilinear Steiner Arborescence Problem. Engineering Optimization.
[CW03] John F Croix and D F Wong (2003). Blade and Razor: Cell and Interconnect
Delay Analysis Using Current-Based Models. Proceedings of the 40th Design
Automation Conference. IEEE, pp. 386–389.
[Dij59] Edsger W Dijksta (1959). A Note on Two Problems in Connexion with Graphs.
Numerische mathematik 1.1, pp. 269–271.
[ES15] Michael Elkin and Shay Solomon (2015). Steiner Shallow-Light Trees are
Exponentially Lighter than Spanning Ones. SIAM Journal on Computing 44.4,
pp. 996–1025.
[Elm48] William C Elmore (1948). The Transient Response of Damped Linear Networks
with Particular Regard to Wideband Amplifiers. Journal of Applied Physics
19.1, pp. 55–63.
[Fen+06] Zhe Feng, Yu Hu, Tong Jing, Xianlong Hong, Xiaodong Hu, and Guiying
Yan (2006). An O(n log n) Algorithm for Obstacle-Avoiding Routing Tree
Construction in the λ-Geometry Plane. Proceedings of the 2006 International
Symposium on Physical Design. ACM, pp. 48–55.
[For56] Lester R Ford (1956). Network Flow Theory. DTIC Document.
Bibliography 159
[FT87] Michael L Fredman and Robert E Tarjan (1987). Fibonacci Heaps and their
Uses in Improved Network Optimization Algorithms. Journal of the ACM
(JACM) 34.3, pp. 596–615.
[GJ77] Michael R Garey and David S. Johnson (1977). The Rectilinear Steiner Tree
Problem is NP-Complete. SIAM Journal on Applied Mathematics 32.4, pp. 826–
834.
[GJ79] Michael R Garey and David S Johnson (1979). Computers and Intractability:
a Guide to the Theory of NP-Completeness. San Francisco, LA: Freeman.
[GK07] Naveen Garg and Jochen Koenemann (2007). Faster and Simpler Algorithms
for Multicommodity Flow and Other Fractional Packing Problems. SIAM
Journal on Computing 37.2, pp. 630–652.
[Ges+13] Michael Gester, Dirk Müller, Tim Nieberg, Christian Panten, Christian Schulte,
and Jens Vygen (2013). BonnRoute: Algorithms and Data Structures for
Fast and Good VLSI Routing. ACM Transactions on Design Automation of
Electronic Systems (TODAES) 18.2, p. 32.
[Goe+12] Michel X Goemans, Neil Olver, Thomas Rothvoß, and Rico Zenklusen (2012).
Matroids and Integrality Gaps for Hypergraphic Steiner Tree Relaxations.
Proceedings of the forty-fourth annual ACM symposium on Theory of computing.
ACM, pp. 1161–1176.
[GH05] Andrew V Goldberg and Chris Harrelson (2005). Computing the Shortest
Path: A∗ Search Meets Graph Theory. Proceedings of the Sixteenth Annual
ACM-SIAM Symposium on Discrete Algorithms. Society for Industrial and
Applied Mathematics, pp. 156–165.
[GK94] Michael D Grigoriadis and Leonid G Khachiyan (1994). Fast Approximation
Schemes for Convex Programs with many Blocks and Coupling Constraints.
SIAM Journal on Optimization 4.1, pp. 86–107.
[Häh15] Nicolai Hähnle (2015). Time-Cost Tradeoff and Steiner Tree Packing with
Multiplicative Weights. Technical report no. 1511115, Research Institute for
Discrete Mathematics, University of Bonn.
[Hel08] Stephan Held (2008). Timing Closure in Chip Design. PhD thesis, Research
Institute for Discrete Mathematics, University of Bonn, Germany.
[HHV15] Stephan Held, Stefan Hougardy, and Jens Vygen (2015). Chip Design. Prince-
ton Companion to Applied Mathematics (N. Higham, ed.) Princeton University
Press.
[Hel+11] Stephan Held, Bernhard Korte, Dieter Rautenbach, and Jens Vygen (2011).
Combinatorial Optimization in VLSI Design. Combinatorial Optimization-
Methods and Applications 31, pp. 33–96.
[Hel+17] Stephan Held, Dirk Müller, Daniel Rotter, Rudolf Scheifele, Vera Traub, and
Jens Vygen (2017). Global Routing with Timing Constraints. submitted.
[Hel+15] Stephan Held, Dirk Müller, Daniel Rotter, Vera Traub, and Jens Vygen (2015).
Global Routing with Inherent Static Timing Constraints. Proceedings of the
IEEE/ACM International Conference on Computer-Aided Design. IEEE Press,
pp. 102–109.
[HR13] Stephan Held and Daniel Rotter (2013). Shallow-Light Steiner Arborescences
with Vertex Delays. Integer Programming and Combinatorial Optimization.
Springer, pp. 229–241.
[HS14] Stephan Held and Sophie T Spirkl (2014). A Fast Algorithm for Rectilinear
Steiner Trees with Length Restrictions on Obstacles. Proceedings of the 2014
on International Symposium on Physical Design. ACM, pp. 37–44.
160 Bibliography
[Hen16] Dorothee Henke (2016). Pfadsuche im Detailed Routing. Bachelor’s thesis,
Research Institute for Discrete Mathematics, University of Bonn, Germany.
[HSC82] Robert B Hitchcock, Gordon L Smith, and David D Cheng (1982). Timing
Analysis of Computer Hardware. IBM journal of Research and Development
26.1, pp. 100–105.
[HS75] Dan Hoey and Michael I Shamos (1975). Closest-Point Problems. Proceedings
of the 17th IEEE Annual Symposium on Foundations of Computer Science.
Vol. 26, pp. 151–162.
[Hon+97] Xianlong Hong, Tianxiong Xue, Jin Huang, Chung-Kuan Cheng, and Ernest S
Kuh (1997). TIGER: an Efficient Timing-Driven Global Router for Gate Array
and Standard Cell Layout Design. IEEE Transactions on Computer-Aided
Design of Integrated Circuits and Systems 16.11, pp. 1323–1331.
[HL02] Miloš Hrkić and John Lillis (2002). S-Tree: a Technique for Buffered Routing
Tree Synthesis. Proceedings of the 39th Annual Design Automation Conference.
ACM, pp. 578–583.
[Hu+03] Jiang Hu, Charles J Alpert, Stephen T Quay, and Gopal Gandham (2003).
Buffer Insertion with Adaptive Blockage Avoidance. IEEE Transactions on
Computer-Aided Design of Integrated Circuits and Systems 22.4, pp. 492–498.
[HS02] Jiang Hu and Sachin S Sapatnekar (2002). A Timing-Constrained Simultaneous
Global Routing Algorithm. IEEE Transactions on Computer-Aided Design of
Integrated Circuits and Systems 21.9, pp. 1025–1036.
[Hu+07] Shiyan Hu, Charles J Alpert, Jiang Hu, Shrirang K Karandikar, Zhuo Li,
Weiping Shi, and Chin N Sze (2007). Fast Algorithms for Slew-Constrained
Minimum Cost Buffering. IEEE Transactions on Computer-Aided Design of
Integrated Circuits and Systems 26.11, pp. 2009–2022.
[HLA09] Shiyan Hu, Zhuo Li, and Charles J Alpert (2009). A Fully Polynomial Time
Approximation Scheme for Timing Driven Minimum Cost Buffer Insertion.
Proceedings of the 46th Annual Design Automation Conference. ACM, pp. 424–
429.
[Hua+93] Jin Huang, Xian-Long Hong, Chung-Kuan Cheng, and Ernest S Kuh (1993).
An Efficient Timing-Driven Global Routing Algorithm. Proceedings of the 30th
Conference on Design Automation. ACM, pp. 596–600.
[Huf52] David A Huffman (1952). A Method for the Construction of Minimum-
Redundancy Codes. Proceedings of the IRE 40.9, pp. 1098–1101.
[Hwa76] Frank K Hwang (1976). On Steiner Minimal Trees with Rectilinear Distance.
SIAM journal on Applied Mathematics 30.1, pp. 104–114.
[JZ08] Klaus Jansen and Hu Zhang (2008). Approximation algorithms for general
packing problems and their application to the multicast congestion problem.
Mathematical Programming 114.1, pp. 183–206.
[Kas+04] Chandramouli V Kashyap, Charles J Alpert, Frank Liu, and Anirudh Devgan
(2004). Closed Form Expressions for Extending Step Delay and Slew Metrics
to Ramp Inputs for RC Trees. IEEE Transactions on Computer-Aided Design
of Integrated Circuits and Systems 23.4, pp. 509–516.
[KRY95] Samir Khuller, Balaji Raghavachari, and Neal Young (1995). Balancing Min-
imum Spanning Trees and Shortest-Path Trees. Algorithmica 14.4, pp. 305–
321.
[Kie16] Annika K Kiefner (2016). Minimizing path lengths in rectilinear Steiner min-
imum trees with fixed topology. Operations Research Letters 44, pp. 835–
838.
Bibliography 161
[KRV07] Bernhard Korte, Dieter Rautenbach, and Jens Vygen (2007). BonnTools: Math-
ematical Innovation for Layout and Timing Closure of Systems on a Chip.
Proceedings of the IEEE 95.3, pp. 555–572.
[KV12] Bernhard Korte and Jens Vygen (2012). Combinatorial Optimization: Theory
and Algorithms. 5th edition. Springer.
[Kra49] Leon G Kraft (1949). A Device for Quanitizing, Grouping and Coding Ampli-
tude Modulated Pulses. MA thesis. MIT, Cambridge.
[LS06] Zhuo Li and Weiping Shi (2006). An O(b · n2) Time Algorithm for Optimal
Buffer Insertion with b Buffer Types. IEEE Transactions on Computer-Aided
Design of Integrated Circuits and Systems 25.3, pp. 484–489.
[LZS12] Zhuo Li, Ying Zhou, and Weiping Shi (2012). O(mn) Time Algorithm for Op-
timal Buffer Insertion of Nets With m Sinks. IEEE Transactions on Computer-
Aided Design of Integrated Circuits and Systems 31.3, pp. 437–441.
[LCL96] John Lillis, Chung-Kuan Cheng, and Ting-Ting Y Lin (1996). Optimal Wire
Sizing and Buffer Insertion for Low Power and a Generalized Delay Model.
IEEE Journal of Solid-State Circuits 31.3, pp. 437–447.
[Lin+08] Chung-Wei Lin, Szu-Yu Chen, Chi-Feng Li, Yao-Wen Chang, and Chia-Lin
Yang (2008). Obstacle-Avoiding Rectilinear Steiner Tree Construction Based on
Spanning Graphs. IEEE Transactions on Computer-Aided Design of Integrated
Circuits and Systems 27.4, pp. 643–653.
[Liu+09] Chih-Hung Liu, Shih-Yi Yuan, Sy-Yen Kuo, and Szu-Chi Wang (2009). High-
Performance Obstacle-Avoiding Rectilinear Steiner Tree Construction. ACM
Transactions on Design Automation of Electronic Systems (TODAES) 14.3,
p. 45.
[LZM08] Jieyi Long, Hai Zhou, and Seda Ogrenci Memik (2008). An O(n log n) Edge-
Based Algorithm for Obstacle-Avoiding Rectilinear Steiner Tree Construction.
Proceedings of the 2008 International Symposium on Physical Design. ACM,
pp. 126–133.
[LR00] Bing Lu and Lu Ruan (2000). Polynomial Time Approximation Scheme for
the Rectilinear Steiner Arborescence Problem. Journal of Combinatorial Opti-
mization 4.3, pp. 357–363.
[Maß15] Jens Maßberg (2015). The rectilinear Steiner tree problem with given topology
and length restrictions. International Computing and Combinatorics Confer-
ence. Springer, pp. 445–456.
[MV08] Jens Maßberg and Jens Vygen (2008). Approximation Algorithms for a Facility
Location Problem with Service Capacities. ACM Transactions on Algorithms
(TALG) 4.4, p. 50.
[Meh88] Kurt Mehlhorn (1988). A Faster Approximation Algorithm for the Steiner
Problem in Graphs. Information Processing Letters 27.3, pp. 125–128.
[MMP00] Adam Meyerson, Kamesh Munagala, and Serge Plotkin (2000). Cost-Distance:
Two Metric Network Design. Proceedings of the 41st Annual Symposium on
Foundations of Computer Science. IEEE, pp. 624–630.
[Moo59] Edward F Moore (1959). The Shortest Path Through a Maze. Proceedings of
the International Symposium on the Theory of Switching, Part II. Harward
University Press, pp. 285–292.
[Mül06] Dirk Müller (2006). Optimizing Yield in Global Routing. Proceedings of
the IEEE/ACM International Conference on Computer-Aided Design. IEEE,
pp. 480–486.
162 Bibliography
[Mül09] Dirk Müller (2009). Fast Resource Sharing in VLSI Design. PhD thesis, Research
Institute for Discrete Mathematics, University of Bonn, Germany.
[MRV11] Dirk Müller, Klaus Radke, and Jens Vygen (2011). Faster min–max resource
sharing in theory and practice. Mathematical Programming Computation 3.1,
pp. 1–35.
[MP03] Matthias Müller-Hannemann and Sven Peyer (2003). Approximation of Rec-
tilinear Steiner Trees with Length Restrictions on Obstacles. Workshop on
Algorithms and Data Structures. Springer, pp. 207–218.
[Nic66] T Alastair J Nicholson (1966). Finding the shortest route between two points
in a network. The computer journal 9.3, pp. 275–280.
[OC96] Takumi Okamoto and Jason Cong (1996). Interconnect Layout Optimization
by Simultaneous Steiner Tree Construction and Buffer Insertion. Proceedings
of the Fifth ACM/SIGDA Physical Design Workshop. Citeseer, pp. 1–6.
[Per16] Rodion Permin (2016). A Near-Optimum Algorithm for Cost-Based Buffering.
Master’s thesis, Research Institute for Discrete Mathematics, University of
Bonn, Germany.
[Pri57] Robert C Prim (1957). Shortest connection networks and some generalizations.
Bell Labs Technical Journal 36.6, pp. 1389–1401.
[RT87] Prabhakar Raghavan and Clark D Tompson (1987). Randomized Rounding: a
Technique for Provably Good Algorithms and Algorithmic Proofs. Combina-
torica 7.4, pp. 365–374.
[Rao+92] Sailesh K Rao, P Sadayappan, Frank K Hwang, and Peter W Shor (1992). The
Rectilinear Steiner Arborescence Problem. Algorithmica 7.1-6, pp. 277–288.
[RP94] Curtis L Ratzlaff and Lawrence T Pillage (1994). RICE: Rapid Interconnect
Circuit Evaluation using AWE. IEEE Transactions on Computer-Aided Design
of Integrated Circuits and Systems 13.6, pp. 763–776.
[Roc16] Benjamin M Rockel (2016). Optimale Shallow-Light-Steiner-Arboreszenzen.
Bachelor’s thesis, Research Institute for Discrete Mathematics, University of
Bonn, Germany.
[Rom15] Daniel Romen (2015). Cost-Based Buffering for Multiple Resources. Mas-
ter’s thesis, Research Institute for Discrete Mathematics, University of Bonn,
Germany.
[Rot12] Daniel Rotter (2012). Light and Fast Repeater Tree Topologies. Master’s thesis,
Research Institute for Discrete Mathematics, University of Bonn, Germany.
[Sac15] Pietro Saccardi (2015). Global routing with exact pin positions. Master’s thesis,
Research Institute for Discrete Mathematics, University of Bonn, Germany.
[Sam+15] Radhamanjari Samanta, Adil I Erzin, Soumyendu Raha, Yuriy V Shamardin,
Ivan I Takhonov, and Vyacheslav V Zalyubovskiy (2015). A provably tight delay-
driven concurrently congestion mitigating global routing algorithm. Applied
Mathematics and Computation 255, pp. 92–104.
[Sap04] Sachin Sapatnekar (2004). Timing. Springer Science & Business Media.
[Sch14] Rudolf Scheifele (2014). Steiner trees with bounded RC-delay. Proceedings of
the International Workshop on Approximation and Online Algorithms (WAOA).
Springer, pp. 224–235.
[Sch15] Ulrike Schorr (2015). Algorithms for Circuit Sizing in VLSI Design. PhD thesis,
Research Institute for Discrete Mathematics, University of Bonn, Germany.
[Sch09] Werner Schwärzler (2009). On the Complexity of the Planar Edge-Disjoint
Paths Problem with Terminals on the Outer Boundary. Combinatorica 29.1,
pp. 121–126.
Bibliography 163
[SL05] Weiping Shi and Zhuo Li (2005). A Fast Algorithm for Optimal Buffer
Insertion. IEEE Transactions on Computer-Aided Design of Integrated Circuits
and Systems 24.6, pp. 879–891.
[SLA04] Weiping Shi, Zhuo Li, and Charles J Alpert (2004). Complexity Analysis
and Speedup Techniques for Optimal Buffer Insertion with Minimum Cost.
Proceedings of the 2004 Asia and South Pacific Design Automation Conference.
IEEE, pp. 609–614.
[SS05] Weiping Shi and Chen Su (2005). The Rectilinear Steiner Arborescence Problem
is NP-Complete. SIAM Journal on Computing 35.3, pp. 729–740.
[Tov84] Craig A Tovey (1984). A Simplified NP-Complete Satisfiability Problem. Dis-
crete Applied Mathematics 8.1, pp. 85–89.
[Van90] Lukas P P P Van Ginneken (1990). Buffer Placement in Distributed RC-Tree
Networks for Minimal Elmore Delay. Proceedings of the IEEE International
Symposium on Circuits and Systems. IEEE, pp. 865–868.
[Vyg04] Jens Vygen (2004). Near-Optimum Global Routing with Coupling, Delay
Bounds, and Power Consumption. Springer.
[Vyg16] Jens Vygen (2016). Chip Design. Lecture Notes. Preliminary version.
[Wei+13] Yaoguang Wei, Zhuo Li, Cliff Sze, Shiyan Hu, Charles J Alpert, and Sachin S
Sapatnekar (2013). CATALYST: Planning Layer Directives for Effective Design
Closure. Proceedings of the Conference on Design, Automation and Test in
Europe (DATE). IEEE, pp. 1873–1878.
[Yan+06] Jin-Tai Yan, Yen-Hsiang Chen, Chia-Fang Lee, and Ming-Ching Huang (2006).
Multilevel Timing-Constrained Full-Chip Routing in Hierarchical Quad-Grid
Model. Proceedings of the IEEE International Symposium on Circuits and
Systems (ISCAS).
[YL04] Jin-Tai Yan and Shun-Hua Lin (2004). Timing-Constrained Congestion-Driven
Global Routing. Proceedings of the 2004 Asia and South Pacific Design Au-
tomation Conference. IEEE, pp. 683–686.
