Graph processing hardware accelerator for shortest path algorithms in nanometer very large-scale integration interconnect routing by Ch'ng, Heng Sun
  
GRAPH PROCESSING HARDWARE ACCELERATOR FOR SHORTEST PATH 
ALGORITHMS IN NANOMETER VERY LARGE-SCALE INTEGRATION 
INTERCONNECT ROUTING 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
CH’NG HENG SUN 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
UNIVERSITI TEKNOLOGI MALAYSIA
  
 
Graph Processing Hardware Accelerator for Shortest Path 
Algorithms in Nanometer Very Large-Scale Integration 
Interconnect Routing
2006/2007
CH’NG HENG SUN 
NO. 11, JALAN INDAH 7,  
TAMAN KURAU INDAH, 
34350 KUALA KURAU, PERAK. 
PROF. DR. MOHAMED KHALIL 
MOHD. HANI 
29 MAY 2007 29 MAY 2007 
? 
υ 
υ
  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
“ I hereby declare that I have read this thesis and in my  
opinion this thesis is sufficient in terms of scope and quality for the  
award of the degree of  Master of Engineering (Electrical)” 
 
 
 
 
 
Signature  : ___________________________________ 
Supervisor  : ___________________________________ 
Date   : ___________________________________ 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Prof. Dr. Mohamed Khalil Mohd. Hani 
29 MAY 2007
BAHAGIAN A – Pengesahan Kerjasama* 
 
Adalah disahkan bahawa projek penyelidikan tesis ini telah dilaksanakan melalui 
kerjasama antara ______________________ dengan _________________________ 
Disahkan oleh: 
Tandatangan :………………………………………………… Tarikh :………… 
Nama  :………………………………………………… 
Jawatan :………………………………………………… 
(Cop rasmi) 
 
* Jika penyediaan tesis/projek melibatkan kerjasama. 
 
BAHAGIAN B – Untuk Kegunaan Pejabat Fakulti Kejuruteraan Elektrik 
 
Tesis ini telah diperiksa dan diakui oleh: 
 
Nama dan Alamat  
Pemeriksa Luar :  
 
 
Nama dan Alamat  
Pemeriksa Dalam I :  
 
 
Pemeriksa Dalam II :  
 
Name Penyelia lain : 
(jika ada) 
 
Disahkan oleh Timbalan Dekan (Pengajian Siswazah & Penyelidikan) / Ketua 
Jabatan Program Pengajian Siswazah: 
Tandatangan : ………………………………………..  Tarikh :………………... 
Nama  : ………………………………………..
Prof. Madya Dr. Abdul Rahman bin Ramli 
E013, Blok E, 
Fakulti Kejuruteraan, 
Universiti Putra Malaysia, 
43400 UPM Serdang, 
Selangor.
Prof. Dr. Abu Khari bin A’in 
Fakulti Kejuruteraan, 
Universiti Teknologi Malaysia, 
81310 UTM Skudai, 
Johor. 
GRAPH PROCESSING HARDWARE ACCELERATOR FOR SHORTEST PATH 
ALGORITHMS IN NANOMETER VERY LARGE-SCALE INTEGRATION 
INTERCONNECT ROUTING 
 
 
 
 
 
 
 
CH’NG HENG SUN 
 
 
 
 
 
 
A thesis submitted in fulfilment of the  
requirements for the award of the degree of  
Master of Engineering (Electrical) 
 
 
 
 
 
 
Faculty of Electrical Engineering 
Universiti Teknologi Malaysia 
 
 
 
 
 
 
MAY 2007
  
ii
 
 
 
 
 
 
 
 
 
 
 
 
 
 
I declare that this thesis entitled “Graph Processing Hardware Accelerator for 
Shortest Path Algorithms in Nanometer Very Large-Scale Integration Interconnect 
Routing” is the result of my own research except as cited in references. The thesis 
has not been accepted for any degree and is not concurrently submitted in 
candidature of any other degree. 
 
 
 
 
  Signature  : ______________________________ 
  Name of Candidate : ______________________________ 
  Date   : ______________________________ 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
CH’NG HENG SUN 
29 MAY 2007 
  
iii
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
Specially dedicated to 
 my beloved family 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
  
iv
 
ACKNOWLEDGEMENTS 
 
 
 
 
First and foremost, I would like to extend my deepest gratitude to Professor 
Dr. Mohamed Khalil bin Haji Mohd Hani for giving me the opportunity to explore 
new grounds in the computer-aided design of electronic systems without getting lost 
in the process. His constant encouragement, support and guidance were key to 
bringing this project to a fruitful completion. I have learnt and gained much in my 
two years with him, not only in the field of research, but also in the lessons of life. 
 
 
My sincerest appreciation goes out to all those who have contributed directly 
and indirectly to the completion of this research and thesis. Of particular mention are 
lecturer Encik Nasir Shaikh Husin for his sincere guidance and the VLSI-ECAD lab 
technicians, En. Zulkifli bin Che Embong and En. Khomarudden bin Mohd Khair 
Juhari, in creating a conducive learning and research environment in the lab. 
 
 
Many thanks are due to past and present members of our research group at 
VLSI-ECAD lab. I am especially thankful to my colleagues Hau, Chew, Illiasaak and 
Shikin for providing a supportive and productive environment during the course of 
my stay at UTM. At the same time, the constant encouragement and camaraderie 
shared between all my friends in campus made life in UTM an enriching experience. 
 
 
Finally, I would like to express my love and appreciation to my family who 
have shown unrelenting care and support throughout this challenging endevour. 
 
 
 
  
v
 
ABSTRACT 
 
 
 
 
Graphs are pervasive data structures in computer science, and algorithms 
working with them are fundamental to the field. Many challenging problems in Very 
Large-Scale Integration (VLSI) physical design automation are modeled using 
graphs. The routing problems in VLSI physical design are, in essence, shortest path 
problems in special graphs. It has been shown that the performance of a graph-based 
shortest path algorithm can severely be affected by the performance of its priority 
queue. This thesis proposes a graph processing hardware accelerator for shortest path 
algorithms applied in nanometer VLSI interconnect routing problems. A custom 
Graph Processing Unit (GPU), in which a hardware priority queue accelerator is 
embedded, designed and prototyped in a Field Programmable Gate Array (FPGA) 
based hardware platform. The proposed hardware priority queue accelerator is 
designed to be parameterizable and theoretically cascadable. It is also designed for 
high performance and it exhibits a run-time complexity for an INSERT (or 
EXTRACT) queue operation that is constant. In order to utilize the high performance 
hardware priority queue module, modifications have to be made on the graph-based 
shortest path algorithm. In hardware, the priority queue size is constrained by the 
available logic resources. Consequently, this thesis also proposes a hybrid software-
hardware priority queue which redirects priority queue entries to software priority 
queue when the hardware priority queue module exceeds its queue size limit. For 
design validation and performance test purposes, a computationally expensive VLSI 
interconnect routing Computer Aided Design (CAD) module is developed. Results of 
the performance tests on the proposed hardware graph accelerator, graph 
computations are significantly improved in terms of algorithm complexity and 
execution speed. 
 
  
vi
 
ABSTRAK 
 
 
 
 
Graf adalah struktur data yang meluas dalam sains komputer, dan algoritma 
yang bekerja dengan mereka adalah teras kepada bidang ini. Kebanyakan masalah 
yang mencabar dalam bidang automasi rekabentuk fizikal ‘Very Large-Scale 
Integration’ (VLSI) dimodelkan sebagai graf. Banyak masalah penyambungan wayar 
dalam rekabentuk fizikal VLSI melibatkan masalah mencari-jalan paling pendek 
dalam graf yang istimewa. Ianya juga telah di tunjukkan bahawa prestasi algoritma 
mencari-jalan paling pendek berdasarkan graf dipengaruhi oleh prestasi baris gilir 
keutamaan. Tesis ini mengusulkan perkakasan pemproses graf untuk 
mempercepatkan perhitungan graf dalam masalah mencari-jalan paling pendek. Unit 
Pemprosesan Graf (GPU), di mana modul perkakasan pemecut keutamaan giliran 
dibenamkan dan prototaip dalam perkakasan ‘Field Programmable Gate Array’ 
(FPGA) dapat dibentuk semula. Modul perkakasan pemecut keutamaan giliran 
tersebut direka supaya mudah diubahsuai, ia berprestasi tinggi dan mampu 
memberikan kompleksiti masa-lari yang malar bagi setiap tugas SISIPAN atau 
SARI. Untuk menggunakan perkakasan pemecut keutamaan giliran yang berprestasi 
tinggi tersebut, pengubahsuaian ke atas algoritma graf juga dilakukan. Dalam 
perkakasan, saiz baris gilir ketumaan dikekang oleh sumber-sumber logik yang ada. 
Tesis ini juga mengusulkan pemecut keutamaan giliran hibrid berasaskan perkakasan 
dan perisian, di mana sisipan ke perkakasan pemecut keutamaan giliran akan 
ditujukan ke perisian apabila perkakasan pemecut keutamaan giliran tidak mampu 
untuk menampungnya. Untuk pengesahan rekacipta dan pengujian prestasi, satu 
modul pengkomputeran VLSI penyambungan wayar ‘Computer Aided Design’ 
(CAD) dibangunkan. Hasil kerja tesis ini menunjukkan bahawa perkakasan pemecut 
yang diusulkan dapat mempercepatkan penghitungan graf, baik dari segi kerumitan 
algoritma dan masa perlakuan.
  
vii
 
 
 
 
 
 
TABLE OF CONTENTS 
 
 
 
 
CHAPTER TITLE PAGE 
   
 DECLARATION ii 
 DEDICATION iii 
 ACKNOWLEDGEMENTS iv 
 ABSTRACT v 
 ABSTRAK vi 
 TABLE OF CONTENTS vii 
 LIST OF TABLES xi 
 LIST OF FIGURES xii 
 LIST OF SYMBOLS xvii 
 LIST OF APPENDICES xviii 
   
   
1 INTRODUCTION 1 
 1.1 Background 
1.2 Problem Statement 
1.3 Objectives 
1.4 Scope of Work 
1.5 Previous Related Work 
1.5.1 Hardware Maze Router and Graph Accelerator 
1.5.2 Priority Queue Implementation 
1.6 Significance of Research 
1.7 Thesis Organization 
1.8 Summary 
 
1 
3 
4 
5 
6 
6 
8 
10 
11 
13 
   
  
viii
 
2 THEORY AND RESEARCH BACKGROUND 14 
 2.1 Graph 
2.2 Graph-based Shortest Path Algorithm 
2.3 Priority Queue 
2.4 Priority Queue and Dijkstra’s Shortest Path Algorithm   
2.5 Modeling of VLSI Interconnect Routing as a Shortest 
Path Problem 
2.6 Summary 
 
14 
17 
18 
23 
30 
  
33 
   
3 PRIORITY QUEUE AND GRAPH-BASED SHORTEST 
PATH PROBLEM – DESCRIPTIONS OF 
ALGORITHMS 
34 
 3.1 Priority Queue and the Insertion Sort Algorithm 
            3.1.1    Insertion-Sort Priority Queue 
3.2 Maze Routing with Buffered Elmore Delay Path 
Optimization 
3.3 Simultaneous Maze Routing and Buffer Insertion (S-
RABI) Algorithm  
            3.3.1    Initial Graph Pruning in S-RABI 
            3.3.2    Dijkstra’s Algorithm applied in S-RABI 
            3.3.3    S-RABI in maze routing with buffered  
interconnect delay optimization 
3.4 Summary 
 
34 
35 
39 
  
45 
  
45 
47 
49 
 
56 
   
4 ALGORITHM MODIFICATIONS FOR HARDWARE 
MAPPING 
57 
 4.1 Modification in graph algorithm to remove 
DECREASE-KEY operation 
4.2 Modifications in Dijkstra’s and S-RABI algorithm 
4.3 Modification of Insertion Sort Priority Queue 
4.4 Summary 
57 
 
62 
68 
73 
  
ix
 
5 THE GRAPH PROCESSING UNIT 74 
 5.1 Introduction 
5.2 System Architecture of Graph Processing Unit (GPU) 
5.3 Priority Queue Accelerator Module 
5.3.1 Specification and Conceptual Design of hwPQ 
            5.3.2    Specification and Conceptual Design of 
Avalon Interface Unit 
5.4 hwPQ Device Driver 
5.5 Hybrid Hardware-Software Priority Queue 
(HybridPQ) 
 
74 
76 
78 
79 
81 
 
84 
87 
 
   
6 DESIGN OF PRIORITY QUEUE ACCELERATOR 
MODULE 
93 
 6.1 Hardware Priority Queue Unit (hwPQ) 
6.1.1 The design of Processing Element – RTL 
Design 
6.2 Pipelining in hwPQ 
6.2.1 Data Hazards in the Pipeline 
6.3 Timing Specifications of hwPQ 
6.4 Avalon Interface Unit – Design Requirement 
6.5 Avalon Interface Unit – RTL Design 
6.5.1 Avalon Data Unit 
6.5.2 Avalon Control Unit 
 
93 
98 
 
102 
104 
108 
113 
114 
115 
117 
 
   
7 SIMULATION, HARDWARE TEST AND 
PERFORMANCE EVALUATION 
119 
 
 
7.1 Design Verification through Timing Simulation 
7.1.1 Simulation of Priority Queue Accelerator 
Module 
7.2 Hardware Test 
7.3 Comparison with priority queue software 
119 
119 
  
123 
125 
  
x
 
implementation 
7.4 Comparison with other priority queue hardware design 
7.5 Performance Evaluation Platform 
7.6 Performance of Priority Queue in Graph Computation 
7.6.1 Worst Case Analysis 
7.6.2 Practical Case Analysis 
7.7 Summary 
 
 
128 
130 
132 
134 
139 
142 
   
8 CONCLUSIONS 145 
 8.1 Concluding Remarks 
8.2 Recommendations for Future Work 
 
145 
147 
 
REFERENCES               150 
   
Appendices A - I      158 - 226 
   
 
  
xi
 
LIST OF TABLES 
  
 
 
 
TABLE NO TITLE  PAGE
    
2.1 Run-time complexity for each operation among 
different heap data structures. 
 30 
5.1 Avalon System Bus signal descriptions  82 
5.2 Memory-mapped Register descriptions  83 
6.1 IO Port Specifications of hwPQ  110 
7.1 Set of Test Vectors  120 
7.2 Resource Utilization and Performance of hwPQ  125 
7.3 Comparison in Run-Time Complexity  126 
7.4 Comparison in Number of Processor Cycles  126 
7.5 Speed Up Gain by Priority Queue Accelerator 
Module 
 126 
7.6 Comparison with other hardware implementations  129 
7.7 Number of elapsed clock cycles per operation  144 
8.1 Features of Hardware Priority Queue Unit (hwPQ)  146 
  
xii
 
 
 
 
 
 
LIST OF FIGURES 
 
 
 
 
FIGURE NO TITLE  PAGE
    
1.1 System Architecture  11 
2.1 Two representations of an undirected graph  15 
2.2 Two representation of a directed graph   15 
2.3 A weighted graph  16 
2.4 Shortest Path and Shortest Unit Path  17 
2.5 Basic Operations of Priority Queue  19 
2.6 Simplest way to implement Priority Queue  20 
2.7 Priority Queue implemented as array or as heap  21 
2.8 Set, Graph, Tree and Heap  22 
2.9 Example of Binomial-Heap and Fibonacci-Heap  22 
2.10 Function RELAX ( )  23 
2.11 Relaxation  23 
2.12 Dijkstra’s Shortest Path Algorithm  24 
2.13 Illustration of Dijkstra’s algorithm  25 
2.14 Illustration of the final execution result  29 
2.15 VLSI layout represented in grid-graph  31 
2.16 VLSI Routing as shortest unit path problem  31 
2.17 Parallel expansion in Lee’s algorithm  32 
2.18 VLSI Routing as shortest path (minimum-delay) 
problem 
 33 
3.1 Insertion-Sort Algorithm  36 
3.2 Insertion-Sort Priority Queue Algorithm  37 
3.3 Operations in Insertion-Sort Priority Queue  38 
3.4 A typical routing grid-graph  39 
  
xiii
 
 
3.5 Typical maze routing algorithm with buffered 
delay path optimixation 
 40 
3.6 Elmode Delay Model  41 
3.7 Elmore Delay in hop-by-hop maze routing  42 
3.8 Elmore Delay for buffer insertion in hop-by-hop 
maze routing 
 43 
3.9 Graph pruning  46 
3.10 Hop-by-hop Dijkstra’s Algorithm  48 
3.11 Function Cost ( )  50 
3.12 Function InsertCandidate ( )  51 
3.13 Simltaneous Maze Routing and Buffer Insertion 
(S-RABI) 
 53 
4.1 DECREASE-KEY and Relaxation  58 
4.2 Function DECREASE-KEY ( )  59 
4.3 INSERT in Relaxation  60 
4.4 EXTRACT in Relaxation  61 
4.5 Modifcation rules to remove DECREASE-KEY  61 
4.6 Modified Dijkstra’s Algorithm – without 
DECREASE-KEY 
 62 
4.7 Modified InsertCandidate ( )  63 
4.8 Modified S-RABI Algorithm  65 
4.9 Further optimization to reduce overhead  66 
4.10 One-dimensional Systolic Array Architecture  68 
4.11 Execution of identical task-cycles for one 
operation 
 69 
4.12 Series of operations executed in pipeline  70 
4.13 Modified Insertion-Sort Priority Queue  71 
4.14 Example of INSERT_MOD operation  72 
4.15 INSERT_MOD in identical sub-tasks of 
Compare-and-Right-Shift 
 76 
5.1 NIOS II System Architecture  75 
5.2 Different layers of software components in NIOS 
II System 
 76 
  
xiv
 
5.3 Top-Level Architecture of Graph Processing 
Unit 
 76 
5.4 GPU – Software/Hardware System Partitioning  78 
5.5 Functional Block Diagram of Priority Queue 
Accelerator Module 
 79 
5.6 Top-Level Description of hwPQ  80 
5.7 Memory-mapped IO of Avalon Slave Peripheral  81 
5.8 Functional Block Diagram of Avalon Interface 
Unit 
 82 
5.9 Programming Model of Priority Queue 
Accelerator Module 
 84 
5.10 Device driver routine for INSERT operation  85 
5.11 Device driver routine for EXTRACT operation  85 
5.12 Device driver routine for PEEK operation  86 
5.13 Device driver routine for DELETE operation  87 
5.14 Software Abstraction Layer of HybridPQ  88 
5.15 Functional Block Diagram of HybridPQ  89 
5.16 INSERT control mechanism in HybridPQ  90 
5.17 EXTRACT control mechanism in HybridPQ  90 
5.18 Functions provided in HybridPQ  91 
6.1 Top-Level Functional Block Diagram of Priority 
Queue Accelerator Module 
 93 
6.2 compare and right-shift tasks in an INSERT 
operation  
 94 
6.3 Left-shift tasks on an EXTRACT operation  95 
6.4 Hardware Priority Queue Unit  95 
6.5 INSERT operation in systolic array based hwPQ  96 
6.6 Execution of identical tasks for one operation  97 
6.7 idle and left-shift tasks in EXTRACT   97 
6.8 RTL Architecture of Processing Element  98 
6.9 Communication between PEs  99 
6.10 Behavioral Description of PE  100 
6.11 RTL Control Sequence of PE  101 
  
xv
 
6.12 Series of operations executed in pipeline  102 
6.13 Pipelined execution of multiple INSERT  103 
6.14 Pipelined execution of multiple EXTRACT  103 
6.15 Symbolic representation of PEs in hwPQ  104 
6.16 Example of INSERT followed by EXTRACT  105 
6.17 Example of INSRT ? NOP ? EXTRACT  107 
6.18 Several ways to insert idle state  108 
6.19 Hardware Priority Queue Unit (hwPQ)  110 
6.20 Timing Specification of hwPQ  111 
6.21 Communication rule for RESET operation  113 
6.22 Communication rule for INSERT operation  113 
6.23 Communication rule for EXTRACT operation  114 
6.24 Functional Block Diagram of Avalon Interface 
Unit 
 115 
6.25 Functional Block Diagram of Avalon Data Unit  116 
6.26 Behavioral Description of Avalon Data Unit  116 
6.27 Functional Block Diagram of Avalon Control 
Unit 
 117 
6.28 Behavioral Description of Avalon Control Unit  117 
6.29 Control Flowchart of Avalon Control Unit  118 
6.30 State Diagram of Avalon Control Unit  118 
7.1 Simulation of Priority Queue Accelerator 
Module 
 121 
7.2 Hardware Test Result  124 
7.3 Overview of demonstration prototype  131 
7.4 GUI of “VLSI Maze Routing DEMO” 
application 
 131 
7.5 TPQ VS Entire Graph Computation Run-Time  133 
7.6 Size of Priority Queue for Entire Graph 
Computation 
 133 
7.7 Dijkstra’s – Maximum Queue Size VS Graph 
Size 
 134 
7.8 S-RABI – Maximum Queue Size VS Graph Size  134 
  
xvi
 
7.9 Dijkstra’s – Total number of operations VS 
Graph Size 
 135 
7.10 S-RABI – Total number of operations VS Graph 
Size 
 135 
7.11 S-RABI (FHPQ): Number of operations VS 
Graph Size 
 136 
7.12 S-RABI (FHPQ): Total Cycle Elapsed for each 
operation 
 137 
7.13 Dijkstra’s – Speed up Gain of using HybridPQ  137 
7.14 S-RABI – Speed up gain of using HybridPQ  138 
7.15 S-RABI – FHPQ: Maximum Queue Size VS 
Graph Size 
 139 
7.16 S-RABI – HybridPQ: Maximum Queue Size VS 
Graph Size 
 140 
7.17 High Dense – S-RABI: Speed up gain of using 
HybridPQ 
 140 
7.18 Less Dense – S-RABI: Speed up gain of using 
HybridPQ 
 141 
7.19 S-RABI – HybridPQ: Speed up gain VS 
Maximum Queue Size 
 141 
7.20 Dijkstra’s – HybridPQ: Speed up Gain VS 
Maximum Queue Size 
 142 
  
xvii
 
LIST OF SYMBOLS 
 
 
 
 
 
API - Application Programming Interface 
ASIC - Application Specific Integrated Circuit 
CAD - Computer Aided Design 
EDA - Electronic Design Automation 
FPGA - Field Programmable Gate Array 
GUI - Graphical User Interface 
HDL - Hardware Development Language 
IDE - Integrated Development Environment 
I/O - Input/Output 
LE - Logic Element 
MHz - Megahertz 
PC - Personal Computer 
PE - Processing Element 
RAM - Random Access Memory 
RTL - Register Transfer Logic 
SoC - System-on-Chip 
SOPC - System-on-Programmable-Chip 
UART - Universal Asynchronous Receiver Transmitter 
UTM - Universiti Teknologi Malaysia 
VHDL - Very High Speed Integrated Circuit Hardware Description Language 
VLSI  - Very Large Scale Integration 
 
  
xviii
 
LIST OF APPENDICES 
 
 
 
 
APPENDIX TITLE  PAGE
    
A Numerical Example of Dijkstra’s Algorithm  158 
B Numerical Example of hop-by-hop Dijkstra’s 
Algorithm 
 167 
C Numerical Example of S-RABI Algorithm  175 
D Numerical Example of the Insertion Sort 
Priority Queue Operation 
 197 
E Introduction to Altera Nios II Development 
System 
 203 
F VHDL Source Codes of Priority Queue 
Accelerator Module 
 205 
G C Source Code for hwPQ device driver and 
HybridPQ API 
 210 
H Sample Graphs for Performance Test and 
Evaluation 
 216 
I Design Verification – Simulation Waveform  219 
 
CHAPTER 1 
 
 
 
 
INTRODUCTION 
 
 
 
 
This thesis proposes a graph processing hardware accelerator for shortest path 
algorithms applied in nanometer VLSI interconnect routing problems. A custom 
Graph Processing Unit (GPU), in which a hardware priority queue accelerator 
module is embedded, designed and prototyped on a reconfigurable FPGA-based 
hardware platform. The hardware priority queue accelerator off-loads and speed up 
graph-based shortest path computations. For design validation and performance test 
purposes, a computationally extensive VLSI interconnect routing CAD module (or 
EDA sub-system) is developed to execute on the proposed GPU. This chapter 
introduces the background of research, objectives, problem statement, scope of work, 
previous related works and the significance of this research. The organization of 
thesis is summarized at the end of the chapter. 
 
 
 
 
1.1 Background 
 
 
Graphs are pervasive data structures in computer science, and algorithms 
working with them are fundamental to the field. There are many graph algorithms, 
and the well-established ones include Depth-First Search, Breadth-First Search, 
Topological Search, Spanning Tree algorithm, Dijkstra’s algorithm, Bellman-Ford 
algorithm and Floyd-Warshall algorithm. These graph algorithms are basically 
shortest path algorithms. For instance, Dijkstra’s algorithm is an extension of the 
Depth-First Search algorithm except the former solves the shortest path problem on 
weighted graph, while the latter solve the shortest unit path problem on unweighted 
 2
graph. Bellman-Ford algorithm and Dijkstra’s algorithm solve single-source shortest 
path problem, except the former targets graph with negative edges, while the latter is 
restricted to graph with non-negative edges. 
 
 
Many interesting problems in VLSI physical design automation are modeled 
using graphs. Hence, VLSI electronic design automation (EDA) systems are based 
on the graph algorithms. These algorithms include, among others, Min-Cut and Max-
Cut algorithms for logic partitioning and placement, Clock Skew Scheduling 
algorithm for useful skew clock tree synthesis, Minimum Steiner Tree algorithm and 
Span Minimum Tree algorithm for critical/global interconnect network synthesis, 
Maze Routing algorithm for point-to-point interconnect routing, etc. Many routing 
problems in VLSI physical design are, in essence, shortest path problems in special 
graphs. Shortest path problems, therefore, play a significant role in global and 
detailed routing algorithms (Sherwani, 1995). 
 
 
Real world problems modeled in mathematical set can be mapped into 
graphs, where elements in the set are represented by vertices, and the relation 
between any two elements are represented by edges. The run-time complexity and 
memory-consumption of graph algorithms are expressed in terms of the vertices and 
edges. A graph searching algorithm can discover much about the graph structure. 
Searching a graph means systematically following the edges of the graph so as to 
visit the vertices of graph. Many graph algorithms are organized as simple 
elaborations of basic graph searching algorithms (Cormen et al., 2001). Hence, the 
technique of searching in a graph is the heart of these algorithms. In the graph 
searching process, Priority Queues are used to maintain the tentative search results, 
which can grow very large as the graph size increases. Consequently, the 
implementation of these priority queues can significantly affect the run-time and 
memory consumption of a graph algorithm (Skiena, 1997). 
 
 
 
 
 
 
 
 3
1.2 Problem Statement 
 
 
According to Moore’s Law, to achieve minimum cost, the number of 
transistors in an Integrated Circuit (IC) needs to double every 18 months. Achieving 
minimum cost per transistor entails enormous design effort and high non-recurrent-
engineering (NRE) cost. The design complexity grows proportionally to the increase 
of transistor density, and subsequently, circuit engineers face tremendous design 
challenges. When physical design moves into nanometer circuit integration range, we 
would encounter a combinatorial explosion of design issues, involving signal 
integrity, interconnect delay and lithography, which not only challenge the attempt 
for effective design automation, but further the need to suppress NRE cost, which in 
turn increases the demand of EDA (Electronic Design Automation) tools.  
 
 
Conventional interconnect routing is rather straight-forward, and hence does 
not pose too great a challenge to the development of algorithms. However, the 
continual miniaturization of technology has seen the increasing influence of the 
interconnect delay. According to the simple scaling rule (Bakoglu, 1990), when 
devices and interconnects are scaled down in all three dimensions by a factor of S, 
the intrinsic gate delay is reduced by a factor of S but the delay caused by 
interconnect increases by a factor of S2. As the device operates at higher speed, the 
interconnect delay becomes even more significant. As a result, interconnect delay has 
become the dominating factor affecting system performance. In many system designs 
targeting 0.35um – 0.5um, as much as 50% to 70% of clock cycles are consumed by 
interconnect delay. This figure will continue to rise as the feature technology size 
decreases further (Cong et al., 1996). Consequently, the effect of interconnect delay 
can no longer be ignored in nanometer VLSI physical design. 
 
 
Many techniques are employed to reduce interconnect delay; among them, 
buffer insertion has been shown to be an effective approach (Ginneken, 1990). Hence, 
in contrast to conventional routing which considers only wires, nanometer VLSI 
interconnect routing considers both buffer insertion and wire-sizing along the 
interconnect path, in order to achieve minimum interconnect delay. It is obvious that 
the complexity of nanometer interconnect routing is greater, and in fact, grows 
 4
exponentially when multiple buffer choices and wire-sizes (at different metal layers, 
with different width and depth) are considered as potential interconnect candidates at 
each point along the interconnect path. 
 
 
In general, given a post-placement VLSI layout, there are restrictions on 
where buffers may be inserted. For instance, it may be possible to route wires over a 
pre-placed macro-cell, but it may not be possible to insert buffers in that region. In 
this case, the routing has to, not only minimize the interconnect delay, but 
simultaneously strive for good buffer location, manage buffer density and congestion, 
and wire sizing. Consequently, many researches have proposed techniques in 
simultaneous maze routing with buffer insertion and wire sizing to solve the above 
interconnect routing problem.  
 
 
A number of interconnect routing algorithms have been proposed, with 
different strategies for buffer insertion (Chu and Wong, 1997; Chu and Wong, 1998; 
Chu and Wong, 1999; Dechu et al., 2004; Ginneken, 1990; Lai and Wong, 2002; 
Jagannathan et al., 2002; Nasir, 2005; Zhou et al., 2000). Most of these algorithms 
are formulated as graph theoretic shortest path algorithms. Clearly, as many 
parameters and constraints are involved in VLSI interconnect routing, these 
algorithms are, essentially, multi-weighted multi-constrained graph search algorithms. 
In graph search, the solution space and search results are effectively maintained 
using priority queues. The choice of priority queue implementation, hardware or 
software, differ significantly on how they affect the run-time and memory 
consumption of the graph algorithms (Skienna, 1997). 
 
 
 
 
1.3 Objectives 
 
 
The overall objective of this thesis is to propose the design of a graph 
processing hardware accelerator for high-speed computation of graph based 
algorithm. This objective is modularized into the following sub-objectives: 
 
 5
1) To design a Graph Processing Unit (GPU) customized for high-speed 
computation of graph based shortest path algorithm. 
 
2) To design a priority queue accelerator module to speed up priority queue 
operations on the above custom GPU. 
 
3) To verify the design and validate the effectiveness of accelerating, via 
hardware, priority queue operations in a graph algorithm. This is derived 
from performance validation studies on the application of the proposed GPU 
executing a compute-intensive VLSI interconnect routing algorithm. 
 
 
 
 
1.4 Scope of Work 
 
 
1) The Graph Processing Unit (GPU) is implemented on FPGA-based embedded 
system hardware platform on Altera Stratix II development board.  
 
2) The priority queue accelerator module will have the following features: 
a. It supports the two basic priority queue function: (i) INSERT and (ii) 
EXTRACT. 
b. It is parameterizable so that the implemented length of priority queue 
can be adjusted based on available logic resources.  
c. It is cascade-able such that further queue length extension is possible. 
d. It is able to store each queue-entry in 64-bit: 32-bit for priority-value 
and 32-bit for the associate-identifier. 
 
3) A hybrid hardware-software priority queue is developed. It avoids overflow 
at hardware priority queue module. 
 
4) A demonstration application prototype is developed to evaluate the design. 
System validation and performance evaluation are derived by examining the 
graph based shortest path algorithms on this application prototype. Note that:  
 6
a. The test algorithm is called S-RABI for Simultaneous Maze Routing 
and Buffer Insertion algorithm, proposed by Nasir et al. (2006). 
b. In order to utilize the hardware priority queue accelerator module 
effectively, the algorithms have to be modified. 
 
 
 
 
1.5 Previous Related Work 
 
 
The area of hardware maze router design, generic graph accelerator design, 
and priority queue has received significant attention over the years. In this section 
these previous related work are reviewed and summarized.  
 
 
 
 
1.5.1 Hardware Maze Router and Graph Accelerator 
 
 
Maze routing is the most fundamental algorithm among many other VLSI 
routing algorithms. Technically speaking, other routing problems can be decomposed 
into multiple sub-problems and solved with the maze routing algorithm. Many 
hardware maze routers had been proposed and most the work exploit the inherent 
parallelism of Lee’s algorithm (Lee, 1961). This includes the Full-Grid Maze Router, 
independently proposed by (Nestor, 2000; Keshk, 1997; Breuer and Shamsa, 1981). 
The architecture accelerates Lee’s algorithm using N*N identical processor-elements 
for worst-case N*N grid-graph, thus huge hardware resources are consumed. 
Another hardware maze router is the Wave-Front Machine, proposed by Sahni and 
Won (1987), and Suzuki et al. (1986). The Wave-Front-Machine uses N number of 
processing-elements and a status map for N*N grid graph.  
 
 
A more flexible and practical design, the cellular architecture with Raster 
Pipeline Subarray (RPS) is proposed (Rutenbar, 1984a, 1984b). Applying raster 
scanning concept, the grid-graph is divided into smaller square regions and floated 
into RPS. For each square region, RPS updates the status-map. The architecture of 
RPS is complex but constant for any input size. Systolic Array implementation of 
 7
RPS is then proposed (Rutenbar and Atkins, 1988) for better handling of the 
pipelined data. 
 
 
The above full-custom maze routers are specifically for maze routing, another 
approach to accelerate the graph-based shortest path algorithms is via generic graph 
accelerator. Unweighted graph represented in adjacency-matrix can be mapped into 
massive parallel hardware architecture where each of the processing units is a simple 
bit-machine. The computation of bit-wise graph characteristics: reachability, 
transitive closure, and connected-components can be accelerated. Huelsbergen (2000) 
had proposed such implementation in FPGA. Besides reachability, transitive closure 
and connected components, the computation of shortest unit path can be accelerated 
as well. An improved version, Hardware Graph Array (HAGAR) is proposed by 
Mencer et al. (2002) which uses RAM blocks than mere logic elements in FPGA. 
The proposed architecture of Huelsbergen (2000) and Mencer (2002) are actually 
quite similar to Full-Grid Maze Router except the former targets more generic 
application rather than the specific VLSI maze routing. 
 
 
In general, most graph problems, however, are weighted. Shortest Path 
Processor proposed by Nasir and Meador (1995, 1996) can be used to solve 
weighted-graph problems. It uses square-array analog hardware architecture to direct 
benefit from the adjacency-matrix representation of graph. The critical challenge of 
such implementation lies on the accuracy of D/A converter and voltage comparator 
(both analog) to provide accurate result. An improved version called Loser-Take-All 
is then proposed, it uses current-comparator instead of voltage-comparator (Nasir and 
Meador, 1999). Besides that, a digital version is proposed to resolve inaccuracy 
issues resulted in analog design (Rizal, 1999). Specifically for undirected weighted 
graph problems, triangle-array is proposed by Nasir et al. (2002a, 2002b). The 
triangle-array saves about half of the logic resources consumed by square-array 
implementation.  
 
 
All proposed previous work on hardware maze router and generic graph 
accelerator primarily explore the inherit parallelism of adjacency-matrix 
representation in graph. The major problem in such design required huge logic 
 8
resources, e.g. generic graph accelerator uses Θ (V2) logic resources for a graph of 
|V| vertices while maze router uses Θ (V2) logic resources for a grid-graph of |V * V| 
vertices (see section 2.1 for definition of ‘Θ’). In contrast, grid-graph for VLSI 
physical design is actually sparse; adjacency-matrix representation is simply a waste 
besides its inflexibility to support other graph variants. 
 
 
The hardware maze routers and generic graph accelerators eventually 
required entire graph input at initial stage, before proceed for shortest unit path 
computation. On the other hand, nanometer VLSI routing adopts hop-by-hop 
approach during graph-searching; information of graph vertices is unknown prior to 
execution. This completely different scenario reflects that the conventional maze 
routers and generic graph accelerators are not an option. 
 
 
In addition to that, the hardware maze routers and generic graph accelerators 
are designed to accelerate elementary graph algorithms, e.g. shortest unit path, 
transitive closure, connected-components, etc, not only nanometer VLSI routing has 
evolved into shortest path problem, it has evolved into multi-weight multi-constraint 
shortest path problem. Certain arithmetic power is needed besides complex data 
manipulation. This phenomenon leaves no room for the application of the primitive 
parallel hardware discussed above. New designs of hardware graph accelerators are 
needed. 
 
 
 
 
1.5.2 Priority Queue Implementation 
 
 
Due to the wide application of priority queue, much research effort had been 
made to achieve better priority queue implementations. In general, the research on 
priority queue can be categorized into: (i) various advanced data structure for priority 
queue, (ii) specific priority queue data structure with inherent parallelism, targeted 
Parallel Random Access Machine (PRAM) model, and (iii) full-custom hardware 
design to accelerate array-based priority queue.  
 
 9
 
Research in category (i) basically explore the various ‘heap’ structure (a 
variant of ‘tree’ data structure) to obtain theoretically better run-time complexity of 
priority queue operations. Binary-Heap, Binomial-Heap and Fibonacci-Heap are 
some instances of priority queue implementation under this category. Whereas 
research classified in category (ii) includes, among others, Parallel-Heap, Relaxed-
Heap, Sloped-Heap, etc. Basically, priority queue implementation under these two 
categories is interesting from software/parallel-software point of view; these 
implementations are capable to provide improvement in term of run-time complexity 
at the expenses of more memory consumption, but fail to address the severe constant 
overhead on memory data communication. In short, those heap-like structures are 
interesting in software but are not adaptable for high speed hardware implementation 
(Jones, 1986). 
 
 
Research work in category (iii), full-custom hardware priority queue design is 
driven by the demand of high-speed applications such as internet network routing 
and real-time applications. These hardware priority queue can achieve very high 
throughput and clocking frequency, thus improve the performance of priority queue 
in both run-time complexity and communication overhead. Works in (iii) includes 
Binary Trees of Comparator (BTC) by Picker and Fellman (1995); the organization 
of comparators mimics the Binary-Heap. New elements enter BTC through the 
leaves, the highest priority element is extracted from the root of BTC; therefore 
constant O(lg n) run-time for BTC priority queue operations.  
 
 
Ioannou (2000) proposed another variant of hardware priority queue, the 
Hardware Binary-Heap Priority Queue. The algorithm maintaining Binary-Heap 
property is pipelined and executed on custom pipelined processing units, results 
constant O(1) run-time for both INSERT and EXTRACT priority queue operations. 
Another implementation similar to it but using Binary-Random-Access-Memory 
(BRAM) is also proposed by Argon (2006). Noted, adding successive layer at 
binary-tree double the total number of tree-nodes, all these binary-tree based designs 
suffer from quadratic expansion complexity.  
 
 
 10
Brown (1988) and Chao (1991), independently propose the implementation 
of hardware priority queue using First-In-First-Out architecture, called FIFO Priority 
Queue. For l-levels of priority, l numbers of FIFO arrays is deployed; each stores 
elements of that priority. This implementation gives constant O(1) run-time, besides 
the FIFO order among elements with same priority is maintained. This 
implementation inherits the disadvantage as discussed: if the desired priority-level is 
large, huge number of FIFO arrays is needed. For example, if 32-bit priority-value is 
desired, then 4,294,967,296 FIFO arrays are needed.  
 
 
Shift Register and Systolic-Shift-Register implementation of priority queue 
(Toda et al., 1995; Moon et al., 2000) has better performance compared to the above 
designs. The priority level and the implemented worst-case priority queue size can be 
easily scaled. The designs deploy O(n) processing-elements arranged in one 
dimensional array, for constant O(1) INSERT and EXTRACT run-time complexity. 
The designs has the disadvantage of severe bus loading effect because all processing-
elements are connected to the input data bus, which results in low clocking 
frequency. 
 
 
 
 
1.6 Significance of Research 
 
 
This research is significant in that it tackles the issue of interconnect delay 
optimization in VLSI physical design since the interconnect delay now dominates 
gate delay in nanometer VLSI interconnect routing. Existing maze routers consider 
interconnects contribute negligible delay, which is now not correct. Nanometer VLSI 
routing algorithms now has to include strategies to handle interconnect delay 
optimization problem which include, among others, buffer insertion. Consequently, 
the algorithms are now more complex in that they are modeled using multi-weighted 
multi-constrained graphs. These graphs involve searching over millions of nodes, 
and hence the algorithms are now extremely compute-intensive. The need for 
hardware acceleration as proposed in this research is clear. The contribution of this 
research is as follows: 
 
 11
1) A comprehensive design of a 32-bit, parameterizable hardware priority queue 
accelerator module to accelerate priority queue operations. The module is 
incorporated into a graph processing unit, GPU. Modifications to the graph 
algorithms are made such that the proposed design can be applied with other 
graph-based shortest path algorithms. 
  
2) A hybrid priority queue based on hardware-software co-design is also 
developed. Such implementation introduces a simple yet efficient control 
mechanism to avoid overflow in hardware priority queue module. 
 
3) An application demonstration prototype of a graph processing hardware 
accelerator is developed. It includes the front-end GUI on host to generate 
sample post-placement layout. Figure 1.1 gives the architecture of the 
proposed system. 
 
 
 
Figure 1.1: System Architecture 
Graph Processing Unit (GPU) 
 
VLSI 
Maze 
Routing 
DEMO 
(GUI) 
 
 
 
 
Hardware  
Priority Queue Unit
NIOS II Processor Priority Queue Accelerator Module 
A
va
lo
n 
In
te
rf
ac
e 
U
ni
t 
System Bus
Host PC 
 
Simultaneous 
Maze Routing 
and Buffer 
Insertion 
algorithm  
(S-RABI) 
HybridPQ 
UART 
 
 
 
 
1.7 Thesis Organization 
 
 
The work in this thesis is conveniently organized into eight chapters. This 
first chapter presents the motivation and research objectives and follows through 
 12
with research scope, previous related works, research contribution, before concluding 
with thesis organization. 
 
 
 The second chapter provides brief summaries of the background literature 
and theory reviewed prior to engaging the mentioned scope of work. Several topics 
related to this research are reviewed to give an overall picture of the background 
knowledge involved.  
 
 
 Chapter Three discusses the priority queue algorithm which leads to our 
hardware design. Next, the Simultaneous Maze Routing and Buffer Insertion (S-
RABI) algorithm applied in nanometer VLSI routing module is presented. It entails 
the two underlying algorithms which form the S-RABI algorithm. 
 
 
 Chapter Four presents the necessary algorithmic modification on the S-RABI 
algorithm in order to benefit from the limited but fast operation of hardware priority 
queue. Next the architecture chosen for the implementation of hardware priority 
queue accelerator is described; followed by the necessary modifications on the 
priority queue algorithm for better hardware implementation.  
 
 
Chapter Five explains the design of the Graph Processing Unit. First the top-
level description of GPU is given; followed by each of its sub-components: the NIOS 
II processor, the system bus, the bus interface and the priority queue accelerator 
module. Also in this chapter, the development of device driver and HybridPQ is 
discussed.  
 
 
 Chapter Six delivers the detailed description on the design of priority queue 
accelerator module. This includes the Hardware Priority Queue Unit and the required 
bus interface module as per required by our target implementation platform. 
 
 
Chapter Seven describes the simulation and hardware test that are performed 
on individual sub-modules, modules and the system for design verification and 
system validation. Performance evaluations of the designed priority queue 
 13
accelerator module are discussed and comparisons with other implementations are 
made. This chapter also illustrates the top-level architecture of nanometer VLSI 
routing module developed to be executable on GPU. Further by detail analysis on the 
performance of graph algorithm with the presence of priority queue accelerator 
module. 
 
 
In the final chapter of the thesis, the research work is summarized and 
deliverables of the research are stated. Suggestion for potential extensions and 
improvements to the design is also given. 
 
 
 
 
1.8 Summary 
 
 
In this chapter, an introduction was given on the background and motivation 
of the research. The need for a hardware implementation of priority queue module to 
accelerate graph algorithm, particularly state-of-the-art nanometer VLSI interconnect 
routing is discussed. Based on it, several scope of project was identified and set to 
achieve the desired implementation. The following chapter will discuss the literature 
relevant to the theory and research background. 
CHAPTER 2 
 
 
 
 
THEORY AND RESEARCH BACKGROUND 
 
 
 
 
This chapter elaborates the fundamental concepts pertaining to the 
background of this research. The chapter begins with graph theory, followed by 
discussions on a fundamental graph algorithm, the shortest path algorithm. Next, the 
concept of priority queue is presented, with comprehensive explanations of its 
influence on shortest path graph computations.  
 
 
 
 
2.1 Graph 
 
 
A graph, G = (V, E) consist of |V| number of vertices/nodes and |E| number of 
edges. Any discrete mathematic set can be presented in a graph, where each element 
in the set is represented by vertices, and the relation between any two elements is 
represented by edges. There are two basic approaches in modeling a graph: as a 
collection of adjacency lists or as adjacency matrix. The adjacency-list representation 
is usually preferred, because it provides a compact way to represent sparse graphs—
those for which |E| is much less than |V|2. Most of graph algorithms assume that an 
input graph is represented in adjacency-list form. An adjacency-matrix representation 
may be preferred; however, when the graph is dense, i.e. |E| is close to |V|2. Figures 
2.1 and 2.2 show the examples of undirected and directed graphs, in both adjacency-
list and adjacency-matrix representations. 
 
 15
 
Figure 2.1: Two representations of an undirected graph 
1 
2 
3 
4 
5
2
1
2
2
4
5
5
4
5
1
3
3
2
1 
5 4 
2 
3 
4
1  2  3  4  5 
0  1  0  0  1 
1  0  1  1  1 
0  1  0  1  0 
0  1  1  0  1 
1  1  0  1  0 
1
2
3
4
5
(a)  
An undirected graph G 
having five vertices 
and seven edges. 
(b)  
An adjacency-list 
representation of G.
(c)  
An adjacency-matrix 
representation of G. 
 
 
 
Figure 2.2: Two representations of a directed graph 
2 41 
2 
3 
4 
5 
6
 
 
The adjacency-list representation of a graph G = (V, E) consists of |V| 
number of adjacency-lists, one for each vertex in V. For each vertex u є V, the 
adjacency-list Adj[u] contains all the vertices v such that there is an edge connecting 
u and v: (u, v) є E. If G is a directed graph, the sum of the lengths of all the 
adjacency-lists is |E|. If G is an undirected graph, the sum of the lengths of all 
adjacency lists is 2|E|, since if there is an edge (u, v), u appears in v’s adjacency-list 
and v appears in u’s adjacency-list. For both directed and undirected graphs, the 
adjacency-list representation has the desirable property that the amount of memory it 
requires is Θ (V + E). Noted, to give an exact analysis on the complexity of 
algorithm is usually not worth the effort of computing it. The symbol ‘Θ’ denotes 
‘asymptotic’, just liked ‘O’ denotes ‘asymptotic upper bound’ and ‘Ω’ denotes 
‘asymptotic lower bound’; it is a approximate technique to analyze the complexity of 
an algorithm (Cormen et al., 2001). 
 
0  1  0  1  0  0 
0  0  0  0  1  0 
0  0  0  0  1  1 
0  1  0  0  0  0 
0  0  0  1  0  0 
0  0  0  0  0  1 
1  2  3  4  5  6 
5
1  
2  
3  
4  
5  
6 
(a)  
A directed graph G 
having six vertices and 
eight edges. 
(b)  
An adjacency-list 
representation of G. 
(c)  
An adjacency-matrix 
representation of G.
1 
4 5 
2 3 
6 
6
2
4
5
6
 16
For the adjacency-matrix representation of a graph G = (V, E), the vertices 
are numbered 1, 2, …, |V|. Then the adjacency-matrix representation of a graph G 
consist a |V| x |V| matrix: A = (aij) such that aij = 1 if there is edge (i, j) є E, aij = 0 
otherwise. The adjacency-matrix of a graph requires Θ (V2) memory, asymptotically 
more memory compared to the adjacency-list representation. One advantage of 
adjacency-matrix representation is that it can tell quickly if a given edge (u, v) is 
present in the graph. 
 
 
Graph can be further classified as unweighted graph or weighted graph. The 
examples in Figures 2.1 and 2.2 are unweighted graph, whereas Figure 2.3 illustrates 
a weighted graph. For weighted graph, each edge has an associated weight, typically 
given a weight function w: E ? R. For example, let G = (V, E) be a weighted graph 
with weight function w. The weight w(u, v) of edge (u, v) є E is simply stored with 
vertex v in u’s adjacency-list. The adjacency-list representation is quite robust in that 
it can be modified to support many other graph problems. In fact, most real-world 
problems are weighted graph problems. For example, Dijkstra’s algorithm finds the 
shortest path on a weighted graph. 
 
 
Figure 2.3: A weighted graph 
A
E D
1 
3 
6 12 
B 10
8 
1 C (a) A weighted graph G. 
A   B    C   D    E 
A
B
C
D
E
B/1
A/1
B/10
B/1
D/3
E/12
E/6
D/8
E/3
A/12
C/10
C/8
B/6
D/1
A
B
C
D
E
∞   1    ∞   ∞   12 
1    ∞   10   1    6 
∞   10   ∞   8    ∞ 
∞    1    8    ∞   3 
12   6    ∞   3    ∞
(b)  
An adjacency-list 
representation of G. 
(c)  
An adjacency-matrix 
representation of G. 
 
 
 
 
 
 17
2.2 Graph-based Shortest Path Algorithm 
 
 
The technique for searching a graph is the heart of all graph algorithms. 
Searching a graph means systematically following the edges of the graph so as to 
visit the vertices. There are two elementary graph searching algorithms: breadth-first 
search (BFS) and depth-first search (DFS). Other graph algorithms are organized as 
simple elaborations of either BFS or DFS. For example, Prim’s minimum-spanning-
tree (MST) algorithm and Dijkstra’s single-source shortest-paths algorithm use ideas 
similar to those in BFS. 
 
 
It should be noted here, shortest path is different from shortest unit path; the 
former is applied in weighted graphs while the latter is applied in unweighted graphs. 
The BFS algorithm is a shortest unit path algorithm on unweighted graph, while 
Dijkstra’s algorithm is the equivalent of BFS on weighted graph. In Figure 2.4(a), 
shortest unit path from vertex-A to vertex-E is straight forward but in Figure 2.4(b), 
shortest path from vertex-A to vertex-E is to follow the path on vertex-A ? vertex-B 
? vertex-D ? vertex-E.  
 
 
 
Figure 2.4: Shortest Path and Shortest Unit Path 
A
E D
1
3
6
B 10 
8 
1 C 
A 
E D
B
C 12
(a) 
Shortest unit path from 
vertex-A to vertex-E, on 
unweighted graph: 
A ? E
(b) 
Shortest path from vertex-A 
to vertex-E, on weighted 
graph: 
A ? B ? D ? E 
A
E D
1
3
6
B 10
8
1 C
A
E D
1
3
6
B 10
8
1 C
A
E D
1
3
6
B 10
8
1 C12 12 12
(c) 
Shortest path from vertex-A
to vertex-B, on weighted 
graph: 
A ? B 
(d) 
Shortest path from vertex-B 
to vertex-D, on weighted 
graph: 
B ? D
(e) 
Shortest path from vertex-D 
to vertex-E, on weighted 
graph: 
D ? E 
 18
Shortest-paths algorithms typically rely on the property that a shortest path 
between two vertices contains other shortest paths within it. For example in Figure 
2.4(b), the shortest path from A to E is A ? B ? D ? E, it happens where all sub-
paths, e.g. A? B, B ? D and D? E are the shortest path between the two vertices, 
see Figure 2(c), 2(d) and 2(e). The maximum-flow graph algorithm: Edmonds-
Karp’s algorithm relies on this property. This optimal property is a hallmark of the 
applicability of both dynamic-programming method and greedy method. For 
instance, Dijkstra’s algorithm is a greedy algorithm, and the Floyd-Warshall’s all-
pair shortest paths algorithm is a dynamic-programming algorithm. 
 
 
Given a weighted graph, shortest path algorithm can be used to find the 
shortest distance route connecting two vertices, in which case the edge-weights 
represent distances. The edge weights can also be interpreted as metrics, other than 
distance, such as time, cost, penalties, loss or any other quantity that accumulates 
along the path and that one wishes to minimize. In electronic circuit design, the edge 
weights may represent physical wire-length, interconnect delay, cumulative 
resistance, capacitance or inductance. As a result, shortest path algorithms have very 
wide applications, which include Internet routing, Quality-of-Services (QoS) 
network routing, Printed-Circuit-Board (PCB) interconnect routing and VLSI 
interconnect routing. 
 
 
 
 
2.3 Priority Queue 
 
 
Priority Queue, Q, is an abstract data structure to maintain a set of elements. 
Each element contains a priority-level and an associated-identifier. In priority queue, 
all elements are arranged in accordance to their priority-level. The associate-
identifier contains other information about the element, or it is often a pointer 
dereferencing other information about the element. 
 
 
A priority queue has two basic operations: (i) INSERT (Q, x), and (ii) 
EXTRACT (Q). INSERT (Q, x) adds to Q, a new element x (which consists of a 
 19
priority-level and an associated-identifier). EXTRACT (Q) removes the element with 
highest priority-level. The performance of priority queue operations are measured in 
terms of n, where n is the total number of elements in the queue. Figure 2.5 provides 
more details of the definitions of these operations. 
 
 
As outlined in Figure 2.5, there are two variance of the EXTRACT operation, 
namely: EXTRACT-MIN (Q) and EXTRACT-MAX (Q). Depending on the target 
application, either EXTRACT-MIN (Q) or EXTRACT-MAX (Q) is implemented. In 
software, EXTRACT-MIN (Q) implementation is easily converted to EXTRACT-
MAX (Q) (or vice-versa) by switching the sign of comparison. However, in 
hardware, because the comparator is hardwired, this is not so straightforward. 
Nevertheless, the solution is simple. Consider the fact that a maximum is actually 
reciprocal of the minimum, or vice-versa (maximum = 1/minimum).  This is not a 
big issue. Hence, for example, if a hardware priority queue provides INSERT (Q) 
and EXTRACT-MIN (Q), but the target-application needs EXTRACT-MAX (Q), 
then simply invert the priority-level, i.e. 1/(priority-level), before inserted into Q. 
From here on, EXTRACT (Q) is used interchangeably with EXTRACT-MIN (Q) or 
EXTRACT-MAX (Q). 
 
 
Figure 2.5: Basic Operations of Priority Queue 
INSERT (Q, x) - Insert new element x into queue Q, this increases the queue size by 
one, n ? n + 1. Note, x contain two things, a priority-level and an 
associated-identifier, the Q is sorted based on the priority-levels, not 
associated-identifiers. 
- Also known as ENQUEUE operation. 
 
EXTRACT (Q) - Remove and return the highest-priority element in Q, this reduces the 
queue size by one, n ? n – 1. 
 - Also known as DEQUEUE operation. 
- The term EXTRACT-MAX is used if the highest priority element 
referred to the element with largest priority-value. 
- The term EXTRACT-MIN is used if the highest priority element 
referred to the element with smallest priority-value. 
 
 
 20
Depending on the target application, the priority-level is determined based on 
time-of-occurrence, level-of-importance, physical-parameters, delay or latency, etc. 
In many advanced algorithms where items/tasks are processed according to a 
particular order, priority queue has proven to be very useful. For task-scheduling on 
a multi-thread, shared-memory computer; priority queue is used to schedule and keep 
track of the prioritized pending processor tasks/threads. In the case of discrete-event-
simulation, priority queue is used where items in the queue are pending-event-sets, 
each with associated time-of-occurrence that serves as priority.  
 
 
The simplest way to implement a priority queue is to keep an associate array 
mapping of each priority to a list of items/elements having that priority. Referring to 
Figure 2.6, the priorities are held in a static array which stores the pointers to the list 
of items assigned with that priority. Such implementation is static, for example, if the 
allowed priority ranged from 1 to 4,294,967,295 (32-bit) then an array of (4 Giga-
length) * (size of pointer storage, i.e. 32-bit) is consumed, a total of 16 Gigabytes is 
needed, just to construct a priority data structure. 
 
 
Figure 2.6: Simplest way to implement Priority Queue 
A
Z
D B E
G
ZH J V
C
NIL 
NIL 
List of Elements 
Each element has an 
associated-identifier.
K
Priority Level 
1 
8 
7 
6 
5 
4 
3 
2 
NIL 
 
 
A more flexible and practical way to implement a priority queue is to use 
dynamic array. In this case, the length of the array does not depend on the range of 
priority. Referring to Figure 2.7 (a), each INSERT (Q, x) will extend the existing 
queue-length by one unit (n ? n + 1); append the new element, then sort the Q to 
maintain the priority order. The sorting during insertion takes O (n) worst-case run-
time. For extraction operation, the highest priority element is removed from the left-
 21
end; each remaining elements will be left-shifted to fill-in the vacant. Hence, 
EXTRACT (Q) takes constant O (n) time. Note, in the figures, we only show the 
priority-level of each element, the associated-identifier is not shown, it is understood 
that there is an associated-identifier at each element. 
 
 
 
Figure 2.7: Priority Queue implemented as array or as heap 
8 
25 
2 
16 38 4 12 7 6 5 
3 3 
2 
Root 
1 index, i
 1       2       3       4       5       6       7 index, i 
2 3 8 12 16 25 38
(a) 
Priority Queue, view as Array.
(b) 
Priority Queue, view as Heap. 
 
 
In Figure 2.7(b), the priority queue is implemented as a heap. In the research 
of advanced data structure: graph, tree, and heap, the definition of graph is already 
given, tree is a special case of acyclic undirected graph, i.e. there are no 
combinations of edges which can form a cycle in the graph, whereas heap is a special 
case of tree where all vertices are arranged in certain sorted order (see Figure 2.8). 
Having said, “heap” in our context referred to a sorted-heap; it is definitely not a 
garbage-collected storage as referred in operating system.  
 
 
By making use the more complex but advanced data structure, heap 
implementation of priority queue gives theoretical improvement in run-time 
complexity by reducing the number of nodes it had to sort during INSERT or 
EXTRACT. Referring to Figure 2.9, there have been a number of researches to 
implement priority queue using different heap data structure, e.g. Binary-Heap, 
Binomial-Heap, Fibonacci-Heap, Relaxed-Heap, Parallel-Heap, etc. Each 
implementation has to consider the trade-off among speed, memory consumption, 
and required hardware platform. In addition to the basic operations of INSERT and 
EXTRACT, heap implementation of priority queue can support new operations, such 
as DECREASE-KEY. The DECREASE-KEY operation is used to perform 
‘relaxation’ in shortest path algorithm. In the next section, we will discuss the 
 22
utilization of INSERT, EXTRACT and DECREASE-KEY operations in graph based 
shortest path computation. 
 
 
 
Figure 2.8: Set, Graph, Tree and Heap 
(a) 
Set of elements with no 
relation to each other. 
12 
Element
3 
8 
16 
(b) 
Graph, contain of vertices 
connected by edges. 
Vertice
8 
3 
25
12
16 
Edge
25
Root
16
8325
2 1238
Root
16
8
3
25
2 1238
Root
2
8
25
3
16 3812
(c)  
Tree, no edges form cycles, 
all edges are branching 
outward. 
(d)  
Binary Tree, each node 
(vertex) has only two child-
nodes. 
(e)  
Binary Heap, all nodes 
are arranged in sorted 
order. The value of 
parent-node always 
smaller than the value of 
child-nodes. 
 
 
 
Figure 2.9: Example of Binomial-Heap and Fibonacci-Heap 
 
 
 
 
 
 
 
(a)  
Binomial-Heap: a number of sub
trees in defined topology. 
-
(b)  
Fibonacci-Heap: all nodes in totally 
disordered topology. It uses pointer 
structure to hold the nodes. 
 23
2.4 Priority Queue and Dijkstra’s Shortest Path Algorithm 
 
 
Priority queue has been used extensively in graph based shortest path 
algorithms. The shortest path algorithm uses a typical technique called ‘relaxation’. 
Consider a shortest path problem on a graph, G = (V, E) with a weight function w. 
Then w(u, v) denotes the edge-weight from vertex u to v, where u precedes v. Each 
vertex v є V maintains an attribute d[v], the ‘shortest path estimate’. With reference 
to Figure 2.11, the relaxation is: if the ‘shortest path estimate at vertex v’ is larger 
than the sum of ‘shortest path estimate at vertex u’ and weight from u to v, then 
update the ‘shortest path estimate at vertex v’ (Figure 2.10., line 1 to 2). 
 
 
Figure 2.10: Function RELAX ( ) 
RELAX ( ) 
1 if d[v] > d[u] + w(u, v) 
2  then  d[v] ? d[u] + w(u, v)
3   π[v] ? u 
 
 
 
Figure 2.11: Relaxation 
d[u]
5 9
d[v]w(u, v) 
2 
5 7
d[u] d[v]w(u, v) 
2 
RELAX
(a) 
if d[v] > d[u] + w(u, v) 
(i.e. 9 > 5 + 2 in this case) 
then d[v] ? d[u] + w(u, v) 
        (i.e. d[v] ? 7 ) 
d[u]
5
d[v] 
6 
w(u, v) 
2 
RELAX 
d[u]
5
d[v] 
6 
w(u, v) 
2 
(b) 
if d[v] > d[u] + w(u, v), 
(FALSE !!! i.e. 6 > 5 + 2) 
then no update at d[v]. 
 24
 
Figure 2.12: Dijkstra’s Shortest Path Algorithm 
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
DIJKSTRA(G, w, s){  
for (each vertex v є V[G]){  
  d[v] ? ∞ 
  π[v] ? NIL 
} 
d[s] ? 0 
S ? Ø 
for (each vertex v є V[G]){ 
 INSERT(Q, v, d[v]) 
} 
do{ 
(u, d[u]) ? EXTRACT-MIN(Q) 
 S ? S  U {u} 
 for (each vertex v є Adj[u]){ 
if (d[v] > d[u] + w(u, v)){ 
     d[v] ? d[u] + w(u, v) 
       π[v] ? u 
       DECREASE-KEY(Q, v, d[v]) 
} 
} 
}(while Q ≠ Ø) 
} 
 
 
To further the explain details of relaxation in shortest path algorithm, we use 
Dijkstra’s single source shortest path algorithm given in Figure 2.12 as an example. 
Given a graph G = [V, E], V[G] denotes the set of vertices and W[G] denotes the set 
of edge-weights. We use s to denote the source-vertex. If u and v are adjacent 
vertices, then v = Adj[u] or u = Adj[v]. d[u] denotes ‘shortest path estimate’ from s to 
u, while d[v] denotes ‘shortest path estimate’ from s to v. Given that w(u, v) denotes 
the edge-weights from u to v, then d[v] = d[u] + w(u, v). S is the set of vertices whose 
final shortest path estimates from source s have already been determined. The 
precedence list, π[v] is used to hold the precedent-vertex of v. Upon complete 
execution of algorithm, the shortest path from s to v can be traced by dereferencing 
π[v] backward to the source, and the shortest path from s to each vertex is then given 
by the final d[v].  
 
 25
Let us illustrates the execution of Dijkstra’s algorithm via an example of 
weighted graph in Figure 2.13(a). The data trace in the arrays d[v], π[v] and Q is 
illustrated in Figure 2.13 (b) to 2.13 (d). Figure 2.14 presents the result upon 
completion of the algorithm execution. 
 
 
 
Figure 2.13(a): Illustration of Dijkstra’s algorithm - Initialization 
1.    for (each vertex v є V[G]){ 
2.  d[v] ? ∞ 
3.  π[v] ? NIL // HERE WE INITIALIZE AS INFINITE ‘∞’ 
4.    }    // NOTED THE PRIORITY QUEUE, PQ IS EMTPY. 
5.    d[s] ? 0   // TAKE ‘N1’ AS SOURCE NODE. 
6.    S ? Ø   // ‘VISITED-LIST’ IS EMPTY. 
Initially,
d[ ] N1 
∞∞ ∞ ∞ ∞0
N2 N3 N4 N5 N6 
N1 N2 N3
7 2 
N4 N5 N64 
1 3 
5 
6 π[ ] N1 
∞∞ ∞ ∞ ∞∞
N2 N3 N4 N5 N6 
Q
∞ ∞ ∞ ∞ ∞ 
∞∞ ∞ ∞ ∞∞
∞Priority-level
Associated-identifier
 
 
In the initialization step of the algorithm (line 1-6), the predecessor-list, π[v] 
is initialized to NIL and the ‘shortest path estimate at each vertex’, d[v] to infinity, 
except at source, d[s] = 0. Line 7-9 constructs the priority queue, Q, to contain all 
vertices in V. Note that each element in Q has the ‘shortest-path estimate, d[v]’ as 
priority-level and the vertex identity, v, as the associated identifier. In the algorithm, 
Q is used to maintain the set of shortest path estimate at each vertex. The 
construction of priority queue invokes |V| number of INSERT on Q. Figure 2.13 (b) 
shows the initialization stage. 
 
 
 26
 
Figure 2.13(b): Illustration of Dijkstra’s algorithm – Priority Queue Construction 
 
7.    for (each vertex v є V[G]){ // CONSTRUCT THE PRIORITY QUEUE. 
8. INSERT(Q, v, d[v]) 
9.    } 
N1 N2 N3
N4 N5 
Q
0 ∞ ∞ ∞ ∞ ∞
N1 N2 N3 N4 N5 N6
N6
7 2 
5 
3 1 6 
4 
∞ ∞
N2 N3 π[ ] N1 
∞∞ ∞ ∞∞∞
d[ ] N6 N4 N5 N2 N3 N1 
∞∞ ∞ ∞
N4 N5 N6 
 
 
Each time though the while loop (line 11), a vertex with smallest ‘shortest 
path estimate’ will be extracted (EXTRACT-MIN) from Q (Figure 2.13(c)).  
 
 
 
Figure 2.13(c): Illustration of Dijkstra’s algorithm - EXTRACT operation 
10.   do{
11.  (u, d[u]) ? EXTRACT-MIN(Q) // THE HIGHEST PRIORITY IS AT N1 
12. S ? S U {u}   // INCLUDED IN ‘VISITED-LIST’ 
 : 
: 
20.   }(while Q ≠ Ø) 
 
 
N1 N2 N3
N4 N5 N6
7 2 
5 
3 1 6 
4 
d[ ] N1 N2 N3 N4 N5 N6 
0 ∞ ∞ ∞ ∞ ∞
Q
π[ ] N1 N2 N3 N4 N5 N6 
∞ ∞ ∞ ∞ ∞ ∞
∞0 ∞ ∞ ∞ ∞ ∞
N1 ∞N2 N3 N4 N5 N6 
 27
Then line 13-19 relax each edge (u, v) leaving u, thus updating the estimate 
d[v] and the predecessor π[v] when necessary (Figure 2.13(d) and 2.13(e)). While Q 
is used to maintain the set of shortest path estimate at each vertex, it is also updated 
with the changes, then sort (or consolidate) to maintain the priority-orders among Q-
entries. Such operation at Q is called DECREASE-KEY.  
 
 
 
Figure 2.13(d): Illustration of Dijkstra’s algorithm – Relaxation & DECREASE-
KEY 
do{ 
: 
13. for (each vertex v є Adj[u]){  // VISIT EACH ADJACENT-NODES 
14.  if (d[v] > d[u] + w(u, v)){ // RELAXATION at d[N2]. 
15.       d[v] ? d[u] + w(u, v) 
16.       π[v] ? u 
17.       DECREASE-KEY(Q, v, d[v]) // AT PQ. 
18.  } 
} 
}(while Q ≠ Ø) 
RELAXATION at N2: d[N2] > d[N1] + w(N1,N2), i.e. ∞ > (0 + 7), so update d[N2]. 
N1 N2 N3
N4 N5 
Q
7 ∞ ∞ ∞ ∞ ∞
N2 N3 N4 N5 N6 ∞
Q
7 ∞ ∞ ∞ ∞ ∞
Update, then sort.
∞N6 N5 N4 N3N2
N64 
1 3 
5 
7 2 
d[ ] N1 
∞∞ ∞ ∞ 70
N2 N3 N4 N5 N6 
6 π[ ] N1 
∞∞ ∞ ∞ N1∞
N2 N3 N4 N5 N6 
DECREASE-KEY at N2
 28
 
Figure 2.13(e):  Illustration of Dijkstra’s algorithm – Relaxation & DECREASE-
KEY 
N1 N2 N3
N4 N5 N6
7 2 
5 
3 1 6 
4 
0 7 ∞ 6 ∞ ∞
N1 N2 N3 N4 N5 N6 d[ ] 
Q
∞ N1 ∞ N1 ∞ ∞
N1 N2 N3 N4 N5 N6 π[ ] 
6 7 ∞ ∞ ∞ ∞
N4 N2 N3 N5 N6 ∞
Q
7 ∞ 6 ∞ ∞ ∞
N2 N3 N4 N5 N6 
DECREASE-KEY at N4
RELAXATION at N4: d[N4] > d[N1] + w(N1,N4), i.e. ∞ > (0 + 6), so update d[N4]. 
do{ 
: 
13. for (each vertex v є Adj[u]){  // VISIT EACH ADJACENT-NODES 
14.  if (d[v] > d[u] + w(u, v)){ // RELAXATION at d[N4]. 
15.       d[v] ? d[u] + w(u, v) 
16.       π[v] ? u 
17.       DECREASE-KEY(Q, v, d[v]) // AT PQ. 
18.  } 
} 
}(while Q ≠ Ø) 
Update, then sort. ∞
 
 
Note, EXTRACT-MIN is invoked exactly |V| times and DECREASE-KEY is 
invoked at worst case |E| times. The complete execution is given in Appendix A. 
Figure 2.14 gives the final execution result. 
 29
 
Figure 2.14: Illustration of the final execution result 
do{
(u, d[u]) ? EXTRACT-MIN(Q) // THE HIGHEST PRIORITY IS AT N6 
S ? S U {u}   // INCLUDED IN ‘VISITED-LIST’ 
for (each vertex v є Adj[u]){  // NO MORE ADJACENT NODES FOR N6 
 : 
} 
}(while Q ≠ Ø)   // PQ IS EMPTY. 
 
d[ ] N1 N2 N3 N4 N5 N6 
 
 
It is clear that the run-time complexity of Dijkstra’s algorithm (or any other 
shortest path algorithm for that matter) is dependent on the performance of the 
priority queue. Throughout the execution, INSERT and EXTRACT operations are 
invoked |V| times while DECREASE-KEY is invoked |E| times. Hence if the priority 
queue operates with INSERT, EXTRACT and DECREASE-KEY at O(V) (because 
the worst case Q length, n = |V|), then the run-time of Dijkstra’s algorithm is O(V2 + 
V2 + V.E) ≈ O(V2). Refer Table 2.1, Binary-Heap gives all INSERT, EXTRACT and 
DECREASE-KEY at O(lg V), therefore the run-time becomes O(V lg V + V lg V + 
E lg V) ≈ O( (V + E) lg V ). If uses Fibonacci-Heap where INSERT and 
DECREASE-KEY are O(1) but EXTRACT at O(lg V), the run-time complexity of 
Dijkstra’s algorithm hence O(V + V lg V + E) ≈ O(V lg V). 
0 128 6 97
Q
∞ ∞ ∞ ∞ ∞ ∞
∞ ∞ ∞ ∞ ∞ ∞
12
N6
RESULT 
TRACE-BACK d[ ] AND π[ ], THE SHORTEST PATH FROM N1 TO:- 
N2 is to follow the track N1 ? N2,   with COST = 7; 
N3 is to follow the track N1 ? N2 ? N3,  with COST = 9; 
N4 is to follow the track N1 ? N4,   with COST = 6; 
N5 is to follow the track N1 ? N2 ? N5,  with COST = 8; 
N6 is to follow the track N1 ? N2 ? N5 ? N6,  with COST = 12. 
N1 N2 N3
N4 N5 N64 
6 1 3 
5 
7 2 
π[ ] N1 N2 N3 N4 N5 
N5N2 N1 N2N1∞
N6 
 30
Table 2.1 : Run-time complexity for each operation among different heap data 
structures; n denoted the number of elements in the heap 
Operation Binary-Heap 
(worst-case) 
Binomial-Heap 
(worst-case) 
Fibonacci-Heap 
(amortized) 
MAKE-HEAP Θ (1) Θ (1) Θ (1) 
INSERT Θ (lg n) O (lg n) Θ (1) 
MIN Θ (1) O (lg n) Θ (1) 
EXTRACT-MIN Θ (lg n) Θ (lg n) O (lg n) 
UNION Θ (n) O (lg n) Θ (1) 
DECREASE-KEY Θ (lg n) Θ (lg n) Θ (1) 
DELETE Θ (lg n) Θ (lg n) O (lg n) 
 
 
 
 
2.5 Modeling of VLSI Interconnect Routing as a Shortest Path Problem 
 
 
In physical design automation, VLSI layouts are typically modeled as grid-
graph. Interconnect routing in post-placement layout involves constructing 
connection between two (or more) electrical nodes. The term global-routing is used 
when we connect more than two nodes; while the term maze-routing is used when 
we connect only two nodes. Maze routing is a subset of global routing. In practice, a 
global routing is decomposed into multiple maze routing (Bakoglu, 1990; Wolf, 
2002).  
 
 
Referring to Figure 2.15, layout usually contains some obstacle regions where 
interconnect or buffers are prohibited. VLSI interconnect routing is usually treated as 
shortest path problems. To discuss this concept further, consider an example layout 
shown in Figure 2.15 where we wish to connect source A to destination (or sink) B. 
Conventionally, the goal is to find a route that minimizes the total wire-length. 
Figure 2.16(a) shows the shortest route when all obstacles are avoided. Figure 
2.16(b) gives the shortest route if only the wire obstacles are avoided. The 
conventional maze routing is essentially a shortest path problem.  
 
 
The classic Lee’s algorithm (Lee, 1961) for maze routing had fully exploited 
the inherent parallelism of shortest unit path in grid-graph. Lee’s algorithm features 
 31
parallel-expansion for maze routing. As illustrated in Figure 2.17, the expansion 
begins at source vertex where all vertices adjacent to source are mark as “1”. Then, 
all vertices adjacent to vertex marked 1 are marked as ‘2’, and so on. The expansion 
process continues until the destination vertex is reached, the mark at destination 
vertex gives the minimum wire-length from source to destination. 
 
 
 
Figure 2.15: VLSI layout represented in grid-graph 
A
B 
Buffer obstacles 
Wire obstacles 
 
 
 
Figure 2.16: VLSI Routing as shortest unit path problem 
(a) 
Shortest unit path, avoid all obstacles. 
Wire = 36 unit-length. 
(b)  
Shortest unit path, avoid wire obstacles. 
Wire = 24 unit-length. 
 
 32
 
Figure 2.17: Parallel expansion in Lee’s algorithm 
2
1
2
4
3
1
A
1
3
2
2
1
2
4
3
3
6
5
4
8
7
1
1
1
1
B
A
B 
A 
(a) 
Problem: route source 
A to destination B 
avoiding obstacles 
(c) 
Destination B is 
reached, minimum wire-
length = 8 unit. 
(b) 
1st parallel expansion in 
Lee’s algorithm. 
 
 
When VLSI physical design moves into nanometer range, shrinking gate-size 
has improved the transistor switching-speed, but shrinking interconnect-size yields 
higher resistive-delay. Now the interconnect delay dominants gate delay. As a result, 
the interconnect-delay has now become the dominating factor in the performance of 
a system. In many system design targeting 0.35um – 0.5um technology, as much as 
50% to 70% of clock cycles are consumed by interconnect delay (Cong et al., 1996). 
This figure will continue to rise as the feature technology size decreases further. 
 
 
Many techniques are employed to reduce interconnect delay; among them, 
buffer insertion has been shown to be an effective approach. New approaches of 
routing involving buffer-insertion and wire-sizing have been proposed for nanometer 
VLSI interconnect design. These routing with buffer insertion methods are 
formulated as shortest path problems. The goal of this shortest path problem is to 
find a buffered minimum delay path between source and sink. In the presence of 
buffer obstacles, the shortest path is not necessarily the minimum delay path. The 
conventional Lee’s algorithm is no longer applicable in this case. A number of 
routing algorithms have been proposed for different buffer insertion approaches, each  
claiming to achieve better performance than the others in terms of good buffer 
location, buffer density management, the minimum interconnect delay achieved, and 
the complexity of the algorithm itself (Chu and Wong, 1997; Chu and Wong, 1998; 
Chu and Wong, 1999; Dechu et al., 2004; Ginneken, 1990; Lai and Wong, 2002; 
Jagannathan et al., 2002; Nasir, 2005; Zhou et al., 2000). Figure 2.18 illustrates some 
variants of these routing algorithms.  
 33
 
Figure 2.18: VLSI Routing as shortest path (minimum-delay) problem 
(a) 
Shortest path length first, 
then insert buffer if allow. 
Delay = 621.81ps. 
(b) 
Avoid all blocks, then 
insert buffer if allow. 
Delay = 680.62ps. 
(c)  
Simultaneous Routing 
and Buffer Insertion. 
Delay = 521.73ps. 
 
 
 
 
2.6 Summary 
 
 
This chapter elaborates the fundamental concepts pertaining to the 
background of this research. The chapter begins with graph theory, followed by 
discussions on a fundamental graph algorithm, the shortest path algorithm. Next, the 
concept of priority queue is presented, with comprehensive explanations of its 
influence on shortest path graph computations. In the next chapter, VLSI 
interconnect routings that we used to validate the proposed GPU are discussed in 
detail. This includes the algorithms of Dijkstra’s, the Simultaneous Routing and 
Buffer Insertion (S-RABI) algorithm, and the priority queue. 
CHAPTER 3 
 
 
 
 
PRIORITY QUEUE AND GRAPH-BASED SHORTEST PATH PROBLEM 
- DESCRIPTIONS OF ALGORITHMS 
 
 
 
 
This chapter begins with the description of the priority queue basic sorting 
algorithm and reviews the relevant details of Elmore delay models. This chapter also 
introduces the VLSI interconnect routing methodology, and this is followed by the 
shortest path formulation of Simultaneous Maze Routing and Buffer Insertion 
algorithm (S-RABI) that is applied in this thesis.  
 
 
 
 
3.1 Priority Queue and the Insertion Sort Algorithm 
 
 
In the Chapter 2, sections 2.3 and 2.4 have discussed how the performance of 
priority queue can severely affect the computation run-time of graph-based shortest 
path algorithms. By definition, Priority Queue is an abstract data structure to 
maintain a set of elements/entries, where all elements are arranged in order of their 
priority. When a new element is inserted into the priority queue, the whole queue is 
sorted to maintain the priority-order. When the highest priority element is extracted, 
the queue is consolidated to maintain the priority-order. The order of priority in the 
queue can be maintained using a sorting algorithm. 
 
 
Among the variety of sorting algorithms available, insertion-sort is a suitable 
method to sort a priority queue (Cormen et al. 2001). Insertion-sort sorts on-the-fly, 
that is, it sorts the array as it receives a new entry. This ‘online’ behavior matches 
 35
very well with the INSERT mechanism of a priority queue. Most advanced sorting 
algorithms such as quick-sort, heap-sort or merge-sort, are more effective in handling 
large lists, but insertion-sort has its advantages when implemented in hardware. 
 
 
First, it is relatively simple to implement in hardware. The lower run-time 
complexity of above mentioned advanced algorithms often trade-off with large 
constant factor, i.e. more complex data structure for each entry, therefore more 
memory consumption and severe data communication overhead.  
 
 
The second advantage of insertion-sort over the other sorting algorithms in 
priority queue applied in graph computation is that it sorts in place. It only require a 
constant amount of O(1) extra temporary memory space, whereas the other advanced 
sorting algorithms demand up to an additional O(n) temporary storage. Lastly, it 
sorts on-the-fly. Sorting process starts immediately when new entry is received. 
Sorting algorithms which wait until all entries are received before start sorting, 
cannot be used to implement a hardware priority queue. 
 
 
 
 
3.1.1 Insertion-Sort Priority Queue 
 
 
 Insertion-Sort works the way many people sort a hand of playing cards. Start 
with left-hand empty and all cards face down on the table, remove one card at a time 
from table and insert it into the correct position in the left-hand. In order to find the 
correct position for a card, we compare it with each of the cards already in the hand, 
from right to left. At all times, the cards held in the left hand are sorted, and these 
cards were originally the top cards of the pile on the table (Cormen et al., 2001). 
Figure 3.1 gives the pseudo-code of Insertion-Sort algorithm. A numerical example 
which illustrates its execution is provided in Appendix D.1. 
 
 
 36
 
Figure 3.1: Insertion-Sort Algorithm 
INSERTION-SORT(array A, int length) { 
      j ? 1; 
 // Enter Step-j 
      while (j < length) { 
        INSERT(A, j, A[j]);     
        j ? j + 1; 
  } 
} 
INSERT(array A, int length, key) { 
  i ? length - 1; 
 // Enter InnerLoop(i+1) 
  while (i ≥ 0 and A[i] > key) {  
       A[i + 1] ? A[i]; 
       i ? i - 1; 
  } 
  A[i + 1] ? key;   
} 
  
 
Remove the top-level abstraction of Insertion-Sort algorithm, the remaining 
INSERT(array A, int length, key) function is exactly the INSERT operation in 
priority queue. Such implementation is called Insertion-Sort Priority Queue. Its 
INSERT operation begins at last-element, one-by-one, new-element will be 
compared with existing-element. If the existing-element has lower priority, it will be 
right-shifted. The process continues until the correct position for new-element is 
found. All the time, array A is sorted, the highest priority element is always at the 
left-end. Hence for EXTRACT operation, top-priority element is extracted from the 
left-end, follow by series of left-shift on the remaining elements. Figure 3.2 gives the 
pseudo-code describing Insertion-Sort Priority Queue. Figure 3.3 illustrates the 
execution of Insertion-Sort Priority Queue. A numerical example which illustrates 
the execution is provided in Appendix D.2. 
 
 37
INSERT(array A, int length, key) { 
  i ? length - 1; 
 // Enter InnerLoop(i+1) 
  while (i ≥ 0 and A[i] > key) { 
       A[i + 1] ? A[i]; 
       i ? i - 1; 
  } 
  A[i + 1] ? key;   
} 
EXTRACT-MIN(array A, int length) { 
  min-key ? A[0]; 
 k ? 0; 
 while ( k < length-1 ) {  
       A[k] ? A[k + 1]; 
       k ? k + 1; 
  } 
 length ? length – 1; 
  return(min-key);   
} 
 
Figure 3.2: Insertion-Sort Priority Queue Algorithm 
 
 38
 
Figure 3.3: Operations in Insertion-Sort Priority Queue 
(a) INSERT operation, worst-case O(n) run-time complexity. 
 12 18 19 55    9 
 12 18 19 55    
1812  19 55    
1812 19  55    
1812 19 55     
9 18 19  
12
55   
9 18  19 55    
9  18 19 55    
9 12 18 19
The correct position for new element.
12
12
   55
 
new element 
12     5519189
existing elements always in sorted order 
(b) EXTRACT-MIN operation, worst-case O(n) run-time complexity. 
 
 
 
 
 39
3.2 Maze Routing with Buffered Elmore Delay Path Optimization 
 
 
The VLSI interconnect design, using the maze routing approach, begins with 
a post-placement topology, buffer obstacles, wire obstacles, source and destination 
(sink) points all given. The problem is then mapped into a grid-graph, where spaces 
between the source and destination are segmented to a given grid-resolution. Higher 
grid-resolution leads to a more accurate routing result. Each segment can be filled 
with any one of the buffer-choices or wire-sizes. Figure 3.4 shows a typical routing 
topology for graph-based maze routing algorithm.  
 
 
 
Figure 3.4: A typical routing Grid-Graph 
A 
B 
Uniform Edges where each of them 
represents a unit wire length. 
Vertex where buffer 
can inserted here. 
Free zone, the placement of buffer or wire is allowed.
Buffer obstacles where only wire (interconnects) is allowed to across it. 
Wire obstacles where neither wire nor buffer is permitted to across this region.
 
 
The general flow of the maze routing algorithm with buffered delay path 
optimization is depicted in the flow chart in Figure 3.5. Beginning at the source, the 
vertices adjacent to the source are scanned. At each of those vertices, for each 
possible choice of buffer or wire, the interconnect-delay is computed, on-the-fly, 
based on the Elmore Delay model. These tentative interconnect candidates are 
‘inserted’ to or ‘relaxed’ at the priority queue. When scanning all adjacent vertices 
have been scanned, and all possible interconnect choices have been considered, the 
procedure moves to the vertex with the candidate which has the minimum 
 40
interconnect-delay. The best candidate is extracted from priority queue. All its 
adjacent-vertices are then scanned, interconnect candidates are inserted or relaxed at 
the priority queue. This is continued until the destination vertex is reached.  
 
 
 
Figure 3.5: Typical maze routing algorithm with buffered delay path optimization 
Source 
Pick a neighbour 
Pick a allowed interconnect type, 
(either buffer choices or wire sizes)
Compute the propagate-delay, 
resistance and capacitance.
Dominated?
INSERT or DECREASE-KEY 
at priority queue 
Next interconnect type? 
Next neighbour?
EXTRACT the highest priority entry 
from priority queue. 
Reach target? 
Set the candidate-node as 
(intermediate) source-node. 
END 
YES 
NO 
NO 
NO 
YES 
YES 
YES 
NO 
 
 41
As mentioned above, the interconnect delay is computed using the Elmore 
Delay model. Elmore Delay model is a popular model since it was proposed by 
Elmore (1948). It has a simpler analytical closed-form, which is based on the first-
order impulse response of resistive-capacitive delay. Any interconnect with 
minimum Elmore delay is guaranteed to have minimum actual delay. Figure 3.6 
gives the delay of the interconnect candidates at given the propagate-delay of the 
previous vertex. The resistance-delay computation model is applied when the routing 
computation begins at source and propagates towards the destination. The 
capacitance-delay computation model is used if the propagation originates from 
destination. More complex models, e.g. Fitted Elmore Delay from Seido et al. 
(2004), are also available if higher accuracy in interconnect delay estimation is 
desired. 
 
 
Figure 3.6: Elmore Delay model 
C /2 C /2 
r 
(r”, t”)(r’, t’) 
(a) 
Wire-only interconnect model 
(with resistance-delay pair)
r” = r’ + r 
t” = t’ + c (r’ + r/2) 
C /2 C /2
r
(c’, t’) (c”, t”)
(b)   
Wire-only interconnect model 
(with capacitance-delay pair)
c” = c’ + c 
t” = t’ + r (c’ + C/2) 
db/rbCb
rb
(r”, t”)(r’, t’) 
(c)   
Buffered interconnect model 
(with resistance-delay pair)
r” = rb
t” = t’ + r’cb + db
(d)  
Buffered interconnect model 
(with capacitance-delay pair)
db/rbCb
rb
(c’, t’) (c”, t”)
c” = cb
t” = t’ + c’rb + db
 
 
Figure 3.7 illustrates the propagate-delays at each successive vertex, from 
source to sink. In order to compute for the route, recursive construction of resistance-
 42
delay (r, t) pair at each vertex is performed downstream from the source to sink. 
Observing figure 3.7, rw and cw can be extended to included choices of wire-
interconnect widths available in the wire library, hence called the maze-routing with 
interconnect sizing. Figure 3.8 illustrates the computation of Elmore delay when a 
buffer is inserted at any vertex point (we choose vertex C in the following 
illustration). 
 
 
 
A B C D
rW
CW /2CW /2 
rW
CW /2CW /2
rW
C  W /2CW /2
A B C D 
Compute for (rD, tD): 
rD = rC + rW    -- resistance from source to vertex D. 
tD = tC + rCcw / 2 + (rC + rw)cw / 2 -- cumulative delay from source to vertex D.
    = tC + rCcw  + rwcw / 2 
Compute for (rC, tC): 
rC = rB + rW    -- resistance from source to vertex C. 
tC = tB + rBcw / 2 + (rB + rw)cw / 2  -- cumulative delay from source to vertex C. 
    = tB + rBcw  + rwcw / 2 
Compute for (rB, tB): 
rB = rA + rW    -- resistance from source to vertex B. 
tB = tA + rAcw / 2 + (rA + rw)cw / 2 -- cumulative delay from source to vertex B. 
    = tA + rAcw  + rwcw / 2 
Compute for (rA, tA): 
rA = 0;   -- resistance seen at vertex A (source). 
tA = 0;   -- cumulative delay at vertex A. 
In conclusion to that, each vertex can be labeled as: 
 rn = rn-1 + rw   -- resistance from source to vertex n. 
tn = tn-1 + rn-1cw  + rwcw / 2 -- cumulative delay from source to vertex n. 
Figure 3.7: Elmore Delay in hop-by-hop maze routing 
 43
 
A B C D
CP
rW
C  W /2CW /2 
rW
CW /2CW /2 
A B 
rW
C  W /2CW /2 
D
rb
db/rbCb
CQ
Compute for (rC, tC): 
rC = rCQ = rb    -- resistance from source to vertex C. 
tC = tCP + db    -- cumulative delay from source to vertex C.
= tB + rBcw  + rwcw / 2 + cb(rB + rw) + db 
= tB + rBcw  + rwcw / 2 + cb(rB + rw) + db 
Compute for (rCQ, tCQ): 
rCQ = rb     -- resistance from source to CQ. 
tCQ = tCP + rb(db/rb)    -- cumulative delay fr. source to CQ. 
      = tCP + db
Compute for (rCP, tCP): 
rCP = rB + rW     -- resistance from source to CP. 
tCP = tB + rBcw / 2 + (rB + rw)(cw / 2 + cb) -- cumulative delay fr. source to CP. 
     = tB + rBcw  + rwcw / 2 + cb(rB + rw) 
Compute for (rB, tB): 
rB = rA + rW    -- resistance from source to vertex B. 
tB = tA + rAcw / 2 + (rA + rw)cw / 2 -- cumulative delay from source to vertex B. 
    = tA + rAcw  + rwcw / 2 
Compute for (rA, tA): 
rA = 0;   -- resistance seen at vertex A (source). 
tA = 0;   -- cumulative delay at vertex A. 
 44
 
Compute for (rD, tD): 
rD = rC + rW    -- resistance from source to vertex C. 
tD = tC + rCcw / 2 + (rC + rw)cw / 2 -- cumulative delay from source to vertex C. 
    = tC + rCcw  + rwcw / 2 
In conclusion to that, we only have to compute the (r, t) using a different 
formula when a buffer is inserted at that particular vertex. The formula when 
buffer is inserted is:- 
rn = rb         -- resistance from source to n. 
tn = tn-1 + rn-1cw  + rwcw / 2 + cb(rn-1 + rw) + db-- cumulative delay fr. source to n.
 
• Here we notice that, when a buffer is inserted, the effective resistance up to that 
node is reduced to only rb, namely the buffer (driver) resistance. 
• Even though the delay up to this buffered-node seems become larger, but the 
smaller resultant resistive at this buffered node will yield to smaller delay value at 
all the following decedent nodes. 
Figure 3.8: Elmore Delay for buffer insertion in hop-by-hop maze routing 
 
 
This hop-by-hop graph computation of shortest path problem results in a huge 
amount of data accumulated. The computational complexity of the problem is NP. 
Besides very long run-time, a computer could possibly run out of memory before the 
problem is solved. Hence, various heuristics are proposed to prune (reduce) the 
problem complexity for a near-exact solution. Typical techniques of solution-space 
pruning includes look-ahead, buffer prediction, delay estimation, multiple 
constraints, etc. In this thesis, the maze routing with buffered delay interconnect 
optimization is modeled as a graph-based shortest path problem solved by the 
Simultaneous Routing and Buffer Insertion algorithm (S-RABI) proposed by Nasir et 
al. (2006). 
 
 
 
 
 
 
 
 
 45
3.3 Simultaneous Maze Routing and Buffer Insertion (S-RABI) Algorithm 
 
 
The S-RABI algorithm proposed by Nasir et al. (2006) is based on an exact 
Quality-of-Service (QoS) routing algorithm proposed by Kuiper et al. (2004a, 
2004b), called SAMCRA, the Self-Adaptive Multiple Constraints Routing 
Algorithm. SAMCRA is a multi-constrained algorithm applied in QoS routing 
protocol in the internet. The S-RABI algorithm formulates the simultaneous maze 
routing and buffer insertion problem as a graph based shortest path model. By 
adapting the multi-constrained routing technique in SAMCRA to VLSI layout maze 
routing environment, we obtained an innovative multi-weight multi-constrained 
shortest path routing algorithm in S-RABI.  
 
 
The S-RABI algorithm applied in VLSI maze routing is successful in 
producing an exact minimum interconnect delay route in VLSI layout topology. An 
exact minimum delay path is obtained when at the end of the graph-based search, the 
propagate-delay and cumulative resistance of the entire interconnect route is at the 
minimum. Clearly, each graph edge is multi-weighted with two parameters: 
resistance and delay. S-RABI applies several search pruning techniques to reduce the 
NP problem down to an NP-hard problem. The concepts of S-RABI techniques are 
based on graph theory, and priority queue plays a critical role in its operation.  
 
 
 
 
3.3.1 Initial Graph Pruning in S-RABI 
 
 
The purpose of graph-pruning is to downsize the solution-space. One 
technique applied in S-RABI is to modify the wire/buffer obstacles such that vertices 
are removed from consideration, resulting in reduced number of vertices involved in 
the routing computation.  
 
 46
 
Figure 3.9: Graph pruning, (a) before pruning, (b) after pruning. 
14
121 
110
41 
71 
dDistance[1][71]=3
dDistance[3][110]=12
dDistance[2][71]=11 
1 20 
140 
(a)
121
1 
41
110
20 
140 
(b)
 
 
Consider the example in Figure 3.9 (a), the dark-grey areas are wire obstacles 
while the light-grey areas indicate buffer obstacles. The source and destination are 
given as vertex-41 and vertex-110. In graph pruning, each vertex will be examined. 
For example, the shortest path from source to vertex-71 is 11 units, and the shortest 
path from destination to vertex-71 is 3 units. Thus if source is connected to 
destination via vertex-71, the path length will be 11 + 3 = 14 units. Meanwhile, we 
know that the shortest path from source and destination is only 12 units. Therefore, 
vertex-71 will be turned to become wire-obstacle, which we will never consider it as 
a solution. An efficient graph-pruning can downsize the initial graph significantly 
since the typical post-placement layout is densely occupied with buffer and wire 
obstacles. 
 47
The graph pruning in S-RABI involves three runs of Dijkstra’s algorithm. At 
the outset, note that the Dijkstra’s algorithm applied in S-RABI is based on hop-by-
hop approach, hence is slightly different from the standard version as discussed in 
Chapter 2. The major difference is the standard Dijkstra’s algorithm, when initialized; 
construct a priority queue of |V| length for a graph of |V| vertices; whereas Dijkstra’s 
algorithm in S-RABI does not. Dijkstra’s algorithm in S-RABI is based on hop-by-
hop. The worst-case priority queue size needed for hop-by-hop Dijkstra’s algorithm 
is clearly smaller than the conventional Dijkstra’s algorithm. From here on, the term 
“Dijkstra’s algorithm” refers to the “hop-by-hop Dijkstra’s algorithm”.  
 
 
3.3.2 Dijkstra’s Algorithm applied in S-RABI 
 
 
Figure 2.12 in Chapter 2 shows the original version of Dijkstra’s algorithm. 
This algorithm is modified as shown in Figure 3.10 to produce the hop-by-hop 
Dijkstra’s algorithm.  
 
 
Referring to Figure 3.10, the priority queue Q is empty initially. The 
algorithm begins by initializing the d[v] parameter at all vertices to infinity value. 
d[v] = ∞ indicates that vertex v has not been visited, and there is no entry inserted to 
Q at this vertex v. The precedence-list π[v] is clearly nothing (NIL). Now, start 
visiting the source s. Total-path-length from source to source is, of course zero, 
hence d[s] = 0. The first entry to priority queue Q is inserted with priority value d[s] 
= 0 and vertex (source, s) as identifier. The identifier tells at which vertex this 
particular Q-entry created. 
 
 48
 
Figure 3.10: Hop-by-hop Dijkstra’s Algorithm 
DIJKSTRA(G, w, s){ 
for (each vertex v є V[G]){  
  d[v] ? ∞ 
  π[v] ? NIL 
} 
d[s] ? 0 
INSERT(Q, s, d[s]) 
S ? Ø  
do{ 
(u, d[u]) ? EXTRACT-MIN(Q) 
 S ? S U {u} 
 for (each vertex v є Adj[u]){ 
if (d[v] = ∞){ 
     d[v] ? d[u] + w(u, v) 
       π[v] ? u 
       INSERT(Q, v, d[v]) 
} 
elseif (d[v] > d[u] + w(u, v)){ 
     d[v] ? d[u] + w(u, v) 
       π[v] ? u 
       DECREASE-KEY(Q, v, d[v]) 
} 
} 
}(while Q ≠ Ø) 
} 
 
 
Next, the top-priority element is extracted from Q, taking the identifier as 
current-vertex u, and the priority value as total-path-length at current-vertex, d[u]. 
This current-vertex is one of the nearest vertices from source; hence it is included 
into set S. From u, its adjacency-list Adj[u] is scanned. For each vertex v є Adj[u],  if 
d[v] is infinite, vertex v has never been visited before. Therefore d[v] is updated, a 
new entry is inserted to Q with priority value d[v] and identifier v. Else, vertex v has 
 49
been visited, there is a corresponding entry in Q (can be recognized through 
identifier v); if the existing d[v] is dominated (d[v] > d[u] + w(u, v)), then update d[v] 
as well as the corresponding entry in Q. The procedure is iterated until Q is empty. 
When Q empty, indicating all vertices have been visited, the single-source-to-all-
destination shortest paths are found. It can be traced-back by dereferencing π[v] for v 
є S. A numerical example of Dijkstra’s computation is given in Appendix B. 
 
 
 
 
3.3.3 S-RABI in maze routing with buffered interconnect delay optimization 
 
 
S-RABI extends the idea in the above hop-by-hop Dijkstra’s algorithm, to 
implement simultaneous interconnect routing with buffer insertion and wire sizing.  
 
 
Let us now formulate the S-RABI algorithm. Given the input graph G, where 
V[G] denotes the set of vertices. The vertices are classified according to buffer-
obstacle area OB[G], wire-obstacle area OW[G], and non-obstacle area (OB[G] U 
OW[G])’; i.e. V[G] = OB[G] U OW[G] U (OB[G] U OW[G])’. Construct a minimum 
delay path from source s to destination z, provided with buffer library B and wire 
library W. For any two adjacently-connected vertices, v = Adj[u] or u = Adj[v], u 
precedent v. If v is not in wire-obstacle area (v є OW[G]’), u can be connected to v 
with any of the wire-sizes, w[i] є W. And if v is not in buffer-obstacle area (v є 
OB[G]’), it is allowed to insert any of the buffer choice b[i] є B at v.  
 
 
Let D(v) denotes a weight candidate-dataset of 5 parameters; D(v[k]) = {u, uk, 
e, r, t} where u denotes the vertex precedent to vertex v, uk denotes the k-th index of 
D(u), e є (B U W) denotes choice of buffer/wire used. (r, t ) is the resistance-delay 
pair at vertex v computed by function Cost( ) given in Figure 3.11. Each vertex v will 
have a list of these candidate-datasets, which we will refer to as v-list, denoted by 
L[v], i.e. L[v] = {D(v[k]), k ≥ 0 and k є integer }. The (r, t ) resistance-delay pairs are 
used to maintain the dominance-property, which is evaluated by function 
InsertCandidate( ), given in Figure 3.12. Note, the basic principle to maintain the 
 50
dominance-property is, given two candidates A and B, candidate-A only dominates 
candidate-B if and only if both rA < rB and tA < tB.  
 
 
During dominance-property maintenance, there is a possibility where more 
than one of the candidate-datasets in an existing v-list might be dominated (by the 
newly computed parameters). Consequently, an additional parameter, the status flag 
needed to be included; so, D(v[k]) = {u, uk, e, r, t, sf}. The candidate dataset is valid 
if sf = VALID, otherwise, sf = NON-VALID. By default, all newly created 
candidate-dataset will have sf = VALID. In contrast to that, all dominated candidate-
datasets will be set sf = NON-VALID during dominance-property maintenance. 
 
 
 
Figure 3.11: Function Cost ( ) 
Cost( ru, tu, e) 
{ 
 if ( e є W ) 
{ rv = ru + rw[i]; 
 tv = tu + cw[i]*(ru + rw[i]/2); 
} 
 
if ( e є B ) 
{ rv = rb[i]; 
 tv = tu + tb[i] + ru*cb[i]; 
} 
 
return (rv, tv); 
} 
 51
InsertCandidate(D(u[k]), v, rv, tv, e, L[v]){  
// #PART_1: IDENTIFY THE CONTEXT STATE OF VERTEX. 
if ( L[v] = NIL ) {    // v-list is empty. 
CASE ? ‘LIST_EMPTY’; 
}else{ 
 for each D(v[i]) є L[v] { 
    if (sf є D(v[i])== VALID && rv > r є D(v[i]) && tv > t є D(v[i])){ 
        // this new candidate is dominated. 
  CASE ? ‘NEW_DOMINATED’; 
 Break;   // exit “for each D(v[i]) є L[v]” 
    } 
   elsif(sf є D(v[i]) ==VALID && rv < r є D(v[i]) && tv < t єD(v[i])) 
          || (sf є D(v[i]) == VALID && t є D(v[i]) < estimated_delay){ 
  // existing candidate-dataset is dominated, so mark invalid. 
  sf є D(v[i]) ? NON-VALID; 
  CASE ? ‘OLD_DOMINATED’; 
 Break;   // exit “for each D(v[i]) є L[v]” 
      } 
     else { 
  // neither new, nor old candidate-datasets dominate. 
  CASE ? ‘NONE_DOMINATED’; 
     } 
}// end “for each D(v[i]) є L[v]” 
     } 
 
// #PART_2: MANIPULATE V-LIST & PRIORITY QUEUE. 
    if ( CASE == ‘LIST_EMPTY’ ) { 
 D(v[0]) = {u, k, e, rv, tv, VALID } // D(v[k]) = {u, uk, e, r, t, sf } 
 L[v] ? L[v] U D(v[0]) 
INSERT(Q, D(v[0]), tv є D(v[0])) 
    } 
 
Figure 3.12: Function InsertCandidate ( ) 
 
 52
 
Figure 3.12: Function InsertCandidate ( ) (continued) 
       
      elsif ( CASE == ‘NEW_DOMINATED’ ) { 
 // do nothing. 
      } 
elsif ( CASE == ‘OLD_DOMINATED’ ) { 
 for each D(v[i]) є L[v] { 
     if ( sf є D(v[i]) == NON_VALID ) { 
  // get the first invalid candidate-dataset and overwrite it. 
  // it’s ok to just leave the rest of invalid candidate-dataset. 
  D(v[i]) ? {u, k, e, rv, tv, VALID } 
  DECREASE-KEY(Q, D(v[i]), t є D(v[i])) 
 Break;   // exit “for each D(v[i]) є L[v]” 
      } 
      } 
     elsif (CASE == ‘NONE_DOMINATED’) { 
 // append to the v-list. 
 i = Length[L[v]] + 1 
D(v[i]) = {u, k, e, rv, tv, VALID } 
L[v] ? L[v] U D(v[i]) 
 INSERT(Q, D(v[i]), t є D(v[i])) 
     } 
 
// #PART_3: UPDATE estimated_delay IF NECESSARY. 
     if (v = z) {  // reach the destination 
 if (estimated_delay >  tv + rv*Cz  ) { 
      estimated_delay ?  tv + rv*Cz  // update the value 
      estimated_end_candidate ? D(v[i]) // remember this candidate 
} 
     } 
} 
 
 53
 
Figure 3.13: Simultaneous Maze Routing and Buffer Insertion (S-RABI) 
S-RABI(G, B, W, s, z){  
for (each vertex v є V[G]){  
 L[v] ? NIL 
} 
estimated_delay ? ∞ 
D(s[0]) = {NIL, NIL, NIL, Rs, 0, VALID} // D(v[k]) = {u, uk, e, r, t, sf } 
L[s] ? L[s] U D(s[0]) 
INSERT(Q, D(s[0]), t є D(s[0]))  // INSERT(Q, identifier, key) 
do{ 
   do{ 
  (D(u[k]), t є D(u[k])) ? EXTRACT-MIN(Q) 
         }(while sf є D(u[k]) == NON_VALID) 
         if (estimated_delay > t є D(u[k])) { 
     for (each vertex v є Adj[u]) { 
     if (v є OW[G]’) { // if v is not wire-obstacle. 
    for each w є W { 
(rv, tv) ? Cost(r є D(u[k]), t є D(u[k]), w[i]) 
   if (tv  < estimated_delay) 
 {  InsertCandidate(D(u[k]), v, rv, tv, w[i], L[v])    } 
 
   if (v є OB[G]’) { // if v is not buffer-obstacle. 
          for each b є B{ 
            (rv, tv) ? Cost(r є D(u[k]), t є D(u[k]), b[i]) 
    if (tv  < estimated_delay) 
           { InsertCandidate(D(u[k]), v,, rv, tv, b[i], L[v])}
      } 
              }// end buffer trials 
            } 
  }// end wire trials 
 }// end all adjacent-vertices 
     }(while Q ≠ Ø) 
} 
 54
Let us now explain the working of S-RABI algorithm, which is given in 
Figure 3.13. Initially, the priority queue Q is empty. The algorithm begins by 
initializing the v-list to all vertices to be set empty (L[v] ? NIL). The estimated 
source-to-destination delay is initially set to infinite (estimated_delay ? ∞); as we 
shall see later, this parameter plays a role to control Q size.  
 
 
Starting from source s, a candidate-dataset is created, D(s[0]) = {0, 0, NIL, 
Rs, 0, VALID}. This is the first candidate-dataset at vertex s, hence the index k = 0, 
i.e. D(v=s[k=0]). There is no vertex precedence of s (i.e. u = 0), therefore no 
reference to the index of candidate-dataset at precedent vertex (i.e. uk = 0), no 
interconnect prior to s (i.e. e = 0), driving-resistance at source is Rs (i.e. r = Rs), 
propagate-delay prior to s is zero (i.e. t = 0). This candidate-dataset is added to the v-
list at source s, i.e. L[s] ? L[s] U D(s[0]). This candidate-dataset is inserted to Q 
with propagate-delay as priority-value and (pointer to) candidate-dataset as identifier. 
The identifier dereference to the location of candidate-dataset, i.e. at which vertex the 
candidate-dataset belongs to, at what index the candidate-dataset resided in the v-list 
of that vertex. 
 
 
Consider now the program loop where graph-traverse is performed. The top-
priority element is extracted from Q. With the identifier, the origin of candidate-
dataset is known. It is created at vertex u with index uk in v-list of u, i.e. D(u[k]). 
Each adjacent-vertex v is now scanned, and if v is not in wire-obstacle region, 
available wire-size w[i] is picked, the propagated-resistance rv and propagated-delay 
tv  is computed by function Cost( ). The (rv, tv) are utilized in dominancy-check in the 
InsertCandidate( ) function.  
 
 
The InsertCandidate( ) can be explained in three separate parts, namely: 
#PART_1, #PART_2 and #PART_3. In #PART_1, the context state of the vertex is 
first determined. If vertex v has not been visited yet, indicated by an empty v-list at v 
(i.e. L[v] = NIL), then the context state is set as ‘LIST_EMPTY’. Otherwise, if v has 
been visited before, there must be candidate-dataset in v-list. So consider each 
candidate-dataset in turn; if this candidate with (rv, tv) has been dominated by any of 
the existing candidate-dataset, the context state is set as ‘NEW_DOMINATED’, 
 55
meaning that this new one is dominated. Else, if this new candidate dominates any 
exiting candidate-dataset, the context state is set as ‘OLD_DOMINATED’, with the 
existing candidate-dataset now set to NON_VALID. Note also, the existing candidate-
datasets in list will be set to NON_VALID if its propagated-delay has exceeded the 
value of estimated_delay: t є D(v[i]) < estimated_delay. In this evaluation, once the 
context state is identified as either ‘NEW_DOMINATED’ or ‘OLD_DOMINATED’, 
this #PART_1 part of the algorithm is immediately exited. Lastly, if it ends up where 
for all candidates in the v-list, neither the new candidate nor (any) existing candidates 
dominates the other, the context state is set as ‘NONE_DOMINATED’. 
 
 
Next in #PART_2, specific action is taken for the context state determined in 
#PART_1. If the context state is ‘LIST_EMPTY’, the new candidate is inserted into 
the v-list. The candidate-dataset is created: D(v[0]) = {u, k, e, rv, tv, VALID }, then 
added to v-list: L[v] ? L[v] U D(v[0]), and inserted into priority queue, Q. Else, if 
the context state is ‘NEW_DOMINATED’, simply do nothing, implying that the new 
candidate is discarded. Else, if the context state is ‘OLD_DOMINATED’, the new 
candidate has dominated one of the existing candidate-datasets or there is an invalid 
candidate-dataset which its propagated-delay has exceeded the estimated_delay. 
Here, the NON_VALID candidate-dataset is overwritten with parameters of new 
candidates: D(v[i]) ? {u, k, e, rv, tv, VALID}, and Decrease-Key is invoked for Q 
relaxation. Lastly, if the context state is ‘NONE_DOMINATED’, neither new nor 
existing one dominates. A new candidate-dataset is hence created P[v][i] = {u, k, e, 
rv, tv, VALID}, appended to the list L[v] ? L[v] U D(v[i]), and inserted to Q.  
 
 
Next, if eventually we are visiting the destination z (i.e. if v = z), one possible 
value of source-to-destination-delay is obtained. It is computed using the formula tv 
+ rv*Cz where Cz represents the load-capacitance of destination/sink. In multi-
weighted routing, however, if not all vertices have been visited and not all possible 
interconnect-types have been tried out, it is not certain that this source-to-
destination-delay is the exact minimum delay path. Therefore, it is called “estimated-
delay”, the candidate which gives this estimated_delay is remembered as 
“estimated_end_candidate”. The estimated_delay parameter is powerful. In S-RABI, 
if the returned tv from Cost( ) is greater than estimated_delay, the interconnect-
 56
candidate is dropped immediately because it can never give a minimum-delay better 
than the estimated_delay. The use of estimated_delay parameter eliminates 
unnecessary expansion of tentative search result, thus lighten the load on priority 
queue Q. Without the deployment of estimated_delay, the NP behavior could arise 
and the problem becomes unsolvable. 
 
 
When all wire candidates has been considered, if v is not in buffer-obstacle 
region, then an available buffer-choice b[i] is picked, the propagated (rv, tv) in Cost( ) 
is estimated, dominancy is checked in InsertCandidate( ). This is reiterated with other 
buffer-choices. The process repeats for all vertices with all possible wire-sizes and 
buffer-choices with frequent tighter-updates on estimated_delay, until the priority 
queue Q is empty. When Q empty, there is no other possible route, the 
estimated_delay is the exact minimum-source-to-destination-delay based on the 
Elmore Delay model. This exact minimum path can be traced-back by dereferencing 
{u, uk, e} є D(v[k]) from the estimated_end_candidate at vertex z backward to s. 
 
 
A numerical example that illustrates the detailed working of S-RABI is given 
in Appendix C. 
 
 
 
 
3.4 Summary 
 
 
This chapter explains the S-RABI algorithm and Insertion Sort priority queue 
in detail. In the next chapter, necessary algorithmic modifications on S-RABI are 
presented, in order to benefit from hardware priority queue which only provide 
INSERT and EXTRACT function. The necessary algorithmic modification on 
Insertion Sort is also presented in the next chapter, for high-speed hardware priority 
queue implementation. 
 
CHAPTER 4 
 
 
 
 
ALGORITHM MODIFICATIONS FOR HARDWARE MAPPING 
 
 
 
 
This chapter presents the algorithmic modifications made on S-RABI, in 
order to utilize the functions of the proposed hardware priority queue. The chapter 
also provides a discussion on specific modifications on the Insertion-Sort Priority 
Queue algorithm so as to obtain a high performance hardware implementation. 
 
 
 
 
4.1 Modifications in graph algorithm to remove DECREASE-KEY 
operation 
 
 
Graph-based shortest path algorithms typically rely on a relaxation technique 
to maintain the dominance-property during graph searching (Cormen et. al., 2001). 
Referring to Figure 4.1, during relaxation, the shortest path estimate d[v] is updated 
(if necessary). As the priority queue (Q) is used to maintain the set of shortest path 
estimates at each vertex, it is also updated with the changes. This latter operation at 
Q is called DECREASE-KEY.  
 
 
The DECREASE-KEY function, as shown in Figure 4.2, is made up of three 
main steps. In the first step, Q is searched for the location of the shortest path 
estimate of vertex v. Secondly, the change is confirmed and the value is updated (if 
necessary). In the third step, Q is consolidated (or sorted) to maintain the orders of 
priority. 
 
 58
 
Figure 4.1: DECREASE-KEY and Relaxation 
u v 
5 194 
5 9
d[u] d[v]
4 
RELAX
if d[v] > d[u] + w(u, v) then 
( i.e. 19 > 5 + 4 )  
d[v] ? d[u] + w(u, v) 
DECREASE-KEY(Q, v, d[v]) 
Q 
8 12 17 19 23 …
c a b v x …
Priority-level
identifier 
d[ ] 
d[ ] 
12 17 8 … 19 23
a b c … v x 
 d[v] ? d[u] + w(u, v) 
… 9 
v 
238 1712
search
Q 
8 12 17 19 23 …
c a b v x …
?? ?? ?? ?? ??
replace 
DECREASE-KEY
w(u, v) 
a b c … x 
Q 
8 12 17 9 23 …
c a b v x …
Q 
8 9 12 17 23 …
v a b …x
consolidate 
c 
 59
 
Figure 4.2: Function DECREASE-KEY ( ) 
DECREASE-KEY ( Q, x, new_priority_of_x )  
Search element x in Q; 
If (new_priority_of_x dominates old_priority_of_x ) 
     old_priority_of_x ? new_priority_of_x;    // replace the priority value. 
     consolidate(); 
 
 
As reported in literature, existing hardware implementations of priority queue 
has excluded the DECREASE-KEY function, that is, only INSERT and EXTRACT 
are provided. A simple reason might be that those hardware priority queue designs 
proposed by Toda et al. (1995), Moon et al. (2000), Chao (1991), Ioannou (2000) 
and Argon (2006) are applied primarily, to accelerate internet packet routing 
algorithms which do not involve relaxation. Another reason is that DECREASE-
KEY is a relatively more complex function as compared to INSERT and EXTRACT. 
It involves the process to search/scan for identifier and not just the process of 
comparing the priority-values.  
 
 
Software implementation of Q can easily be manipulated to provide the 
additional DECREASE-KEY function with no added logic cost. In hardware priority 
queue implementations, besides consuming more logic resources, there is additional 
difficulty in synchronizing the more complex underlying operations in DECREASE-
KEY, which results in lower-speed performance. Hence, the hardware priority queue 
proposed in this thesis does not support the DECREASE-KEY operation. Instead, the 
graph algorithm is modified such that relaxation is achieved using only the two basic 
priority queue operations of INSERT and EXTRACT.  
 
 
Recall that the DECREASE-KEY function based on the given identifier, 
searches for the corresponding Q-entry, then replace its priority value if necessary, 
then consolidates the Q to maintain its priority order. Our modification is than rather 
simple. This is illustrated in Figure 4.3. We do not search, but instead INSERT a new 
entry into Q. Recall that, INSERT adds in the new entry into Q, and consolidates the  
Q to maintain the priority order. This approach avoids the recursive search as in 
 60
DECREASE-KEY, which involves locating the corresponding Q-entry and updating 
the priority value. 
 
 
 
Figure 4.3: INSERT in Relaxation 
u v 
5 194 
5 9
d[u] d[v]
if d[v] > d[u] + w(u, v) then 
e.g. 19 > 5 + 4 
d[v] ? d[u] + w(u, v) 
DECREASE-KEY(Q, v, d[v]) 
4 
RELAX
PQ 
8 12 17 19 23 …
c a b v x …
INSERT 
PQ 
8 12 17 19 23 9
c a b v x v
PQ 
8 9 12 17 19 23
c v a b v x
12 17 8 … 19 23
a b c … v x 
d[ ] 
 d[v] ? d[u] + w(u, v) 
12 17 8 … 9 23
a b c … v x 
d[ ] 
Priority-level
identifier 
w(u, v) 
 
 61
 Referring to Figure 4.3 again, as we INSERT a new entry when we are 
suppose to update an existing (old) entry, the Q now has an invalid entry, i.e. that old 
entry. Therefore, during EXTRACT, the validity of extracted entry is checked by 
comparing its priority value with the shortest path estimate value maintained by the 
graph algorithm, this is illustrated in Figure 4.4. In summary, the modifications can 
be summarized by the rules given in Figure 4.5. 
 
 
 
Figure 4.4: EXTRACT in Relaxation 
d[ ] 
 
 
 
Figure 4.5: Modification rules to remove DECREASE-KEY 
 
 
 
 
12 17 8 … 9 23
a b c … v x 
PQ
23 … … … … … 
x … … … … … 
19 
v 
d[u]
temp
do{ 
do{ 
 (u, temp) ? EXTRACT(Q)   
}(while d[u] ≠ temp)  // INVALID ENTRY !!! (9 ≠ 19), DISCARD!!! 
 : 
: 
}(while Q ≠ Ø) 
Rule-1: If DECREASE-KEY function is needed, replace it with I
function. This increases the Q size by one, but one of the entries
in Q is invalid. 
NSERT 
 
 
Rule-2: During each EXTRACT, the extracted entry will be validated. If 
the entry is not valid, discarded it. The process continues until 
the first valid entry is returned. 
 62
4.2 Modifications in Dijkstra’s and S-RABI algorithm 
  
 
The modified version of Dijkstra’s algorithm is presented in Figure 4.6. The 
algorithm now only invokes INSERT and EXTRACT functions. The original 
execution flow remains except in the following situation: (i) when DECREASE-
KEY is needed, INSERT is invoked instead; this causes the Q-size to actually grow 
by one, and there is one entry in queue which is no longer valid, and (ii) During 
EXTRACT-MIN, the extracted queue-entry is validated; the non-valid entry is 
discarded and the first valid entry is returned. 
 
 
Figure 4.6: Modified Dijkstra’s Algorithm – without DECREASE-KEY 
0
1
DIJKSTRA_MODIFIED(G, w, s){ 
for (each vertex v є V[G]){ 
d[v] ? ∞ 2
3  π[v] ? NIL 
} 4
5 d[s] ? 0 
INSERT(Q, s, d[s])     6
7 S ? Ø 
do{ 8
9 do{ 
 (u, temp) ? EXTRACT-MIN(Q)   
}(while d[u] ≠ temp)    
S ? S U {u} 
 for (each vertex v є Adj[u]){ 
if (d[v] = ∞){ 
     d[v] ? d[u] + w(u, v) 
       π[v] ? u 
       INSERT(Q, v, d[v]) 
} 
elseif (d[v] > d[u] + w(u, v)){ 
      d[v] ? d[u] + w(u, v) 
        π[v] ? u 
        INSERT(Q, v, d[v])  
} 
 } 
}( while Q not empty) 
}
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
 63
Applying the same modification rules, the modified version of S-RABI is 
presented in Figures 4.7 and 4.8. Note that in Dijkstra’s algorithm, only one shortest 
path estimate d[v] is maintained at each vertex; whereas in S-RABI, a list of 
candidates is maintained at each vertex. Each of the candidate-datasets has a status-
flag, sf є D(u[k]), which is utilized to validate the extracted entry. 
 
 
InsertCandidate_MODIFIED(D(u[k]), v, rv, tv, e, L[v]){  
// #PART_1: IDENTIFY THE CONTEXT STATE. 
if ( L[v] = NIL ) {    // v-list is empty. 
CASE ? ‘LIST_EMPTY’; 
}else{ 
   for each D(v[i]) є L[v] { 
 if (sf є D(v[i]) == VALID && rv > r є D(v[i]) && tv > t є D(v[i])){ 
        // this new candidate is dominated. 
  CASE ? ‘NEW_DOMINATED’; 
 Break;   // exit “for each D(v[i]) є L[v]” 
 } 
 elsif(sf є D(v[i]) == VALID && rv < r є D(v[i]) && tv < t є D(v[i]))
           || (sf є D(v[i]) == VALID && t є D(v[i]) < estimated_delay ){ 
  // existing candidate is dominated, so mark invalid. 
  sf є D(v[i]) ? NON-VALID; 
  CASE ? ‘OLD_DOMINATED’; 
 Break;   // exit “for each D(v[i]) є L[v]” 
  } 
 else { 
  // neither new, nor old candidates dominate. 
  CASE ? ‘NONE_DOMINATED’; 
 } 
          }// end “for each D(v[i]) є L[v]” 
     } 
 
Figure 4.7: Modified_InsertCandidate ( ) 
 
 64
      // #PART_2: MANIPULATE V-LIST & PRIORITY QUEUE. 
     if ( CASE == ‘LIST_EMPTY’ ) { 
 D(v[0]) = {u, k, e, rv, tv, VALID } // D(v[k]) = {u, uk, e, r, t, sf }
 L[v] ? L[v] U D(v[0]) 
INSERT(Q, D(v[0]), tv є D(v[0])) 
     } 
      elsif ( CASE == ‘NEW_DOMINATED’ ) { 
 // do nothing. 
      } 
elsif ( CASE == ‘OLD_DOMINATED’ ) { 
 // append to the v-list. 
 i = Length[L[v]] + 1 
D(v[i]) = {u, k, e, rv, tv, VALID } 
L[v] ? L[v] U D(v[i]) 
 INSERT(Q, D(v[i]), t є D(v[i])) 
} 
     elsif (CASE == ‘NONE_DOMINATED’) { 
 // append to the v-list. 
 i = Length[L[v]] + 1 
D(v[i]) = {u, k, e, rv, tv, VALID } 
L[v] ? L[v] U D(v[i]) 
 INSERT(Q, D(v[i]), t є D(v[i])) 
     } 
 
// #PART_3: UPDATE estimated_delay IF NECESSARY. 
     if (v = z) {  // reach the destination 
 if (estimated_delay >  tv + rv*Cz  ) { 
      estimated_delay ?  tv + rv*Cz  // update the value 
      estimated_end_candidate ? D(v[i]) // remember this candidate
} 
     } 
} 
 
Figure 4.7: Modified_InsertCandidate ( ) (continued) 
 65
S-RABI_MODIFIED(G, B, W, s, z){  
for (each vertex v є V[G]){  
 L[v] ? NIL 
} 
estimated_delay ? ∞ 
D(s[0]) = {NIL, NIL, NIL, Rs, 0} // D(v[k]) = {u, uk, e, r, t} 
L[s] ? L[s] U D(s[0]) 
INSERT(Q, D(s[0]), t є D(s[0])) // INSERT(Q, identifier, key) 
do{ 
do{ 
(D(u[k]), t є D(u[k])) ? EXTRACT-MIN(Q) 
          }(while sf є D(u[k]) == NON_VALID) 
 if (estimated_delay > t є D(u[k])) { 
     for (each vertex v є Adj[u]) { 
     if (v є OW[G]’) { // if v is not wire-obstacle. 
    for each w є W { 
(rv, tv) ? Cost(r є D(u[k]), t є D(u[k]), w[i]) 
   if (tv  < estimated_delay) 
 { InsertCandidate(D(u[k]), v, rv, tv, w[i], L[v]) } 
         
if (v є OB[G]’) { // if v is not buffer-obstacle. 
          for each b є B{ 
            (rv, tv) ? Cost(r є D(u[k]), t є D(u[k]), b[i]) 
    if (tv  < estimated_delay) 
              {InsertCandidate(D(u[k]), v,, rv, tv, b[i],L[v])}
      } 
              }// end buffer trials 
            } 
  }// end wire trials 
 }// end all adjacent-vertices 
     }(while Q ≠ Ø) 
} 
 
Figure 4.8: Modified S-RABI Algorithm 
 66
Let us now analyze the modified Dijkstra’s algorithm. Analysis on S-RABI is 
very complex hence excluded here. Interested readers are invited to refer to Nasir et 
al. (2006) 
 
 
The modification of Dijkstra’s algorithm results in a constant overhead in the 
process of validating extracted entry during EXTRACT. This process is equivalent to 
compare. Referring to lines 14-23, regardless of the condition, whether d[v] = ∞ or 
d[v] > d[u] + w(u, v), the work is the same, i.e. relaxation and INSERT. We can save 
this overhead. Instead of two compares, it is reduced to one, e.g. if (d[v] > d[u] + 
w(u, v)) then relax and INSERT (see Figure 4.9). As a result, this saves one compare 
process, thus compensate the one spent to check for validity. This is also applied in 
S-RABI. 
 
 
 
Figure 4.9: Further optimization to reduce overhead 
if (d[v] = ∞){ 
    d[v] ? d[u] + w(u, v) 
    π[v] ? u 
    INSERT(Q, v, d[v]) 
} 
elseif (d[v] > d[u] + w(u, v)){ 
    d[v] ? d[u] + w(u, v) 
    π[v] ? u 
    INSERT(Q, v, d[v])  
} 
18
19
20
21
22
23
17
16
15
14
} 
 
    π[v] ? u 
    INSERT(Q, v, d[v]) 
if (d[v] > d[u] + w(u, v)){ 
    d[v] ? d[u] + w(u, v) 
 
 
In terms of run-time complexity, after modification, the algorithm can utilize 
hardware priority queue which gives O(1) for EXTRACT and O(1) for INSERT but 
no DECREASE-KEY. At any moment, there could be invalid entries in the Q, thus 
the worst case Q-size is |E|. The hop-by-hop Dijkstra’s algorithm hence invokes 
worst case |E| times EXTRACT, the INSERT remains |E| times. Anyway, hardware 
priority queue gives constant O(1) run-time complexity for both INSERT and 
EXTRACT operation, it doesn’t matter with any Q-size. Therefore after modification, 
 67
the run-time complexity of hop-by-hop Dijkstra’s algorithm is O(E) + O(E) ≈ O(2*E) 
≈ O(E). 
 
 
For sparse graph, |E| << |V|2. Especially for VLSI routing, the graph is a 2D 
grid-graph, |V| = |number-of-column| * |number-of-row|. Assume the grid-graph is 
perpendicular symmetry, i.e. number-of-column = number-of-row. Total number of 
edges is exactly E = 2*(number-of-column-1)(number-of-row), or E ≈ 2V. Hence the 
run-time complexity O(E) ≈ O(V).  
 
 
As compared to Dijkstra’s algorithm running on software priority queue, if 
insertion-sort priority queue with O(V2) run-time complexity is compared with, this 
approach gives speed up in quadratic factors. If Binary-Heap priority queue with 
O((V + E) lg V) run-time complexity or Fibonacci-Heap priority queue with O(V lg 
V) run-time complexity is compared with, a theoretically worst-case logarithmic 
improvement in run-time complexity can be achieved, besides huge improvement in 
constant communication overhead.  
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 68
4.3 Modification of Insertion Sort Priority Queue 
 
 
Insertion Sort Priority Queue involves recursive compare and shift operations. 
The massive parallelism of Insertion Sort Priority Queue has been fully explored in 
(hardware implemented) Shift-Register Priority Queue (Moon et al., 2000; Toda et 
al., 1995; Chao, 1991). The implementation assumes n number of processing 
elements (PEs) for n Q-size. The implementation completes INSERT and EXTRACT 
at O(1) run-time complexity, regardless of Q-size. Both INSERT and EXTRACT 
completes in two clock-cycles but the loading effect of shared-bus results in very low 
clocking rate and there is also difficulty in further expanding the Q-size. 
 
 
Another feasible parallel approach for recursive operations is via pipeline 
hardware implementation (Kung, 1980). The architecture adopts groups of 
processing-elements that are connected to a small number of nearest neighbours in a 
defined topology. Figure 4.10 illustrates a one dimensional systolic array. The data 
can flow in any direction but the control must flow in one direction. Systolic array 
architectures are able to implement algorithms that are recursive. Conceptually, a 
recursive operation is divided into tasks which are distributed to a group of 
processing-elements (PEs). Ideally the task is the same in all PEs. Referring to Figure 
4.11, each PE performs one task on a data-item, and then passes it to its adjacent PE, 
and so on. The operation only completes after that data-item has been executed by 
series of PE in the systolic array.  
 
 
Figure 4.10: One-dimensional Systolic Array Architecture 
Systolic Array Architecture 
Data In 
Data Out 
PE1 PE2 PE3 PE4 PEn 
Control 
Signals 
…...  
 
 69
 
Figure 4.11: Execution of identical task-cycles for one operation 
Systolic Array Architecture 
OPERATION-1 
PEn 
…...  1
PE4PE3PE2PE1
PEn 
…...  
PE4PE3PE2PE1
1 
next task-cycle 
next task-cycle 
PEn 
…...  1
PE4PE3PE2PE1
…...  1
PEn PE4PE3PE2PE1
After several 
task-cycles 
 
 
After performing its tasks, each PE is released and is ready for a new task. 
Therefore, a new operation can be invoked immediately after the first PE finished its 
task. Up to n operations can be invoked, one after another, completing in O(n) worst 
case run-time complexity. In short, O(1) run-time complexity for each operation. 
This is illustrated in Figure 4.12. 
 
 70
 
Figure 4.12: Series of operations executed in pipeline 
Systolic Array Architecture 
OPERATION-1 
PEn
…...  
PE4PE3PE2PE1
1
next task-cycle 
…...  
PE1 PE2 PE3 PE4 PEn
1
…...  
PE1 PE2 PE3 PE4 PEn
2
2
13
3
4
OPERATION-3 
OPERATION-4 
next task-cycle 
next task-cycle 
PEn
…...  OPERATION-5 5 4 123
PE4PE3PE2PE1
PEn
…...  OPERATION-2 2 1
PE4PE3PE2PE1
next task-cycle 
 
 
Recall from Chapter 3, in Insertion-Sort Priority Queue operations, each 
INSERT invokes recursive compares and right-shifts on successive queue entries. 
The process starts from the last-entry until the correct position for the new-entry is 
found. Also, EXTRACT triggers recursive left-shifts on entire queue, beginning from 
the left-most entry till the last-entry. It is clear that INSERT and EXTRACT 
performs recursive processing. However, they are in opposite directions and this is 
prohibited in systolic array architecture. In order to support pipelined execution in 
systolic array architecture, both INSERT and EXTRACT functions must proceed in 
the same direction. That is, assuming an array of PEs, either the left PE can trigger 
 71
operation/instruction on its right processing-element or vice-versa, definitely not 
both. Data-items, however, can flow in any directions (including reverse/go-stern).  
 
 
Note that, INSERT operation always begins at the last entry. The position of 
this last entry varies with current effective Q-size. If such INSERT operation is 
implemented, the new-entry must be reachable to all locations along the array. 
Physically, a shared-bus will be needed to connect the input-port to all processing-
elements. The presence of shared-bus will cause severe loading effect as in Shift-
Register Priority Queue implementation.  
 
 
Consider the modifications of the Insertion-Sort Priority Queue where both 
INSERT and EXTRACT proceed in same direction, i.e. from left to right; and new 
entries are all inserted into Q, from the left. The highest priority entry remains 
extracted from left.  
 
 
Figure 4.13: Modified Insertion-Sort Priority Queue 
INSERT_MOD(array A, int length, key) { 
 for (i = 0; i ≤ length-1; i++) { 
       if ( A[i] > key ) { // swap 
   temp ? A[i]; 
   A[i] ? key; 
   key ? temp; 
  } 
  } 
 A[length] ? key; length ? length + 1; 
} 
 
EXTRACT-MIN(array A, int length) { 
  min-key ? A[0]; 
 for (i = 0; i < length-1; i++) { 
       A[i] ? A[i + 1]; 
  } 
 length ? length – 1; 
  return(min-key);   
} 
0
1
2
3
4
5
6
7
8
9
0
1
2
3
4
5
6
 72
Figure 4.13 gives the modified Insertion Sort Priority Queue. INSERT has 
been modified to INSERT_MOD while EXTRACT remains unchanged. Referring to 
lines 1-7 in Figure 4.13, the new-entry (key) is compared with the first-entry in 
queue (array A). Now, the higher priority entry wins the comparison and it will 
occupy the location. The lower priority entry loses out and it will become ‘floating’, 
or new-entry to its right neighbour. The same process continues until the end of 
queue. Figure 4.14 illustrates the execution of modified INSERT operation.  
 
 
Figure 4.14: Example of INSERT_MOD operation 
The existing elements are in order. New element 
9     55191812
    55191812
9 Compare, new element 
has lower priority, so 
float to next element. 
19     5518
12 
9
Compare, new element 
has higher priority, so 
‘
swap takes place. Element 
18’ now floats
    5519
18 
129
Compare, element ‘18’ 
has higher priority, so 
‘
swap takes place. Element 
19’ now floats
    55
19 
18129
Compare, element ‘19’ 
has higher priority, so 
‘
swap takes place. Element 
55’ now floats
    
55 
1918129
Compare, element meets 
the first empty location, 
therefore stops. 
Finally, the queue is fully 
sorted.    551918129
 
 
 73
Having both INSERT_MOD and EXTRACT now proceed in same direction, 
systolic array implementation is now feasible. Then, for n queue-length, the recursive 
‘for’ loop in INSERT_MOD can be divided into n compare-and-right-shift and 
executed in pipelined systolic array. 
 
 
 
 
4.4 Summary 
 
 
This chapter explains presents the modification on graph-based shortest path 
algorithm: the Dijkstra’s and S-RABI algorithm, in order to benefit from hardware 
priority queue which only provide INSERT and EXTRACT function. The necessary 
algorithmic modification on Insertion Sort to allow mapping into high performance 
systolic array architecture is also presented.  In the next chapter, the implementation 
of the Graph Processing Unit is illustrated. 
 
CHAPTER 5 
 
 
 
 
THE GRAPH PROCESSING UNIT 
 
 
 
 
 
This chapter describes the implementation of a Graph Processing Unit (GPU) 
on an FPGA-based embedded system hardware platform under Altera NIOS II 
platform. The design environment and basic development flow is first introduced, 
and then followed with an overview of the top-level architecture of the GPU design. 
Next, each sub-module (hardware and software) and their integration into the system 
are discussed. The software-based sub-modules: the device driver and APIs are 
detailed out in this chapter. Finally, the hybrid priority queue (HybridPQ) 
implementation is presented at the end of the chapter. 
 
 
 
 
5.1 Introduction 
 
 
The key objective of this thesis work is to design and demonstrate the 
deployment of a hardware priority queue in the acceleration of the computation of 
graph-based shortest path algorithm when applied in deep-sub-micron VLSI 
interconnect routing. Implemented in an embedded processor-based hardware 
platform, the proposed hardware priority queue (hwPQ) functions as a co-processor 
to offload the compute-intensive priority queue operations from a general purpose 
processor. Due to the availability of development tools and rapid prototyping 
resources, the Altera NIOS II System-on-Programmable-Chip (SoPC) development 
system is chosen for our proof-of-concept design.  
 75
Under the NIOS II SoPC development environment, NIOS II embedded 
processor serves as the general purpose processor, and other task specific peripherals 
such as custom design logics, various IO controllers and memory controllers are 
connected to NIOS II via Avalon System Bus.  
 
 
All peripherals designed around the Avalon System Bus follow a restricted 
set of rules. For each individual peripheral, a customized hardware Avalon Interface 
Unit and the corresponding software device driver must be developed (refer Figure 
5.1). Basically, the design of Avalon Interface Unit is determined by the behavior of 
the specific peripheral and also the selected Avalon communication protocol (Altera, 
2005a). The development of device driver is based on Altera Hardware Abstraction 
Layer Application Program Interface (HAL API) technology (refer Figure 5.2) 
(Altera, 2004c). Appendix E provides the basic design flow of NIOS II system 
development and also an example of NIOS II system. 
 
 
 
Figure 5.1: NIOS II System Architecture 
NIOS II System Architecture
NIOS II Processor  
Avalon System Bus 
 
A
pp
lic
at
io
n 
(C
-P
ro
gr
am
) D
ev
ic
e 
D
riv
er
_1
 
D
ev
ic
e 
D
riv
er
_2
 
 
Peripheral_1 
 
 
Peripheral_2 
 
A
va
lo
n 
In
te
rf
ac
e 
U
ni
t_
1 
A
va
lo
n 
In
te
rf
ac
e 
U
ni
t_
2 
 
Peripheral_3 
 A
va
lo
n 
In
te
rf
ac
e 
U
ni
t_
3 
…
 
…
 
D
ev
ic
e 
D
riv
er
_3
 
Other peripherals 
Software Components 
Development 
Hardware Components 
Development 
 
 
 76
 
Figure 5.2: Different layers of software components in NIOS II System 
 
Various Hardware Peripherals 
NIOS II Software Architecture 
Device 
Driver_1
Device 
Driver_2
Device 
Driver_3
Device 
Driver_n
Standard C 
Library 
 
HAL API 
User Software Routines / User API 
 
User Application (C-Program) 
 
 
 
 
5.2 System Architecture of Graph Processing Unit (GPU) 
 
 
Figure 5.3 below illustrates the top-level hardware architecture of the 
proposed Graph Processing Unit (GPU), which essentially consists of a general 
purpose processor (the NIOS II processor) and the proposed Priority Queue 
Accelerator Module.  
 
 
 
Figure 5.3: Top-Level Architecture of Graph Processing Unit 
Graph Processing Unit (GPU)
 
 
Hardware  
Priority Queue Unit 
(hwPQ) 
Priority Queue Accelerator Module 
A
va
lo
n 
In
te
rf
ac
e 
U
ni
t 
Avalon System Bus
General Processor
 
 
 
NIOS II 
 77
The Priority Queue Accelerator Module offloads and accelerates priority 
queue operations from the NIOS II processor. All other graph functions are executed 
by NIOS II processor. Referring Figure 5.3, the Hardware Priority Queue Unit 
(referred to as “hwPQ”) in the accelerator module is a stand-alone priority queue 
core. It executes the priority queue operations of INSERT and EXTRACT-MIN 
(hereon referred as “EXTRACT”).  
 
 
The design of hwPQ is to be made highly parameterizable, and the queue size 
scalable. Note, however that the queue in an (actual) physical implementation will 
also depend on the available logic resources (in this case, the number of logic 
elements in FPGA device). In an ASIC implementation, the size depends on die size. 
In short, physical implementation has a finite queue size, and therefore, if the graph 
computation needs a larger priority queue, some sort of control mechanism is 
required. For instance, the graph computation can be terminated immediately but this 
will yield no result. Another approach is to discard low priority queue entries. But if 
those queue entries with low priorities are discarded when the computation is on-the-
fly, the final result might be not accurate, since a discarded entry might possibly be 
the one which results in the shortest path. 
 
 
Taking into consideration the discussion above, Hybrid Hardware-Software 
Priority Queue (HybridPQ) is proposed in this thesis. HybridPQ integrates the fixed-
size (hardware) Priority Queue Accelerator Module, with a flexible-size software-
based priority queue. A specific control mechanism is incorporated into HybridPQ 
such that the number of queue entries is not restricted by the implemented size of 
(hardware) Priority Queue Accelerator Module thus avoiding the above mentioned 
situations.  
 
 
Figure 5.4 gives an illustration of all software-hardware partitioning in the 
proposed GPU, hierarchically modularized into several layers. There are three 
software components, namely the user application, HybridPQ API, and the device 
drivers; all of which executed by the NIOS II Processor.  
 
 
 78
 
Graph Processing Unit (GPU)
NIOS II 
Processor 
Priority Queue 
Accelerator Module 
Avalon 
System Bus 
 
 
Hardware Priority Queue Unit 
(hwPQ) 
hwPQ_Avalon_Interface_Unit 
User Application 
Nano-scale VLSI Routing Module 
(SRABI)
HAL API 
hwPQ_Device_Driver 
User API HybridPQ 
Layer-5 
(SW) 
Layer-4 
(SW) 
Layer-3 
(SW) 
Layer-2 
(HW) 
Layer-1 
(HW) 
Layer-0 
(HW) 
Figure 5.4: GPU – Software/Hardware System Partitioning 
 
 
 
 
5.3 Priority Queue Accelerator Module 
 
 
The Priority Queue Accelerator Module shown in Figure 5.5 consists of two 
main blocks: (i) the Hardware Priority Queue Unit (hwPQ), and (ii) the 
corresponding Avalon Interface Unit. The Hardware Priority Queue Unit, hwPQ, 
accommodates up to n priority queue entries, and at the same time serves as a 
processing engine to sort the queue entries and maintain their priority-orders. It 
executes the priority queue operations: INSERT and EXTRACT. The Avalon 
Interface Unit handles data communication between hwPQ and the Avalon System 
Bus. 
 79
 
 
Figure 5.5: Functional Block Diagram of Priority Queue Accelerator Module 
Priority Queue Accelerator Module 
Data In 
Data Out 
Instruction
 
 
 
Hardware  
Priority Queue Unit 
(hwPQ) 
writedata 
readdata 
address 
chipselect 
CLK A
va
lo
n 
In
te
rf
ac
e 
U
ni
t 
from 
Avalon 
System 
Bus 
 
 
As the designs of these modules are rather complex, it is proposed that the 
conceptual design of these modules are introduced in this chapter, the detailed design 
and implementation is then provided in the Chapter 6. 
 
 
 
 
5.3.1 Specification and Conceptual Design of hwPQ 
 
 
Searching a graph, which is a systematic traversal of the vertices, results in a 
huge amount of data on the tentative candidates, each with its own dataset and 
assigned priority values. From vertex to vertex, the candidates are continuously 
dumped and sorted in the priority queue, in accordance to the priority-values. For 
each entry, the Hardware Priority Queue Unit (hwPQ) allocates storage for priority 
value and storage for data. The storages are 32-bit. If the data exceeds 32-bit, a 32-bit 
identifier (i.e. pointer) to the storage of the data is stored instead. 
 
 
Referring to Figure 5.6, hwPQ supports INSERT and EXTRACT. During 
INSERT, hwPQ receives a 32-bit priority value and a 32-bit identifier for the new 
entry. During EXTRACT, the extracted top-priority entry is also in that format. 
Therefore, hwPQ has two separate data ports, one for input-data, and the other for 
output-data. Each data port is 64-bit, 32-bit for priority value, 32-bit for identifier. 
 80
The priority value and identifier can be simultaneously inserted to or extracted from 
hwPQ. This yields a higher data throughput. 
 
 
The control-signals work as follows. For INSERT operation, upon a new 
entry being ready at the input data port, hwPQ will latch the data upon on the stroke 
of a control-signal. For EXTRACT operation, hwPQ is designed such that the 
highest priority entry is always ready at output data port. Reading the output data 
port will ‘return’ the highest priority entry. To ‘remove’ it, control-signal is strobed 
and the hwPQ will destroy that data by replacing it with the successive highest 
priority entry. 
 
 
 
Figure 5.6: Top-Level Description of hwPQ 
64 
64 
PEn PE4PE3PE2PE1
 
…...  
Control 
Signals 
Top priority entry 
extracted here. 
New entry 
inserted here. 
Hardware Priority Queue Unit (hwPQ)
63:32 31:0
priority-value identifier 
64-bit
 
 
There are n identical processing elements (PEs) in hwPQ to support worst-
case n number of priority queue entries. The design results constant O(1) run-time 
complexity for INSERT and EXTRACT operation. Moreover, the constant overhead 
is very low; it takes only two clock cycles to complete either INSERT or EXTRACT 
operation.  
 
 
 
 
 81
5.3.2 Specification and Conceptual Design of Avalon Interface Unit 
 
 
In general, several Avalon communication protocols are available. In our 
implementation, the Avalon Slave Transfer mode communication protocol is chosen, 
since hwPQ serves as slave to offload priority queue operations from the general 
purpose NIOS II processor. NIOS II is the Avalon master peripheral, while all other 
slave devices, including hwPQ, memory-controller, UART controller are Avalon 
slave peripherals. 
 
 
The Avalon slave transfer mode presumes all Avalon slave peripherals to 
have memory-mapped IO ports, as depicted in Figure 5.7. Data transfer is conducted 
via specific data lines within the Avalon bus. The bus data lines and signals relevant 
to the Avalon slave transfer mode and require our consideration are tabulated in 
Table 5.1. 
 
 
Figure 5.7: Memory-mapped IO of Avalon Slave Peripheral 
Avalon Slave Peripherals
Register 
bank 
 
 
Address 
decoder 
Avalon 
System Bus 
Slave Core Memory mapped IO Ports
C
LK
ch
ip
se
le
ct
ad
dr
es
s[
31
:0
]
w
ri
te
da
ta
[3
1:
0]
re
ad
da
ta
[3
1:
0]
 
 82
Table 5.1: Avalon System Bus signal descriptions 
Signal 
Direction 
(relative to 
Avalon 
System Bus) 
Direction 
(relative to 
connected 
peripheral) 
Width Description 
CLK Output input 1 global synchronous system clock. 
chipselect Output input 1 peripheral-select signal. 
address Output input 32 address to memory-mapped-IO. 
writedata Output input 32 input-data to memory-mapped-IO. 
readdata Input output 32 output-data from memory-mapped-IO. 
 
 
The functional block diagram of the Avalon Interface Unit is given in Figure 
5.8, and Table 5.2 provides the descriptions of the memory-mapped registers in the 
Avalon Interface Unit. The Avalon Interface Unit interfaces the 32-bit Avalon 
System Bus with our hwPQ, which has 64-bit input-output port. A 64-bit bus is 
preferable, but existing development system limits the bus to 32-bit. 
 
 
 
Figure 5.8: Functional Block Diagram of Avalon Interface Unit 
Avalon Interface Unit
readdata[31:0]
writedata[31:0]
address[4:2]
chipselect
CLK
Output data [63:0] 
Input data[63:0] 
Controls[1:0] 
CLK 
REG_INSERT_PRIORITY 
REG_INSERT_IDENTIFIER 
REG_TOP_PRIORITY 
REG_TOP_IDENTIFIER 
REG_OPMODE 
Avalon Data Unit (avalonDU)
Avalon Control Unit (avalonCU) Interface to 
hwPQ 
Interface to 
Avalon 
System Bus 
 83
Table 5.2: Memory-mapped Register descriptions 
Register 
Direction 
relative to 
Avalon 
System Bus 
Direction 
relative to 
hwPQ 
Width Description 
REG_TOP_PRIORITY input output 32 
Stores the 
priority value 
of top-priority 
entry. 
REG_TOP_IDENTIFIER input output 32 
Stores the 
identifier of 
top-priority 
entry. 
REG_INSERT_PRIORITY output input 32 
Stores the 
priority value 
of new entry. 
REG_INSERT_IDENTIFIER output input 32 
Stores the 
identifier of 
new entry. 
REG_OPMODE output input 3 
Stores the 
control mode 
to hwPQ 
 
 
To INSERT a new entry into hwPQ, the NIOS II processor (via the device 
driver) sends the new entry in two separate bus transfer cycles; in one cycle the 
priority value is fetched and stored in REG_INSERT_PRIORITY, and the other 
cycle the identifier is fetched and stored in REG_INSERT_IDENTIFIER. Next, the 
control mode corresponding to INSERT operation at hwPQ is fetched and stored in 
REG_OPMODE. Then, upon receiving the corresponding control signal, the new 
entry is transferred into hwPQ. 
 
 
Recall that we have designed the hwPQ such that the highest priority entry is 
always ready at output data port. It is always stored separately in 
REG_TOP_PRIORITY and REG_TOP_IDENTIFIER registers. In order to 
EXTRACT the highest priority entry from hwPQ, NIOS II processor (via the device 
driver) will have to read these two registers, one after another. Once that is 
completed, the control mode corresponding to EXTRACT is sent to REG_OPMODE, 
and hwPQ removes (destroys) that top-priority entry by overwriting 
REG_TOP_PRIORITY and REG_TOP_IDENTIFIER with the next highest-priority 
entry. 
 84
5.4 hwPQ Device Driver 
  
 
A device driver is a C-written software sub-routine executed on the NIOS II 
for the processor to interact with other hardware peripheral. The design of device 
driver follows the strict requirement of Altera Hardware Abstraction Layer 
Application Program Interface (HAL API) technology (Altera, 2004b). The HAL 
API specifies standard C macros to initiate read/write operation between NIOS II 
processor and memory-mapped Avalon slave peripherals. The use of HAL API 
allows portability across a variety of Altera NIOS II development board.  
 
 
‘hwPQ_Device_Driver’ is developed for NIOS II to interact with Priority 
Queue Accelerator Module. hwPQ_Device_Driver consists of a number of 
successive C-statements which corresponds to sequence of data transfer between 
NIOS II processor and Priority Queue Accelerator Module (the C-codes are provided 
in Appendix G). From programming point of view, the Priority Queue Accelerator 
Module is seen, by NIOS II Processor, as bank of registers, as shown in Figure 
5.9.The sequence of data transfer, at proper bus direction, targets selected registers, 
trigger priority queue operations on hwPQ.  
 
 
 
Priority Queue Accelerator Module NIOS II Processor 
hwPQ_Device_Driver 
INSERT(…) 
{ 
 : 
} 
 
EXTRACT(…) 
{ 
 : 
} 
writedata
readdata
REG_TOP_IDENTIFIER 
REG_TOP_PRIORITY 
REG_OPMODE 
REG_INSERT_IDENTIFIER 
REG_INSERT_PRIORITY 
Figure 5.9: Programming Model of Priority Queue Accelerator Module 
 
 
To invoke an INSERT operation on our Priority Queue Accelerator Module, 
NIOS II executes the driver routine in Figure 5.10. First, the priority value is written 
 85
into REG_NEW_PRIORITY, followed by writing the identifier into 
REG_NEW_IDENTIFIER. The control-code/operation mode (which equivalent to 
INSERT) is then written into REG_OPMODE. Lastly, REG_OPMODE is 
overwritten with NO_OPERATION mode. The reasons why NO_OPERATION 
mode is needed are explained further in Chapter 6. 
 
 
 
Figure 5.10: Device driver routine for INSERT operation 
Step-1 
writedata 
writedata 
writedata 
writedata 
Step-4 
Step-3 
Step-2 
“NO OPERATION” 
“INSERT” 
REG_OPMODE 
REG_NEW_IDENTIFIER 
REG_OPMODE 
REG_NEW_PRIORITY 
NIOS II Processor 
NIOS II Processor 
NIOS II Processor 
NIOS II Processor 
 
 
 
Figure 5.11: Device driver routine for EXTRACT operation 
Step-1 
readdata 
 
 
writedata
readdata 
writedata
NIOS II Processor REG TOP PRIORITY 
Step-2 
REG_TOP_IDENTIFIER NIOS II Processor
Step-3 
REG_OPMODE NIOS II Processor
“EXTRACT”
Step-4 
REG_OPMODE NIOS II Processor
“NO OPERATION”
 86
 To invoke the EXTRACT operation on our Priority Queue Accelerator 
Module, NIOS II executes the driver routine in Figure 5.11. First, the priority value 
is read from REG_TOP_PRIORITY, followed by the identifier being read from 
REG_TOP_IDENTIFIER. The control-code/operation mode (which equivalent to 
EXTRACT) is then written into REG_OPMODE. Lastly, REG_OPMODE is 
overwritten with NO_OPERATION mode.  
 
 
Two additional priority queue functions are provided at the device driver 
level, besides INSERT and EXTRACT. The two additional operations are PEEK and 
DELETE. We will illustrate in the later section, how these four priority queue 
functions (INSERT, EXTRACT, PEEK and DELETE) are utilized in HybridPQ. 
 
 
PEEK function in priority queue means to look at the content of top-priority-
entry, but without removing it. Recall that the top-priority entry is always ready at 
the outputs of REG_TOP_PRIORITY and REG_TOP_IDENTIFIER registers. 
Therefore, PEEK easily accomplished by reading these two registers, without writing 
any control-mode to REG_OPMODE. Figure 5.12 illustrates this operation. 
 
 
 
Figure 5.12: Device driver routine for PEEK operation 
Step-1 
NIOS II Processor REG_TOP_PRIORITY readdata
readdata
Step-2 
NIOS II Processor REG_TOP_IDENTIFIER 
 
 
On the other hand, DELETE function removes the top-priority-entry, without 
concern of the content of the top-priority entry. Referring to Figure 5.13, DELETE 
corresponds to writing an EXTRACT operation mode to REG_OPMODE, then 
overwriting REG_OPMODE again with NO_OPERATION. It can be easily seen that, 
EXTRACT operation is actually a combination of PEEK-then-DELETE. 
 
 87
 
Figure 5.13: Device driver routine for DELETE operation 
 
Step-1 
REG_OPMODE 
writedata 
writedata 
Step-2 
“NO OPERATION” 
“EXTRACT” 
REG_OPMODE NIOS II Processor 
NIOS II Processor 
 
 
 
 
5.5 Hybrid Hardware-Software Priority Queue (HybridPQ) 
 
  
In a single-weighted graph problem, the size of worst-case priority queue is 
known. For example, the worst-case priority queue size for Dijkstra’s algorithm is 
|V|, the total number of vertices. In this case, the queue size is deterministic. On the 
other hand, multi-weighted graphs are NP-problems. The algorithms are non-
polynomial in complexity. The worst-case priority queue size is non-deterministic. 
However for hardware development, we have to determine the exact priority queue 
size required as the available logic resources are limited. In other words, if the 
hardware priority queue size is predetermined and fixed, overflow can occur. 
 
 
In software, priority queue may be implemented in a number of advanced 
data structures, e.g. double-linked-list and pointer data structure. Double-linked-list 
and pointer data structure does allow the size of priority queue to grow or shrink 
throughout the computation. During INSERT, a processor can extend the priority 
queue size though standard memory allocation. During EXTRACT, a processor can 
release the vacant memory location. In software implementation, the size of priority 
queue is ‘self-adaptive’; it is only constrained by the available memory. 
 
 
Hence we propose the implementation of Hybrid Hardware-Software Priority 
Queue (HybridPQ), which integrates the benefit of high-speed but fixed-size 
(hardware) Priority Queue Accelerator Module with self-adaptive but reasonably fast 
software priority queue. Fibonacci-Heap priority queue (FHPQ) is chosen as the 
target software priority queue implementation. The reason is simple, it is the fastest 
 88
software priority queue with O(1) run-time complexity for INSERT and O(lg n) run-
time complexity for EXTRACT. The HybridPQ is implemented as software 
abstracted above the hwPQ device driver and FHPQ software routines, as illustrated 
in Figure 5.14.  
 
 
 
Graph Processing Unit (GPU)
NIOS II 
Processor 
Priority Queue 
Accelerator 
Module 
Avalon 
System Bus 
User Application 
Nano-scale VLSI Routing Module 
(SRABI)
HAL  
API hwPQ Device Driver 
User API HybridPQ 
 
 
 
Hardware Priority Queue Unit 
(hwPQ) 
hwPQ Avalon Interface Unit 
FHPQ 
Figure 5.14: Software Abstraction Layer of HybridPQ 
 
 
Specific control mechanism is proposed and incorporated into HybridPQ, 
such that at the top-level abstraction, the HybridPQ still supports the two basic 
priority queue operations: INSERT and EXTRACT, see Figure 5.15. Also in the 
figure are the underlying functions of each hardware and software priority queue, 
which are utilized in the control mechanism of HybridPQ. 
 89
 
HybridPQ
 
Hardware  
Priority Queue Unit 
(hwPQ) 
Priority Queue 
Accelerator Module 
HybridPQ_insert 
HybridPQ_extract 
hwPQ_insert 
hwPQ_extract 
hwPQ _peek 
hwPQ delete
FHPQ_insert 
FHPQ_extract
FHPQ_peek 
FHPQ delete
 
 
 
HybridPQ 
Control 
Mechanism
(C routine 
executed by 
NIOS II) FHPQ
 
Software  
Priority Queue 
(resides on RAM) 
Figure 5.15: Functional Block Diagram of HybridPQ 
 
 
The control mechanism is simple. We have a fix-k-size Hardware Priority 
Queue Unit (hwPQ) which accommodates up to k queue entries. On the other hand, 
the software priority queue, FHPQ has unlimited size. For INSERT operation, if the 
hwPQ is not fully occupied (not fully loaded with entries), then the new-element will 
be inserted into hwPQ, otherwise the new-element will be inserted into FHPQ (see 
Figure 5.16). Both hwPQ and FHPQ possesses O(1) run-time complexity for insert 
operation. Hence, HybridPQ still maintain O(1) run-time complexity for INSERT, 
although insert a new entry into FHPQ have larger constant run-time overhead, 
compared to hwPQ. 
 
 
For EXTRACT operation, both top-priority-element from hwPQ and FHPQ 
will be compared, only the highest-priority among the two will be returned (see 
Figure 5.17). Thus, if the extracted-element is originated from hwPQ, the run-time is 
still O(1). But if it is from FHPQ, then the run-time is O(lg n). 
 90
 
Figure 5.16: INSERT control mechanism in HybridPQ 
 
 
Figure 5.17: EXTRACT control mechanism in HybridPQ 
 
HybridPQ_extract 
Highest priority entry 
from my HWPQ?
my_HWPQ_peek FHPQ_peek 
END
YES NO
my_HWPQ_delete FHPQ_delete 
From my_HWPQ, return 
the highest priority entry.
From FHPQ, return the 
highest priority entry. 
HybridPQ_insert
my_HWPQ full?
my_HWPQ_insert FHPQ_insert
YES 
END
NO
 91
 
 
hybridPQ_reset(Q) 
{ my_HWPQ_reset(Q); 
 FHPQ_create_heap(Q); 
 queueCount = 0; 
} 
 
 
 
hybridPQ_insert (Q, x) 
{ if (queueCount < length-of-my_HWPQ) then 
  my_HWPQ_insert(Q, x); 
  queueCount ++ ; 
 else 
  FHPQ_insert(Q,x); 
 end if; 
} 
 
 
hybridPQ_extract (Q) 
{ variable var_my_HWPQ_min ? my_HWPQ_peek(Q); 
variable var_FHPQ_min ? FHPQ_peek(Q); 
 
if(var_my_HWPQ_min < var_FHPQ_min ) then 
  returned (var_my_HWPQ_min); 
my_HWPQ_delete(Q); 
  queueCount -- ; 
 else 
  returned (var_FHPQ_min) 
FHPQ_delete(Q); 
 end if; 
} 
Figure 5.18: Functions provided in HybridPQ  
 
 
The HybridPQ INSERT operation gives first priority to hwPQ. This ensures 
queue entries first filled up hwPQ, making full utilization of the high-speed but fixed 
size hwPQ. On the other hand, forcing most entries to enter hwPQ increase the 
possibilities that a HybridPQ EXTRACT operation will extract an entry from hwPQ, 
thus O(1) run-time rather than O(lg n) if the extracted entry is from FHPQ. The 
trade-off due to HybridPQ are one additional counter is required to count the fill-up 
level of hwPQ, besides constant additional overhead due to the comparison process 
spent during INSERT and EXTRACT. Pseudo-code describing the control 
mechanism is in Figure 5.18. 
 
 
 92
The implementation of HybridPQ avoids overflow condition in hardware 
priority queue (hwPQ). Although software priority queue (FHPQ) might possibly 
overflow due to insufficient memory (RAM), it will be handled by the Operating 
System. Software memory overflow is not within our scope. Anyway, as the 
hardware hwPQ can be extended by cascading it; software memory overflow can be 
solved by simply expand the memory capacity. Certainly, expanding software 
memory will be easier and cheaper, comparably to hardware priority queue 
expansion. 
 
 
Appendix G gives the C-code implementation of HybridPQ. 
CHAPTER 6 
 
 
 
 
DESIGN OF PRIORITY QUEUE ACCELERATOR MODULE 
 
 
 
 
This chapter details out the complete design of the proposed Priority Queue 
Accelerator Module. As shown in Figure 6.1, the Priority Queue Accelerator Module 
consists of two main units: the custom-design Hardware Priority Queue Unit (hwPQ) 
and its corresponding Avalon Interface Unit. The mapping of the pipelined insertion-
sort priority-queue algorithm into the systolic array architecture to yield compact yet 
high performance Hardware Priority Queue Unit is discussed. The detailed design of 
Avalon Interface Unit is also presented. VHDL source codes of the modules are 
given in Appendix F. 
 
 
 
 
Priority Queue Accelerator Module 
Data In 
Data Out 
Instruction
writedata 
readdata 
address 
chipselect 
CLK 
from 
Avalon 
System 
Bus 
 
 
 
Hardware  
Priority Queue Unit 
(hwPQ) 
A
va
lo
n 
In
te
rf
ac
e 
U
ni
t 
Figure 6.1: Top-Level Functional Block Diagram of Priority Queue Accelerator 
Module 
 
 
 
 
 94
 
6.1 Hardware Priority Queue Unit (hwPQ) 
 
 
The Hardware Priority Queue Unit (hwPQ) accommodates up to n priority 
queue entries (Q-entries), at the same time serves as processing engine to sort the Q-
entries and maintain their priority-orders. It executes the priority queue operations: 
INSERT and EXTRACT. In our implementation of the hwPQ, the Insertion-Sort 
Priority Queue is mapped into systolic array architecture. The goal is to obtain a 
high-speed and compact hardware priority queue unit. Pipelining required that both 
INSERT and EXTRACT operation in the modified Insertion Sort Priority Queue 
proceed in the same direction, from the left to the right of the array. This had been 
already explained in Chapter 4. Referring to Figures 6.2 and 6.3, each INSERT 
operation is a series of compare and right-shift tasks, while each EXTRACT 
operation is a series of left-shift tasks. The tasks are executed sequentially, one after 
another, from the left to the right of Q (i.e. propagating in one direction). This 
operation structure is suitable for pipelining.  
 
 
 
New element 
9 12 18 19 55 ∞ ∞
The existing elements are in order. 
∞∞
Compare:  
New element has higher priority.
 
Right-Shift: 
Lower priority element is right-
shifted. 
12 18 19 55 ∞ ∞ ∞ ∞ 
9
9 18 19 55 ∞ ∞ ∞ 
12
∞ 
9 18 19 55 ∞ ∞ ∞ ∞ 
12
9 12 19 55 ∞ ∞ ∞ ∞ 
18
Compare:  
Element-18 has lower priority. 
 
Right-Shift: 
Element-18 is right-shifted. 
 
9 12 19 55 ∞ ∞ ∞ ∞ 
9 12 18 55 ∞ ∞ ∞ ∞ 
18
19
Compare:  
Element-19 has lower priority. 
 
Right-Shift: 
Element-19 is right-shifted. 
 
Figure 6.2: compare-and-right-shift tasks in an INSERT operation 
 95
 
 
 
Figure 6.3: left-shift tasks in an EXTRACT operation 
 
 
Figure 6.4 shows the functional block diagram of the proposed Hardware 
Priority Queue Unit (hwPQ). It adopts systolic array architecture. Each processing 
element (PE) in the systolic array is identical and contains localized control unit, 
storages and comparator. The PEs are designed to perform the tasks: compare and 
right-shift or left-shift. Each PE contains storage to hold a queue entry; thus 
eliminating the complexity for multiple PEs accessing one shared-memory 
concurrently (as with RAM). Therefore, to support n queue entries (worst-case), 
hwPQ need to have n PEs.  
 
 
9 12 18 19 55 ∞ ∞ ∞ 9
Top-priority read (extracted). 
idle: 
Left-Shift: ∞ ∞ 1212 18 19 55 ∞ 
1812 18 19 55 ∞ ∞ ∞ 
1812 19 19 55 ∞ ∞ ∞ 
1812 19 55 ∞ ∞ ∞ 55
Left-Shift: 
Left-Shift: 
Left-Shift: 
 96
 
12 18
Priority Queue, Q 
5519
Hardware Priority Queue Unit New element 
inserted here. 
PEn
…...
PE4PE3PE2PE1
Top priority element 
extracted here. 
Control Signals 
 
A
Storage B 
Comparator PEx
Input control from PEx-1
Storage A 
Input data from PEx-1 B
Output data to PEx-1 Input data from PEx+1
Output data to PEx+1
Output control to PEx+1
Control Unit
Figure 6.4: Hardware Priority Queue Unit 
As shown in Figure 6.4, each PE contains two storages, A and B. The 
existing queue entry is at A. Storage B is temporary storage holding the lower-
priority entry during a right-shift. The deployment of the PEs in an INSERT 
operation is illustrated in Figure 6.5.  
 
 
 97
 
 
1A
1B
PE1
2A 
2B 
PE2 PE3
3A 
3B 
new-entry
new-entry
1A
1B
PE1
2A 
2B 
PE2
3A 
3B 
PE3
1A
1B
PE1
2A 
2B 
PE2
3A 
3B 
PE3
(b) 
PE1: right-shift phase. 
Higher-priority entry is kept at 1A, 
lower-priority-entry is kept at 1B, 
the temporary-storage. 
(a) 
PE1: compare phase. 
New entry is compared with 
existing entry at 1A. 
 
(c) 
PE2: compare phase. 
New entry is compared with 
existing entry at 2A. 
 
(d) 
PE2: right-shift phase. 
Higher-priority entry is kept at 2A, 
lower-priority-entry is kept at 2B, 
the temporary-storage. 
PE1
1A
1B
PE2
2A 
2B 
PE3
3A 
3B 
Figure 6.5: INSERT operation in systolic array based hwPQ 
 
 
 Each INSERT (or EXTRACT) consists of a series of tasks compare and 
right-shift (or left-shift) executed at the processing elements. For a queue size of n, n 
tasks are incurred by each operation. Each task is performed in one task-cycle. 
Consider an INSERT instruction issued to PE1. PE1 will complete its task within one 
task-cycle, and trigger PE2 for the next task. The second task is completed by PE2 
within one task-cycle as well, followed by a trigger issued by PE2 to PE3, and so on. 
Figure 6.6 illustrates an INSERT or EXTRACT operation propagates through the 
systolic array. 
 
 
 98
 
 
Hardware Priority Queue Unit OPERATION-1 
Figure 6.6: Execution of identical tasks for one operation 
 
 
Recall that a task in an INSERT operation consists of (a) compare and (b) 
right-shift, and each task in EXTRACT is a left-shift. Clearly, the compare and right-
shift at each PE requires a minimum of two clock-cycles to complete. On the other 
hand, a left-shift needs only one clock-cycle. To achieve synchronization among the 
PEs, without using handshaking signals, we set the task-cycle of both INSERT and 
EXTRACT to take two clock-cycles. Then, in EXTRACT, the task is idle and left-
shift, as shown in Figure 6.7. This natural synchronization among PEs results in a 
simple control unit design for each PE, since no handshaking is required. 
 
 
 
Figure 6.7: idle and left-shift tasks in EXTRACT 
…...  1
PE1 PE2 PE3 PE4 PEn 
next task-cycle 
After several 
task-cycles 
…...  1
PE1 PE2 PE3 PE4 PEn 
…...  1
PE1 PE2 PE3 PE4 PEn 
Top-priority read (extracted). 
idle: 
 
Left-Shift: 1212 18 19 55 ∞ ∞ ∞ 
1812 18 19 55 ∞ ∞ ∞ 
1212 18 19 55 ∞ ∞ ∞ idle: 
 
Left-Shift: 
:
:
9 12 18 19 55 ∞ ∞ ∞ 9
 99
6.1.1 The design of Processing Element – RTL Design 
 
 
Figure 6.8 presents the RTL architecture of a PE in hwPQ. A PE is 
partitioned into the data path unit, DU and the control unit, CU. The DU consists of 
two 64-bit registers and one 32-bit unsigned comparator. The reset (rst) and clock 
(CLK) are global, and all other signals are locally connected to its left and right PE. 
Labels of signals connected to the left-PE (previous PE) are suffixed with a “P” (for 
Previous); while the labels of signals connected to the right-PE (next PE) are suffixed 
with “S” (implying Slave). 
 
 
 
B
G
A
   
  
   se
lH
ol
d 
 ld
H
ol
d 
  
 se
lte
m
p 
 ld
Te
m
p 
Comparator 
 Data Path unit (DU)
regHoldwritedataP
readdataP
regTemp 
writedataS
 
 
 
 
 
 
 
 
readdataS 
 
0
1
0
1
selTemp 
selHold 
rst 
ldTemp rst 
ldHold 
Control Unit (CU) 
rst
CLK
A         B 
writeP
readP
writeS
readS 
63:0 
63:0 
63:0 
63:0 
63:0 
31:0 
31:0 
63:0 
Processing Element (PE) 
Figure 6.8: RTL Architecture of Processing Element 
 
 100
On reference to Figure 6.9, each PE is triggered by a Previous-PE via two 
control signals: (i) writeP, which corresponds to a task invocation in an INSERT 
operation, and (ii) readP, which corresponds to a task invocation in an EXTRACT. 
When both these control signals are deactivated, the PE is in NO-OPERATION 
(NOP) state (i.e it does nothing). When the current-PE completes a task-cycle, it 
triggers the Slave-PE to begin its task-cycle, using the control-signals writeS and 
readS. In short, output writeS of PEx is input writeP to PEx+1, and output readS of 
PEx is input readP to PEx+1, as shown in Figure 6.9. 
 
 
 
Figure 6.9: Communication between PEs 
writedataS
readdataS
writeS
readS
writedataP
 
readdataP 
 
 
 
writeP 
 
readP 
PExPEx-1
writedataS
readdataS
writeS
readS
PEx+1
writedataP 
 
readdataP 
 
 
 
writeP 
 
readP 
64
6464 
64 
PE1 PE2
 
PE3
 
PEn 
 
… … 
NC 
 
all 
‘1’ 
 
NC 
NC 
Hardware Priority Queue Unit (hwPQ) 
w
readdataP
writeP
reset
ritedataP
readP
CLK
Top priority 
element 
extracted here.
New element 
inserted here. 
Control Signals
 
Each task-cycle of INSERT consists of phases: (i) “Compare” and (ii) “Right-
Shift”. When writeP is asserted on PEx, PEx compares the released-element from 
 101
PEx-1 (at writedataP) with its existing-element (in regHold). Then during “Right-
Shift”, the higher-priority-element is latched in regHold while the lower-priority-
element is latched in regTemp.  
 
 
Each task-cycle of EXTRACT consists of two phases: (i) “Idle” and (ii) 
“Left-Shift”. When readP is asserted on PEx, PEx is idle (do nothing) because the 
previous-PE (PEx-1) is reading its element (at regHold) via readdataP. Then during 
“Left-Shift”, PEx replace the vacant regHold by reading the regHold content of 
slave-PE (PEx+1). The recursive process continues until the last-PE eventually latch-
in an infinite-value (0xFFFFFFFF), at which it terminates. The behavioural 
description of PE is given in Figure 6.10. The corresponding RTL control sequence 
of a PE is given in Figure 6.11. 
 
 
 
Operation? 
Right-Shift: 
Store the element with higher-
priority. 
Released the element with lower 
priority to next-PE. 
Read new-element (the released-
element) from previous-PE. 
Compare the priority of current-
element with the new-element 
if INSERT task-cycle
If EXTRACT task-cycle 
Idle: 
Do nothing 
Left-Shift: 
Take element from 
next-PE. 
Trigger next-PE for 
EXTRACT task-cycle. 
Trigger next-PE for 
INSERT task-cycle. 
If NO_OPERATION 
(NOP) task-cycle 
Idle: 
Do nothing 
Trigger next-PE for 
NOP task-cycle. 
Figure 6.10: Behavioural Description of PE 
 
 102
 
 
 
Asynchronous reset == ‘1’  -- RESET (global) 
 regTemp ? 0xFFFFFFFF; 
 regHold ? 0xFFFFFFFF; 
 State ? S0; 
 
WHEN State = S0; 
IF writeP == ‘1’;  -- INSERT task-cycle 
  IF (writedataP < regHold) THEN -- if regHold lower priority 
   regTemp ? regHold;    -- release regHold 
   regHold ? writedataP;    -- replace with that new-element 
  ELSE       -- else new-element lower priority 
   regTemp ? writedataP;    -- release that new-element 
  END IF; 
State ? S1; 
ELSIF readP == ‘1’; -- EXTRACT task-cycle 
  readdataP ? regHold;   -- **always connected 
  regHold ? readdataS;   -- replace regHold with element from next-PE. 
State ? S2; 
ELSIF   -- NO OPERATION 
writeS ? ‘0’;  -- trigger next-PE for NO-OPERATION task-cycle 
  readS ? ‘0’; 
  State ? S0; 
     ENDIF 
 
WHEN State = S1; 
 writeS ? ‘1’;   -- trigger next-PE for INSERT task-cycle 
 State ? S0; 
  
WHEN State = S2; 
 readS ? ‘1’;   -- trigger next-PE for EXTRACT task-cycle 
 State ? S0; 
Figure 6.11: RTL Control Sequence of PE 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 103
6.2 Pipelining in hwPQ 
 
 
Recall that, each priority queue operation in hwPQ (INSERT or EXTRACT) 
is divided into series of tasks. Each task is performed by an individual PE within one 
task-cycle. Figure 6.12 shows that, once PE1 completes its task-cycle and passes the 
job to PE2, a new operation can be fed to PE1. Hence, series of new operations can 
be issued to hwPQ, in a pipeline manner. Figure 6.13 illustrates multiple INSERT 
operations executed in pipelined hwPQ, and Figure 6.14 illustrates the pipelined 
execution of EXTRACT operations. For a systolic array with n number of PEs, up to 
n operations can be issued in n consecutive task-cycles, resulting in O(1) run-time 
complexity for each INSERT or EXTRACT operations.  
 
 
 
Hardware Priority Queue Unit 
OPERATION-1 …...  
PE1 PE2 PE3 PE4 PEn
1
next task-cycle 
next task-cycle 
…...  OPERATION-2 2 1
PE1 PE2 PE3 PE4 PEn
…...  
PE1 PE2 PE3 PE4 PEn
1
…...  
PE1 PE2 PE3 PE4 PEn
2
2
13
3
4
OPERATION-3 
OPERATION-4 
next task-cycle 
next task-cycle 
PEn
…...  OPERATION-5 5 4 123
PE4PE3PE2PE1
Figure 6.12: Series of operations executed in pipeline 
 104
 
 
∞ ∞ ∞ ∞ ∞ ∞ ∞ ∞ 
55
Insert new element
Empty Q. 
55
Compare: 
 
 
Right-Shift: 
∞
∞ ∞ ∞ ∞∞∞∞
55 ∞ ∞ ∞ ∞ ∞ ∞ ∞ 
18
Insert new element
∞
18 ∞ ∞ ∞ ∞ ∞ ∞ ∞ 
55 ∞
Compare: 
 
 
Right-Shift: 
18 ∞ ∞ ∞ ∞ ∞ ∞ ∞ 
9 55 ∞
18 ∞ ∞
9
Insert new element
55 ∞
Compare: 
 
 
Right-Shift: ∞ ∞ ∞ ∞∞
Figure 6.13: Pipelined execution of multiple INSERT 
 
 
 
 
Top-priority extracted. 
Figure 6.14: Pipelined execution of multiple EXTRACT 
 
 
9 ∞ ∞ ∞ idle: 
 
 
Left-Shift: 
9 12 18 19 55
∞ ∞ ∞ 12 12 18 19 55
Top-priority extracted. 
∞ ∞ ∞ 12 12 18 18 19 55idle: 
 
 
Left-Shift: 1818 19 19 ∞ ∞ ∞ 55
Top-priority extracted. 
∞ ∞ ∞ idle: 
 
 
Left-Shift: 
18 18 19 19 55 55
1919 55 55 ∞ ∞ ∞ ∞ 
 105
 
6.2.1 Data Hazards in the Pipeline 
 
 
In INSERT and EXTRACT operations, each task-cycle at the PE takes two 
clock-cycles, while NO-OPERATION (NOP) takes one clock-cycle. Figure 6.15 
gives the symbolic representation of PE and hwPQ which will be used in discussions 
here. As mentioned earlier, INSERT has a two-phase task-cycle: Compare and Right-
shift. During Compare, new-element will be compared with regHold; during Right-
Shift, higher-priority-element is stored in regHold, lower-priority-element is right-
shifted. Meanwhile, EXTRACT has two-phase at each PE as well: Idle and Left-
Shift. During Idle, regHold is extracted by previous-PE. During Left-Shift, PE 
extracts the regHold of next-PE. 
 
 
PE
 
 regTemp
regHold
New element Right-shifted element 
Extracted element Left-shifted element 
A global ‘reset’ initiates all register-storages to 
contain infinite value. This takes one clock cycle. 
∞1
∞2
∞3
∞4
∞5
∞6
∞7
∞8
PE1       PE2     PE3        PE4 
hwPQ with four PEs 
Figure 6.15: Symbolic representation of PEs in hwPQ 
 106
 
(a) Operation-1: INSERT takes two clock-cycles. 
Explanation: 
At clock (1), PE1 compare the new-element with regHold. Other PEs no operation. 
At clock (2), PE1 latch in higher-priority element, right-shift the lower priority element. Other 
PEs no operation. 
∞1
∞2
∞3
∞4
∞5 
 ∞6 
∞7
 ∞8
PE1       PE2     PE3        PE4 
PE1: Compare 
PE2: NOP  
PE3: NOP  
PE4: NOP  
New element
55
55
∞1
∞3
∞4
∞5 
 ∞6 
∞7
 ∞8
PE1       PE2     PE3        PE4 
PE1: Right-Shift 
PE2: NOP  
VS
1 
2 
PE4: NOP  
PE3: NOP  
(b) Operation-2: INSERT takes two clock-cycles. 
Explanation: 
At clock (3), PE1 and PE2 will compare their respective new-element with their regHold. Other 
PEs no operation. 
At clock (4), PE1 and PE2 latch in higher-priority element, right-shift the lower priority 
element. Other PEs no operation. 
55
∞1
∞3
∞4
∞5 
 ∞6 
∞7
 ∞8
PE1       PE2     PE3        PE4 
New element
18
18
55
∞3
∞1
VS VS
∞5 
 ∞6 
∞7
PE1       PE2     PE3        PE4 
 ∞8
PE2: Compare 
PE3: NOP 3 
PE4: NOP 
PE1: Compare 
PE2: Right-Shift  
4 
PE4: NOP 
PE3: NOP 
PE1: Right-Shift 
 107
 
(c) Operation-3: EXTRACT takes two clock-cycles. 
Explanation: 
At clock (5), Idle-stage of EXTRACT starts at PE1, regHold of PE1 is extracted. PE2 and PE3 
are doing their compare stage.  
At clock (6), PE1 extract the regHold content of PE2, at this time, the regHold status of PE2 is 
ambiguous (racing condition); it can be “∞3” or “55”. 
18
55
∞3
∞1
∞5 
 ∞6 
∞7
 ∞8
PE1       PE2     PE3        PE4 
PE2: Compare  
PE3: Compare  
PE4: NOP 
5 
∞3
55
∞3
∞1
∞5 
 ∞1 
∞7
 ∞8
PE1       PE2     PE3        PE4 
PE1: Left-Shift 
PE2: Right-Shift  
PE3: Right-Shift 
PE4: NOP 
6 
VS VS
18
Racing condition 
between “55 and 
”∞3” 
PE1: Idle 
Figure 6.16: Example of INSERT followed by EXTRACT 
 
 
 Figure 6.16 illustrates the error condition when EXTRACT is invoked 
immediately after INSERT operation. When a sequence of INSERT operation is 
executed by hwPQ, each INSERT incurs Compare-and-Right-Shift on PE. As shown 
in Figure 6.16(a) and 6.16(b), at interval between Compare-phase and Right-Shift-
phase, the data in regHold of PE is not valid. If the previous-PE tried to extract the 
content of regHold, it might get wrong the invalid data. Figure 6.17 shows if one 
idle-cycle is slotted between INSERT and EXTRACT, the EXTRACT will get 
correct data. The idle-cycle is invoking NO-OPERATION (NOP) on hwPQ. In fact, 
there are several ways to slot an idle cycle between INSERT and EXTRACT, see 
Figure 6.18.  
 108
 
(c2) Operation-3: NO-OPERATION takes one clock-cycle. 
Explanation: 
At clock (5), PE2 and PE3 will compare its new-element with regHold. Other PEs no operation.
18
55
∞3
VS ∞1
∞5 
VS  ∞6 
PE1       PE2     PE3        PE4 
∞7
 ∞8PE2: Compare  
PE3: Compare 5 
PE4: NOP 
PE1: NOP 
(d2) Operation-4: EXTRACT takes two clock-cycles. 
Explanation: 
At clock (6), Idle-stage of EXTRACT starts at PE1, regHold of PE1 is extracted. PE2 and PE3 
latch in higher-priority element, right-shift the lower priority element. 
At clock (7), PE1 extract the regHold content of PE2, at this time, PE2 is idle. PE3 and PE4 are 
doing their compare stage.  
PE2: Right-Shift  
PE3: Right-Shift 6 
PE4: NOP 
PE1: Idle 
18
55
55
∞3
∞5 
 ∞1 
∞7
 ∞8
PE1       PE2     PE3        PE4 
55
55
55
∞3
18
∞5 
VS 
 ∞6 
PE1       PE2     PE3        PE4 
∞7
VS 
 ∞8PE2: Idle  
PE3: Compare 
PE4: Compare 
7 
PE1: Left-Shift 
Figure 6.17: Example of INSERT?NOP?EXTRACT operations 
 
 109
 
 
1. INS ? idle ? EXT 
2. INS ? INS ? idle ? idle ? EXT ? EXT 
3. INS ? idle ? EXT ? INS ? idle ? EXT 
4. INS ? INS ? idle ? EXT ? idle ? EXT 
5. INS ? INS ? INS ? idle ? idle ? idle ? EXT ? EXT ? EXT 
6. INS ? idle ? EXT ? INS ? idle ? EXT ? INS ? idle ? EXT 
: 
: 
7. INS ? INS ? INS ? idle ? EXT ? idle ? EXT ? idle ? EXT 
Note:- 
INS: INSERT 
EXT: EXTRACT 
and more combinations 
Figure 6.18: Several ways to insert idle state 
 
 
The control unit at each PE can be re-designed to extend each work-cycle to 
three clock cycles. However, taking this approach will slow down the performance of 
hwPQ by 50%, therefore aborted. In the next section, the timing specifications of 
hwPQ is established, that involves the assertion of idle state between INSERT and 
EXTRACT, which we will apply in the design of Avalon Interface Unit. 
 
 
 
 
6.3 Timing Specifications of hwPQ 
 
 
In this section, the timing characteristic of hwPQ is summarized. The timing 
characteristic is important as it affects the design of Avalon Interface Unit. 
 
 
Pipelined hwPQ is designed to complete each operation (INSERT or 
EXTRACT) within two clock cycles. For INSERT operation, one clock-cycle of 
active-HI on writeP will start the operation on hwPQ, the operation completes 
immediately at the next clock-cycle; so as for EXTRACT operation. If a host to 
hwPQ fail to de-assert the control signal (writeP or readP) before the rising-edge of 
the third clock cycle, the hwPQ will execute an extra operation. As a result, either 
invalid entries inserted into hwPQ, or entries accidentally extracted and lost from 
hwPQ. Such ‘one-clock-cycle of active-Hi on control signal’ cannot be achieved 
with mere software control, it must be generated manually, e.g. by hardware step 
function generator.  
 110
Besides we have illustrated in the previous section, a host to hwPQ must be 
aware to assert the idle-state in between an INSERT and an EXTRACT operation. 
The assertion of such idle-state can be done in several ways. Hence a standard 
protocol to utilize hwPQ is established and presented here, followed by detail design 
considerations of Avalon Interface Unit to meet the protocol. 
 
 
Figure 6.19 shows the top-level functional block diagram of hwPQ. The I/O 
port specifications are given in Table 6.1. New-element is sent to hwPQ via 
writedataP data port. Highest-priority-element read hwPQ through readdataP data 
port. Both writedataP and readdataP are 64-bit width; the upper 32-bit store the 
element’s identifier, the lower 32-bit store the priority-value.  
 
 
During INSERT operation, a writeP active-HI the operation on hwPQ, the 
hwPQ takes the content of writedataP as the new-element. During EXTRACT 
operation, the highest-priority-element can be obtained at output port readdataP, 
followed by a readP active-HI to complete the operation. Both INSERT and 
EXTRACT operations on hwPQ Core follow a strict set of timing specifications 
illustrated in Figure 6.20(a) and Figure 6.20(b) respectively.  
 
 
The three control signals, reset, readP, and writeP, are exclusive to the others. 
The reset is asynchronous; a reset stroke will initialize the hwPQ within a clock 
cycle. If none of the three control signal is active, the hwPQ at NO-OPERATION 
(idle) condition. 
 
 111
 
 
Figure 6.19: Hardware Priority Queue Unit (hwPQ) 
 
 
Table 6.1: I/O Port Specifications of hwPQ  
Signal 
Direction 
(relative 
to SAPQ) 
Category Width Description 
writedataP input data port 64 
Input data port for 
identifier and 
priority-value. 
readdataP output Data port 64 
Output data port for 
identifier and 
priority-value. 
writeP input Control signal 1 
If writeP is 
asserted, operation 
INSERT is triggered 
in hwPQ. 
readP input Control signal 1 
If readP is 
asserted, operation 
EXTRACT-MIN is 
triggered in hwPQ. 
Reset input control signal 1 Asynchronous reset 
CLK input clock 1 Synchronous clock 
Hardware Priority Queue Unit (hwPQ) 
Port Description for writedataP and readdataP:
PE1 PE2
 
PE3
 
PEn 
 
… … 
NC 
 
all ‘1’
 
 
NC 
NC 
w
readdataP
writeP
reset
CLK
readP
ritedataP
63:32 31:0
priority-value identifier 
64-bit
 112
 
 
(a) INSERT operation 
DATA VALID 
XXXXXXXX 
0
R2 
cycle-1 cycle-2 cycle-3 
R1
R3 
T3 
T1 T2 
0
1
0
1
0
1
1
0
1
0
1
writedataP
readdataP
writeP
readP
CLK
reset
Timing Specification: 
T1 ? MINimum ONE clock cycle, input data writedataP must valid BEFORE writeP asserted.
T2 ? MINimum ONE clock cycle, input data writedataP must valid AFTER writeP asserted. 
T3 ? MAXimum within TWO clock cycle, writeP must be de-asserted. 
 
Remark: 
R1 ? input control signal, writeP asserted after T1. 
R2 ? input control signal, writeP must be de-asserted. 
R3 ? recommended, writeP is de-asserted ONE clock cycle after it has been asserted. 
 
Note: 
hwPQ needs TWO clock cycles to complete the INSERT operation (R1 ? R3). 
 113
 
 
(b) EXTRACT operation 
writedataP
readP
DATA VALID 
CLK
reset
readdataP
writeP
XXXXXXXX 
0
1
0
1
0
1
0
1
0
1
0
1
T4
R5
T5
cycle-1 cycle-2 cycle-3 
R6 
T6 
R4
Timing Specification: 
T4 ? readdataP always valid. 
T5 ? within less than a clock cycle after the assert of readP, readdataP is destroyed. 
T6 ? MAXimum within TWO clock cycle, readP must be de-asserted. 
 
Remark: 
R4 ? input control signal, readP asserted after readdataP is read. 
R5 ? input control signal, readP must be de-asserted. 
R6 ? recommended, readP is de-asserted ONE clock cycle after it has been asserted. 
 
Note: 
hwPQ needs TWO clock cycles to complete the EXTRACT operation (R4 ? R6). 
Figure 6.20: Timing Specification of hwPQ 
 
 
 
 114
6.4 Avalon Interface Unit - Design Requirement 
 
 
The Avalon Interface Module, in conjuction with the device-driver (in 
Appendix G), is designed to drive hwPQ. The hwPQ Device Driver, executed on 
Nios II embedded processor, consists of C routines to control the interaction between 
Avalon Interface Module and hwPQ. The Avalon Interface Module will bridge data 
in between bus and hwPQ ports accordingly, and generate that required ONE clock 
cycle ONLY stroke for INSERT and EXTRACT operation. The communication 
protocol for each operation is summarized in Figures 6.21, 6.22 and 6.23.  
 
To invoke one RESET operation in hwPQ:  
1. Nios II put RESET operation mode (indicated as “*opmode = reset” in the Device Driver) 
to Avalon Interface Unit. 
o Avalon Interface Unit will generate ONLY ONE stroke of “reset” input-control to
hwPQ. 
2. Niso II put DO_NOTHING operation mode (indicated as “*opmode = do_nothing” in the 
Device Driver) to Avalon Interface Unit. 
o Avalon Interface Unit now knows the previous operation (in this case, RESET) is 
done. Avalon Interface Unit will now ready for new operation. 
 
Figure 6.21: Communication protocol for RESET Operation 
 
 
 
To invoke one INSERT operation in hwPQ:  
1. Nios II put ‘associate-pointer’ to Avalon Interface Unit (then it is directly achieved by 
hwPQ). 
2. Nios II put ‘priority-value’ to Avalon Interface Unit (then it is directly achieved by hwPQ). 
3. Nios II put INSERT operation mode (indicated as “*opmode = writeP” in the Device 
Driver) to Avalon Interface Unit. 
o Avalon Interface Unit will generate ONLY ONE stroke of “writeP” input-control 
to hwPQ. 
o The hwPQ then complete ONLY ONE insert operation. 
4. Nios II put DO_NOTHING operation mode (indicated as “*opmode = do_nothing” in the 
Device Driver) to Avalon Interface Unit. 
o Avalon Interface Unit now knows the previous operation (in this case, INSERT) is 
done. Avalon Interface Unit will now ready for new operation. 
Figure 6.22: Communication protocol for INSERT operation 
 115
To invoke one EXTRACT operation in hwPQ: 
1. Nios II get ‘associate-pointer’ from Avalon Interface Unit (where the data is directly from 
the hwPQ). 
2. Nios II get ‘priority-value’ from Avalon Interface Unit (where the data is directly from the 
hwPQ). 
3. Nios II put EXTRACT-MIN operation mode (indicated as “*opmode = readP” in the 
Device Driver) to Avalon Interface Unit. 
4. Avalon Interface Unit will generate ONLY ONE stroke of “readP” input-control to hwPQ.
5. The hwPQ then complete ONLY ONE extract-min operation. 
6. Nios II put DO_NOTHING operation mode (indicated as “*opmode = do_nothing” in the 
Device Driver) to Avalon Interface Unit. 
7. Avalon Interface Unit now knows the pervious operation (in this case, EXTRACT-MIN) is 
done. Avalon Interface Unit will now ready for new operation.  
Listing 6.23: Communication protocol for EXTRACT operation 
 
 
 
 
6.5 Avalon Interface Unit – RTL Design 
 
 
 The Avalon Interface Unit is basically a register based interface design with 
accordance to Avalon Slave Transfer Protocol. It contains two major blocks which 
are the Avalon Control Unit (avalonCU) and the Avalon Data Unit (avalonDU). The 
top-level functional block diagram of Avalon Interface Module is given in Figure 
6.24 below. The reference-parameters, which include REG_TOP_PRIORITY, 
REG_TOP_IDENTIFIER, REG_NEW_PRIORITY, REG_NEW_IDENTIFIER and 
REG_OPMODE are used by the corresponding device driver to communicate with 
this interface. For example, instructions to hwPQ are sent by a software write to 
REG_MODE, the priority value of the top-priority-element is known by software 
read from REG_TOP_PRIORITY. 
 
 
The Avalon Control Unit is responsible to generate the ONE-CLOCK-
CYCLE control signal as required, to trigger correct operation on hwPQ. The Avalon 
Data Unit is used to register all data between Avalon System Bus and hwPQ. It 
interfaces the 64-bit data port of hwPQ to 32-bit data port of Avalon System Bus. 
When the host processor would like to trigger an operation on hwPQ, the operation 
 116
mode (opmode) is hold stable in Avalon Data Unit. Then the operation mode will be 
interpreted by the Avalon Control Unit, corresponding control signals will be 
generated to trigger the operation on hwPQ.  
 
 
 
Figure 6.24: Functional Block Diagram of Avalon Interface Unit 
 
 
 
 
6.5.1 Avalon Data Unit (avalonDU) 
 
 
The Avalon Data Unit (avalonDU) consists of multiplexer, demultiplexer and 
several registers to hold data at stable. Figure 6.25 gives the functional block diagram 
of Avalon Data Unit. Data from the system bus (Avalon Bus) is latched into the 
corresponding register based on the address supplied through the Address bus. For 
INSERT operation, the new element is transferred in two separated bus cycle, first 
the priority value, then the identifier, followed by the instruction to avalonCU to 
generate a insert control signal to hwPQ. For EXTRACT operation, the priority value 
and the identifier of the top-priority-element is latched into readdata register, then 
transfer to host through Avalon Bus; the top-priority-element then destroyed by 
writing instruction to avalonCU to generate a extract signal to hwPQ. Figure 6.26 
gives the behavioural description on its operation.  
Avalon Interface Unit
readdataP[63:0] 
writedataP[63:0]
readP 
writeP 
reset 
CLK 
readdata[31:0]
writedata[31:0]
address[4:2]
chipselect
CLK
Avalon Control Unit (avalonCU)
Avalon Data Unit (avalonDU) 
REG_INSERT_PRIORITY
REG_INSERT_IDENTIFIE
REG_TOP_IDENTIFIER 
REG_OPMODE 
REG_TOP_PRIORITY 
 117
 D
 
 
Figure ata Unit 
 
 
 
Figure 6.25: Functional Block Diagram of Avalon ata Unit 
 
 
 6.26: Behavioural Description of Avalon D
 
 
 
 If (Avalon Bus) address[4:2] == 
(Avalon Bus) readdata ? (hwPQ) readdataP[31:0] 
(Avalon Bus) readdata ? (hwPQ) readdataP[61:32] 
(Avalon Bus) writedata ?(hwPQ) writedataP[61:32] 
(Avalon Bus) writedata ?(hwPQ) writedataP[31:0] 
(Avalon Bus) writedata[0] ?(avalonCU) op_extract  
(Avalon Bus) writedata[1] ?(avalonCU) op_insert 
“000” 
“001” 
“010” 
“011” 
“100” 
writedata[31:0]
address[4:2]
chipselect
CLK
readdataP[31:0] 
readdataP[61:32]
Avalon Data Unit (avalonDU)
op_insert 
 
op_extract 
 
p_reset o
MUX 
010
011
100
Insert_Priority 
Insert_Identifier 
Opmode
readdata 
DEMUX
100
101
writedataP[31:0] 
writedataP[61:32] 
readdata[31:0]
2:0 Bit-0
Bit-1
Bit-2
(Avalon Bus) writedata[2] ?(avalonCU) op_reset 
 118
6.5.2 Avalon Control Unit 
 
 
unctional block diagram of Avalon Control Unit 
valonCU) while Figure 6.28 gives the behavioural description. The avalonCU 
generates ONE-CLOCK_CYCLE stroke to trigger the corresponding operation on 
nput 
ral  Unit 
Figure 6.27 shows the f
(a
hwPQ. The input op_reset, op_insert and op_extract are active high. If all these i
signals are low, no operation is triggered on hwPQ; the hwPQ is in idle condition. 
Figure 6.29 gives the flow chart which illustrates the control mechanism. Figure 6.30 
shows the state diagram of the Avalon Control Unit. 
 
 
 
Figure 6.27: Functional Block Diagram of Avalon Control Unit 
 
 
 
Figure 6.28: Behaviou Description of Avalon Control
if opmode ==
Generate stroke for (hwPQ) reset. 
Generate stroke for (hwPQ) writeP. 
Generate stroke for (hwPQ) readP. 
“op reset”
“op insert”
“op extract”
hwPQ_readP 
hwPQ_writeP 
hwPQ_reset chipselect
CLK
Avalon Control Unit 
(avalonCU) 
op_extract
op_insert
op_reset
 119
 
 
Figure 6.29: Control Flowchart of Avalon Control Unit 
no 
Initially/Asynchronous reset, 
hwPQ_reset = ‘0’; 
hwPQ_writeP = ‘0’; 
hwPQ_readP = ‘0’. 
yes
noop_reset = “1” ? 
hwPQ_reset = ke HI  stro
Control-signal 
hwPQ_writeP = stroke HI 
yes 
no
op_insert = “0” ? 
hwPQ_readP = stroke HI
op_extract = “0” ? 
op_insert = “1” ? op_extract = “1” ? 
yes 
 
 
Figure 6.30: State Diagram of Avalon Control Unit 
if (op_insert == ‘0’){ 
     hwPQ_reset  ? ‘0’; 
     hwPQ_writeP ? ‘0’; 
     hwPQ_readP  ? ‘0’; 
}
     hwPQ set  ? ‘0’; 
     hwPQ_writeP ? ‘0’; 
_re
hwPQ readP ? ‘0’;
S0
S1
S2
S3
S4
S5
if (op_reset == ‘1’){ 
     hwPQ_reset  ? ‘1’; 
0’;      hwPQ_writeP ? ‘
     hwPQ_readP  ? ‘0’; 
}
     hwPQ_reset  ? ‘0’; 
     hwPQ_writeP ? ‘0’; 
hwPQ readP ? ‘0’;
if (op_extract == ‘1’){ 
     hwPQ_reset  ? ‘0’; 
;      hwPQ_writeP ? ‘0’
     hwPQ_readP  ? ‘1’; 
}
if (op_insert == ‘1’){ 
     hwPQ_reset  ? ‘0’; 
’;      hwPQ_writeP ? ‘1
     hwPQ_readP  ? ‘0’; 
}
     
     
hwPQ_reset  ? ‘0’; 
hwPQ_writeP ? ‘0’; 
hwPQ readP ? ‘0’;
if (op_extract == ‘0’){ 
     hwPQ_reset  ? ‘0’; 
;      hwPQ_writeP ? ‘0’
     hwPQ_readP  ? ‘0’; 
}
CHAPTER 7 
 
 
 
 
SIMULATION, HARDWARE TEST AND PERFORMANCE EVALUATION 
 
 
 
 
This chapter describes the simulation and hardware test that are performed on 
each sub-modules, modules and the system for design verification and system 
validation. Performance evaluations of the designed priority queue accelerator 
module are discussed and comparisons with other implementations are made. This 
chapter also illustrates the top-level architecture of nanometer VLSI routing module 
developed to be executable on GPU. Detail analysis on the performance of graph 
algorithm with the presence of priority queue accelerator module is presented. 
 
 
 
 
7.1 Design Verification through Timing Simulation 
 
 
 All the sub-modules and combined modules are put through the simulation 
test. Through timing simulation, we verify the functionality of the design and check 
whether they meet the timing constraints. We apply bottom-up simulation approach; 
i.e. we begin with the lower-level module of the design hierarchy and up to the top-
level Priority Queue Accelerator Module. The simulation results of the PE and hwPQ 
are provided in Appendix H. Here, we provide the simulation results of the Priority 
Queue Accelerator Module. 
 
 
 
 
 
 
 
 120
7.1.1 Simulation of Priority Queue Accelerator Module 
 
 
 The Priority Queue Accelerator Module integrates the Hardware Priority 
Queue Unit (hwPQ) and the Avalon Interface Unit. The hwPQ uses n PEs for worst-
case n priority queue size. As the hwPQ is parameterizable, it is safe to assume that if 
a small design is functionally correct, then a large design will also be functionally 
correct.  
 
 
Therefore for waveform simulation purpose, the Priority Queue Accelerator 
Module with hwPQ-4 (i.e. hwPQ with 4 PEs) is implemented. In order to prove the 
functionality of priority queue operation, series INSERT and EXTRACT operations 
are invoked on hwPQ. The set of test vectors used in simulation is given in Table 7.1. 
All possible sequence of operations is covered: INSERT- then-INSERT, EXTRACT-
then-EXTRACT, INSERT-then-EXTRACT, and EXTRACT-then-INSERT. Figure 
7.1 shows the simulation results.  
 
 
Table 7.1: Set of Test Vectors 
Operation Type Identifier Priority Value 
1 INSERT AAAAAAAA 00000038 
2 INSERT BBBBBBBB 00000053 
3 INSERT CCCCCCCC 00000018 
4 INSERT DDDDDDDD 00000009 
5 EXTRACT DDDDDDDD 00000009 
6 EXTRACT CCCCCCCC 00000018 
7 EXTRACT AAAAAAAA 00000038 
8 INSERT EEEEEEEE 00006522 
9 EXTRACT BBBBBBBB 00000053 
10 INSERT FFFFFFFF 00005866 
11 EXTRACT FFFFFFFF 00005866 
12 EXTRACT EEEEEEEE 00006522 
13 EXTRACT FFFFFFFF FFFFFFFF 
 
 
 121
 
 
RESET 
operation on 
hwPQ INSERT Identifier: 
Priority-value: 38 
AAAAAAAA
ONE cycle of latency at 
avalonInterfaceUnit to complete INSERT
INSERT 
Identifier: BBBBBBBB 
Priority-value: 53 
ONE cycle of latency at 
avalonInterfaceUnit to complete INSERT 
INSERT 
Identifier: CCCCCCCC 
Priority-value: 18 
INSERT 
Identifier: DDDDDDDD
Priority-value: 9 
EXTRACT 
Identifier: DDDDDDDD
Priority-value: 9 
ONE cycle of latency at 
avalonInterfaceUnit to complete INSERT
ONE cycle of NO-OPERATION
prior to EXTRACT at hwPQ 
 122
 
 
ONE cycle of latency at avalonInterfaceUnit
to complete EXTRACT 
ONE cycle of NO-OPERATION
prior to EXTRACT at hwPQ 
EXTRACT 
Identifier: CCCCCCCC 
Priority-value: 18 
ONE cycle of latency at 
avalonInterfaceUnit to 
complete EXTRACT 
EXTRACT 
Identifier: DDDDDDDD
Priority-value: 9 
ONE cycle of latency at avalonInterfaceUnit 
to complete EXTRACT 
EXTRACT 
Identifier: AAAAAAAA 
Priority-value: 38 
INSERT 
Identifier: EEEEEEEE 
Priority-value: 6522 
ONE cycle of latency at 
avalonInterfaceUnit to 
complete INSERT 
ONE cycle of latency at avalonInterfaceUnit to 
complete EXTRACT 
EXTRACT 
Identifier: BBBBBBBB 
Priority-value: 53 
INSERT 
Identifier: FFFFFFFF 
Priority-value: 5866 
ONE cycle of latency at 
avalonInterfaceUnit to 
complete INSERT 
 
 123
 
Figure 7.1: Simulation of Priority Queue Accelerator Module 
EXTRACT 
Identifier: FFFFFFFF
Priority-value: 5866 
ONE cycle of latency at 
avalonInterfaceUnit to 
complete EXTRACT ONE cycle of 
NO-OPERATION
EXTRACT at 
EXTRACT 
Identifier: EEEEEEEE 
Priority-value: 6522 
ONE cycle of latency at 
avalonInterfaceUnit to 
complete EXTRACT 
prior to 
hwPQ 
ONE cycle of 
NO-OPERATION
prior to 
hwP
EXTRACT at 
Q
EXTRACT 
Identifier: FFFFFFFF 
Priority-value: FFFFFFFF
 
hwPQ is empty! 
ONE cycle of 
NO-OPERATION
EXTRACT at 
ONE cycle of latency at 
avalonInterfaceUnit to 
complete EXTRACT prior to 
hwPQ
 
 
 
 
7.2 Hardware Test 
 
 
 After all the sub-modules and the top-level modules are tested through timing 
simulation and proved to be correct, the Graph Processing Unit is developed on a 
FPGA development board by integrating the Nios II processor with the Priority 
Queue Accelerator Module. This time, the test vectors are written in C-code and 
executed on GPU. Figure 7.2 below shows the console output from hardware test. 
 
 124
 
Figure 7.2: Hardware Test Result 
 
 
The resource utilization and timing performance of the hwPQ is directly 
proportional to the number of logic resource (LEs) available on the FPGA device, the 
architectural design of LEs, and its fabrication process technology. In this thesis 
hwPQ is evaluated based on five selected ALTERA FPGA devices. These results are 
obtained from the report generated by the Quartus II software after the synthesis and 
compilation process. Independent of the FPGA devices, the compilation shows that 
each processing-element (PE) in Priority Queue Computation Unit consumes less 
than 170 LEs. The results are given in Table 7.2. The data on Stratix III is based on 
estimation, assuming 170 LEs per PE (in hwPQ). 
 
 
 
 
 125
Table 7.2: Resource Utilization and Performance of hwPQ 
Device 
Equivalent 
Logic 
Elements 
(LEs) 
Process 
Technology 
Performance 
(Million operations per 
second) 
Max. 
number 
of PEs  
(in 
hwPQ) 
Cyclone II (EP2C35) 33,216 90 nm 87.5 M op/s @ 175 MHz 195 
Stratix (EP1S40) 41,250 0.13 um 120 M op/s @ 240 MHz 240 
Stratix II (EP2S60ES) 60,440 90 nm 125 M op/s @ 250 MHz 350 
Stratix II (EP2S180) 179,400 90 nm 125 M op/s @ 250 MHz 1,055 
Stratix III (EP3SL340) 338,000 65 nm - 2,000 
 
 
It can be seen that larger hwPQ can be obtained simply by using a higher 
density FPGA device. As the feature sizes of process technology shrunk and die size 
increase, larger and larger hwPQ can be obtained.  
 
 
 
 
7.3 Comparison with priority queue software implementation 
 
 
The performance of the Priority Queue Accelerator Module is compared with 
several other priority queue implementation. Many software priority queue 
implementations have their performance stated in terms of run time complexity. This 
however does not reflect the real situation as large processor cycles are hidden 
behind the apparently small run-time complexity. Here, we used the GPU as our 
comparison platform.  
 
 
Several software priority queues are implemented, including Insertion-Sort 
Priority Queue (ISPQ), Binary-Heap Priority Queue (BHPQ) and Fibonacci-Heap 
Priority Queue (FHPQ). As the performance of priority queue depends on the queue-
size, n; for comparison purposes, we fix n = 200. Then, for each of the priority 
queue, the NIOS II processor inserts 200 entries into the queue and then extracts all 
the inserted entries. The worst case INSERT occurs when the queue is full. The 
 126
worst case EXTRACT is when extracting the minimum entry at queue full. The 
worst case run-time in terms of elapsed clock cycles is recorded. Through this way, 
fair comparison can be made.  
 
 
Table 7.3: Comparison in Run-Time Complexity 
WORST CASE 
(number of clock cycles) (Queue size limited to n = 200) 
RESET INSERT EXTRACT 
Priority Queue Accelerator Module O(1) O(1) O(1) 
Insertion-Sort Priority Queue, ISPQ O(n) O(n) O(1) 
Binary-Heap Priority Queue, BHPQ O(n) O(lg n) O(lg n) 
Fibonacci-Heap Priority Queue, FHPQ O(1) O(1) O(lg n)∗
 
 
Table 7.4: Comparison in Number of Processor Cycles 
WORST CASE 
(number of clock cycles) (Queue size limited to n = 200) 
RESET INSERT EXTRACT 
Priority Queue Accelerator Module 27 110 131 
Insertion-Sort Priority Queue, ISPQ 23525 63707 211 
Binary-Heap Priority Queue, BHPQ 58172 4825 5925 
Fibonacci-Heap Priority Queue, FHPQ 78 3492 340869 
 
 
Table 7.5: Speed Up Gain by Priority Queue Accelerator Module 
SPEED UP GAIN 
(compared to Priority Queue Accelerator Module) 
(Queue size limited to n = 200) RESET INSERT EXTRACT 
Priority Queue Accelerator Module 1 1 1 
Insertion-Sort Priority Queue, ISPQ 871 579 2 
Binary-Heap Priority Queue, BHPQ 2,155 44 45 
Fibonacci-Heap Priority Queue, FHPQ 3 32 2602 
 
 
Table 7.3 gives various priority queues in terms of run-time complexity while 
Table 7.4 gives the worst-case number of elapsed clock cycles on each operation. 
Having the Priority Queue Accelerator Module actually running at lower than 
optimal speed (70 MHz compared to the maximum allowed is 240 MHz), narrower 
bus bandwidth (32-bit compared to allowed 64-bit), and high redundancy in cycles 
                                                 
∗ For FHPQ, the O(lg n) for EXT-MIN is worst-case amortized-time, refer Cormen et al. (2001) for 
amortized analysis. 
 127
per operation incurred by the host processor (27 cycles for RESET, 110 cycles for 
INSERT and 131 cycles for EXTRACT; compared to actual design: 1 cycles for 
RESET, 2 cycles for INSERT and EXTRACT), the gain achieved in terms of number 
of elapsed processor cycles is still very significant. 
 
 
ISPQ and BHPQ are implemented as fixed-memory array while FHPQ is 
implemented as memory heap created on-the-fly. From Table 7.4 and Table 7.5, it 
can be seen that ISPQ and BHPQ needs large number of processor cycles to initialize 
the queue. For FHPQ, the cost of initialization is constant and low.  
 
 
Compared to BHPQ, the speed up gain achieved by Priority Queue 
Accelerator Module is more than expected. Theoretically, one could expect lg n 
speed up (which in this case n = 200, log2 200 = 7.6) by Priority Queue Accelerator 
Module. This test shows that in the real implementation, Priority Queue Accelerator 
Module with respect to BHPQ, is 44 times faster in INSERT and 45 times faster in 
EXTRACT operation. This is because software priority queue suffers severe memory 
communication overhead, where tremendous cycles are spent to access queue data 
structure stored in memory. Whereas for Priority Queue Accelerator Module, all 
queue entries are stored registers, the memory communication overhead is avoided. 
 
 
Next, compared to ISPQ, Priority Queue Accelerator Module obtained 579 
speed up for worst-case INSERT and 2 times speed up for EXTRACT. Notice the 
579 times speed gain is also greater than theoretical complexity. This is also due to 
the advantage of Priority Queue Accelerator Module which eliminates the memory 
communication overhead. 
 
 
Lastly, comparison is made between the theoretically most efficient priority 
queue, FHPQ, with our Priority Queue Accelerator Module. The result is very 
impressive. Both FHPQ and Priority Queue Accelerator Module claims O(1) run-
time complexity for INSERT, this implementation shows about 32 times gain 
achieved using Priority Queue Accelerator Module because constantly large cycles 
are spent by FHPQ to handle a bunch of pointer manipulation. Similarly, for 
 128
EXTRACT operation, the worst case speed-up gain achieved by Priority Queue 
Accelerator Module is over 2600 times. 
 
 
The above reported worst case speed-up gain could be more if we compare 
the priority queues at larger queue size. This is because other software priority queue 
has run-time grows with increasing queue size, while Priority Queue Accelerator 
Module has constant run-time complexity for any queue size. The main drawback of 
Priority Queue Accelerator Module is the amount of logic consumed, while other 
software implementation of priority queues do not consume logics, but spaces in 
random access memory. Anyway, if the speed is top-priority, the drawback in logic 
consumption is a worthy trade-off. 
 
 
 
 
7.4 Comparison with other priority queue hardware design 
 
 
As mentioned in previous related work, there are a number of hardware 
priority queue processor designs. All those design, however, targeted specifically 
internet packet routing application and therefore seek for full-custom ASIC 
implementation. Although it is not a norm to compare an FPGA implementation (our 
design) with a full-custom ASIC implementation, here we compare our FPGA-based 
hwPQ design with these ASIC implementations. 
 
 
The performance of a priority queue processor in internet routing is given in 
terms of ‘Million operations per second’ (M op/s) and data throughput ‘Giga bits per 
second’ (Gbps). In internet routing, information is routed in packets. Each packet 
consists of a priority-level and a satellite-data.  The ‘M op/s’ measures how many 
packets the processor can handle in a second. For example, our design complete each 
operation in 2 clock-cycles at maximum clocking frequency of 250 MHz hence 125 
M op/s (i.e. 250 MHz / 2 clock-cycles per operation = 125 M op/s). On the other 
hand, the satellite-data in an internet packet is 53 Bytes (424-bits), the data 
throughput rate ‘Gbps’ is obtained by multiply the M op/s with 424-bits. In our 
implementation, there is no satellite-data, therefore we do not compare in terms of 
 129
Gbps. If we do so, then our implementation will be 125 M op/s * 424-bits = 53 Gbps, 
which is far exceed the expectation of 10 Gbps of OC-192 internet network protocol. 
 
 
Table 7.6: Comparison with other hardware implementations 
Designs Target Technology Clk  Mhz 
 
Priority 
Level 
 
Performance 
(Million 
operations per 
second) 
hwPQ Stratix II EP2S60ES (FPGA) 250 
 
32-bit 
 
125 M op/s 
hwPQ Stratix II EP2S60ES (FPGA) 70 
 
32-bit 
 
35 M op/s 
Ioannou (2000), 
Ioannou and 
Katevenis (2001) 
0.18um CMOS 
(ASIC) 200 
 
18-bit 
 
100 M op/s 
Bhagwan et al 
(2000) 
0.35um CMOS 
(ASIC) N/A 
 
32-bit 
 
35.56 M op/s 
Moon et al 
(2000) 
1.2um CMOS 
(ASIC) 40 
 
8-bit 
 
22.84 M op/s 
 
 
Refer to Table 7.6, our design (hwPQ) achieved maximum clocking rate of 
250 MHz in Altera Stratix II EP2S60ES FPGA device. Each operation completes in 
two clock cycles. We express our implementation in two different clocking rates: 250 
MHz and 70 MHz. The 70 MHz is given because the waveform simulation is 
conducted at this frequency; the maximum running frequency of embedded NIOS II 
system module is also at 70 MHz.  
 
 
Ioannou (2000), Ioannou and Katevenis (2001), Bhagwan and Lin (2000a, 
2000b) had independently proposed ASIC implementations which utilizes O(n) 
memory (i.e. RAM) and O(lg n) processing-elements. The implementation adopts 
pipelined binary-heap management. Referring to Table 7.2, although our design is 
implemented on FPGA platform, it achieves similar performance compared to these 
ASIC implementations. This is because the use of O(n) memory RAM in their 
designs had slowed down the performance. Normally, one can expect to achieve 
higher clocking rate when a FPGA-based design is migrated into ASIC, thus our 
hwPQ can achieve even better result if it is migrated into ASIC.  
 
 130
Compare to Moon et al. (2002), the design called Hybrid-Shift Priority Queue 
is an improvement version of shift-register priority queue. It uses O(n) numbers of 
processing-elements, achieve maximum 40 MHz clocking rate in ASIC 
implementation. Compared with our proposed design, we achieve maximum 
clocking rate of 250 MHz in FPGA implementation, despite of our design actually 
supports larger range of priority-levels 
 
 
In short, we conclude that even through our implementation is in FPGA, it 
can be seen that our design achieves very good performance compared with the rest 
of ASIC implementation. We do not compare the implementation in term of exact 
hardware resource utilization, i.e. the number of gates or the number of transistors, 
because the highly scalability of priority queue processor allows it to be implemented 
in any number of gates/transistors. 
 
 
 
 
7.5 Performance Evaluation Platform 
 
 
In the following section, performance evaluation is presented to verify the 
significance of accelerating priority queue operations in graph computation. A 
demonstration application prototype is developed to validate the design. In this 
prototype, the GPU is used to execute the graph based shortest path algorithm for 
VLSI routing. The algorithm is called S-RABI for Simultaneous Maze Routing and 
Buffer Insertion algorithm. Figure 7.3 gives the overview of the entire demonstration 
prototype. Figure 7.4 illustrates the developed graphical user interface called “VLSI 
Maze Routing DEMO”, executed on host PC to allow generation of sample VLSI 
post-placement layout and gives graphical presentation of the routing results returned 
from the execution of S-RABI on GPU.  
 
 131
 
 
Graph Processing Unit (GPU) 
 
VLSI 
Maze 
Routing 
DEMO 
(GUI) 
 
 
 
 
Hardware  
Priority Queue Unit
NIOS II Processor Priority Queue Accelerator Module 
A
va
lo
n 
In
te
rf
ac
e 
U
ni
t 
System Bus
Host PC 
 
Simultaneous 
Maze Routing 
and Buffer 
Insertion 
algorithm  
(S-RABI) 
HybridPQ 
UART 
Figure 7.3: Overview of demonstration prototype 
 
 
 
 
Figure 7.4: GUI of “VLSI Maze Routing DEMO” application 
 
 
 
 
 
 
 
 
 132
7.6 Performance of Priority Queue in Graph Computation 
 
 
In this thesis, HybridPQ is implemented with software priority queue, FHPQ, 
and hardware priority queue, hwPQ-250 (which supports up to 250 queue entries in 
hardware). We compare the execution of the graph algorithm on GPU under two 
circumstances:  
(i) The general purpose processor (NIOS II) with software priority queue 
(FHPQ); the algorithm can invoke INSERT, EXTRACT and DECREASE-
KEY. 
(ii) The general purpose processor (NIOS II) with HybridPQ. Recall, HybridPQ 
supports only INSERT and EXTRACT. Therefore, the (modified) graph-
based shortest path algorithm invokes only these two queue operations.  
 
 
For the entire graph computation, total elapsed cycles spent by the Nios II 
processor to execute the priority queue operation is recorded and denoted as TPQ. We 
have TPQ for INSERT, TPQ for EXTRACT, and TPQ for ALL priority queue 
operations (i.e. the sum of TPQ for INSERT and TPQ for EXTRACT).  
 
 
Dijkstra’s algorithm is executed three times during graph pruning in S-RABI. 
For both Dijkstra’s and S-RABI algorithms, the computations involve priority queue 
operations and other operations (i.e. cost calculation, memory access, etc). Figure 7.5 
illustrates under circumstances (i), with software priority queue, most of the graph 
computation time are spent on priority queue operation; i.e. almost 90% of total 
computation run-time in Dijkstra’s algorithm are on priority queue operations 
(TPQ/ENTIRE * 100% = 90%). While for S-RABI, almost 50% of its total 
computation run-time is on priority queue operations, the rest 50% is on interconnect 
delay calculation, dominancy-check, etc. When the input problem size increases (i.e. 
graph size increases), the percentage of computation run-time on priority queue 
operation also increases; the percentage increase is particularly obvious in S-RABI. 
As the execution of priority queue operation takes significant portion of the entire 
graph computation, our attempt in this research, to spend up priority queue operation 
in a graph computation, is a viable approach.  
 133
TPQ/ENTIRE*100% VS Graph Size
0
10
20
30
40
50
60
70
80
90
100 225 400 625 900 1225 1600 2025
Graph Size
TP
Q
/E
N
TI
R
E
*1
00
%
Dijkstra_1
S-RABI
 
Figure 7.5: TPQ VS Entire Graph Computation Run-Time 
 
 
Referring to Figure 7.6, throughout the entire execution of algorithm, the 
queue size grows from the beginning of the algorithm, reaches a maximum queue 
size, and then declines to zero. The maximum queue size determines the worst case 
queue size one must provide for graph computation. For larger graph problem size, 
the maximum queue size is larger.  
 
Graph 50*50
0
20
40
60
80
100
120
140
1 103 205 307 409 511 613 715 817 919 1021 1123 1225 1327 1429 1531 1633 1735 1837 1939 2041 2143 2245 2347 2449 2551 2653 2755 2857 2959 3061
Time
Q
u
eu
e 
S
iz
e
 
Figure 7.6: Size of Priority Queue for Entire Graph Computation 
 
 
For both Dijkstra’s and S-RABI algorithm, we compare the execution of 
algorithm under circumstance (ii) with the execution of algorithm under 
circumstances (i). Both circumstances uses Nios II as the general processor, but the 
implemented priority queue is different. As this thesis only targets to accelerate 
priority queue operations in graph computation (not other operation (i.e. interconnect 
delay computation, etc), we conduct the comparison in term of speed up gain on the 
execution of priority queue operations. 
 134
7.6.1 Worst Case Analysis 
 
 
In this thesis, the worst case problem refers to input graphs with no obstacles 
at all, i.e. an empty post-placement layout of no transistors. Although this is the 
extreme case which will never occur, we give such worst case analysis to explain the 
general characteristic one can expect, generally with respect to input problem size. 
Several worst case graph samples is generated using the GUI. Referring to Figures 
7.7 and 7.8, the ‘maximum queue size’ of the Dijkstra’s and S-RABI increases as the 
graph problem size increases. For Dijkstra’s algorithm, which is single constraint 
routing algorithm, the increase of ‘maximum queue size’ is in the range of tens, it is 
very small compared to the multi-weighted S-RABI algorithm, which the increase of 
‘maximum queue size’ versus input problem size is in the range of hundreds.  
 
Dijkstra: Maximum Queue Size VS Graph Size
0
10
20
30
40
50
60
70
80
90
100
100 225 400 625 900 1225 1600 2025 2500 3025 3600 4225 4900
Graph Size
 M
ax
im
um
 Q
ue
ue
 S
iz
e
FHPQ HybridPQ
 
Figure 7.7: Dijkstra’s - Maximum Queue Size VS Graph Size 
 
 
S-RABI: Maximum Queue Size VS Graph Size
0
200
400
600
800
1000
1200
100 225 400 625 900 1225 1600 2025 2500 3025 3600 4225 4900
Graph Size
M
ax
im
um
 Q
ue
ue
 S
iz
e
FHPQ HybridPQ
 
Figure 7.8: S-RABI - Maximum Queue Size VS Graph Size 
 135
Referring to Figures 7.9 and 7.10, as the problem size increases (i.e. the total 
vertices of graph increases), the run-time of graph algorithm increases, hence the 
total number of priority queue operations invoked throughout the entire graph 
computation is also increased.  Compared with the increase of maximum queue size 
versus input problem size, the increase in total number of priority queue operations is 
exponential to the increase of input problem size. The increase is even more 
significant in multi-weighted routing, i.e. S-RABI (in the range of ten-thousands) 
compared to Dijkstra’s (in the range of thousands). 
 
Dijkstra: Total Number of Operations VS Graph Size
0
1000
2000
3000
4000
5000
6000
7000
8000
100 225 400 625 900 1225 1600 2025 2500 3025 3600
Graph Size
To
ta
l N
um
be
r o
f O
pe
ra
tio
ns
FHPQ HybridPQ
 
Figure 7.9: Dijkstra’s – Total number of operations VS Graph Size 
 
 
S-RABI: Total Number of Operations VS Graph Size
0
10000
20000
30000
40000
50000
60000
70000
1 2 3 4 5 6 7 8 9 10 11
Graph Size
To
ta
l N
um
be
r o
f O
pe
ra
tio
ns
FHPQ HybridPQ
 
Figure 7.10: S-RABI – Total number of operations VS Graph Size 
 
 
 136
TPQ when using HybridPQ 
=Speed Up Gain TPQ when using FHPQ 
 
Here on, we define the ‘speed up gain’ as the above equation. For each 
priority queue operations in graph computation, the speed up gain refer to TPQ (of 
that operation, i.e. INSERT, or EXTRACT, or ALL) when the general processor uses 
HybridPQ (i.e. circumstance (ii)) compared to the TPQ (of that operation, i.e. 
INSERT, or EXTRACT, or ALL) when the processor uses FHPQ (i.e. circumstance 
(i)). Note, under circumstances (i), there are three priority queue operations 
measured, i.e. INSERT, EXTRACT and DECREASE-KEY. Referring to Figures 
7.11 and 7.12, the number of DECREASE-KEY operations and the total cycles 
elapsed on this operation is actually very small, compared to INSERT or EXTRACT. 
Hence for comparison purposes, we sum TPQ of DECREASE-KEY into TPQ of 
EXTRACT under circumstances (i), and compare it with the total elapsed cycles of 
EXTRACT for under circumstances (ii). 
 
S-RABI (FHPQ): Number of Operation VS Graph Size
0
5000
10000
15000
20000
25000
30000
100 225 400 625 900 1225 1600 2025 2500 3025 3600
Graph Size
N
um
be
r 
of
 O
pe
ra
tio
n
#INSERT #EXTRACT #DECREASE
 
Figure 7.11: S-RABI (FHPQ): Number of operations VS Graph Size 
 
 137
S-RABI (FHPQ): Total Cycle Elapsed for each operation VS Graph 
Size
0
500,000,000
1,000,000,000
1,500,000,000
2,000,000,000
2,500,000,000
100 225 400 625 900 1225 1600 2025 2500 3025 3600
Graph Size
To
ta
l C
yc
le
 E
la
ps
ed
#INSERT #EXTRACT #DEC_KEY
 
Figure 7.12: S-RABI (FHPQ): Total Cycle Elapsed for each operation 
 
 
Dijkstra: Speed Up Gain VS maximum queue size
0
5
10
15
20
25
30
35
40
45
50
12 18 24 31 35 43 47 53 58 65 71
maximum queue size
Sp
ee
d 
Up
 G
ai
n
INSERT EXTRACT ALL
 
Figure 7.13: Dijkstra’s – Speed up Gain of using HybridPQ 
 
 
 In Figures 7.7 and 7.8, we have shown the maximum queue size increases as 
the graph size increases. For any input problem size, as long as the resulted 
maximum queue size (during graph computation) does not exceed the underlying 
hwPQ-size of HybridPQ, significant speed up gain is obtained. This can be seen in 
Figure 7.13. Despite the INSERT operation has O(1) run-time complexity for 
HybridPQ and FHPQ, in Dijkstra’s execution, INSERT is speed up at almost 5 folds 
with HybridPQ (circumstance (ii)), compared to software FHPQ (circumstance (ii)). 
Meanwhile for EXTRACT operation, FHPQ gives O(lg n) run-time complexity 
while HybridPQ gives constant O(1) run-time complexity, as long as the maximum 
 138
queue size does not exceed the size of hwPQ in HybridPQ. Therefore, logarithmic 
speed up for EXTRACT operation can be obtained when using HybridPQ, the speed 
up gain increases as the problem size increases. For ALL priority queue operations 
(i.e. TPQ for INSERT + TPQ for EXTRACT), the speed up gain is about half the speed 
up gain in EXTRACT operation. This is because the HybridPQ, when compare to 
FHPQ, the speed up gain is basically on EXTRACT operation, in run-time 
complexity point of view. 
 
 
Figure 7.14 illustrates the speed up gain obtained by using HybridPQ in S-
RABI algorithm. As in the analysis on Dijkstra’s execution, the speed up gain 
increase when the problem sizes increase. Up to a point where the problem yields 
maximum queue size larger then the size of hwPQ (in HybridPQ, which is hwPQ-
250), HybridPQ directs the queue entries into its underlying software FHPQ, 
therefore the speed up gain decreases. When the maximum queue size extremely 
exceeds the size of hwPQ, HybridPQ no longer gives speed up gain, compared to 
software FHPQ. Note, this thesis implemented hwPQ-250 (size = 250), the 
maximum speed up gain is about 23 times (when maximum queue size is 250). 
 
 
S-RABI: Speed Up Gain VS maximum queue size
0
5
10
15
20
25
30
35
40
45
96 189 249 324 411 486 553 623 686 750 843
maximum queue size
Sp
ee
d 
Up
 G
ai
n
INSERT EXTRACT ALL
 
Figure 7.14: S-RABI – Speed up gain of using HybridPQ 
 
 
 
 
 
 139
7.6.2 Practical Case Analysis 
 
 
In practical routing problems, the post placement layout contains a number of 
wire obstacles and buffer obstacles. The initial input problem size can be reduced by 
excluding the obstacle regions, via graph pruning. Hence for practical case analysis, 
sample graphs with various sizes are generated with randomly placed obstacles. To 
further investigate the effect on speed up gain if the density of placement/obstacle 
varies, the sample graphs are further categorized into high-dense graphs and less-
dense graphs. Here, we assume that if 30% to 50% of the total vertices in graph 
(problem) are obstacles, the sample is classified as less-dense graph, whereas if 50% 
to 80% of total vertices are obstacles, the sample is classified as high-dense-graph.  
 
 
Due to the effect of graph-pruning which removes the vertices in obstacle 
regions, both high-dense and less-dense graph gives lower maximum queue size, 
compared to the worst case graph problems. Refer to Figures 7.15 and 7.16, the 
maximum queue size actually reduce dramatically due to the presence of obstacle 
region in both high-dense and less-dense graphs.  
 
 
S-RABI - FHPQ: Maximum Queue Size VS Graph Size
0
100
200
300
400
500
600
100 225 400 625 900 1225 1600 2025 2500 3025 3600
Graph Size
M
ax
im
um
 Q
ue
ue
 S
iz
e
WORST CASE HIGH DENSE LESS DENSE
 
Figure 7.15: S-RABI – FHPQ: Maximum Queue Size vs Graph Size 
 
 
 140
S-RABI - HybridPQ: Maximum Queue Size VS Graph Size
0
100
200
300
400
500
600
700
800
900
100 225 400 625 900 1225 1600 2025 2500 3025 3600
Graph Size
M
ax
im
um
 Q
ue
ue
 S
iz
e
WORST CASE HIGH DENSE LESS DENSE
 
Figure 7.16: S-RABI – HybridPQ: Maximum Queue Size vs Graph Size 
 
  
 Figures 7.17 and 7.18 illustrate the speed up gain of using HybridPQ 
(circumstance (ii)) in graph computation compared to FHPQ (circumstance (ii)), with 
respected to the initial input graph size. The speed up gain obtained for INSERT 
operation remains at constant value as in worst case analysis. The speed up gain for 
EXTRACT operation seems to increase as graph size increase, but the relationship of 
the speed up gain versus initial graph input size, is somehow unclear.  
 
 
High Dense - S-RABI: Speed Up Gain of HybridPQ over FHPQ
0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
40.00
45.00
50.00
100 225 400 625 900 1225 1600 2025 2500 3025 3600 4225 4900 5625 6400
Graph Size
Sp
ee
d 
U
p 
G
ai
n
INSERT EXTRACT ALL
 
Figure 7.17: High Dense – S-RABI: Speed up gain of using HybridPQ 
 
 
 141
Less Dense - S-RABI: Speed Up Gain of HybridPQ over FHPQ
0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
40.00
45.00
100 225 400 625 900 1225 1600 2025 2500 3025 3600
Graph Size
S
pe
ed
 U
p 
G
ai
n
INSERT EXTRACT ALL
 
Figure 7.18: Less Dense – S-RABI: Speed up gain of using HybridPQ 
 
 
Thus, for all graph samples, independent of the density of obstacles (i.e. 
independent of high dense or less dense), a scattered graph showing speed up gain 
versus maximum queue size is plotted. The maximum queue size is directly 
proportional to the effective problem size after pruning. Figure 7.19 shows the 
scattered graph, the similar explanation in worst case analysis can be applied here. 
For S-RABI, the speed up gain in INSERT operation is at constant. For EXTRACT, 
the speed up gain increases as the maximum queue size of that problem increases.  
 
 
"HybridPQ" - S-RABI: Speed Up Gain VS max queue size
0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
40.00
45.00
50.00
0 50 100 150 200 250 300 350 400 450
max queue size
Sp
ee
d 
U
p 
G
ai
n
INSERT
EXTRACT
ALL
 
Figure 7.19: S-RABI – HybridPQ: Speed up gain vs Maximum Queue Size 
 142
Recall, in graph pruning, Dijkstra’s algorithm is executed. For both high-
dense and less-dense graph, the relationship of speed up gain versus initial input 
graph size is also unclear. Hence, we apply the similar scatter graph analysis as in S-
RABI. Refer to Figure 7.20, the relationship of speed up gain versus maximum 
queue size is similar to the case of S-RABI.  
 
 
HybridPQ - Dijkstra: Speed Up Gain VS max queue size
0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
40.00
0 5 10 15 20 25 30 35 40 45
maz queue size
Sp
ee
d 
U
p 
G
ai
n
INSERT
EXTRACT
ALL
 
Figure 7.20: Dijkstra’s – HybridPQ: Speed up Gain VS Maximum Queue Size 
 
 
 
 
7.7 Summary 
 
 
In concluding the above analysis, compared to software FHPQ, the 
deployment of HybridPQ which consist of a underlying hwPQ will achieve at most 
approximately 23 times speed up gain in overall priority queue operations during 
graph computation; approximately 45 times speed up for EXTRACT operation and 5 
times speed up for INSERT operation. The speed up gain will be higher for larger 
graph size, provided a larger hwPQ can be deployed. Note, the 23 times speed up 
gain is equivalent to 96% reduction of total elapsed processor cycles in executing the 
priority queue operations, i.e. (1 – 1/23)*100% = 96%. 
 
 
 143
 From our detail analysis (see Table 7.7) for HybridPQ, each INSERT 
operation takes average 486 clock cycles while each EXTRACT operation 644 clock 
cycles. If the API abstraction layer of HybridPQ is bypassed, meaning invoke the 
priority queue function directly with device driver. It is found that each INSERT on 
hwPQ takes, in average, 347 clock cycles while EXTRACT takes 189 clock cycles; 
despite our design effort in modeling the Priority Queue Accelerator Module where 
each INSERT should takes 5 clock cycles and each EXTRACT should takes 6 clock 
cycles. This scenario is due to several unavoidable factors:  
 
(1) Development targeted NIOS II environment employs multiple processors (or 
coprocessors) to share a common bus (the Avalon System Bus) for data 
communication. Data transfer through shared bus architecture is slow. For 
instance, to insert a new element into Priority Queue Accelerator Module, it 
takes four ‘bus write cycles’: (i) fetch the priority value, (ii) fetch the 
identifier, (iii) fetch INSERT operation mode, and finally (iv) fetch 
DO_NOTHING operation mode. Each of the bus cycles takes a number of 
clock cycles to complete.  
 
(2) The development of device drivers for user peripherals (in this case, the 
Priority Queue Accelerator Module) on NIOS II is based on the Altera 
Hardware Abstraction Layer (HAL) technology. The abstraction layer 
eventually adds overhead to the communication and slows down the data 
transfer between NIOS II and the Priority Queue Accelerator Module. 
 
(3) It is known that device drivers are better written in assembly code to 
optimize the communication between devices. In NIOS II development, the 
device driver for user peripherals is in C code, which is not optimized. 
Therefore this also slows down the communication between NIOS II and the 
user peripheral. 
 
  
 For all the above reasons, there is no improvement can be made, unless we 
migrates the design into other development platform. Some possible improvements 
are suggested in the last chapter. If improvements is made and the hwPQ is fully 
 144
utilized at two clock cycles per operation, the speed up gain will be enormous, 
approximately another 250 folds for current speed up of INSERT and 300 folds for 
current EXTRACT operation. Meaning in ideal condition, about 1250 speed up gain 
(250*5) for INSERT and 13,500 (300*45) speed up gain for EXTRACT can be 
obtained. 
 
 
Table 7.7: Number of elapsed clock cycles per operation 
Number of clock cycles per operation 
Layers 
INSERT EXTRACT 
HybridPQ API 486 644 
Device Driver 347 189 
Avalon Interface Unit 5 6 
hwPQ 2 2 
 
CHAPTER 8 
 
 
 
 
CONCLUSIONS 
 
 
 
 
8.1 Concluding Remarks 
 
 
Nanometer VLSI interconnect routing problems are, typically modeled as 
graph-based shortest path problems and the corresponding algorithms are highly 
compute-intensive. This thesis has proposed a graph processing hardware accelerator 
that can speed up the execution of the above computationally expensive graph-based 
shortest path algorithms, by implementing priority queue in hardware. Thus, a 
custom Graph Processing Unit (GPU), in which a hardware priority queue 
accelerator module is embedded, designed and prototyped in a reconfigurable FPGA-
based hardware platform. Constraints of hardware priority queue accelerator module 
make it necessary to modify the graph-based shortest path algorithms before the 
queue accelerator can be deployed to off-load and speed up the priority queue 
operations during the graph computation. In addition, the overflow issue of fixed-size 
hardware priority queue is addressed, and solved with the proposed hybrid hardware-
software priority queue (HybridPQ).  
 
 
The top-level Priority Queue Accelerator Module consists of Hardware 
Priority Queue Unit (hwPQ) and AvalonTM Interface module. The design of hwPQ is 
custom designed to be highly parameterizable and theoretically cascade-able for 
various queue size. The features of the Hardware Priority Queue Unit are 
summarized in Table 8.1 below. 
 
 
 146
Table 8.1: Features of Hardware Priority Queue Unit (hwPQ) 
Specification Description 
Function A Priority Queue Processor Core 
Operations 
supported 
1. INSERT 
2. EXTRACT-MIN 
3. PEEK or MIN 
4. DELETE-MIN 
Performance Constantly two clock cycles per operation. 
Hardware 
Platform 
Altera Stratix II EP2S60F672C5ES FPGA Device (Nios II 
Professional Development Board) 
Maximum Clock 250 MHz 
Data Range The priority-value and the associated-identifier are in 32-bit 
each. 
Architecture One dimensional systolic array architecture. Employs n 
number of PEs of n priority queue entries.  
 
 
The modeling of Priority Queue Accelerator Module is in VHDL, through 
hierarchical modular design approach. Adopts the embedded system architecture and 
SoPC technology, the Priority Queue Accelerator Module is integrated with the 
NIOS II embedded processor to produce the Graph Processing Unit (GPU). The GPU 
is synthesized for implementation on Altera Stratix II EP2S60F672C5ES FPGA 
device, where the Priority Queue Accelerator Module off-loads and accelerates 
priority queue operations from the GPU. The entire development also includes two 
more components which are the embedded device drivers and HybridPQ APIs. The 
device drivers are used by NIOS II to trigger operations on Priority Queue 
Accelerator Module. The HybridPQ API implements hybrid hardware-software 
priority queue which avoid possible queue overflow on the Priority Queue 
Accelerator Module. Both device drivers and HybridPQ API are developed using 
embedded C language. 
 
 
For design validation and performance verification purposes, the S-RABI 
algorithm for nanometer VLSI interconnect routing is implemented as embedded 
 147
software module executed by the GPU. Also, Graphical User Interface (GUI) which 
allows generation of sample post-placement layout is developed and executed on 
Host PC. The GUI sends the sample routing problems to be executed by the 
embedded S-RABI module on GPU, to find the minimum interconnect delay path. 
The path is sent back and display on GUI.  
 
 
Executed on GPU, the performance of S-RABI algorithm using software 
priority queue (FHPQ) is compared with the performance of modified S-RABI 
algorithm using the HybridPQ (which incorporates the Priority Queue Accelerator 
Module). When the input problem is large enough and the capacity of HybridPQ is 
fully utilized, about 23 times speed up in total processor cycles elapsed on priority 
queue operations is achieved, which in turn means about 96% run-time reduction in 
priority queue operations. 
 
 
 
 
 
8.2 Recommendation for Future Work 
 
 
In the interest of improving the design of the Graph Processing Unit in this 
work, several potential extensions are suggested for future undertakings: 
 
¾ VLSI implementation of Hardware Priority Queue Unit (hwPQ) 
 
If higher speed yet lower logic cost design is required, then this option would 
be suitable. From the prototype in FPGA, each PE in hwPQ utilizes less than 
170 numbers of logic-elements (LEs). In 0.25um process, each LE is 
equivalent to only 7 gate counts ASIC implementation. Hence it is strongly 
believed that very large hwPQ can be obtained in current multi-million gate 
fabrication process. The current implementation of hwPQ on FPGA device 
had already achieved high clocking rate. It is believed that even higher 
clocking rate can be achieved through full-custom ASIC implementation of 
hwPQ. 
 
 
 148
 
¾ Incorporating Decrease-Key function in hwPQ 
 
The implemented size of priority queue must scale well with input graph size. 
Thus HybridPQ is introduced to avoid overflow in fixed-size hardware 
priority queue, which then made the Decrease-Key function a redundant. For 
fixed graph problems, the worst case priority queue size is known. One can 
have the size of hwPQ known prior to graph computation. In this case, 
Decrease-Key can be implemented on hwPQ by adding more sterling logics 
or use two comparators per PE, at the cost of more logic consumed per PE 
and more clock cycles spent per operation.  
 
 
 
¾ Migrate to higher performance system bus 
 
The availability of Avalon system bus accelerates SoC prototyping efforts. 
However, the Avalon bus is in 32-bit width but hwPQ is designed at 64-bit 
interfaces; hence, the hwPQ is not running at its maximum throughput in 
Avalon bus environment. Besides, the data communication through Avalon 
bus is rather slow. Despite of the Priority Queue Accelerator Module is 
designed to handle each priority queue operation in 5 or 6 clock cycles; the 
actual execution via Avalon bus eventually takes a few hundred clock cycles 
to complete each operation. Hence exploring the better design of system bus 
would improve the communication bottle neck between the hwPQ and the 
general purpose processor. 
 
 
 
¾ Migrate to higher performance general purpose processor 
 
In this thesis, the design of hwPQ actually achieves maximum 240 MHz 
clocking rate on the corresponding FPGA device. In order to synchronize 
with the speed of softcore NIOS II processor, which is only at 70 MHz, the 
hwPQ is down tune to run at 70 MHz. Hence, hardcore processors such as the 
ARM embedded processor can be exploited to replace the current NIOS II 
processor as they have higher system clock frequencies and this will provide 
a major boost in performance. The ARM embedded processor is supported by 
 149
the industry-standard high performance AMBATM high-performance bus 
(AHB) allows the processor stripe to operate up to 200MHz. 
 
 
¾ Improve the device driver and API 
 
The device driver to communicate with Priority Queue Accelerator Module is 
developed using embedded C program, based on Altera Nios II Hardware 
Abstraction Layer (HAL). The device driver can be further improved by 
writing the controls in assembly language or bypass the use of HAL.  
 
REFERENCES 
 
 
 
 
Altera Corporation (2003a). Introduction to Quartus II. Altera Corporation. 
 
Altera Corporation (2003b). SOPC Builder Data Sheet. Altera Corporation. 
 
Altera Corporation (2004a). Nios II Hardware Development Tutorial. Altera Corporation. 
 
Altera Corporation (2004b). Nios II Processor Reference Handbook. Altera Corporation. 
 
Altera Corporation (2004c). Nios II Software Developer’s Handbook. Altera Corporation. 
 
Altera Corporation (2005a). Avalon Interface Specification. Altera Corporation. 
 
Alpert, C. J., Hu, J., Sapatnekar, S. S. and Villarrubia, P. G.  (2001). A Practical Methodology 
for Early Buffer and Wire Resource Allocation. IEEE/ACM Design Automation Conference, 
Las Vegas. 2001. Nevada, United States: IEEE/ACM, 189-195. 
 
Alpert, C. J., Hrkic, M. and Quay, S. T. (2004). A Fast Algorithm for Identifying Good Buffer 
Insertion Candidates Locations. ACM International Symposium on Physical Design (ISPD’04). 
Phoenix, Arizona, USA: ACM, 47-52. 
 
Argon, J. A. (2006). Real-Time Scheduling Support for Hybrid CPU/FPGA SoCs. University 
of Kansas, United States of America: Master Degree Thesis. 
 
Auletta, V., Das, S. K., Vivo, A. D., Pinotto, M. C., Scarano, V. (2002). Optimal Tree Access 
by Elementary and Composite Templates in Parallel Memory Systems. IEEE Transactions on 
Parallel and Distributed Systems. 13(4): 399-411. 
  
151
Bakoglu, H. B. (1990). Circuits, Interconnects, and Packaging for VLSI. Reading MA: 
Addison-Wesley. 
 
Bhagwan, R., Lin, B. (2000a). Fast and Scalable Priority Queue Architecture for High-Speed 
Network Switches. IEEE Annual Conference on Computer Communication (INFOCOM 2000) 
Tel Aviv, Israel: IEEE, vol. 2, 538-547. 
 
Bhagwan, R., Lin, B. (2000b). Design of a High-Speed Packet Switch with Fine-Grained 
Quality-of-Service Guarantees. IEEE International Conference on Communication (ICC 
2000). New Orleans, USA: IEEE, vol. 3, 1430-1434. 
 
Breuer, M. and Shamsa, K. (1981). A Hardware Router. Journal of Digital Systems. 4(4): 393-
408. 
 
Brodal, G. S., Zaroliagis, C. D. and Traff, J. L. (1997). A Parallel Priority Data Structure with 
Applications. The 11th International Parallel Processing Symposium, Geneva, Switzerland. 
689-693. 
 
Brown, R. (1988). Calendar Queues: A Fast O(1) Priority Queue Implementation for the 
Simulation Event Set Problem. Communications of the ACM. October 1988. 31(10): 1220-
1227. 
 
Chao, J. (1991). A Novel Architecture for Queue Management in the ATM Network. IEEE 
Journal on Selected Areas in Communication. 9(7): 1110-1118. 
 
Chu, C. C. N., Wong, D. F. (1997). A new approach to buffer insertion and wire sizing. 
Proceeding IEEE International Conference on Computer Aided Design 1997. San Jose, 
California: IEEE, 614-621. 
 
Chu, C. C. N., Wong, D. F. (1998). A Polynomial Time Optimal Algorithm for Simultaneous 
Buffer and Wire Sizing. Proc. Design Automation & Test 1998. Europe. 479-485. 
 
  
152
Chu, C. C. N., Wong, D. F. (1999). A Quadratic Programming Approach to Simultaneous 
Buffer Insertion/Sizing and Wire Sizing. IEEE Transactions on Computer Aided Design of 
Integrated Circuits and Systems. 18(6): 787-798. 
 
Cong, J., Kong, T. and Pan, D. Z. (1999). Buffer Block Planning for Interconnect-Driven 
Floorplanning. IEEE/ACM International Conference on Computer Aided Design 1997. San 
Jose, California: IEEE, 358-363. 
 
Cong, J., Lei, H., Koh, C-K., Madden, P. H., (1996). Performance Optimization of VLSI 
Interconnect Layout. Technical Report, Dept. of Computer Science, University of California, 
L.A., 1-99. 
 
Cormen, T. H., Leiserson, C. E., Rivest, R. L. and Stein, C. (2001) Introduction to Algorithms. 
2nd Edition. The MIT Press, McGraw-Hill Book Company. 
 
Das, S. K., Sarkar, F. and Pinotti, M.C. (1996a). Distributed Priority Queues on Hypercube 
Architectures. IEEE, Proceedings of the 16-th International Conference on Distributed 
Computing Systems 1996. Hong Kong: IEEE, 620-627. 
 
Das, S. K., Sarkar, F. and Pinotti, M.C. (1996b). Optimal and Load Balanced Mapping of 
Parallel Priority Queues on Hypercubes. IEEE Transactions on Parallel and Distributed 
Systems. 555-564. 
 
Driscoll, J. R., Gabow, H. N., Shrairman, R. and Tarjan, R. E. (1998). Relaxed Heaps: An 
Alternative to Fibonacci-Heaps with Applications to Parallel Computation. Communications of 
the ACM. 31(11): 1343-1354. 
 
Dechu, S., Shen, Z. C., Chu, C. C. N. (2004). An Efficient Routing Tree Construction 
Algorithm with Buffer Insertion, Wire Sizing and Obstacle Consideration. Proceedings of the 
ASP-DAC 2004. Yokohama, Japan. 
 
Elmore, W. C. (1948). The transient response of dampled linear networks with particular 
regard to wide-band amplifiers. Journal of Applied Physics. 19(1): 55-63. 
 
  
153
Ginneken, L. P. P. P. V. (1990). Buffer Placement in Distributed RC-Tree Networks for 
Minimal Elmore Delay. Proc. International Symposium of Circuits and Systems 1990. 865-
868. 
 
Gupta, A. K. and Phoutiou, A. G. (1994). Load Balanced Priority Queue Implementations on 
Distributed Memory Machine. ACM - Lecture Notes in Computer Science, July 1994. vol. 817, 
pp. 689-700. 
 
Huelsbergen, L. (2000). A Representation for Dynamic Graphs in Reconfigurable Hardware 
and its Application to Fundamental Graph Algorithms. Proc. ACM/SIGDA International 
Symposium on Field Programmable Gate Arrays 2000. Monterey, CA, USA:ACM, 105-115. 
 
Ioannou, A. D. (2000). An ASIC Core for Pipelined Heap Management to Support Scheduling 
in High Speed Networks. Technical Report FORTH-ICS/TR-278 October 2000. Computer 
Architecture and VLSI Systems Laboratory (CRAV), Institute of Computer Science (ICS), 
Foundation for Science and Technology – Hellas (FORTH), University of Crete, Greece. 
Master Degree Thesis. 
 
Ioannou, A. D. and Katevenis, M. (2001). Pipelined Heap (Priority Queue) Management for 
Advanced Scheduling in High-Speed Networks. IEEE International Conference on 
Communications (ICC 2001). Helsinki, Finland: IEEE, vol. 7, 2043-2047. 
 
Jagannathan, A., Hur, S-W. and Lillis, J. (2002). A Fast Algorithm for Context-Aware Buffer 
Insertion. ACM Trans. On Design Automation of Electronic Systems, January 2002. 7(1): 173-
188. 
 
JohnsonBaugh, R. and Schaefer, M. (2004). ALGORITHMS, Pearson Prentice Hall, 2004. 
 
Jones, D. (1986). An Empirical Comparison of Priority-Queue and Event-Set Implmentations. 
Commununication of the ACM, April 1986. 29(4): 300-311. 
 
Khalil M, Koay K H, (1999). VHDL Module Generator: A Rapid-prototyping Design Entry 
Tool for Digital ASICs. Jurnal Teknologi UTM, December. 31:45-61. 
 
  
154
Keshk, H., Mori, S., Nakashima, H., Tomita, S. (1996). Amon2: A parallel wire routing 
algorithm on a torus network parallel computer. Proceedings of the 10th international 
conference on Supercomputing, January1996. 197-204. 
Kuiper, F. A. and Mieghem, P. V. (2004a). Concepts of Exact QoS Routing Algorithms. 
ACM/IEEE Trans. on Computer Networking (TON), 2004. 12(5): 851-864. 
 
Kuiper, F. A. and Mieghem, P. V. (2004b). Quality-of-Service Routing in the Internet: Theory, 
Complexity, and Algorithms. Delft University of Technology, Netherlands: PhD Thesis. 
 
Kung. H.T. (1980). The Structure of Parallel Algorithm. Advances in Computers. 19: 65-112. 
Academic Press, Inc. 
 
Lai, M. and Wong, D. F. (2002). Maze routing with buffer insertion and wire sizing. IEEE 
Transaction on Computer-Aided Design of Integrated Circuits and Systems, Oct 2002. 21: 
1205-1209. 
 
Lavoie, P. and Savaria, Y. (1994). A Systolic Architecture for Fast Stack Sequential Decoders. 
IEEE Transaction on Communication, Feb./Mar./Apr. 1994. 42(2/3/4): 324-334. 
 
Lee, C. Y., (1961). An Algorithm for Path Connections and Its Applications. IRE Transactions 
on Electronic Computers, 1961. 
 
Leiserson, C.E. (1979) Systolic Priority Queue. Proceeding Caltech Conference of VLSI. Jan. 
1979. Caltech, Pasadena, California. 200-214. 
 
Meador, J. L. (1995) Spatiotemporal Neural Networks for Shortest Path Optimization. Proc. 
IEEE International Symposium on Circuits and Systems (ISCAS95). Seattle, Washington, USA. 
II801-II804. 
  
Mencer, O., Huang, Z. and Huelsbergen, L. (2002). HAGAR: Multi-Context Hardware Graph 
Accelerators. 12th International Conference of Field Programmable Logic and Applications 
2002. France. 
 
  
155
Moon, S.W., Rexford, J., Shin, K.G. (2002). Scalable Hardware Priority Queue Architectures 
for High-Speed Packet Swicthes. IEEE Transaction on Computers, Nov. 2000. 49(11). 
 
Nasir, S., Meador, J. L. (1995). Mixed Signal Neural Circuits for Shortest Path Computation. 
Proc. IEEE Conference on Signals, System and Computers 1995. California, USA. II876-
II880. 
 
Nasir, S., Meador, J. L. (1996). Spatiotemporal Neural Networks for Link-State Routing 
Protocols. Proc. IEEE International Symposium on Circuits and Systems (ISCAS96). Atlanta, 
Georgia. III547-III550. 
 
Nasir, S., Meador, J. L. (1999). A High Precision Current Copying Loser-Take-All Circuit. 
Proc. World Engineering Congress (WEC99). Malaysia. EE177-179. 
  
Nasir, S., Khalil, M., Teoh, G. S. (2002a). Implementation of Recurrent Neural Network for 
Shortest Path Calculation in Network Routing. Proc. IEEE International Symposium on 
Parallel Architectures, Algorithms and Networks (ISPAN), 2002, Manila, Philipines. 313-317. 
 
Nasir, S., Khalil, M., Teoh, G. S. (2002b). Design and Implementation of a Shortest Path 
Processor for Network Routing. Proc. 2nd World Engineering Congress (WEC ‘02), Malaysia. 
EE175-179. 
 
Nasir, S., Khalil, M. (2005). Multi-Constrained Routing Algorithm for Minimizing 
Interconnect Wire Delay. Universiti Teknologi Malaysia: Ph.D. Research Proposal. 
 
Nasir, S., Khalil, M., Ch’ng, H. S. (2006). Simultaneous Maze Routing and Buffer Insertion, 
VLSI-ECAD Research Laboratory, Universiti Teknologi Malaysia: Technical Report VLSI-
ECAD-TR-NSH-021-06. 
 
Nestor, J. A. (2002). A New Look at Hardware Maze Routing. Proceedings of the 12th 
ACM Great Lakes Symposium on VLSI. April 18-19, 2002. New York, USA. 142-147. 
 
  
156
Picker, D. and Fellman, R. (1995). A VLSI Priority Packet Queue with Inheritance and 
Overwrite. IEEE Transaction on Very Large Scale Integration Systems, June 1995. 3(2): 245-
252. 
 
Prasad, S. and Deo, N. (1992). Parallel Heap: Improved and Simplified. Proc. IEEE 6th 
International Parallel Processing Symposium 1992. California, USA: IEEE, 448-451. 
 
Prasad, S. and Sawart, S.I. (1995). Parallel Heap: A Practical Priority Queue for Fine-to-
Medium-Grained Applications on Small Multiprocessors. Proc.7th IEEE Symposium on 
Parallel and Distributed Processing 1995. Santa Barbara, CA: IEEE, 328-335 
 
Ranade, A., Cheng, S., Deprit, E., Jones, J. and Shih, S. (1994). Parallelism and Locality in 
Priority Queues. Proceeding 6th IEEE Symposium on Parallel and Distributed Processing. Oct 
1994. Dallas. (99): 490-496. 
 
Rizal, K. G. (1999). Shortest Path Processor Using FPGA. Universiti Teknologi Malaysia: 
Bachelor Degree Thesis. 
 
Rutenbar, R. A. (1984a). A Class of Cellular Computer Architectures to Support Physical 
Design Automation. Univ. of Michigan, Computing Res. Lab.: Ph.D. dissertation, CRL-TR-35-
84. 
 
Rutenbar, R. A., Mudge, T. N. and Atkins, D. E. (1984b). A Class of Cellular Architectures to 
Support Physical Design Automation, IEEE Trans. Computer-Aided Design, Oct. 1984. vol. 
CAD-3: 264-278. 
 
Rutenbar, R. A. and Atkins, D. E. (1988). Systolic Routing Hardware: Performance Evaluation 
and Optimization. IEEE Transaction on Computer-Aided Design, Mar. 1988. vol. 7, 397-410. 
 
Sahni, S. and Won, Y. (1987). A Hardware Accelerator for Maze Routing. Proceeding on 
Design Automation Conference 1987. Miami, Florida:ACM/IEEE, 800-806. 
 
  
157
Saxena, P., Menezes, N., Cocchini, P. and Kirkpatrick, D. A. (2003). The Scaling Challenge: 
Can Correct-by-Construction Design Help?. Proceeding International Symposium on Physical 
Design 2003. San Diego, CA:ACM/SIGDA 51-58. 
 
Seido, A. I. A., Nowak, B. and Chu, C. (2004). Fitted Elmore Delay, A Simple and Accurate 
Model. IEEE Trans. on VLSI, July 2004. 12(7): 691-696. 
 
Sherwani, N. (1995). Algorithms for VLSI Physical Design Automation, 2nd Edition. Intel 
Corporation: Kluwer Academic Publishers, Toppan Company (S) Pte. Ltd. 
 
Skiena, S. S. (1997). The Algorithm Design Manual. New York: Springer-Verlag. 
 
Suzuki, K., Matsunaga, Y., Tachibana, M. and Ohtsuki, T. (1986). A hardware maze router 
with application to interactive rip-up and reroute, IEEE Trans. Computer-Aided Design. Oct. 
1986. vol. 5, 466-476. 
 
Tommiska, M. and Skytt, J. (2001). Dijkstra’s Shortest Path Routing Algorithm in 
Reconfigurable Hardware. 11th International Conference of Field Programmable Logic and 
Applications. Monterey, CA. 653-657. 
 
Toda, K., Nishida, K., Takahashi, E., Michell, N. and Tamaguchi, Y. (1995). Design and 
Implementation of a Priority Forwarding Router Chip for Real-Time Interconnect Networks. 
Int. J. Mini and Microcomputers 1995. 17(1): 42-51. 
 
Wolf, W. (2002). Modern VLSI Design: System-on-Chip Design, 3/E, Chapter 3, Prentice Hall. 
 
Zhang, W. and Korf, R. E. (1992). Parallel Heap Operations on EREW PRAM: Summary of 
Results. Proc. 6th IEEE International Parallel Processing Symposium, 1992. Beverly Hills, 
CA, USA:IEEE, 315-318. 
 
Zhou, H., Wong, D. F., Liu, I-M. and Aziz, A. (2000). Simultaneous Routing and Buffer 
Insertion with Restrictions on Buffer Locations. IEEE Trans. Computer-Aided Design of 
Integrated Circuits and Systems, July 2000. vol. 19, 819-824. 
 
APPENDIX A 
 
 
 
 
NUMERICAL EXAMPLE OF DIJKSTRA’S ALGORITHM 
 
 
 
 
 This appendix presents the numerical example illustrating the computation of 
Dijkstra’s algorithm. The algorithm is defined in Chapter 2. 
 
 
 
 
 
 
 
for (each vertex v є V[G]){ 
  d[v] ? ∞ 
  π[v] ? NIL // HERE WE INITIALIZE AS INFINITE ‘∞’ 
}    // NOTED THE PRIORITY QUEUE, PQ IS EMTPY. 
d[s] ? 0   // TAKE ‘N1’ AS SOURCE NODE. 
S ? Ø   // ‘VISITED-LIST’ IS EMPTY. 
 
N1 N2 N3
N4 N5 N6
7 2 
5 
3 1 6 
4 
0 ∞ ∞ ∞ ∞ ∞ 
N1 N2 N3 N4 N5 N6 d[ ] 
Q
∞ ∞ ∞ ∞ ∞ ∞ 
N1 N2 N3 N4 N5 N6 π[ ] 
∞ ∞ ∞ ∞ ∞ ∞ 
∞ ∞ ∞ ∞ ∞ ∞ 
Priority-level
Associated-identifier
Initial stage. 
 159
 
 
for (each vertex v є V[G]){ // CONSTRUCT THE PRIORITY QUEUE. 
 INSERT(Q, v, d[v]) 
} 
do{
(u, d[u]) ? EXTRACT-MIN(Q) // THE HIGHEST PRIORITY IS AT N1 
 S ? S U {u}   // INCLUDED IN ‘VISITED-LIST’ 
 : 
: 
}(while Q ≠ Ø) 
N1 N2 N3
N4 N5 N6
7 2 
5 
3 1 6 
4 
∞ ∞ ∞ ∞ ∞ ∞ 
N1 N2 N3 N4 N5 N6 d[ ] 
Q
∞ ∞ ∞ ∞ ∞ ∞ 
N1 N2 N3 N4 N5 N6 π[ ] 
0 ∞ ∞ ∞ ∞ ∞ 
N1 N2 N3 N4 N5 N6 
N1 N2 N3
N4 N5 N6
7 2 
5 
3 1 6 
4 
0 ∞ ∞ ∞ ∞ ∞ 
N1 N2 N3 N4 N5 N6 d[ ] 
Q
∞ ∞ ∞ ∞ ∞ ∞ 
N1 N2 N3 N4 N5 N6 π[ ] 
∞ ∞ ∞ ∞ ∞ ∞ 
N2 N3 N4 N5 N6 ∞ 
0
N1
(a) Construct the entire priority queue, Q. 
(b) Extract the highest priority entry from Q, now at N1. 
 160
 
N1 N2 N3
N4 N5 N6
7 2 
5 
3 1 6 
4 
0 7 ∞ ∞ ∞ ∞ 
N1 N2 N3 N4 N5 N6 d[ ] 
Q
∞ N1 ∞ ∞ ∞ ∞ 
N1 N2 N3 N4 N5 N6 π[ ] 
7 ∞ ∞ ∞ ∞ ∞ 
N2 N3 N4 N5 N6 ∞ 
Q
7 ∞ ∞ ∞ ∞ ∞ 
N2 N3 N4 N5 N6 ∞ 
DECREASE-KEY at N2
CONSOLIDATE/SORT 
N1 N2 N3
N4 N5 N6
7 2 
5 
3 1 6 
4 
0 7 ∞ 6 ∞ ∞ 
N1 N2 N3 N4 N5 N6 d[ ] 
Q
∞ N1 ∞ N1 ∞ ∞ 
N1 N2 N3 N4 N5 N6 π[ ] 
6 7 ∞ ∞ ∞ ∞ 
N4 N2 N3 N5 N6 ∞ 
Q
7 ∞ 6 ∞ ∞ ∞ 
N2 N3 N4 N5 N6 ∞ 
DECREASE-KEY at N4
CONSOLIDATE/SORT 
RELAXATION at N2: d[N2] > d[N1] + w(N1,N2), i.e. ∞ > (0 + 7), so update d[N2]. 
RELAXATION at N4: d[N4] > d[N1] + w(N1,N4), i.e. ∞ > (0 + 6), so update d[N4]. 
do{
: 
 for (each vertex v є Adj[u]){  // VISIT EACH ADJACENT-NODES 
if (d[v] > d[u] + w(u, v)){ // RELAXATION at d[N2] and d[N4]. 
     d[v] ? d[u] + w(u, v) 
       π[v] ? u 
       DECREASE-KEY(Q, v, d[v]) // AT PQ. 
} 
} 
}(while Q ≠ Ø) 
 
(c) Adjacent vertices to N1 is relaxed, so DECREASE_KEY at Q. 
 161
 
Q
7 ∞ ∞ ∞ ∞ ∞ 
N2 N3 N5 N6 ∞ ∞ 
6
N4
0 7 ∞ 6 ∞ ∞ 
N1 N2 N3 N4 N5 N6 d[ ] 
∞ N1 ∞ N1 ∞ ∞ 
N1 N2 N3 N4 N5 N6 π[ ] 
N1 N2 N3
N4 N5 N6
7 2 
5 
3 1 6 
4 
do{ 
(u, d[u]) ? EXTRACT-MIN(Q) // THE HIGHEST PRIORITY IS AT N4 
S ? S U {u}   // INCLUDED IN ‘VISITED-LIST’ 
 : 
: 
}(while Q ≠ Ø) 
do{ 
: 
 for (each vertex v є Adj[u]){  // VISIT EACH ADJACENT-NODES 
if (d[v] > d[u] + w(u, v)){ // RELAXATION at d[N5]. 
     d[v] ? d[u] + w(u, v) 
       π[v] ? u 
       DECREASE-KEY(Q, v, d[v]) // AT PQ. 
} 
} 
}(while Q ≠ Ø) 
0 7 ∞ 6 10 ∞ 
N1 N2 N3 N4 N5 N6 d[ ] 
Q
∞ N1 ∞ N1 N4 ∞ 
N1 N2 N3 N4 N5 N6 π[ ] 
N1 N2 N3
N4 N5 N6
7 2 
5 
3 1 6 
4 
7 10 ∞ ∞ ∞ ∞ 
N2 N5 N3 N6 ∞ ∞ 
Q
7 ∞ 10 ∞ ∞ ∞ 
N2 N3 N5 N6 ∞ ∞ 
DECREASE-KEY at N5
CONSOLIDATE/SORT 
RELAXATION at N4: d[N5] > d[N4] + w(N4,N5), i.e. ∞ > (6 + 4), so update d[N5]. 
(d) Extract the highest priority entry from Q, now at N4. 
(e) Adjacent vertices to N4 is relaxed, so DECREASE_KEY at Q. 
 162
 
do{ 
(u, d[u]) ? EXTRACT-MIN(Q) // THE HIGHEST PRIORITY IS AT N2 
S ? S U {u}   // INCLUDED IN ‘VISITED-LIST’ 
 : 
: 
}(while Q ≠ Ø) 
Q
10 ∞ ∞ ∞ ∞ ∞ 
N5 N3 N6 ∞ ∞ ∞ 
7
N2
0 7 ∞ 6 10 ∞ 
N1 N2 N3 N4 N5 N6 d[ ] 
∞ N1 ∞ N1 N4 ∞ 
N1 N2 N3 N4 N5 N6 π[ ] 
N1 N2 N3
N4 N5 N6
7 2 
5 
3 1 6 
4 
do{ 
: 
 for (each vertex v є Adj[u]){  // VISIT EACH ADJACENT-NODES 
if (d[v] > d[u] + w(u, v)){ // RELAXATION at d[N3] and [N5]. 
     d[v] ? d[u] + w(u, v) 
       π[v] ? u 
       DECREASE-KEY(Q, v, d[v]) // AT PQ. 
} 
} 
}(while Q ≠ Ø) 
0 7 9 6 10 ∞ 
N1 N2 N3 N4 N5 N6 d[ ] 
Q
∞ N1 N2 N1 N4 ∞ 
N1 N2 N3 N4 N5 N6 π[ ] 
N1 N2 N3
N4 N5 N6
7 2 
5 
3 1 6 
4 
9 10 ∞ ∞ ∞ ∞ 
N3 N5 N6 ∞ ∞ ∞ 
Q
10 9 ∞ ∞ ∞ ∞ 
N5 N3 N6 ∞ ∞ ∞ 
DECREASE-KEY at N3
CONSOLIDATE/SORT 
RELAXATION at N3: d[N3] > d[N2] + w(N2,N3), i.e. ∞ > (7 + 2), so update d[N3]. 
(f) Extract the highest priority entry from Q, now at N2. 
(g) Adjacent vertices to N4 is relaxed, so DECREASE_KEY at Q. 
 163
 
0 7 9 6 8 ∞ 
N1 N2 N3 N4 N5 N6 d[ ] 
Q
∞ N1 N2 N1 N2 ∞ 
N1 N2 N3 N4 N5 N6 π[ ] 
N1 N2 N3
N4 N5 N6
7 2 
5 
3 1 6 
4 
8 9 ∞ ∞ ∞ ∞ 
N5 N3 N6 ∞ ∞ ∞ 
Q
9 8 ∞ ∞ ∞ ∞ 
N3 N5 N6 ∞ ∞ ∞ 
DECREASE-KEY at N5
CONSOLIDATE/SORT 
RELAXATION at N5: d[N5] > d[N2] + w(N2,N5), i.e. 10 > (7 + 1), so update d[N5]. 
do{ 
(u, d[u]) ? EXTRACT-MIN(Q) // THE HIGHEST PRIORITY IS AT N5
S ? S U {u}   // INCLUDED IN ‘VISITED-LIST’ 
 : 
: 
}(while Q ≠ Ø) 
0 7 9 6 8 ∞ 
N1 N2 N3 N4 N5 N6 d[ ] 
Q
∞ N1 N2 N1 N2 ∞ 
N1 N2 N3 N4 N5 N6 π[ ] 
9 ∞ ∞ ∞ ∞ ∞ 
N3 N6 ∞ ∞ ∞ ∞ 
N1 N2 N3
N4 N5 N6
7 2 
5 
3 1 6 
4 
8
N5
(h) Extract the highest priority entry from Q, now at N5. 
 164
 
do{ 
: 
 for (each vertex v є Adj[u]){  // VISIT EACH ADJACENT-NODES 
if (d[v] > d[u] + w(u, v)){ // RELAXATION at d[N6]. 
     d[v] ? d[u] + w(u, v) 
       π[v] ? u 
       DECREASE-KEY(Q, v, d[v]) // AT PQ. 
} 
} 
}(while Q ≠ Ø) 
0 7 9 6 8 13 
N1 N2 N3 N4 N5 N6 d[ ] 
Q
∞ N1 N2 N1 N2 N5 
N1 N2 N3 N4 N5 N6 π[ ] 
9 13 ∞ ∞ ∞ ∞ 
N3 N6 ∞ ∞ ∞ ∞ 
N1 N2 N3
N4 N5 N6
7 2 
5 
3 1 6 
4 
RELAXATION at N6: d[N6] > d[N5] + w(N5,N6), i.e. ∞ > (8 + 5), so update d[N6]. 
Q
9 ∞ ∞ ∞ ∞ ∞ 
N3 N6 ∞ ∞ ∞ ∞ 
DECREASE-KEY at N6
CONSOLIDATE/SORT 
do{ 
(u, d[u]) ? EXTRACT-MIN(Q) // THE HIGHEST PRIORITY IS AT N3 
S ? S U {u}   // INCLUDED IN ‘VISITED-LIST’ 
 : 
: 
}(while Q ≠ Ø) 
0 7 9 6 8 13 
N1 N2 N3 N4 N5 N6 d[ ] 
Q
∞ N1 N2 N1 N2 N5 
N1 N2 N3 N4 N5 N6 π[ ] 
13 ∞ ∞ ∞ ∞ ∞ 
N6 ∞ ∞ ∞ ∞ ∞ 
N1 N2 N3
N4 N5 N6
7 2 
5 
3 1 6 
4 
9
N3
(i) Adjacent vertices to N5 is relaxed, so DECREASE_KEY at Q. 
(j) Extract the highest priority entry from Q, now at N3. 
 165
 
0 7 9 6 8 12 
N1 N2 N3 N4 N5 N6 d[ ] 
Q
∞ N1 N2 N1 N2 N5 
N1 N2 N3 N4 N5 N6 π[ ] 
12 ∞ ∞ ∞ ∞ ∞ 
N6 ∞ ∞ ∞ ∞ ∞ 
N1 N2 N3
N4 N5 N6
7 2 
5 
3 1 6 
4 
do{ 
: 
 for (each vertex v є Adj[u]){  // VISIT EACH ADJACENT-NODES 
if (d[v] > d[u] + w(u, v)){ // RELAXATION at d[N6]. 
     d[v] ? d[u] + w(u, v) 
       π[v] ? u 
       DECREASE-KEY(Q, v, d[v]) // AT PQ. 
} 
} 
}(while Q ≠ Ø) 
RELAXATION at N6: d[N6] > d[N3] + w(N3,N6), i.e. 13 > (9 + 3), so update d[N6]. 
PQ
12 ∞ ∞ ∞ ∞ ∞ 
N6 ∞ ∞ ∞ ∞ ∞ 
DECREASE-KEY at N6
CONSOLIDATE/SORT 
(i) Adjacent vertices to N3 is relaxed, so DECREASE_KEY at Q. 
 166
 
 
 
 
 
 
 
0 7 9 6 8 12 
N1 N2 N3 N4 N5 N6 d[ ] 
Q
∞ N1 N2 N1 N2 N5 
N1 N2 N3 N4 N5 N6 π[ ] 
∞ ∞ ∞ ∞ ∞ ∞ 
∞ ∞ ∞ ∞ ∞ ∞ 
N1 N2 N3
N4 N5 N6
7 2 
5 
3 1 6 
4 
12
N6
TRACE-BACK d[ ] AND π[ ], THE SHORTEST PATH FROM N1 TO:- 
N2 is to follow the track N1 ? N2,   with COST = 7; 
N3 is to follow the track N1 ? N2 ? N3,  with COST = 9; 
N4 is to follow the track N1 ? N4,   with COST = 6; 
N5 is to follow the track N1 ? N2 ? N5,  with COST = 8; 
N6 is to follow the track N1 ? N2 ? N5 ? N6,  with COST = 12. 
RESULT 
do{
(u, d[u]) ? EXTRACT-MIN(Q) // THE HIGHEST PRIORITY IS AT N6 
S ? S U {u}   // INCLUDED IN ‘VISITED-LIST’ 
for (each vertex v є Adj[u]){  // NO MORE ADJACENT NODES FOR N6 
 : 
} 
}(while Q ≠ Ø)   // PQ IS EMPTY. 
 
(j) Extract the highest priority entry from Q, now at N6. All vertices are visited. Exit… 
APPENDIX B 
 
 
 
 
NUMERICAL EXAMPLE OF  
HOP-BY-HOP DIJKSTRA’S ALGORITHM 
 
 
 
 
 This appendix presents the numerical example demonstrating the 
computation of hop-by-hop Dijkstra’s algorithm. The algorithm definition is given in 
Chapter 3. 
 
 
 
 
 
 
 
for (each vertex v є V[G]){ 
  d[v] ? ∞ 
  π[v] ? NIL // HERE WE INITIALIZE AS INFINITE ‘∞’ 
}    // NOTED THE PRIORITY QUEUE, PQ IS EMTPY. 
 
N1 N2 N3
N4 N5 N6
7 2 
5 
3 1 6 
4 
∞ ∞ ∞ ∞ ∞ ∞ 
N1 N2 N3 N4 N5 N6 d[ ] 
Q
∞ ∞ ∞ ∞ ∞ ∞ 
N1 N2 N3 N4 N5 N6 π[ ] 
∞ ∞ ∞ ∞ ∞ ∞ 
∞ ∞ ∞ ∞ ∞ ∞ 
Priority-level
Associated-identifier
Initial stage, Q is empty. 
 168
 
d[s] ? 0  // TAKE ‘N1’ AS SOURCE NODE. 
INSERT(Q, s, d[s]) // INSERT ENTRY INTO PQ. 
S ? Ø  // ‘VISITED-LIST’ IS EMPTY. 
do{
(u, d[u]) ? EXTRACT-MIN(Q) // THE HIGHEST PRIORITY IS AT N1 
 S ? S U {u}   // INCLUDED IN ‘VISITED-LIST’ 
 : 
: 
}(while Q ≠ Ø) 
N1 N2 N3
N4 N5 N6
7 2 
5 
3 1 6 
4 
0 ∞ ∞ ∞ ∞ ∞ 
N1 N2 N3 N4 N5 N6 d[ ] 
Q
∞ ∞ ∞ ∞ ∞ ∞ 
N1 N2 N3 N4 N5 N6 π[ ] 
∞ ∞ ∞ ∞ ∞ ∞ 
∞ ∞ ∞ ∞ ∞ ∞ 
0
N1
N1 N2 N3
N4 N5 N6
7 2 
5 
3 1 6 
4 
0 ∞ ∞ ∞ ∞ ∞ 
N1 N2 N3 N4 N5 N6 d[ ] 
Q
∞ ∞ ∞ ∞ ∞ ∞ 
N1 N2 N3 N4 N5 N6 π[ ] 
0 ∞ ∞ ∞ ∞ ∞ 
N1 ∞ ∞ ∞ ∞ ∞ 
(a) At source node, i.e. N1, insert into Q. 
(b) Now enter the massive loop, extract the highest priority entry from Q. 
 169
 
do{
: 
 for (each vertex v є Adj[u]){  // VISIT EACH ADJACENT-NODES 
if (d[v] = ∞){  // IF NOT VISITED YET 
     d[v] ? d[u] + w(u, v) // COMPUTE THE COST 
       π[v] ? u  // UPDATE PRECEDENCE-LIST 
       INSERT(Q, v, d[v]) // INSERT INTO PQ 
} 
elseif (d[v] > d[u] + w(u, v)){ 
       : 
} 
} 
}(while Q ≠ Ø) 
 
N1 N2 N3
N4 N5 N6
7 2 
5 
3 1 6 
4 
0 7 ∞ 6 ∞ ∞ 
N1 N2 N3 N4 N5 N6 d[ ] 
Q
∞ N1 ∞ N1 ∞ ∞ 
N1 N2 N3 N4 N5 N6 π[ ] 
6 7 ∞ ∞ ∞ ∞ 
N4 N2 ∞ ∞ ∞ ∞ 
do{
(u, d[u]) ? EXTRACT-MIN(Q) // THE HIGHEST PRIORITY IS AT N4 
S ? S U {u}   // INCLUDED IN ‘VISITED-LIST’ 
 : 
: 
}(while Q ≠ Ø) 
Q
7 ∞ ∞ ∞ ∞ ∞ 
N2 ∞ ∞ ∞ ∞ ∞ 
6
N4
0 7 ∞ 6 ∞ ∞ 
N1 N2 N3 N4 N5 N6 d[ ] 
∞ N1 ∞ N1 ∞ ∞ 
N1 N2 N3 N4 N5 N6 π[ ] 
N1 N2 N3
N4 N5 N6
7 2 
5 
3 1 6 
4 
(c) Visit the vertices adjacent to N1, insert Q-entries. 
(d) Extract the highest priority entry from Q, now at N4. 
 170
 
do{
: 
 for (each vertex v є Adj[u]){  // VISIT EACH ADJACENT-NODES 
if (d[v] = ∞){  // IF NOT VISITED YET 
     d[v] ? d[u] + w(u, v) // THE COST = 6 + 4 = 10 
       π[v] ? u  // UPDATE PRECEDENCE-LIST 
       INSERT(Q, v, d[v]) // INSERT INTO PQ 
} 
elseif (d[v] > d[u] + w(u, v)){ 
       : 
} 
} 
}(while Q ≠ Ø) 
 
do{
(u, d[u]) ? EXTRACT-MIN(Q) // THE HIGHEST PRIORITY IS AT N2 
S ? S U {u}   // INCLUDED IN ‘VISITED-LIST’ 
 : 
: 
}(while Q ≠ Ø) 
Q
10 ∞ ∞ ∞ ∞ ∞ 
N5 ∞ ∞ ∞ ∞ ∞ 
7
N2
0 7 ∞ 6 10 ∞ 
N1 N2 N3 N4 N5 N6 d[ ] 
∞ N1 ∞ N1 N4 ∞ 
N1 N2 N3 N4 N5 N6 π[ ] 
N1 N2 N3
N4 N5 N6
7 2 
5 
3 1 6 
4 
0 7 ∞ 6 10 ∞ 
N1 N2 N3 N4 N5 N6 d[ ] 
Q
∞ N1 ∞ N1 N4 ∞ 
N1 N2 N3 N4 N5 N6 π[ ] 
N1 N2 N3
N4 N5 N6
7 2 
5 
3 1 6 
4 
7 10 ∞ ∞ ∞ ∞ 
N2 N5 ∞ ∞ ∞ ∞ 
(e) Visit the vertices adjacent to N4, insert Q-entries. 
(f) Extract the highest priority entry from Q, now at N2. 
 171
 
do{
: 
 for (each vertex v є Adj[u]){  // VISIT EACH ADJACENT-NODES 
if (d[v] = ∞){  // IF NOT VISITED YET 
     d[v] ? d[u] + w(u, v) // COST = 7 + 2 = 9 
       π[v] ? u  // UPDATE PRECEDENCE-LIST 
       INSERT(Q, v, d[v]) // INSERT INTO PQ 
} 
elseif (d[v] > d[u] + w(u, v)){ 
       : 
} 
} 
}(while Q ≠ Ø) 
0 7 9 6 10 ∞ 
N1 N2 N3 N4 N5 N6 d[ ] 
Q
∞ N1 N2 N1 N4 ∞ 
N1 N2 N3 N4 N5 N6 π[ ] 
N1 N2 N3
N4 N5 N6
7 2 
5 
3 1 6 
4 
9 10 ∞ ∞ ∞ ∞ 
N3 N5 ∞ ∞ ∞ ∞ 
do{
: 
 for (each vertex v є Adj[u]){  // VISIT EACH ADJACENT-NODES 
if (d[v] = ∞){ 
       : 
} 
elseif (d[v] > d[u] + w(u, v)){ // ELSE ( 10 > 7 + 1 ) 
     d[v] ? d[u] + w(u, v)  // UPDATE 
       π[v] ? u   // UPDATE 
       DECREASE-KEY(Q, v, d[v]) // UPDATE 
} 
} 
}(while Q ≠ Ø) 
 
0 7 9 6 8 ∞ 
N1 N2 N3 N4 N5 N6 d[ ] 
Q
∞ N1 N2 N1 N2 ∞ 
N1 N2 N3 N4 N5 N6 π[ ] 
8 9 ∞ ∞ ∞ ∞ 
N5 N3 ∞ ∞ ∞ ∞ 
N1 N2 N3
N4 N5 N6
7 2 
5 
3 1 6 
4 
(g) Visit the vertices adjacent to N2, insert Q-entries. 
(h) Visit the vertices adjacent to N2. Note, N5 is visited before, so DEC_KEY at Q (if needed).
 172
 
do{
(u, d[u]) ? EXTRACT-MIN(Q) // THE HIGHEST PRIORITY IS AT N5 
S ? S U {u}   // INCLUDED IN ‘VISITED-LIST’ 
 : 
: 
}(while Q ≠ Ø) 
0 7 9 6 8 ∞ 
N1 N2 N3 N4 N5 N6 d[ ] 
Q
∞ N1 N2 N1 N2 ∞ 
N1 N2 N3 N4 N5 N6 π[ ] 
9 ∞ ∞ ∞ ∞ ∞ 
N3 ∞ ∞ ∞ ∞ ∞ 
N1 N2 N3
N4 N5 N6
7 2 
5 
3 1 6 
4 
8
N5
do{
: 
 for (each vertex v є Adj[u]){  // VISIT EACH ADJACENT-NODES 
if (d[v] = ∞){  // IF NOT VISITED YET 
     d[v] ? d[u] + w(u, v) // COST = 8 + 5 = 13 
       π[v] ? u  // UPDATE PRECEDENCE-LIST 
       INSERT(Q, v, d[v]) // INSERT INTO PQ 
} 
elseif (d[v] > d[u] + w(u, v)){ 
       : 
} 
} 
}(while Q ≠ Ø) 
0 7 9 6 8 13 
N1 N2 N3 N4 N5 N6 d[ ] 
Q
∞ N1 N2 N1 N2 N5 
N1 N2 N3 N4 N5 N6 π[ ] 
9 13 ∞ ∞ ∞ ∞ 
N3 N6 ∞ ∞ ∞ ∞ 
N1 N2 N3
N4 N5 N6
7 2 
5 
3 1 6 
4 
(i) Extract the highest priority entry from Q, now at N5. 
(j) Visit the vertices adjacent to N5, insert Q-entries. 
 173
 
0 7 9 6 8 13 
N1 N2 N3 N4 N5 N6 d[ ] 
Q
∞ N1 N2 N1 N2 N5 
N1 N2 N3 N4 N5 N6 π[ ] 
13 ∞ ∞ ∞ ∞ ∞ 
N6 ∞ ∞ ∞ ∞ ∞ 
N1 N2 N3
N4 N5 N6
7 2 
5 
3 1 6 
4 
9
N3
do{
(u, d[u]) ? EXTRACT-MIN(Q) // THE HIGHEST PRIORITY IS AT N3 
S ? S U {u}   // INCLUDED IN ‘VISITED-LIST’ 
 : 
: 
}(while Q ≠ Ø) 
0 7 9 6 8 12 
N1 N2 N3 N4 N5 N6 d[ ] 
Q
∞ N1 N2 N1 N2 N5 
N1 N2 N3 N4 N5 N6 π[ ] 
12 ∞ ∞ ∞ ∞ ∞ 
N6 ∞ ∞ ∞ ∞ ∞ 
N1 N2 N3
N4 N5 N6
7 2 
5 
3 1 6 
4 
do{
: 
 for (each vertex v є Adj[u]){  // VISIT EACH ADJACENT-NODES 
if (d[v] = ∞){ 
       : 
} 
elseif (d[v] > d[u] + w(u, v)){ // ELSE ( 13 > 9 + 3 ) 
     d[v] ? d[u] + w(u, v)  // UPDATE 
       π[v] ? u   // UPDATE 
       DECREASE-KEY(Q, v, d[v]) // UPDATE 
} 
} 
}(while Q ≠ Ø) 
 
(k) Extract the highest priority entry from Q, now at N3. 
(l) Visit the vertices adjacent to N3, insert Q-entries. 
 174
 
do{
(u, d[u]) ? EXTRACT-MIN(Q) // THE HIGHEST PRIORITY IS AT N6 
S ? S U {u}   // INCLUDED IN ‘VISITED-LIST’ 
for (each vertex v є Adj[u]){  // NO MORE ADJACENT NODES FOR N6 
 : 
} 
}(while Q ≠ Ø)   // PQ IS EMPTY. 
 
0 7 9 6 8 12 
N1 N2 N3 N4 N5 N6 d[ ] 
Q
∞ N1 N2 N1 N2 N5 
N1 N2 N3 N4 N5 N6 π[ ] 
∞ ∞ ∞ ∞ ∞ ∞ 
∞ ∞ ∞ ∞ ∞ ∞ 
N1 N2 N3
N4 N5 N6
7 2 
5 
3 1 6 
4 
12
N6
TRACE-BACK d[ ] AND π[ ], THE SHORTEST PATH FROM N1 TO:- 
N2 is to follow the track N1 ? N2,   with COST = 7; 
N3 is to follow the track N1 ? N2 ? N3,  with COST = 9; 
N4 is to follow the track N1 ? N4,   with COST = 6; 
N5 is to follow the track N1 ? N2 ? N5,  with COST = 8; 
N6 is to follow the track N1 ? N2 ? N5 ? N6,  with COST = 12. 
RESULT 
(m) Extract the highest priority entry from Q, now at N6. All vertices are visited. Exit. 
APPENDIX C 
 
 
 
 
NUMERICAL EXAMPLE OF S-RABI ALGORITHM 
 
 
 
 
 This appendix presents the numerical example illustrating the computation of 
Simultaneous Maze Routing and Buffer Insertion algorithm (S-RABI) proposed by 
Nasir et al. (2006). The algorithm definition is given in Chapter 3.  
 
 
For this illustration purpose, assume a simple example of a no-obstacle graph 
where all vertices are arranged in one row (i.e. one dimension graph). 
 
N1
u 
uk
e
r
t
sf
L[N1]
D(N1[0])
u 
uk
e
r
t
sf
u 
uk
e
r
t
sf
u  
uk 
e 
r 
t 
sf 
D(N1[1]) D(N1[2]) ...... 
Recall, each vertex v keeps a v-list L[v] of candidate-datasets D(v[k]), where k ≥ 0; k = integer. 
Each of the candidate-dataset contains: 
1. u  ? the previous vertex. 
2. uk  ? index of the corresponding candidate-dataset in previous vertex. 
3. e  ? type of interconnect. 
4. r ? propagated resistance. 
5. t ? propageted delay. 
6. sf ? status flag. 
 
N1 N2 N3 N4 
 176
 
for (each vertex v є V[G]){  
 L[v] ? NIL  // initially, all v-lists are empty. 
}    // note the priority queue, Q is also empty. 
estimated_delay ? ∞  // the estimated end-to-end delay is initialized to infinite value. 
Q 
∞ ∞ ∞ ∞ ∞ ∞ 
∞ ∞ ∞ ∞ ∞ ∞ 
estimated_delay   = ∞ 
estimated_end_candidate = ? 
Priority-level
Identifier 
N1 
N2 
L[N1]
L[N2]
N3 L[N3]
N4 L[N4]
u 
uk
e
r
t
sf
D(N4[k])
u 
uk
e
r
t
sf
u 
uk
e
r
t
sf
D(N2[k])
u 
uk
e
r
t
sf
D(N1[k])
D(N3[k])
Initial stage, v-list at each vertex is empty, Q is also empty. 
 177
 
D(N1[0]) = {NIL, NIL, NIL, Rs, 0, VALID} // 1st candidate at source, N1,  assume Rs = r = 5,  t = 0.
L[N1] ? L[N1] U D(N1[0])  // insert into v-list at source, L[N1]. 
INSERT(Q, D(N1[0]), t є D(N1[0])) // insert into PQ 
Q 
0 ∞ ∞ ∞ ∞ ∞ 
D(N1[0]) ∞ ∞ ∞ ∞ ∞ 
estimated_delay   = ∞ 
estimated_end_candidate = ? 
N1 
N2 
NIL 
NIL 
NIL 
5
0
V
L[N1]
L[N2]
N3 L[N3]
N4 L[N4]
u 
uk
e
r
t
sf
u 
uk
e
r
t
sf
u 
uk
e
r
t
sf
u 
uk
e
r
t
sf
D(N1[k]) D(N1[0])
D(N4[k])
D(N3[k])
D(N2[k])
(a) The first candidate-dataset created at source, N1, and inserted into Q.  
 178
 
do{ 
    do{ 
   (D(u[k]), t є D(u[k])) ? EXTRACT-MIN(Q) // extract the highest priority entry. 
          }( while sf є D(u[k]) == NON_VALID)   // continue if entry is invalid. 
  :     // candidate D(N1[0])  is valid. 
: 
        } 
 
Q 
∞ ∞ ∞ ∞ ∞ ∞ 
∞ ∞ ∞ ∞ ∞ ∞ 
estimated_delay   = ∞ 
estimated_end_candidate = ? 
0 
D(N1[0])
N1 
N2 
L[N1]
L[N2]
N3 L[N3]
N4 L[N4]
u 
uk
e
r
t
sf
u 
uk
e
r
t
sf
u 
uk
e
r
t
sf
u
uk
e
r
t
sf
NIL 
NIL 
NIL 
5
0
V
D(N1[k]) D(N1[0])
D(N4[k])
D(N3[k])
D(N2[k])
(b) Extract the highest priority candidate from Q.  
 179
 
if (estimated_delay > t є D(u[k])) { 
     for (each vertex v є Adj[u]) { 
     if (v є OW[G]’) { // if v is not wire-obstacle. 
    for each w є W { 
(rv, tv) ? Cost(r є D(u[k]), t є D(u[k]), w[i]) // assume r = 10, t = 2; 
   if (tv  < estimated_delay)   // 2 < ∞ 
 {  InsertCandidate(D(u[k]), v, rv, tv, w[i], L[v])    }// Insert PQ, skip details 
    :
N1 
N2 
L[N1]
L[N2]
N3 L[N3]
N4 L[N4]
u 
uk
e
r
t
sf
u 
uk
e
r
t
sf
u 
uk
e
r
t
sf
u 
uk
e
r
t
sf
Q 
2 ∞ ∞ ∞ ∞ ∞ 
D(N2[0]) ∞ ∞ ∞ ∞ ∞ 
estimated_delay   = ∞ 
estimated_end_candidate = ? 
NIL 
NIL 
NIL 
5 
0 
V
N1 
0 
w 
10
2
V
D(N4[k])
D(N3[k])
D(N2[k]) D(N2[0])
D(N1[k]) D(N1[0])
(c) Scan neighbour-vertex, try with wire-candidate, candidate not dominated, insert into Q.  
 180
 
if (v є OB[G]’) { // if v is not buffer-obstacle. 
    for each b є B{ 
       (rv, tv) ? Cost(r є D(u[k]), t є D(u[k]), b[i])  // assume r = 9, t = 3; 
  if (tv  < estimated_delay)   // 3 < ∞; 
     {  InsertCandidate(D(u[k]), v,, rv, tv, b[i], L[v])  } // Insert PQ, skip details; 
    } 
 }// end buffer trials 
  : 
N1 
N2 
L[N1]
L[N2]
N3 L[N3]
N4 L[N4]
u 
uk
e
r
t
sf
u 
uk
e
r
t
sf
u 
uk
e
r
t
sf
u 
uk
e
r
t
sf
Q 
2 3 ∞ ∞ ∞ ∞ 
D(N2[0]) D(N2[1]) ∞ ∞ ∞ ∞ 
estimated_delay   = ∞ 
estimated_end_candidate = ? 
NIL 
NIL 
NIL 
5
0
V
N1 
0 
w 
10
2
V
N1 
0 
b 
9
3
V
D(N4[k])
D(N3[k])
D(N2[k]) D(N2[0]) D(N2[1]) 
D(N1[k]) D(N1[0])
(d) Scan neighbour-vertex, try with buffer-insertion, candidate not dominated, insert into Q.  
 181
 
do{ 
    do{ 
   (D(u[k]), t є D(u[k])) ? EXTRACT-MIN(Q) // extract the highest priority entry. 
          }( while sf є D(u[k]) == NON_VALID)   // continue if entry is invalid. 
  :     // candidate D(N2[0]) is valid. 
: 
        } 
 
N1 
N2 
L[N1]
L[N2]
N3 L[N3]
N4 L[N4]
u 
uk
e
r
t
sf
u 
uk
e
r
t
sf
u 
uk
e
r
t
sf
u 
uk
e
r
t
sf
Q 
3 ∞ ∞ ∞ ∞ ∞ 
D(N2[1]) ∞ ∞ ∞ ∞ ∞ 
estimated_delay   = ∞ 
estimated_end_candidate = ? 
NIL 
NIL 
NIL 
5 
0 
V
N1 
0 
w 
10
2
V
N1 
0 
b 
9
3
V
2 
D(N2[0])
D(N4[k])
D(N3[k])
D(N2[k]) D(N2[0]) D(N2[1]) 
D(N1[k]) D(N1[0])
(e) Tried with all interconnect choices; now extract the highest priority candidate from Q.  
 182
 
if (estimated_delay > t є D(u[k])) { 
     for (each vertex v є Adj[u]) { 
     if (v є OW[G]’) { // if v is not wire-obstacle. 
    for each w є W { 
(rv, tv) ? Cost(r є D(u[k]), t є D(u[k]), w[i]) // assume r = 14, t = 8; 
   if (tv  < estimated_delay)   // 8 < ∞ 
 {  InsertCandidate(D(u[k]), v, rv, tv, w[i], L[v])    }// Insert PQ, skip details 
    :
N1 
N2 
L[N1]
L[N2]
N3 L[N3]
N4 L[N4]
u 
uk
e
r
t
sf
u 
uk
e
r
t
sf
u 
uk
e
r
t
sf
u 
uk
e
r
t
sf
Q 
3 8 ∞ ∞ ∞ ∞ 
D(N2[1]) D(N3[0]) ∞ ∞ ∞ ∞ 
estimated_delay   = ∞ 
estimated_end_candidate = ? 
NIL 
NIL 
NIL 
5 
0 
V
N1 
0 
w 
10
2
V
N1 
0 
b 
9
3
V
N2 
0 
w 
14
8
V
D(N4[k])
D(N3[k]) D(N3[0])
D(N2[k]) D(N2[0]) D(N2[1]) 
D(N1[k]) D(N1[0])
(f) Scan neighbour-vertex, try with wire-candidate, candidate not dominated, insert into Q.  
 183
 
if (v є OB[G]’) { // if v is not buffer-obstacle. 
    for each b є B{ 
       (rv, tv) ? Cost(r є D(u[k]), t є D(u[k]), b[i])  // assume r = 3, t = 20; 
  if (tv  < estimated_delay)   // 20 < ∞; 
     {  InsertCandidate(D(u[k]), v,, rv, tv, b[i], L[v])  } // Insert PQ, skip details; 
    } 
 }// end buffer trials 
  : 
N1 
N2 
L[N1]
L[N2]
N3 L[N3]
N4 L[N4]
u 
uk
e
r
t
sf
u 
uk
e
r
t
sf
u 
uk
e
r
t
sf
u 
uk
e
r
t
sf
Q 
3 8 20 ∞ ∞ ∞ 
D(N2[1]) D(N3[0]) D(N3[1]) ∞ ∞ ∞ 
estimated_delay   = ∞ 
estimated_end_candidate = ? 
NIL 
NIL 
NIL 
5 
0 
V
N1 
0 
w 
10
2
V
N1 
0 
w 
9
3
V
N2 
0 
w 
14
8
V
N2 
0 
b 
3
20
V
D(N4[k])
D(N3[k]) D(N3[0]) D(N3[1]) 
D(N2[k]) D(N2[0]) D(N2[1]) 
D(N1[k]) D(N1[0])
(g) Scan neighbour-vertex, try with buffer-insertion, candidate not dominated, insert into Q.  
 184
 
do{ 
    do{ 
   (D(u[k]), t є D(u[k])) ? EXTRACT-MIN(Q) // extract the highest priority entry. 
          }( while sf є D(u[k]) == NON_VALID)   // continue if entry is invalid. 
  :     // candidate D(N2[1]) is valid. 
: 
        } 
 
N1 
N2 
L[N1]
L[N2]
N3 L[N3]
N4 L[N4]
u 
uk
e
r
t
sf
u 
uk
e
r
t
sf
u 
uk
e
r
t
sf
u 
uk
e
r
t
sf
Q 
8 20 ∞ ∞ ∞ ∞ 
D(N3[0]) D(N3[1]) ∞ ∞ ∞ ∞ 
estimated_delay   = ∞ 
estimated_end_candidate = ? 
NIL 
NIL 
NIL 
5 
0 
V
N1 
0 
w 
10
2
V
N1 
0 
b 
9
3
V
N2 
0 
w 
14
8
V
N2
0 
b 
3
20
V
3 
D(N2[1])
D(N4[k])
D(N3[k]) D(N3[0]) D(N3[1]) 
D(N2[k]) D(N2[0]) D(N2[1]) 
D(N1[k]) D(N1[0])
(h) Tried with all interconnect choices. Now extract the highest priority candidate from Q.  
 185
 
if (estimated_delay > t є D(u[k])) { 
     for (each vertex v є Adj[u]) { 
     if (v є OW[G]’) { // if v is not wire-obstacle. 
    for each w є W { 
(rv, tv) ? Cost(r є D(u[k]), t є D(u[k]), w[i]) // assume r = 12, t = 8; 
   if (tv  < estimated_delay)   // 8 < ∞ 
 {  InsertCandidate(D(u[k]), v, rv, tv, w[i], L[v])    }// Dec-Key, skip details. 
    :
N1 
N2 
L[N1]
L[N2]
N3 L[N3]
N4 L[N4]
u 
uk
e
r
t
sf
u 
uk
e
r
t
sf
u 
uk
e
r
t
sf
u 
uk
e
r
t
sf
Q 
8 20 ∞ ∞ ∞ ∞ 
D(N3[0]) D(N3[1]) ∞ ∞ ∞ ∞ 
estimated_delay   = ∞ 
estimated_end_candidate = ? 
NIL 
NIL 
NIL 
5 
0 
V
N1 
0 
w 
10
2
V
N1 
0 
b 
9
3
V
N2
0 
b 
3
20
V
N2 
1 
w 
12
8
V
D(N4[k])
D(N3[k]) D(N3[0]) D(N3[1]) 
D(N2[k]) D(N2[0]) D(N2[1]) 
D(N1[k]) D(N1[0])
(i) Scan neighbour-vertex, try with wire-candidate, check dominance, new candidate 
dominates existing D(N3[0]), so Decrease-Key at Q.  
 186
 
if (v є OB[G]’) { // if v is not buffer-obstacle. 
    for each b є B{ 
       (rv, tv) ? Cost(r є D(u[k]), t є D(u[k]), b[i])  // assume r = 15, t = 7; 
  if (tv  < estimated_delay)   // 7 < ∞; 
     {  InsertCandidate(D(u[k]), v,, rv, tv, b[i], L[v])  } // Insert PQ, skip details; 
    } 
 }// end buffer trials 
  : 
N1 
N2 
L[N1]
L[N2]
N3 L[N3]
N4 L[N4]
u 
uk
e
r
t
sf
u 
uk
e
r
t
sf
u 
uk
e
r
t
sf
u 
uk
e
r
t
sf
Q 
7 8 20 ∞ ∞ ∞ 
D(N3[2]) D(N3[0]) D(N3[1]) ∞ ∞ ∞ 
estimated_delay   = ∞ 
estimated_end_candidate = ? 
NIL 
NIL 
NIL 
5 
0 
V
N1 
0 
w 
10
2
V
N1 
0 
b 
9
3
V
N2
0 
b 
3
20
V
N2 
1 
w 
12
8
V
N2 
1 
b  
15 
7 
V 
D(N4[k])
D(N3[k]) D(N3[0]) D(N3[1]) D(N3[2]) 
D(N2[k]) D(N2[0]) D(N2[1]) 
D(N1[k]) D(N1[0])
(j) Scan neighbour-vertex, try with buffer-insertion, candidate not dominated, insert into Q.  
 187
 
do{ 
    do{ 
   (D(u[k]), t є D(u[k])) ? EXTRACT-MIN(Q) // extract the highest priority entry. 
          }( while sf є D(u[k]) == NON_VALID)   // continue if entry is invalid. 
  :     // candidate D(N3[2]) is valid. 
: 
        } 
 
N1 
N2 
L[N1]
L[N2]
N3 L[N3]
N4 L[N4]
u 
uk
e
r
t
sf
u 
uk
e
r
t
sf
u 
uk
e
r
t
sf
u 
uk
e
r
t
sf
Q 
8 20 ∞ ∞ ∞ ∞ 
D(N3[0]) D(N3[1]) ∞ ∞ ∞ ∞ 
estimated_delay   = ∞ 
estimated_end_candidate = ? 
NIL 
NIL 
NIL 
5 
0 
V
N1 
0 
w 
10
2
V
N1 
0 
b 
9
3
V
N2
0 
b 
3
20
V
N2
1 
w 
12
8
V
N2 
1 
b 
15 
7 
V 
7 
D(N3[2])
D(N4[k])
D(N3[k]) D(N3[0]) D(N3[1]) D(N3[2]) 
D(N2[k]) D(N2[0]) D(N2[1]) 
D(N1[k]) D(N1[0])
(k) Tried with all interconnect choices. Now extract the highest priority candidate from Q.  
 188
 
if (estimated_delay > t є D(u[k])) { 
     for (each vertex v є Adj[u]) { 
     if (v є OW[G]’) { // if v is not wire-obstacle. 
    for each w є W { 
(rv, tv) ? Cost(r є D(u[k]), t є D(u[k]), w[i]) // assume r = 22, t = 22; 
   if (tv  < estimated_delay)   // 22 < ∞ 
 {  InsertCandidate(D(u[k]), v, rv, tv, w[i], L[v])    }// Insert PQ, skip details. 
    :
N1 
N2 
L[N1]
L[N2]
N3 L[N3]
N4 L[N4]
u 
uk
e
r
t
sf
u 
uk
e
r
t
sf
u 
uk
e
r
t
sf
u 
uk
e
r
t
sf
Q 
8 20 22 ∞ ∞ ∞ 
D(N3[0]) D(N3[1]) D(N4[0]) ∞ ∞ ∞ 
estimated_delay = 22 
estimated_end_candidate = D(N4[0]) 
NIL 
NIL 
NIL 
5 
0 
V
N1 
0 
w 
10
2
V
N1 
0 
b 
9
3
V
N2 
0 
b 
3
20
V
N2 
1 
w 
12
8
V
N2  
1 
b 
15 
7 
V 
N3 
0 
w 
22
22
V
D(N4[k]) D(N4[0])
D(N3[k]) D(N3[0]) D(N3[1]) D(N3[2]) 
D(N2[k]) D(N2[0]) D(N2[1]) 
D(N1[k]) D(N1[0])
(l) Scan neighbour-vertex, try with wire-candidate, candidate not dominated, insert into Q.  
 189
 
if (v є OB[G]’) { // if v is not buffer-obstacle. 
    for each b є B{ 
       (rv, tv) ? Cost(r є D(u[k]), t є D(u[k]), b[i])  // assume r = 18, t = 20 
  if (tv  < estimated_delay)   // 20 < 22; 
     {  InsertCandidate(D(u[k]), v,, rv, tv, b[i], L[v])  } // Dec-Key, skip details; 
    } 
 }// end buffer trials 
  : 
N1 
N2 
L[N1]
L[N2]
N3 L[N3]
N4 L[N4]
u 
uk
e
r
t
sf
u 
uk
e
r
t
sf
u 
uk
e
r
t
sf
u 
uk
e
r
t
sf
Q 
8 20 20 ∞ ∞ ∞ 
D(N3[0]) D(N3[1]) D(N4[0]) ∞ ∞ ∞ 
estimated_delay = 22 = 20 
estimated_end_candidate = D(N4[0]) = D(N4[0]) 
NIL 
NIL 
NIL 
5 
0 
V
N1 
0 
w 
10
2
V
N1 
0 
w 
9
3
V
N2 
0 
b 
3
20
V
N2 
1 
w 
12
8
V
N2  
1 
b 
15 
7 
V 
N3
0 
b 
18
20
V
D(N4[k]) D(N4[0])
D(N3[k]) D(N3[0]) D(N3[1]) D(N3[2]) 
D(N2[k]) D(N2[0]) D(N2[1]) 
D(N1[k]) D(N1[0])
(m) Scan neighbour-vertex, try buffer-insertion, new candidate dominants D(N4[0]), dec-
key at Q.  
 190
 
do{ 
    do{ 
   (D(u[k]), t є D(u[k])) ? EXTRACT-MIN(Q) // extract the highest priority entry. 
          }( while sf є D(u[k]) == NON_VALID)   // continue if entry is invalid. 
  :     // candidate D(N3[0]) is valid. 
: 
        } 
 
N1 
N2 
L[N1]
L[N2]
N3 L[N3]
N4 L[N4]
u 
uk
e
r
t
sf
u 
uk
e
r
t
sf
u 
uk
e
r
t
sf
u 
uk
e
r
t
sf
Q 
20 20 ∞ ∞ ∞ ∞ 
D(N3[1]) D(N4[0]) ∞ ∞ ∞ ∞ 
estimated_delay = 20 
estimated_end_candidate = D(N4[0]) 
NIL 
NIL 
NIL 
5 
0 
V
N1 
0 
w 
10
2
V
N1 
0 
b 
9
3
V
N2 
0 
b 
3
20
V
N2 
1 
w 
12
8
V
N2  
1 
b 
15 
7 
V 
N3
0 
b 
18
20
V
8 
D(N3[0])
D(N4[k]) D(N4[0])
D(N3[k]) D(N3[0]) D(N3[1]) D(N3[2]) 
D(N2[k]) D(N2[0]) D(N2[1]) 
D(N1[k]) D(N1[0])
(n) Tried with all interconnect choices. Now extract the highest priority candidate from Q.  
 191
 
if (estimated_delay > t є D(u[k])) { 
     for (each vertex v є Adj[u]) { 
     if (v є OW[G]’) { // if v is not wire-obstacle. 
    for each w є W { 
(rv, tv) ? Cost(r є D(u[k]), t є D(u[k]), w[i]) // assume r = 13, t = 28; 
   if (tv  < estimated_delay)   // 28 > estimated_delay!!!
 {  InsertCandidate(D(u[k]), v, rv, tv, w[i], L[v])    }// NOT ENTER. 
    :
N1 
N2 
L[N1]
L[N2]
N3 L[N3]
N4 L[N4]
u 
uk
e
r
t
sf
u 
uk
e
r
t
sf
u 
uk
e
r
t
sf
u 
uk
e
r
t
sf
Q 
20 20 ∞ ∞ ∞ ∞ 
D(N3[1]) D(N4[0]) ∞ ∞ ∞ ∞ 
estimated_delay = 20 
estimated_end_candidate = D(N4[0]) 
NIL 
NIL 
NIL 
5 
0 
V
N1 
0 
w 
10
2
V
N1 
0 
b 
9
3
V
N2
0 
b 
3
20
V
N2
1 
w 
12
8
V
N2 
1 
b  
15 
7 
V 
N3
0 
b 
18
20
V
D(N4[k]) D(N4[0])
D(N3[k]) D(N3[0]) D(N3[1]) D(N3[2]) 
D(N2[k]) D(N2[0]) D(N2[1]) 
D(N1[k]) D(N1[0])
(o) Scan neighbour-vertex, try wire-candidate, computed propagate-delay > 
estimated_delay, throw it.  
 192
 
if (v є OB[G]’) { // if v is not buffer-obstacle. 
    for each b є B{ 
       (rv, tv) ? Cost(r є D(u[k]), t є D(u[k]), b[i])  // assume r = 18, t = 18 
  if (tv  < estimated_delay)   // 18 < estimated_delay; 
     {  InsertCandidate(D(u[k]), v,, rv, tv, b[i], L[v])  } // Decrease-Key, update estimated_;
    } 
 }// end buffer trials 
  : 
N1 
N2 
L[N1]
L[N2]
N3 L[N3]
N4 L[N4]
u 
uk
e
r
t
sf
u 
uk
e
r
t
sf
u 
uk
e
r
t
sf
u 
uk
e
r
t
sf
Q 
18 20 ∞ ∞ ∞ ∞ 
D(N4[0]) D(N3[1]) ∞ ∞ ∞ ∞ 
estimated_delay = 20 = 18 
estimated_end_candidate = D(N4[0]) = D(N4[0]) 
NIL 
NIL 
NIL 
5 
0 
V
N1 
0 
w 
10
2
V
N1 
0 
b 
9
3
V
N2
0 
b 
3
20
V
N2
1 
w 
12
8
V
N2 
1 
b  
15 
7 
V 
N3
2 
b 
18
18
V
D(N4[k]) D(N4[0])
D(N3[k]) D(N3[0]) D(N3[1]) D(N3[2]) 
D(N2[k]) D(N2[0]) D(N2[1]) 
D(N1[k]) D(N1[0])
(p) Scan neighbour-vertex, try buffer-insertion, new candidate dominants D(N4[0]), dec-key 
at Q.  
 193
 
do{ 
    do{ 
   (D(u[k]), t є D(u[k])) ? EXTRACT-MIN(Q) // extract the highest priority entry. 
          }( while sf є D(u[k]) == NON_VALID)   // continue if entry is invalid. 
if (estimated_delay > t є D(u[k])) { // NOT ENTER because t = estimated_delay 
: 
: 
 
N1 
N2 
L[N1]
L[N2]
N3 L[N3]
N4 L[N4]
u 
uk
e
r
t
sf
u 
uk
e
r
t
sf
u 
uk
e
r
t
sf
u 
uk
e
r
t
sf
Q 
20 ∞ ∞ ∞ ∞ ∞ 
D(N3[1]) ∞ ∞ ∞ ∞ ∞ 
estimated_delay = 18 
estimated_end_candidate = D(N4[0]) 
NIL 
NIL 
NIL 
5 
0 
V
N1 
0 
w 
10
2
V
N1 
0 
b 
9
3
V
N2
0 
b 
3
20
V
N2
1 
w 
12
8
V
N2 
1 
b  
15 
7 
V 
N3
2 
b 
18
18
V
18 
D(N4[0])
D(N4[k]) D(N4[0])
D(N3[k]) D(N3[0]) D(N3[1]) D(N3[2]) 
D(N2[k]) D(N2[0]) D(N2[1]) 
D(N1[k]) D(N1[0])
(q) Tried with all interconnect choices. Now extract the highest priority candidate from Q. 
Not continue because t of candidate = estimated_delay. 
 194
 
do{ 
    do{ 
   (D(u[k]), t є D(u[k])) ? EXTRACT-MIN(Q) // extract the highest priority entry. 
          }(while sf є D(u[k]) == NON_VALID)   // continue if entry is invalid. 
if (estimated_delay > t є D(u[k])) { // NOT ENTER because t, 20 > estimated_delay 
: 
: 
 
N1 
N2 
L[N1]
L[N2]
N3 L[N3]
N4 L[N4]
u 
uk
e
r
t
sf
u 
uk
e
r
t
sf
u 
uk
e
r
t
sf
u 
uk
e
r
t
sf
Q 
∞ ∞ ∞ ∞ ∞ ∞ 
∞ ∞ ∞ ∞ ∞ ∞ 
estimated_delay = 18 
estimated_end_candidate = D(N4[0]) 
NIL 
NIL 
NIL 
5 
0 
V
N1 
0 
w 
10
2
V
N1 
0 
b 
9
3
V
N2
0 
b 
3
20
V
N2
1 
w 
12
8
V
N2 
1 
b  
15 
7 
V 
N3
2 
b 
18
18
V
20 
D(N3[1])
D(N4[k]) D(N4[0])
D(N3[k]) D(N3[0]) D(N3[1]) D(N3[2]) 
D(N2[k]) D(N2[0]) D(N2[1]) 
D(N1[k]) D(N1[0])
(r) Extract the highest priority candidate from Q. Throw it because t of candidate > 
estimated_delay. 
 195
 
do{ 
    do{ 
   (D(u[k]), t є D(u[k])) ? EXTRACT-MIN(Q) 
          }( while sf є D(u[k]) == NON_VALID) 
: 
        }(while Q ≠ Ø)  // NOW, PQ IS EMPTY !!! 
 
N1 
N2 
L[N1]
L[N2]
N3 L[N3]
N4 L[N4]
u 
uk
e
r
t
sf
u 
uk
e
r
t
sf
u 
uk
e
r
t
sf
u 
uk
e
r
t
sf
Q 
∞ ∞ ∞ ∞ ∞ ∞ 
∞ ∞ ∞ ∞ ∞ ∞ 
estimated_delay = 18 
estimated_end_candidate = D(N4[0]) 
NIL 
NIL 
NIL 
5 
0 
V
N1 
0 
w 
10
2
V
N1 
0 
b 
9
3
V
N2
0 
b 
3
20
V
N2
1 
w 
12
8
V
N2 
1 
b  
15 
7 
V 
N3
2 
b 
18
18
V
D(N4[k]) D(N4[0])
D(N3[k]) D(N3[0]) D(N3[1]) D(N3[2]) 
D(N2[k]) D(N2[0]) D(N2[1]) 
D(N1[k]) D(N1[0])
(s) The Q is empty !!! exit the loop. 
 196
 
 
 
 
 
TRACE-BACK, BEGIN AT THE ESTIMATED_END_CANDIDATE D(N4[0]):- 
D(N4[0]) ? D(N3[0]) ? D(N2[1]) ? D(N1[0]), 
with end-to-end delay = 18. 
N1
N2
L[N1]
L[N2]
N3 L[N3]
N4 L[N4]
u 
uk
e
r
t
sf
u 
uk
e
r
t
sf
u 
uk
e
r
t
sf
u 
uk
e
r
t
sf
estimated_delay = 18 
estimated_end_candidate = D(N4[0]) 
NIL 
NIL 
NIL 
5
0
V
N1 
0 
w 
10
2
V
N1 
0 
b 
9
3
V
N2
0 
b 
3
20
V
N2
1 
w 
12
8
V
N2 
1 
b  
15 
7 
V 
N3
2 
b 
18
18
V
D(N4[k]) D(N4[0])
D(N3[k]) D(N3[0]) D(N3[1]) D(N3[2]) 
D(N2[k]) D(N2[0]) D(N2[1]) 
D(N1[k]) D(N1[0])
Resulted path can be traced back. 
APPENDIX D 
 
 
 
 
NUMERICAL EXAMPLE OF  
THE INSERTION SORT PRIORITY QUEUE OPERATION 
 
 
 
 
 Appendix D.1 presents the numerical example illustrating Insertion Sort 
algorithm. Appendix D.2 illustrates the Insertion Sort Priority Queue operation. The 
algorithm definition is given in Chapter 3. 
 
 
 
 
D.1 Numerical Example Illustrates Insertion-Sort Algorithm 
 
 
18 9 3 12 22 2 78 1 
Given an array of data, sort 
the array using Insertion-Sort 
Algorithm. 
Step-1, Inner-Loop-1: 
E1 < E0, swap E1 and E0. 
E0 E1 E2 E3 E4 E5 E6 E7 
9 18 3 12 22 2 78 1 
18 9 3 12 22 2 78 1 
E0 E1 E2 E3 E4 E5 E6 E7 
Step-2, Inner-Loop-2: 
E2 < E1, swap E2 and E1. 
9 3 18 12 22 2 78 1 
9 18 3 12 22 2 78 1 
E0 E1 E2 E3 E4 E5 E6 E7 
 198
 
Step-2, Inner-Loop-1: 
E1 < E0, swap E1 and E0. 
3 9 12 18 22 2 78 1 
9 3 18 12 22 2 78 1 
E0 E1 E2 E3 E4 E5 E6 E7 
Step-3, Inner-Loop-3: 
E3 > E2, end-loop. 3 9 12 18 22 2 78 1 
3 9 12 18 22 2 78 1 
Step-4, Inner-Loop-4: 
E4 > E3, end-loop. 
E0 E1 E2 E3 E4 E5 E6 E7 
E0 E1 E2 E3 E4 E5 E6 E7 
Step-5, Inner-Loop-5 downto 1. 
3 9 12 18 22 2 78 1 
3 9 12 18 2 22 78 1 
3 9 12 2 18 22 78 1 
3 9 2 12 18 22 78 1 
3 2 9 12 18 22 78 1 
2 3 9 12 18 22 78 1 
Step-5, Inner-Loop-5: 
E5 < E4, swap. 
Step-5, Inner-Loop-4: 
E4 < E3, swap. 
Step-5, Inner-Loop-3: 
E3 < E2, swap. 
Step-5, Inner-Loop-2: 
E2 < E1, swap. 
Step-5, Inner-Loop-1: 
E1 < E0, swap. 
E0 E1 E2 E3 E4 E5 E6 E7 
Step-6, Inner-Loop-6: 
E6 > E5, end-loop. 2 3 9 12 18 22 78 1 
E0 E1 E2 E3 E4 E5 E6 E7 
 199
 
 
 
 
 
 
 
D.2 Numerical Example Illustrates Insertion-Sort Priority Queue Operation 
 
 
Initially, an empty array. 
55  
INSERT a new element (“55”) into queue.  
This fills up the first empty location from left. 
There are no other elements, hence no comparison is needed. 
Leftmost element has 
the highest priority. 
2 3 9 12 18 22 78 1 
2 3 9 12 18 22 1 78 
2 3 9 12 18 1 22 78 
2 3 9 12 1 18 22 78 
2 3 9 1 12 18 22 78 
2 3 1 9 12 18 22 78 
2 1 3 9 12 18 22 78 
1 2 3 9 12 18 22 78 Step-7, Inner-Loop-7 downto 1. 
Step-7, Inner-Loop-2: 
E2 < E1, swap. 
Step-7, Inner-Loop-7: 
E7 < E6, swap. 
Step-7, Inner-Loop-6: 
E6 < E5, swap. 
Step-7, Inner-Loop-5: 
E5 < E4, swap. 
Step-7, Inner-Loop-4: 
E4 < E3, swap. 
Step-7, Inner-Loop-3: 
E3 < E2, swap. 
Step-7, Inner-Loop-1: 
E1 < E0, swap. 
E0 E1 E2 E3 E4 E5 E6 E7 
 200
 
INSERT a new element (“18”) into queue.  
This fills up the first empty location from left (right after “55”). 
Comparison is invoked to sort the queue. 
55 18  
18 55  
INSERT a new element (“9”) into queue.  
This fills up the first empty location from left (right after “55”). 
Comparison is invoked to sort the queue. 
18 55 9  
18 9 55  
9 18 55  
INSERT a new element (“19”) into queue.  
This fills up the first empty location from left (right after “55”). 
Comparison is invoked to sort the queue. 
9 18 55 19     
9 18 19 55     
INSERT a new element (“12”) into queue.  
This fills up the first empty location from left (right after “55”). 
Comparison is invoked to sort the queue. 
9 18 19 55 12  
9 18 19 12 55  
9 18 12 19 55  
9 12 18 19 55  
 201
 
9 12 18 19 55 82  
INSERT a new element (“82”) into queue.  
This fills up the first empty location from left (right after “55”). 
Comparison is invoked to sort the queue. 
9 12 18 19 55 82 95  
INSERT a new element (“95”) into queue.  
This fills up the first empty location from left (right after “82”). 
Comparison is invoked to sort the queue. 
INSERT a new element (“1”) into queue.  
This fills up the first empty location from left (right after “95”). 
Comparison is invoked to sort the queue. 
9 12 18 19 55 82 95 1 
9 12 18 19 55 82 1 95 
9 12 18 19 55 1 82 95 
9 12 18 19 1 55 82 95 
9 12 18 1 19 55 82 95 
9 12 1 18 19 55 82 95 
9 1 12 18 19 55 82 95 
1 9 12 18 19 55 82 95 
 202
 
 
EXTRACT will removed the highest priority element (the leftmost element), 
thus invoked a right-shift throughout the queue to refill the empty location. 
1 9 12 18 19 55 82 95 
9 12 18 19 55 82 951
EXTRACT will removed the highest priority element (the leftmost element), 
thus invoked a right-shift throughout the queue to refill the empty location. 
9 12 18 19 55 82 95
9 12 18 19 55 82 95
EXTRACT will removed the highest priority element (the leftmost element), 
thus invoked a right-shift throughout the queue to refill the empty location. 
18 19 55 82 9512
12 18 19 55 82 95
APPENDIX E 
 
 
 
 
INTRODUCTION TO ALTERA NIOS II DEVELOPMENT SYSTEM 
 
 
 
 
 This appendix illustrates the basic design flow of NIOS II system and also an 
example. Refer to Altera (2003a, 2003b, 2004a, 2004b, 2004c, 2005a) for detail 
information. 
 
 
 
 
E.1 An example of NIOS II System 
 
 
 
 
 204
E.2 Basic Design Flow 
 
 
Software Development 
Device driver development: 
• Use NIOS II IDE software. 
• Target Altera HAL API technology.
User API development: 
• Use NIOS II IDE software. 
• Target Altera HAL API technology.
User Application development: 
• Use NIOS II IDE software. 
• Use Std. ANSI C/C++ language. 
• Target Altera HAL API technology.
User peripheral development: 
• Use Quartus II software for design 
entry and design compilation; then 
verify the functionality through 
waveform simulation. 
• Preferably uses VHDL for design 
modelling. 
Avalon Interface Unit development: 
• Use Quartus II software for design 
entry and design compilation; then 
verify the functionality through 
waveform simulation. 
• Preferably uses VHDL for design 
modelling. 
NIOS II System development: 
• Use ALTERA SoPC Builder to 
integrate all peripherals, NIOS II 
embedded processor and Avalon 
Bus. 
Hardware Development 
On FPGA Dev. Board: 
• NIOS II System design validation and performance verification. 
• Download hardware designs to FPGA dev. Board using Quartus II Programmer, then 
• Download software to FPGA dev. Board using NIOS II IDE. 
The design 
specification of Avalon 
Interface Unit and 
Device Driver are 
related to each other. 
APPENDIX F 
 
 
 
 
VHDL SOURCE CODES OF 
 PRIORITY QUEUE ACCELERATOR MODULE 
 
 
 
 
 This appendix presents the VHDL source code of the Priority Queue 
Accelerator Module and its sub-modules. First, the design hierarchy of the design is 
presented. This is followed by the VHDL codes starting from the top module. The 
full codes are not given here but can be obtained from the author or the supervisor. 
Note here on in appendix, hwPQ is referred as SAPQ, the Avalon Interface Unit is 
referred as SAPQ_Avalon_Interface, and the Priority Queue Accelerator Module is 
referred as SAPQ_Coprocessor. First, the design hierarchy of the design is presented. 
This is followed by the VHDL codes starting from the top module.  
 
 
 
 
F.1 Design Hierarchy of SAPQ_CoProcessor 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 206
F.2 VHDL Code of SAPQ_CoProcessor 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
F.3 VHDL Code of SAPQ Core 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
-- avalon_SystolicPQ 
library IEEE; 
use IEEE.std_logic_1164.all; 
use IEEE.std_logic_arith.all; 
 
entity SAPQ_Coprocessor is 
 port ( CLK   : in STD_LOGIC; 
  chipselect  : in STD_LOGIC; 
  write  : in STD_LOGIC;  
  address  : in STD_LOGIC_VECTOR(2 downto 0); 
  writedata  : in STD_LOGIC_VECTOR(31 downto 0); 
  readdata  : out STD_LOGIC_VECTOR(31 downto 0) 
 ); 
   
end SAPQ_Coprocessor; 
 
architecture SAPQ_Coprocessor_arch of SAPQ_Coprocessor is 
  : 
 -- VHDL Module Generator component declarations 
 component SAPQ 
 : 
 component SAPQ_Avalon_Interface 
  : 
begin 
 -- VHDL Module Generator component instantiations 
  
 U_SAPQ_Avalon_Interface: SAPQ_Avalon_Interface 
  port map ( : 
    :    
 U_SAPQ: SAPQ 
  port map ( : 
: 
end SAPQ Coprocessor arch; 
library IEEE; 
use IEEE.std_logic_1164.all; 
use IEEE.std_logic_arith.all; 
use IEEE.std_logic_unsigned.all; 
 
entity SAPQ is 
 GENERIC (n: INTEGER := 4); -- default SAPQ-4 
 port ( rst   : in STD_LOGIC; 
  writeP   : in STD_LOGIC; 
  readP   : in STD_LOGIC; 
  writeS   : out STD_LOGIC; 
  readS   : out STD_LOGIC; 
  writedataP  : in STD_LOGIC_VECTOR(63 downto 0); 
  readdataP  : out STD_LOGIC_VECTOR(63 downto 0); 
  writedataS  : out STD_LOGIC_VECTOR(63 downto 0); 
  readdataS  : in STD_LOGIC_VECTOR(63 downto 0); 
  CLK   : in STD_LOGIC 
 ); 
end SAPQ; 
 
architecture SAPQ_arch of SAPQ is 
  : 
 component PE 
  : 
begin 
 -- VHDL Module Generator component instantiations 
  :  
 FOR i IN 0 TO (n-1) GENERATE 
 U_PE: PE 
: 
 END GENERATE; 
: 
end SAPQ_arch; 
 207
F.4 VHDL Code of Processing Element (PE) 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
F.5 VHDL Code of Control Unit (CU) 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
library IEEE; 
use IEEE.std_logic_1164.all; 
use IEEE.std_logic_arith.all; 
use IEEE.std_logic_unsigned.all; 
 
entity PE is 
 port ( 
  rst   : in STD_LOGIC; 
  writeP   : in STD_LOGIC; 
  readP   : in STD_LOGIC; 
  writeS   : out STD_LOGIC; 
  readS   : out STD_LOGIC; 
  writedataP  : in STD_LOGIC_VECTOR(63 downto 0); 
  readdataP   : out STD_LOGIC_VECTOR(63 downto 0); 
  writedataS  : out STD_LOGIC_VECTOR(63 downto 0); 
  readdataS   : in STD_LOGIC_VECTOR(63 downto 0); 
  CLK   : in STD_LOGIC 
 ); 
end PE; 
 
architecture PE_arch of PE is 
  : 
 -- VHDL Module Generator component declarations 
 component CU 
  : 
 component DU 
  : 
begin 
 -- VHDL Module Generator component instantiations 
 U_CU: CU 
  port map (  
  : 
 U_DU: DU 
  port map ( 
  : 
end PE_arch; 
-- Synthesizable VHDL generated by VHDL Module Generator 
-- 
-- CU 
library IEEE; 
use IEEE.std_logic_1164.all; 
use IEEE.std_logic_arith.all; 
 
entity CU is 
 port ( writeP  : in STD_LOGIC; 
  readP  : in STD_LOGIC; 
  readS  : out STD_LOGIC; 
  writeS  : out STD_LOGIC; 
  selHold  : out STD_LOGIC; 
  ldHold  : out STD_LOGIC; 
  selTemp  : out STD_LOGIC; 
  ldTemp  : out STD_LOGIC; 
  BGA  : in STD_LOGIC; 
  rst  : in STD_LOGIC; 
  CLK  : in STD_LOGIC 
 ); 
end CU; 
 
 
architecture CU_arch of CU is 
 : 
 :       
end CU_arch; 
 208
F.6 VHDL Code of Datapath Unit (DU) 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
F.7 VHDL Code of SAPQ Avalon Interface 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
-- Synthesizable VHDL generated by VHDL Module Generator 
-- 
-- DU 
library IEEE; 
use IEEE.std_logic_1164.all; 
use IEEE.std_logic_arith.all; 
use IEEE.std_logic_unsigned.all; 
 
entity DU is 
 port ( writedataP  : in STD_LOGIC_VECTOR(63 downto 0); 
  readdataP   : out STD_LOGIC_VECTOR(63 downto 0); 
  writedataS  : out STD_LOGIC_VECTOR(63 downto 0); 
  readdataS   : in STD_LOGIC_VECTOR(63 downto 0); 
  BGA   : out STD_LOGIC; 
  selHold   : in STD_LOGIC; 
  ldHold   : in STD_LOGIC; 
  selTemp   : in STD_LOGIC; 
  ldTemp   : in STD_LOGIC; 
  rst   : in STD_LOGIC; 
  CLK   : in STD_LOGIC 
 ); 
end DU; 
 
architecture DU_arch of DU is 
 : 
:   
end DU_arch; 
-- avalonInterfaceUnit 
library IEEE; 
use IEEE.std_logic_1164.all; 
use IEEE.std_logic_arith.all; 
 
entity SAPQ_Avalon_Interface is 
 port ( CLK   : in STD_LOGIC; 
  chipselect   : in STD_LOGIC; 
  address   : in STD_LOGIC_VECTOR(2 downto 0);  
  writedata   : in STD_LOGIC_VECTOR(31 downto 0); 
  readdata   : out STD_LOGIC_VECTOR(31 downto 0); 
  writedataP  : out STD_LOGIC_VECTOR(63 downto 0); 
  readdataP   : in STD_LOGIC_VECTOR(63 downto 0); 
  SAPQ_reset  : out STD_LOGIC; 
  SAPQ_writeP  : out STD_LOGIC; 
  SAPQ_readP  : out STD_LOGIC 
 ); 
end SAPQ_Avalon_Interface; 
 
architecture SAPQ_Avalon_Interface_arch of SAPQ_Avalon_Interface is 
 : 
 component avalonCU 
 : 
 component avalonDU 
begin 
 : 
end SAPQ_Avalon_Interface_arch; 
 209
F.8 VHDL Code of SAPQ avalonCU 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
F.9 VHDL Code of SAPQ avalonDU 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
-- avalonCU 
library IEEE; 
use IEEE.std_logic_1164.all; 
use IEEE.std_logic_arith.all; 
 
entity avalonCU is 
 port ( CLK   : in STD_LOGIC; 
  avalon_reset  : in STD_LOGIC; 
  avalon_writeP  : in STD_LOGIC; 
  avalon_readP  : in STD_LOGIC; 
  SAPQ_reset  : out STD_LOGIC; 
  SAPQ_writeP  : out STD_LOGIC; 
  SAPQ_readP  : out STD_LOGIC 
 ); 
end avalonCU; 
 
architecture avalonCU_arch of avalonCU is 
 : 
 : 
end avalonCU_arch; 
 
-- avalonDU 
library IEEE; 
use IEEE.std_logic_1164.all; 
use IEEE.std_logic_arith.all; 
 
entity avalonDU is 
 port ( CLK   : in STD_LOGIC; 
  chipselect   : in STD_LOGIC; 
  address   : in STD_LOGIC_VECTOR(2 downto 0); 
  writedata   : in STD_LOGIC_VECTOR(31 downto 0); 
  readdata   : out STD_LOGIC_VECTOR(31 downto 0); 
  writedataP  : out STD_LOGIC_VECTOR(63 downto 0); 
  readdataP  : in STD_LOGIC_VECTOR(63 downto 0); 
  avalon_reset  : out STD_LOGIC; 
  avalon_writeP  : out STD_LOGIC; 
  avalon_readP  : out STD_LOGIC 
 ); 
end avalonDU; 
 
architecture avalonDU_arch of avalonDU is 
: 
: 
end avalonDU_arch; 
 
 APPENDIX G 
 
 
 
C SOURCE CODE FOR  
hwPQ DEVICE DRIVER AND HYBRIDPQ API  
 
 
 
 
 This appendix presents the C source code which make up the hwPQ device 
drivers and HybridPQ API. In this appendix, hwPQ is referred as SAPQ, Priority 
Queue Accelerator Module is referred as SAPQ_Coprocessor. The appendix begins 
with “system.h”, the hardware abstraction layer (HAL) file generated by SoPC 
Builder. In “system.h”, the address offset and device name of SAPQ_Coprocessor is 
defined. The device name is derefenced by the SAPQ Device Driver (comes in 
SAPQ.h and SAPQ.c). The underlying software priority queue, Fibonacci-Heap 
Priority Queue (FHPQ) is also included in this appendix (in Fibonacci.h and 
Fibonacci.c). Finally, the HybridPQ API is given in HybridPQ.h and HybridPQ.c. 
The full source codes are not given for some files due to the page constraint of the 
thesis. However, the complete codes can be obtained from the author or the 
supervisor. 
 
 
 
G.1 system.h 
 
 
 
 
 
 
 
 
 
 
 
 
/* 
 * Machine generated for a CPU named "cpu" as defined in: 
 * 
C:\altera\qdesigns41\SAPQ\my2S60ES_nios2_SAPQ_PerformanceCounter\software\Demo_M
azeRouting_syslib\../../std_2s60ES.ptf 
 * 
 */ 
 
#ifndef __SYSTEM_H_ 
#define __SYSTEM_H_ 
 
/* 
DO NOT MODIFY THIS FILE 
*/ 
 
/*
  
211
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
G.2 SAPQ.h 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
/* MODULE:      SAPQ.h 
 * DESCRIPTION: header file of “SAPQ.c” 
 * PLATFORM:    NIOS II SYSTEM MODULE (TARGETED 2S60ES) 
 * HARDWARE:    MYSAPQ_0 
 *  
 * AUTHOR:  CH'NG HENG SUN 
 * DATE:    26 SEPT 2006 
 */ 
 
#ifndef _SAPQ_h_ 
#define _SAPQ_h_ 
  
//system.h generated from SOPC builder 
#include "system.h" 
//ALTERA HAL PIO data transfer macro utility 
#include "altera_avalon_pio_regs.h"  
 
  #define reset       0x4 
  #define writeP      0x2 
  #define readP       0x1 
  #define do_nothing  0x0 
 
  #define REG_EXTRACT_PRIORITY  0x0 
  #define REG_EXTRACT_POINTER   0x1 
  #define REG_INSERT_PRIORITY   0x2 
  #define REG_INSERT_POINTER    0x3 
  #define REG_OPMODE            0x4 
 
void SAPQ_reset(void); 
void SAPQ_insert(int i_priority, int i_pointer); 
void SAPQ_extract(int *i_priority, int *i_pointer); 
void SAPQ_peek(int *i_priority, int *i_pointer); 
void SAPQ_delete(); 
 
#endif 
 
/* 
 * mySAPQ_0 configuration 
 */ 
 
#define MYSAPQ_0_NAME "/dev/mySAPQ_0" 
#define MYSAPQ_0_TYPE "altera_avalon_user_defined_interface" 
#define MYSAPQ_0_BASE 0x021208E0 
#define MYSAPQ_0_IMPORTED_WAIT 0 
#define MYSAPQ_0_NIOS_GEN_WAITS 1 
#define MYSAPQ_0_SIMULATE_IMPORTED_HDL 1 
#define MYSAPQ_0_PORT_TYPE "Avalon Slave" 
#define MYSAPQ_0_HDL_IMPORT 1 
#define MYSAPQ_0_TIMING_UNITS "cycles" 
#define MYSAPQ_0_UNIT_MULTIPLIER 14.285714285714286 
#define MYSAPQ_0_SETUP_VALUE 1 
#define MYSAPQ_0_HOLD_VALUE 0 
#define MYSAPQ_0_WAIT_VALUE 0 
#define MYSAPQ_0_ADDRESS_WIDTH 32 
#define MYSAPQ_0_MODULE_LIST "" 
#define MYSAPQ_0_SHOW_STREAMING 0 
#define MYSAPQ_0_SHOW_LATENCY 0 
#define MYSAPQ_0_TECHNOLOGY "User Logic" 
#define MYSAPQ_0_FILE_COUNT 1 
#define MYSAPQ_0_PORT_COUNT 6 
#define MYSAPQ_0_COMPONENT_DESC "my_SAPQ" 
#define MYSAPQ_0_MODULE_NAME "SAPQ_Coprocessor" 
 
 : 
#endif /* __SYSTEM_H_ */ 
  
212
G.3 SAPQ.c 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
/* MODULE:      SAPQ.c 
 * DESCRIPTION: SAPQ_Coprocessor device driver. 
 * PLATFORM:    NIOS II SYSTEM MODULE (TARGETED 2S60ES) 
 * HARDWARE:    MYSAPQ_0 
 *  
 * AUTHOR:  CH'NG HENG SUN 
 * DATE:    26 SEPT 2006 
 */ 
 
#include "SAPQ.h" 
 
//********************************************************************* 
void SAPQ_reset(void) 
// Description: Reset SAPQ 
// Input:       (none) 
// Output:      (none) 
//********************************************************************* 
{ 
  IOWR(MYSAPQ_0_BASE, REG_OPMODE, reset); 
  IOWR(MYSAPQ_0_BASE, REG_OPMODE, do_nothing); 
} 
 
//********************************************************************* 
void SAPQ_insert(int i_priority, int i_pointer) 
// Description: Insert an entry into SAPQ 
// Input      : i_priority, i_pointer 
// Output     : (none) 
//********************************************************************* 
{ 
  IOWR(MYSAPQ_0_BASE, REG_INSERT_POINTER, i_pointer); 
  IOWR(MYSAPQ_0_BASE, REG_INSERT_PRIORITY, i_priority); 
  IOWR(MYSAPQ_0_BASE, REG_OPMODE, writeP); 
  IOWR(MYSAPQ_0_BASE, REG_OPMODE, do_nothing); 
} 
 
//********************************************************************* 
void SAPQ_extract(int *i_priority, int *i_pointer) 
// Description: Extract the MINIMUM entry from SAPQ 
// Input      : (none) 
// Output     : *i_priority, *i_pointer 
//********************************************************************* 
{ 
  *i_priority=IORD(MYSAPQ_0_BASE, REG_EXTRACT_PRIORITY);  
  *i_pointer =IORD(MYSAPQ_0_BASE, REG_EXTRACT_POINTER);  
  IOWR(MYSAPQ_0_BASE, REG_OPMODE, readP); 
  IOWR(MYSAPQ_0_BASE, REG_OPMODE, do_nothing); 
} 
 
//********************************************************************* 
void SAPQ_peek(int *i_priority, int *i_pointer) 
// Description: Peek the MINIMUM entry from SAPQ 
// Input      : (none) 
// Output     : *i_priority, *i_pointer 
//***************************************************************************** 
{ 
  *i_priority=IORD(MYSAPQ_0_BASE, REG_EXTRACT_PRIORITY);  
  *i_pointer =IORD(MYSAPQ_0_BASE, REG_EXTRACT_POINTER); 
} 
 
//***************************************************************************** 
void SAPQ_delete() 
// Description: Delete the MINIMUM entry from SAPQ, following a SAPQ_peek 
// Input      : i_priority, i_pointer 
// Output     : (none) 
//***************************************************************************** 
{ 
  IOWR(MYSAPQ_0_BASE, REG_OPMODE, readP); 
  IOWR(MYSAPQ_0_BASE, REG_OPMODE, do_nothing); 
} 
 
  
213
G.4 fiblist_struct.h 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
G.5 Fibonacci.h 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
/*  
 * Define Data Structure in Fibonacci-Heap 
 * see Cormen: Fibonacci Heap. 
 */ 
 
#ifndef __FIBLIST_STRUCT_H__ 
#define __FIBLIST_STRUCT_H__ 
 
#ifndef KOSONG 
#define KOSONG 0 
#endif 
 
struct fiblist {  
 int key; 
 int id; 
 int degree; 
 int mark; 
 struct fiblist *parent, *child, *left, *right; 
}; 
 
#endif 
#ifndef __FIBONACCI_H__ 
#define __FIBONACCI_H__ 
 
 
// HEADER FILE 
#include <stdlib.h> 
#include <string.h> 
#include <stdio.h> 
#include <math.h> 
#include "fiblist_struct.h" 
#include "nrutil.h" 
 
// KOSONG defined in "fiblist_struct.h” 
 
 
// FUNCTION PROTOTYPE FOR EXACT FIBONACCI-HEAP 
void FHPQ_create_heap(struct fiblist **heap, int *num); 
struct fiblist *FHPQ_insert(struct fiblist **heap, int *num, int key, int id); 
struct fiblist *FHPQ_extract_min(struct fiblist **heap, int *num); 
struct fiblist *FHPQ_min(struct fiblist **heap, int *num); 
void FHPQ_decrease_key(struct fiblist **heap, struct fiblist *x, int k);  
void consolidate(struct fiblist **heap, int *num); 
void cut(struct fiblist **heap, struct fiblist *x, struct fiblist *y); 
void cascading_cut(struct fiblist **heap, struct fiblist *y); 
#endif 
 
  
214
G.6 Fibonacci.c 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
G.7 HybridPQ.h 
 
 
 
 
 
 
 
 
 
 
 
 
 
// HEADER FILE 
#include "Fibonacci.h" 
 
//********************************************************************* 
void FHPQ_create_heap(struct fiblist **heap, int *num)   
{    
    CREATE NEW FIBONACCI-HEAP 
 : 
} 
 
//********************************************************************* 
struct fiblist *FHPQ_insert(struct fiblist **heap, int *num, int key, int id) 
{ 
    INSERT NEW ELEMENT INTO FIBONACCI-HEAP 
 : 
} 
 
//********************************************************************* 
struct fiblist *FHPQ_extract_min(struct fiblist **heap, int *num) 
{ 
    EXTRACT THE HIGHEST PRIORITY ELEMENT, 
      i.e. THE ELEMENT WITH MINIMUM KEY VALUE) 
    THEN CONSOLIDATE THE HEAP. 
 : 
} 
 
//********************************************************************* 
void FHPQ_decrease_key(struct fiblist **heap, struct fiblist *x, int k)   
{ 
    SEARCH FOR THE ELEMENT, 
 : 
    IF (new_key < old_key), MEANING new_key HAS HIGHER PRIORITY, 
 REPLACE old_key ? new_key, 
 CONSOLIDATE THE HEAP. 
    END IF 
 : 
 : 
} 
 
//********************************************************************* 
struct fiblist *FHPQ_min(struct fiblist **heap, int *num) 
{ 
    PEEK ON THE HIGHEST PRIORITY ELEMENT 
 : 
} 
#ifndef _HybridPQ_H_ 
#define _HybridPQ_H_ 
 
// HEADER FILE 
#include "fiblist_struct.h" 
#include "fibonacci.h" 
#include "SAPQ.h" 
 
// SPECIFY THE LENGTH OF HARDWARE SAPQ 
#define LENGTH_OF_SAPQ  250 
 
// FUNCTION PROTOTYPE 
void HybridPQ_reset(void); 
void HybridPQ_insert(int i_priority, int i_pointer); 
void HybridPQ_extract(int *i_priority, int *i_pointer); 
 
#endif 
  
215
G.8 HybridPQ.c 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
// HEADER FILE
#include "HybridPQ.h" 
 
 
// SOME PRIVATE VARIABLES 
struct fiblist *helppo;   //pointer to the new inserted element. 
struct fiblist *heap;     //pointer to the minimum element of FHPQ. 
int num;               //hold the number_of_elements in FHPQ. 
int queueCount;        //watchdog to guard the length of SAPQ. 
 
 
//********************************************************************** 
void HybridPQ_reset(void) 
//Description: to initiate a module of hybridPQ. 
//********************************************************************** 
{ 
  SAPQ_reset();   // reset HW SAPQ. 
  FHPQ_create_heap(&heap,&num); // reset SW FHPQ. 
  queueCount = 0;   // reset watch-dog counter. 
} 
 
 
//********************************************************************** 
void HybridPQ_insert(int i_priority, int i_pointer) 
//Insert a new element into hybridPQ. 
//**********************************************************************{ 
  if(queueCount < LENGTH_OF_SAPQ) 
  { 
    SAPQ_insert(i_priority, i_pointer); 
    queueCount++; 
  }else{ 
    helppo=FHPQ_insert(&heap,&num,i_priority,i_pointer); 
  } 
} 
 
 
//********************************************************************** 
void HybridPQ_extract(int *i_priority, int *i_pointer) 
//Extract the minimum entry of hybridPQ. 
//**********************************************************************{ 
  int temp_priority; 
  int temp_pointer; 
 
  SAPQ_peek(&temp_priority, &temp_pointer); // peek HW SAPQ 
 
  if( temp_priority < (heap->key) )  // peek SW FHPQ & compare! 
  { 
       *i_priority = temp_priority;  // higher priority at SAPQ 
    *i_pointer  = temp_pointer; 
    SAPQ_delete();    // delete from SAPQ. 
queueCount--; 
 }else{      // higher priority at FHPQ 
     helppo=FHPQ_extract_min(&heap,&num); // delete from FHPQ. 
     *i_pointer  = helppo->id; 
     *i_priority = helppo->key; 
  } 
} 
 APPENDIX H 
 
 
 
SAMPLE GRAPHS FOR PERFORMANCE TEST AND EVALUATION 
 
 
 
 
 This appendix presents the some graph samples used in performance test and 
evaluation. 
 
 
 
 
H.1 HIGH DENSE GRAPHS 
 
 
Figure 1. Sample of 20*20 (400 nodes) 
 
  
217
 
Figure 2. Sample of 60*60 (3,600 nodes) 
 
 
Figure 3. Sample of 70*70 (4,900 nodes) 
  
218
H.2 LESS DENSE GRAPHS 
 
 
Figure 4. Sample of 20*20 (400 nodes) 
 
 
 
Figure 7. Sample of 60*60 (3,600 nodes) 
APPENDIX I 
 
 
 
 
DESIGN VERIFICATION - SIMULATION WAVEFORMS 
 
 
 
 
 This appendix contains the simulation waveforms that are obtained during the 
verification process of the designed modules. This appendix presents the simulation 
of processing elements, the Hardware Priority Queue Unit (hwPQ), and the Avalon 
Interface Unit. Simulation on Priority Queue Accelerator Module is discussed in 
Chapter 7. 
 
 
 
I.1 Simulation of Processing Element (PE) 
 
 
 As explained in Chapter 6, each PE is an autonomous processing unit. It is 
active upon assertion of reset, writeP and readP. It has two storages: one to hold the 
higher-priority-element, another one to hold the lower-priority-element. The higher-
priority-element is accessible at output port readdataP, while the lower-priority-
element is accessible at output port writedataS. Initially, PE is reset, both storage 
contain infinite value (0xFFFFFFFFFFFFFFFF), see Figure I.1. Recall for each 
element, the lower 32-bit is the actual priority-value whereas the upper 32-bit is the 
identifier associated with the priority-value. In the following section, we omit the 
identifier in discussion.  
 
 220
 
Figure I.1: Simulation Waveform of INSERT operation on PE 
 
 
For INSERT operation, writeP is asserted and the new-element is inserted at 
input port writedataP. The compare process takes place between this new-element (at 
writedataP) and old-element (at readdataP). As illustrated in operation-1, 
0x0000000038 is higher priority than 0xFFFFFFFF, so 0x0000000038 is stored (at 
readdataP) while 0xFFFFFFFF is given to next PE (at writedataS). In operation-2, 
old-element 0x0000000038 is higher priority than the new-element 0x00000053, so 
it remains (at readdataP) while 0x00000053 is given to next PE (at writedataS). For 
operation-3, new-element 0x00000000138 has higher priority than old-element 
0x00000038, so 0x0000000018 is stored (at readdataP) while 0x00000038 is given to 
next PE (at writedataS). For all INSERT operations, the next-PE will be notified with 
a cycle of writeS control signal. 
 
 
 
Figure I.2: Simulation Waveform of EXTRACT operation on PE 
 
The lower priority element The higher priority element
2 3 4
The lower priority element The higher priority element
1 2 3
 221
 
For EXTRACT operation, readP is asserted and the higher-priority-element is 
ready at output port readdataP. The vacant storage will be filled with the element 
from next-PE (via readdataS). See operation-4 in Figure I.2, element 0x00000018 is 
extracted (at readdataP), and this vacancy is filled with 0xFFFFFFFF from the next-
PE (via readdataS). Liked INSERT operation, the EXTRACT operation also notifies 
the next-PE with a cycle of readS control signal. For INSERT and EXTRACT, each 
operation takes two clock-cycles at each PE. 
 
 
 
 
I.2 Simulation of Hardware Priority Queue Unit (hwPQ) 
 
 
 The Hardware Priority Queue Unit (hwPQ) consists of n-identical PEs for 
worst-case n priority queue size. In this work, the design targeted Altera Stratix II 
EP2S60F672C3ES FPGA device, Priority Queue Computation Unit of 250 PEs is 
implemented (hwPQ-250). The number can be higher on other FPGA devices. In the 
latest Altera FPGA device (Stratix III), the implementation can be up to 2000 PEs 
(thus, hwPQ-2000). However, a full hwPQ-250 takes very long time to simulate, e.g. 
insert 250 elements in sequence then extract all 250 elements in sequence. As the 
hwPQ is a parameterized design, it is safe to assume that if a small design (hwPQ-4) 
is functionally correct, then a large design (hwPQ-250) will also be functionally 
correct. 
 
 
Therefore for illustration in simulation waveform, hwPQ-4 is implemented. 
In order to prove the functionality of priority queue operation, series INSERT and 
EXTRACT operations are invoked on hwPQ. The set of test vectors used in 
simulation is given in Table I.1. All possible sequence of operations is covered: 
INSERT- then-INSERT, EXTRACT-then-EXTRACT, INSERT-then-EXTRACT, 
and EXTRACT-then-INSERT. 
 
 
 
 
 
 
 222
 
Table I.1: Set of test vectors used to simulate hwPQ-4 
Operation Type Identifier Priority Value 
1 INSERT AAAAAAAA 00000038 
2 INSERT BBBBBBBB 00000053 
3 INSERT CCCCCCCC 00000018 
4 INSERT DDDDDDDD 00000009 
5 EXTRACT DDDDDDDD 00000009 
6 EXTRACT CCCCCCCC 00000018 
7 EXTRACT AAAAAAAA 00000038 
8 INSERT EEEEEEEE 00006522 
9 EXTRACT BBBBBBBB 00000053 
10 INSERT FFFFFFFF 00005866 
11 EXTRACT FFFFFFFF 00005866 
12 EXTRACT EEEEEEEE 00006522 
13 EXTRACT FFFFFFFF FFFFFFFF 
 
 
 
Operation_1: INSERT 
Identifier: AAAAAAAA
Priority-value: 38 
Operation_2: INSERT 
Identifier: BBBBBBBB 
Priority-value: 53 
Operation_3: INSERT 
Identifier: CCCCCCCC 
Priority-value:  18 
Operation_4: INSERT 
Identifier: DDDDDDDD
Priority-value: 9 
Operation_5: EXTRACT 
Identifier: DDDDDDDD 
Priority-value: 9 
Operation:  
no-operation
 223
 
Figure I.3: Simulation Waveform of hwPQ 
 
 
Operation_9: INSERT
Identifier: EEEEEEEE
Priority-value: 6522 
Operation_10: EXTRACT
Identifier: BBBBBBBB 
Priority-value: 53 
Operation: 
no-
operation
Operation_11: INSERT 
Identifier: FFFFFFFF 
Priority-value: 5866 
Operation_12: EXTRACT
Identifier: FFFFFFFF 
Priority-value: 5866 
Operation: 
no-
operation 
Operation_13: EXTRACT 
Identifier: EEEEEEEE 
Priority-value: 6522 
writedataP = 
FFFFFFFFFFFFFFFF,  
the hwPQ is empty. 
Operation: 
no-
operation 
Operation_7: EXTRACT 
Identifier: CCCCCCCC 
Priority-value: 18 
Operation_8: EXTRACT 
Identifier: AAAAAAAA 
Priority-value: 38 
Operation: 
no-
operation
Operation:  
no-
operation 
 224
 With the communication protocol in Chapter 6 strictly follows, each 
EXTRACT is preceded with NO-OPERATION. Otherwise, the operation will fails. 
Figure I.3 shows the waveform simulation with the proper assertion of NO-
OPERATION.  
 
 
 
 
I.3 Simulation of hwPQ_Avalon_Interface_Unit 
 
 
 The hwPQ_Avalon_Interface_Unit (avalonInterfaceUnit) is designed to 
interface data communication between AvalonTM system bus and the hwPQ. It 
provides memory-mapped-register to hold input data to hwPQ, and to hold output 
data from hwPQ while awaiting system bus fetch cycle. Besides, it generates the one-
clock-cycle active-HI of writeP and readP to hwPQ, to ensuring the hwPQ indeed 
carry out the correct operation. Figure I.4 shows the waveform simulation of 
avalonInterfaceUnit to trigger a RESET operation on hwPQ. First, operation mode 
“reset” (e.g. 0x4) is written into avalonInterfaceUnit then the corresponding control 
signal, e.g. hwPQ_reset is generated to trigger RESET operation on hwPQ. Next, the 
operation mode must be discarded/cleared so that hwPQ will end the operation (e.g. 
NO-OPERATION on hwPQ). This is done by writing operation mode “do_nothing” 
(e.g. 0x0) to avalonInterfaceUnit. 
 
 
Figure I.4: Simulation of avalonInterfaceUnit: RESET on hwPQ 
When address = 1002,  
REG_OPMODE ? writedata. 
 
Here, REG_OPMODE = 0x4; 
meaning assert hwPQ_reset. 
When address = 1002,  
REG_OPMODE ? writedata. 
 
Here, REG_OPMODE = 0x0; 
meaning NO-OPERATION. 
(so de-assert hwPQ_reset.) 
Note, the name: 
hwPQ_reset = SAPQ_reset 
hwPQ_writeP = SAPQ_writeP 
hwPQ_readP = SAPQ_readP 
 225
Figure I.5 shows the waveform simulation of avalonInterfaceUnit to trigger 
an EXTRACT operation on hwPQ. First, readdataP[31:0], e.g. the TOP priority-
value from hwPQ, is copied into Avalon readdata bus. Then, readdataP[63:32], e.g. 
the identifier associated with that TOP priority-value, is copied into Avalon readdata 
bus. That is all for PEEK operation but for EXTRACT; the hwPQ must be informed 
to destroy that TOP priority-element. Therefore, operation mode “extract” (e.g. 0x1) 
is written into avalonInterfaceUnit, exactly ONE clock cycle of hwPQ_extract is 
generated to trigger one EXTRACT operation on the hwPQ. Similarly, the operation 
mode must be discarded/cleared so that hwPQ will end the operation (e.g. NO-
OPERATION on hwPQ). This is done by writing operation mode “do_nothing” (e.g. 
0x0) to avalonInterfaceUnit. 
 
 
 
Figure I.5: Simulation of avalonInterfaceUnit: EXTRACT on hwPQ 
When address = 0002,  
readdata ? readdataP[31:0] When address = 1002,  
REG_OPMODE ? writedata. 
 
Here, REG_OPMODE = 0x0; 
Meaning NO-OPERATION. 
(so, de-assert hwPQ_readP.) 
When address = 0012,  
readdata ? readdataP[63:32]
When address = 1002,  
REG_OPMODE ? writedata. 
 
Here, REG_OPMODE = 0x1; 
meaning assert hwPQ_readP. 
Note, the name: 
hwPQ_reset = SAPQ_reset 
hwPQ_writeP = SAPQ_writeP 
hwPQ_readP = SAPQ_readP 
 226
 
Figure I.6: Simulation of avalonInterfaceUnit: INSERT on hwPQ 
 
 
Figure I.6 shows the waveform simulation of avalonInterfaceUnit to trigger 
an INSERT operation on hwPQ. First, the priority-value of new-element is copied 
into writedataP[31:0], e.g. the lower 32-bit of hwPQ input data port. Then, the 
identifier of the new element is copied into writedataP[63: 32], e.g. the lower 32-bit 
of hwPQ input data port. Next, operation mode “insert” (e.g. 0x2) is written into 
avalonInterfaceUnit, exactly ONE clock cycle of hwPQ_insert is generated to trigger 
one INSERT operation on the hwPQ Core. As so in other operations, the operation 
mode is discarded by writing operation mode “do_nothing” (e.g. 0x0) to 
hwPQ_Avalon_Interface Module. 
 
When address = 0102,  
writedataP[31:0] ? writedata When address = 1002,  
REG_OPMODE ? writedata. 
 
Here, REG_OPMODE = 0x0; 
meaning NO-OPERATION. 
(so, de-assert hwPQ_writeP.) 
When address = 0012,  
writedataP[63:32] ? writedata
When address = 1002,  
REG_OPMODE ? writedata. 
 
Here, REG_OPMODE = 0x2; 
meaning assert hwPQ_writeP. 
Note, the name: 
hwPQ_reset = SAPQ_reset 
hwPQ_writeP = SAPQ_writeP 
hwPQ_readP = SAPQ_readP 
