Circuit design and analysis for on-FPGA communication systems by Mak, Terrence Sui-Tung & Mak, Terrence Sui-Tung
CIRCUIT DESIGN AND ANALYSIS FOR
ON-FPGA COMMUNICATION SYSTEMS
TERRENCE SUI-TUNG MAK
A Thesis submitted for the degree of Doctor of Philosophy and for the Diploma of
Membership of the Imperial College
Department of Electrical and Electronic Engineering
Imperial College of Science, Technology and Medicine
October 2009
2Abstract
On-chip communication system has emerged as a prominently important subject in Very-Large-
Scale-Integration (VLSI) design, as the trend of technology scaling favours logics more than in-
terconnects. Interconnects often dictates the system performance, and, therefore, research for new
methodologies and system architectures that deliver high-performance communication services
across the chip is mandatory. The interconnect challenge is exacerbated in Field-Programmable
Gate Array (FPGA), as a type of ASIC where the hardware can be programmed post-fabrication.
Communication across an FPGA will be deteriorating as a result of interconnect scaling. The pro-
grammable fabrics, switches and the specific routing architecture also introduce additional latency
and bandwidth degradation further hindering intra-chip communication performance.
Past research efforts mainly focused on optimizing logic elements and functional units in FPGAs.
Communication with programmable interconnect received little attention and is inadequately un-
derstood. This thesis is among the first to research on-chip communication systems that are built on
top of programmable fabrics and proposes methodologies to maximize the interconnect through-
put performance. There are three major contributions in this thesis: (i) an analysis of on-chip
interconnect fringing, which degrades the bandwidth of communication channels due to routing
congestions in reconfigurable architectures; (ii) a new analogue wave signalling scheme that sig-
nificantly improves the interconnect throughput by exploiting the fundamental electrical charac-
teristics of the reconfigurable interconnect structures. This new scheme can potentially mitigate
the interconnect scaling challenges. (iii) a novel Dynamic Programming (DP)-network to provide
adaptive routing in network-on-chip (NoC) systems. The DP-network architecture performs run-
time optimization for route planning and dynamic routing which, effectively utilizes the in-silicon
bandwidth. This thesis explores a new horizon in reconfigurable system design, in which new
methodologies and concepts are proposed to enhance the on-FPGA communication throughput
performance that is of vital importance in new technology processes.
3Acknowledgements
The research presented in this thesis was carried out under the supervision of Peter Y. K. Cheung
and Wayne Luk.
I am heartily thankful to Peter, who was the first to introduce me to the fields of both network-
on-chip and asynchronous circuits. His guidance and support throughout my work have been
invaluable. Especially, his constant encouragement, positive feedback and inspirational annota-
tions to my research have been of most memorable. I also want to thank Peter for sending me to
work as an intern at Sun Microsystems Laboratories in California and that the benefit of such an
experience goes a long way.
I would also like to express my gratitude to Wayne, for his persistent and invaluable suggestions
and comments on my research. His kind interest and suggestions in my professional and academic
development has been extremely helpful.
I would also like to thank Pete Sedcole. It was truly a privilege to work with and learn from
someone with such remarkable technical skills.
I would also like to thank Prof. Alex Yakovlev. The interesting discussions with Alex in several
occasions had been very inspirational and that had led to joint paper publications and joint research
grant proposal. His dedication to innovation and boundless enthusiasm has been motivating.
The Croucher Foundation Scholarship provides financial support for this research.
I would like to thank my fiance´e, Ophelia Tsui, for her outstanding and constant support through-
out my study. Our ”4-year English holiday” left a beautiful reminiscence.
I would also like to thank my parents, Wai-Man Mak and Hung-Ping Leung, for their love and
support.
4Contents
1 Introduction 17
1.1 Motivation and Objective . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.2 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
1.3 Statement of Originality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2 Background 23
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2.2 On-Chip Communication Models . . . . . . . . . . . . . . . . . . . . . . . . . . 24
2.2.1 Point-to-Point Interconnect . . . . . . . . . . . . . . . . . . . . . . . . . 25
2.2.2 Bus Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
2.2.3 Network-on-Chip (NoC) . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3 Architecture Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
2.3.1 Soft Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
2.3.2 Hard Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
2.3.3 Interface and Communication Controller . . . . . . . . . . . . . . . . . 36
2.3.4 GALS-Based Architecture . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.4 Design Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.4.1 Quantitative Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
2.4.2 Qualitative Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
2.5 Design Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.5.1 Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
2.5.2 Routing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
2.5.3 Deadlock Avoidance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.5.4 Summary of On-Chip Routing Schemes . . . . . . . . . . . . . . . . . . 46
2.5.5 Globally Asynchronous Locally Synchronous (GALS) . . . . . . . . . . 46
2.6 High-Performance Intra-Chip Signalling . . . . . . . . . . . . . . . . . . . . . . 51
2.6.1 The Princeton Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 51
2.6.2 The GIT Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
CONTENTS 5
2.6.3 The KAIST Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
2.6.4 The Newcastle Approach . . . . . . . . . . . . . . . . . . . . . . . . . . 53
2.6.5 The Cambridge Approach . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.6.6 A Summary of High-Performance Intra-Chip Signalling . . . . . . . . . 56
2.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
3 Fringed Interconnects and Bandwidth Degradation in Communication Links 59
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
3.1.1 Degraded Bandwidth in Communication Links . . . . . . . . . . . . . . 59
3.1.2 Interconnection Length Prediction . . . . . . . . . . . . . . . . . . . . . 60
3.2 A General Communication Link Model . . . . . . . . . . . . . . . . . . . . . . 62
3.2.1 An Island-Based FPGA Model . . . . . . . . . . . . . . . . . . . . . . . 62
3.2.2 An Interconnections Model . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.2.3 Fringed Interconnections . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.2.4 Notations and Indexes . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.3 A Stochastic Model for Interconnections Length Prediction . . . . . . . . . . . . 68
3.3.1 Average Interconnections Length . . . . . . . . . . . . . . . . . . . . . 68
3.3.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.3.3 Variance of the Interconnections Length . . . . . . . . . . . . . . . . . . 78
3.4 Results for the Stochastic Modelling . . . . . . . . . . . . . . . . . . . . . . . . 79
3.4.1 Interconnection Lengths Prediction . . . . . . . . . . . . . . . . . . . . 79
3.4.2 Interconnection Delay Prediction . . . . . . . . . . . . . . . . . . . . . 82
3.4.3 Bandwidth Degradation . . . . . . . . . . . . . . . . . . . . . . . . . . 83
3.5 Model Assumptions and Future Work . . . . . . . . . . . . . . . . . . . . . . . 86
3.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
4 Wave-Pipelined Intra-Chip Signalling in FPGAs 89
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
4.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.3 Global Interconnect Model for FPGAs . . . . . . . . . . . . . . . . . . . . . . . 91
4.3.1 Multiple-Stage Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.3.2 Waveform Approximation at Each Stage . . . . . . . . . . . . . . . . . . 95
4.3.3 Delay of Long Lines in FPGA . . . . . . . . . . . . . . . . . . . . . . . 96
4.4 Wave-Pipelined Signalling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
4.4.1 Throughput of Wave-Pipelined Links . . . . . . . . . . . . . . . . . . . 98
4.4.2 Register-Based Pipelining Link . . . . . . . . . . . . . . . . . . . . . . 103
4.5 Wave-Pipelining Circuit Design in FPGAs . . . . . . . . . . . . . . . . . . . . . 105
CONTENTS 6
4.5.1 Phase Adaptation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.5.2 Multi-Phase Oversampling . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.5.3 Bit Error Rate Computation . . . . . . . . . . . . . . . . . . . . . . . . 109
4.5.4 Register Pipelining: Synchronous Versus Asynchronous . . . . . . . . . 112
4.5.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
4.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
4.6.1 Comparing Analytical Results with SPICE Simulations . . . . . . . . . . 114
4.6.2 Comparing Wave-Pipelining with Delay-Based Signalling . . . . . . . . 116
4.6.3 Evaluation of Wave-Pipelining in a Real FPGA . . . . . . . . . . . . . . 119
4.7 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5 A DP-Network for On-Chip Dynamic Routing 127
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.2.1 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
5.2.2 Shortest Path Computation . . . . . . . . . . . . . . . . . . . . . . . . . 130
5.3 Shortest Path Computation Using DP-Network . . . . . . . . . . . . . . . . . . 132
5.3.1 Discrete and Continuous-Time Formulations . . . . . . . . . . . . . . . 134
5.3.2 Convergence of the Network . . . . . . . . . . . . . . . . . . . . . . . . 135
5.3.3 Numerical Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
5.3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 140
5.4 NoC Routing with DP-Network . . . . . . . . . . . . . . . . . . . . . . . . . . 141
5.4.1 Routing Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
5.4.2 Optimality and Memory Trade-off . . . . . . . . . . . . . . . . . . . . . 144
5.5 Results and Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
5.5.1 Simulation Environment . . . . . . . . . . . . . . . . . . . . . . . . . . 148
5.5.2 Results for Average Packet Delay . . . . . . . . . . . . . . . . . . . . . 149
5.5.3 Results for k-Step Look Ahead . . . . . . . . . . . . . . . . . . . . . . 152
5.5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
5.6 Hardware Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
5.6.1 Router Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
5.6.2 Results of FPGA Implementations . . . . . . . . . . . . . . . . . . . . . 157
5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
6 Conclusion and Future Work 161
6.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
CONTENTS 7
A A Simple Approximation of Interconnections Length 165
A.1 Average Interconnections Length . . . . . . . . . . . . . . . . . . . . . . . . . . 165
A.2 Mathematical Derivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
A.2.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
A.3 Results for the Simple Approximation . . . . . . . . . . . . . . . . . . . . . . . 168
B Power Consumptions for Long Interconnections 171
C Wave-Pipelined Interconnect Testing 175
D Interconnect Design for Maximizing Throughput 178
D.1 Throughput Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
D.2 Throughput Maximizing Buffer Design . . . . . . . . . . . . . . . . . . . . . . 180
D.2.1 Cascaded Buffers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
D.2.2 Transistor Sizing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
D.3 Results and Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
D.3.1 Throughput Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . 183
D.3.2 Power Dissipation Comparison . . . . . . . . . . . . . . . . . . . . . . . 184
D.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
Bibliography 201
8List of Figures
1.1 Projected relative delay for local and global wires and for logic gates in technolo-
gies of the near future. [EKRZ02]. . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.1 (a) Classification of on-chip communication architectures. (b) Classification of
architectures are based on how many processing units sharing the channel logically
and how many processing unit connected to the channel physically. . . . . . . . . 25
2.2 The taxonomy of on-FPGA communication architectures. . . . . . . . . . . . . . 28
2.3 (a) CoreConnect Bus architecture [IBM99]. (b) Sonic custom bus architecture
[SCCL04]. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.4 Different topological structures of Network-on-Chips (a) Mesh, (b) Torus, (c)
Folded-torus, (d) Ring, (e) Tree and (f) Irregular . . . . . . . . . . . . . . . . . . 31
2.5 (a) Conventional interconnect architecture, in which LUTs are connected through
wires with programmable switches. (b) Bundled wire, which can save the hard-
ware area by reducing number of programmable switches. . . . . . . . . . . . . 34
2.6 Interconnect topological structures: (a) Manhattan-style, (b) Fat-tree and (c) Mesh
of tree (MoT) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
2.7 (a) Adaptive System-on-Chip router and (b) a SIMPPL controller . . . . . . . . . 36
2.8 (a) Communication between two clock domains with the asynchronous handshak-
ing protocol. (b) The FPGA-GALS architecture that unit of FPGAs communicate
with each other through the handshaking asynchronous communication channels. 37
2.9 The Wave-pipelined Wave Front Serializer (WAFT) . . . . . . . . . . . . . . . . 53
2.10 The Phase Encoding Communication System . . . . . . . . . . . . . . . . . . . 54
LIST OF FIGURES 9
3.1 The ideal and degraded bandwidths against bit-width of a communication link in
FPGAs. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.2 The basic logical architecture of an island-style FPGA. . . . . . . . . . . . . . . 63
3.3 A typical communication link. (a) An abstract view of the link, which provides
S-bit interconnections for connecting two modular blocks. (b) An global routing
model for the link. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.4 Routing result of a 256-bit link for connecting two modular blocks which are 50
tiles apart. The occurrences of interconnections for different delay in the link is
shown in the upper panel. Among these interconnections, the number of different
types (short, hex, and long) of wires that are used to construct the interconnections
is shown in the lower panel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
3.5 (a) An abstract view of both direct and fringed interconnections in a link after the
placement and routing steps. (b) An typical example of a fringed interconnection. 66
3.6 An illustration of the notations and modeling for the local routing of interconnec-
tion within a modular block. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
3.7 Distribution of the parameterD, which is the distance an interconnection traversed
before connecting to a long wire. . . . . . . . . . . . . . . . . . . . . . . . . . . 71
3.8 Illustration of parameters for computing the X(j). (a) Direction interconnection.
(b) Fringed interconnection. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
3.9 An example of computing the Manhattan distance within a modular block. Each
number in the block represents the Manhattan distance to connect the particular
tile to the point at M1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
3.10 Plots of number of interconnections, Y (j), and average interconnection length,
X(j). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.12 Experimental and theoretical results for interconnection lengths of a 32-bit link. . 80
3.13 Experimental and theoretical results for interconnection lengths of a 64-bit link. . 80
3.14 Experimental and theoretical results for interconnection lengths of a 128-bit link. 80
3.15 Experimental and theoretical results for interconnection lengths of a 128-bit link. 80
3.15 Linear relationship between interconnections delay and length. . . . . . . . . . . 84
3.17 Bandwidth comparison between the ideal and degraded link with length L=20 tiles. 85
LIST OF FIGURES 10
3.18 Bandwidth comparison between the ideal and degraded link with length L=30 tiles. 85
3.19 Bandwidth comparison between the ideal and degraded link with length L=50 tiles. 85
3.20 Bandwidth comparison between the ideal and degraded link with length L=80 tiles. 85
4.1 A model of a typical global interconnection in FPGAs. (a) The schematic of an in-
terconnection comprises of short and long wires and which are connected through
switching points. (b) Circuit model of the corresponding interconnection. (c)
Switch-level RC circuit for the interconnection as a chain of segments driven by
drivers. (d) Parasitic capacitance model for the interconnects. The capacitance,
Cs, is the sum of the capacitance to ground planes and the coupling capacitances. 93
4.2 Signal transmission using (a) delay-based and (b) wave-pipelined schemes. . . . 98
4.3 Signal transmission using (a) Stimulus pulse at the source, (b) waveform of the
pulse observed at the output of the first stage and (c) attenuated waveform observed
at the output of the last stage. . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
4.4 Schematic of transmitter and receiver for the phase adaptation design. . . . . . . 105
4.5 Design of a phase adaptor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.6 Implementation of Delay Element in FPGA. . . . . . . . . . . . . . . . . . . . 107
4.7 The design of an oversampling receiver. . . . . . . . . . . . . . . . . . . . . . . 108
4.8 An illustration of data sampling and moving windows for data recovery. . . . . . 109
4.9 Timing diagram for error computation. . . . . . . . . . . . . . . . . . . . . . . . 110
4.10 4-phase bundle of data asynchronous communication link (Figure adapted from
[SF01]). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
4.11 Circuit model of a typical global interconnection in FPGA. . . . . . . . . . . . . 114
4.12 Analytical model versus SPICE simulation for interconnection comprises of Sin-
gle and Double lines only. (a) Interconnect wave-pipelining throughput. (b)
Delay-based throughput. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
4.13 Analytical model versus SPICE simulation for interconnection comprises of Sin-
gle, Double, Hex and Long lines. (a) Interconnect wave-pipelining throughput.
(b) Delay-based throughput. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
LIST OF FIGURES 11
4.14 Throughput comparison between delay-based and wave-pipelined signalling for
different interconnection length. . . . . . . . . . . . . . . . . . . . . . . . . . . 117
4.15 Throughput comparison between delay-based and wave-pipelined signalling for
different technology nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 117
4.16 Power improvement of wave-pipelined signalling over register pipelining. . . . . 118
4.17 Delay improvement of wave-pipelined signalling over register pipelining. . . . . 119
4.18 An typical communication link for data transfer between two RAMs in FPGAs . 120
4.19 Throughput for the communication links. . . . . . . . . . . . . . . . . . . . . . 122
4.20 Energy consumption (Energy per bit transfer) for the communication links. . . . 123
4.21 Area for the communication links. . . . . . . . . . . . . . . . . . . . . . . . . . 123
4.22 Bit-Error-Rate (BER) estimates for wave-pipelined link with different throughput
and bit-width. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.1 Unit interconnection in a general DP-network where 1 ≤ i, j, k ≤ n; k 6= i, j. . . 133
5.2 Convergence of a DP-network for the 10-node array random walk problem. . . . 138
5.3 The convergence of the expected costs of the nodes from the 10×10 mesh network. 139
5.4 The Root Mean Square (RMS) error of the DP-network for computing shortest
paths in a 10×10 mesh network with different λ values. . . . . . . . . . . . . . 140
5.5 An example of a 3 by 3 mesh network couples with a DP-network. . . . . . . . 141
5.6 Comparing two routing strategies, where the shaded area represents the nodes
covered in routing table and vs is the source and vt is the destination. (a) Optimal
decision can be made at vs (b) Since Vu(s) ≤ Vw(s), vu is selected as transition
node in the sub-optimal path to vt. . . . . . . . . . . . . . . . . . . . . . . . . . 145
5.7 Theoretical estimates for the approximation error of the KSLA approach with re-
spect to optimal DP values and the routing table size for the corresponding k val-
ues. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
5.8 Random traffic with hot-spot at the center. . . . . . . . . . . . . . . . . . . . . 149
5.9 Random traffic with hot-spot at the corner. . . . . . . . . . . . . . . . . . . . . 150
5.10 Transpose traffic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
LIST OF FIGURES 12
5.11 Butterfly traffic. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
5.12 Comparison between k-step look ahead, Odd-Even using a NoP selection and the
XY routing approaches. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
5.13 Design of a router that comprises of optimal path computation unit. . . . . . . . 155
5.14 Design of the DP computational unit. . . . . . . . . . . . . . . . . . . . . . . . 156
5.15 The data flow routine for the k-step look ahead algorithm. . . . . . . . . . . . . 156
5.16 Convergence of DP-network in an FPGA implementation. The y-axis is the root-
mean-error of the optimal values obtained from the value outputs at each compu-
tational unit. The x-axis is the clock cycle. The period of the clock cycle is not
specified here and is varies with different FPGA devices. . . . . . . . . . . . . . 158
A.1 The realization of a communication link to a typical island-based FPGA. . . . . . 166
A.2 Theoretical and the experimental interconnections delay. . . . . . . . . . . . . . 170
B.1 Power dissipation of 5 MCNC benchmark circuits for different interconnection
lengths. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
C.1 Failure rate detection circuit [WSC07] for global interconnection. . . . . . . . . 175
C.2 Measurement results corresponding to the testing circuit in Fig. C.1. . . . . . . . 176
D.1 Waveform at the far end of each segment . . . . . . . . . . . . . . . . . . . . . . 180
D.2 Diagram of the cascaded buffers . . . . . . . . . . . . . . . . . . . . . . . . . . 181
D.3 Charge/discharge time variation in the cases of using different buffers . . . . . . 182
D.4 Diagram of the transistor sized repeaters . . . . . . . . . . . . . . . . . . . . . . 182
D.5 Throughput comparison among the three approaches . . . . . . . . . . . . . . . 183
D.6 The throughput of the interconnect with transistor-sizing (a) is more sensitive to
the NMOS width than using cascaded buffers (b). . . . . . . . . . . . . . . . . . 184
D.7 Maximum dynamic power of an interconnect against throughput at 1V VDD . . . 185
13
List of Tables
2.1 Principal characteristics and classifications of on-FPGA communication architec-
tures. P2P: Point-to-point interconnects; RTR: Run-Time Reconfiguration; NoC:
Network-on-Chip; GALS: Globally Asynchronous Locally Synchronous . . . . . 39
2.2 Summary of the quantitative performance measurements for different architectures 41
2.3 A comparison of routing algorithms for on-chip network. . . . . . . . . . . . . . 47
2.4 A comparison of high-performance intra-chip signalling methods. . . . . . . . . 57
3.1 Notations used in the interconnection model . . . . . . . . . . . . . . . . . . . . 67
3.2 Comparison of experimental and theoretical interconnections length of high-
bandwidth communication links . . . . . . . . . . . . . . . . . . . . . . . . . . 81
3.3 Comparison of experimental and theoretical Interconnections Delay of high-
bandwidth communication links . . . . . . . . . . . . . . . . . . . . . . . . . . 83
4.1 Notations used in this chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
4.2 Comparison between register-based pipelining and wave-pipelining based on the
theoretical analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
4.3 A brief summary and comparison between the different on-chip communication
approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
4.4 Implementation of Communication Links (64-bit) in FPGAs. . . . . . . . . . . . 121
4.5 Standard deviation of interconnection lengths for different bit-width in a link . . 121
4.6 Parameters used in the Bit-Error-Rate analysis . . . . . . . . . . . . . . . . . . . 125
5.1 Notations used this chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
LIST OF TABLES 14
5.2 Convergence analysis of a DP-network for different network topologies. . . . . . 144
5.3 Comparisons for packet injection rates between five routing algorithms. . . . . . 153
5.4 Hardware area results for the DP-networks. Results are obtained by using a Xilinx
Virtex-4 XC4VLX40 FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 159
5.5 Hardware area results for the XY, DP and KSLA routers. Results are obtained
based on a Xilinx Virtex-4 XC4VLX40 FPGA . . . . . . . . . . . . . . . . . . . 160
A.1 Comparison of experimental and theoretical interconnections delay . . . . . . . . 169
B.1 Distribution and power consumption percentages for interconnections with lengths
30 and 50 tiles in 20 MCNC benchmark circuits. . . . . . . . . . . . . . . . . . 173
B.2 Power consumption (W) for different interconnections of MCNC benchmark cir-
cuits. Note that some circuits do not have interconnections of a specific length. . 174
15
List of Publications
The work in this thesis has resulted in the following publications:
• Terrence Mak, P. Sedcole, Peter Y.K. Cheung and W. Luk , “On-FPGA Communication
Architectures and Design Factors”, in Proc. of 16-th IEEE International Conference on
Field Programmable Logic and Applications, Madrid, 2006
• Terrence Mak, P. Sedcole, Peter Y.K. Cheung, W. Luk and K.P. Lam , “A Hybrid Analogue-
Digital Routing Network for NoC Dynamic Routing”, in Proc. of the First IEEE Interna-
tional Symposium on Networks-on-Chips, Princeton, 2007
• Terrence Mak, P. Sedcole, P. Y. K. Cheung and W. Luk, “Average interconnection delay
estimation for on-FPGA communication links”, Electronics Letters, Vol. 43, No. 17, pp
918-920, 2007.
• Terrence Mak, Pete Sedcole, Peter Y.K. Cheung and Wayne Luk, “Interconnection Lengths
and Delays Estimation for Communication Links in FPGAs”, in Proc. ACM International
Workshop on System Level Interconnect Prediction, Newcastle, 2008
• Terrence Mak, Crescenzo D’Alessandro, Pete Sedcole, Peter Y.K. Cheung, Alex Yakovlev
and Wayne Luk, “Global Interconnections in FPGAs: Modeling and Performance Analy-
sis”, in Proc. ACM International Workshop on System Level Interconnect Prediction, New-
castle, 2008
• Terrence Mak, Pete Sedcole, Peter Y.K. Cheung and Wayne Luk, “Wave-Pipelined Sig-
nalling for On-FPGA Communication”, in Proc. IEEE Field Programmable Technology,
Taipei, 2008.
LIST OF TABLES 16
• Li Wang, Terrence Mak, Pete Sedcole, Peter Y.K. Cheung, “Throughput Maximization
for Wave-Pipelined Interconnects Using Cascaded Buffers and Transistor Sizing”, in Proc.
IEEE ISCAS, Taipei, 2009.
• Terrence Mak, Peter Y.K. Cheung, Wayne Luk and Kai-Pui Lam, “A DP-Network for Opti-
mal Dynamic Routing in Network-on-Chip”, in Proc. ACM/IEEE International Conference
on Hardware/Software Codesign and System Synthesis, Grenoble, 2009
17
Chapter 1
Introduction
1.1 Motivation and Objective
Interconnect performance is rapidly deteriorating with the continuous scaling in technology pro-
cesses. As predicted by the International Technology Roadmap for Semiconductors (ITRS) in
Fig.1.1, there is a significant performance gap between interconnection RC delay and gate delay,
and this gap will be increasing exponentially (9:1 with the 65 nm technology according to ITRS
2005 report [EKRZ02]). The gap will continue to grow even with the help of new interconnect
materials and aggressive interconnect optimization [Con01]. Furthermore, because of the tightly
packed wires, capacitances that are attributed to interconnect parasitics also increase drastically.
As a result, interconnect will dictate performances as well as power in circuits and systems.
The interconnect challenge is exacerbated in the Field-Programmable Gate Array (FPGA), which
is a type of ASIC where the hardware can be programmed post-fabrication. With a unique re-
configurable architecture, interconnections in FPGA are generally constructed from segments of
interconnect fabrics and programmable switches. Programmability introduces great flexibility in
post-fabrication and speed up the system development cycle, but at a cost of slower performance
and higher dissipating of energy when compared to its ASIC counterpart. This is because in ASIC
design, interconnect performance can be optimized via various circuit techniques, such as buffer
insertion. For the case of FPGAs, specific circuits or routing netlists are not existing at design time,
as the FPGA architecture is aiming to deliver a programmable platform to realize general circuits.
The programmable switches introduce additional loads to the interconnections and, thus, degrade
the interconnect performance. This is especially the case for global interconnections, which span
1.1 Motivation and Objective 18
 
Figure 1.1: Projected relative delay for local and global wires and for logic gates in technologies
of the near future. [EKRZ02].
a large physical distance across the chip. Long interconnection with multiple concatenated wire
segments result in inferior performance. Although more components and modules can be fitted
into a single FPGA, intra-FPGA communication system with these long interconnections would
become a bottleneck to the overall system performance.
There is therefore a need to explore new design methodologies and architectures to improve on-
FPGA communication bandwidth and potentially these new methods can potentially mitigate the
interconnect deterioration challenges in the future. Recently, numerous communication-centric
circuits and architectures targeting ASIC design are proposed [BB05, DT01, MWM04, LKK+05,
MB06]. These new techniques investigate an alternative design objective that utilizes hardware
resource to increase the on-chip communication performances. Because of the unique structure
of FPGAs, which allows limited modification on the underlying electrical circuits, implementing
alternative design methodologies is challenging.
In-depth understanding of the reconfigurable architecture is of vital importance and provides an
1.2 Overview 19
opportunity to study the exploitation of the underlying FPGA architecture for intra-chip commu-
nication purposes. The primary objective of this thesis is the exploration of novel design methods
that maximize the throughput performance in on-chip communication systems, and in particular
how these methods may be adapted to incorporate and exploit reconfigurable architectures, such
as in FPGAs. This thesis proposes a throughput-centric design paradigm for on-FPGA communi-
cation systems.
Particularly, this thesis argues that the fundamental interconnect structure is inefficient to imple-
ment chip-wide or global communication systems and would result in bandwidth degradation.
However, the underlying electronic structure of FPGA provides an opportunity to fully exploit
the analogue characteristics of wires and thus can substantially increase the interconnect through-
put. This is against the conventional synchronous digital signalling scheme used in FPGAs. The
throughput improvement using analogue signalling provides an opportunity to mitigate the inter-
connect challenges.
Besides, this thesis also argues that with the ever increasing complexity in on-chip communication
system, traffics of on-chip transportation is more difficult to predict at design time. Optimization
and synthesis for communication architecture at design time may not be able to fully utilize the
hardware resources to dynamically adapt to the traffic. Therefore, incorporating intelligent plastic-
ity with “on-the-fly” optimization in the on-chip communication system will enable a substantial
improvement in communication bandwidth and efficiency. A novel Dynamic Programming (DP)-
network is proposed to provide the real-time on-chip optimization for solving the optimal path
planning and dynamic routing problems. With this optimal routing scheme, in-silicon bandwidth
can be fully exploited and utilized.
1.2 Overview
Chapter 2 gives an overview of communication architectures implemented on Field-Programm-
able Gate Arrays, and provides a taxonomy of architectures. Design factors impacting the system
architecture and metrics for evaluating the architectures are presented. Furthermore, techniques
and methodologies from various research groups for on-chip communication architectures are
reviewed.
1.3 Statement of Originality 20
Chapter 3 begins with an analysis of a general design pattern of communication link, and studies
the relationships between bit-width and bandwidth for implementing such links in FPGAs. Due to
the specific constraints on the FPGA routing architecture and switching fabrics, the communica-
tion bandwidths are not scaled with the bitwidth as expected. Therefore, an analytical study using
stochastic model to capture the architecture and routing essence can provide a comprehensive
investigation of the impact of the programmable architecture to the communication link.
Chapter 4 presents an electrical model for FPGA interconnects which captures the essence of pro-
grammable fabrics and architecture. Based on this model, delay and throughput of interconnect
are derived. This model facilitates comparative studies of different design methodologies for com-
munication links in FPGA. Lastly, a novel link design methodology that enables analogue wave
pipelining and substantially improve the throughput performance is proposed. The method ex-
ploits the intrinsic interconnect circuit of FPGA to realize analogue wave signalling across a long
link.
Chapter 5 presents a novel Dynamic Programming (DP)-network to realize dynamic routing in a
Network-on-Chip system. The parallel computation and network convergence properties of the
DP-network are studies and exemplified with two numerical examples. Methodologies and ar-
chitectures to integrate DP-network into a NoC to enable dynamic routing are discussed. A new
routing algorithm, namely k-step look ahead (KSLA), that can substantially reduce the routing
table size, is also introduced. The performance of the DP-network based dynamic routing is eval-
uated and compared with other conventional routing algorithms on different on-chip communica-
tion system benchmarks. The hardware overhead with FPGA hardware implementation is studied
and discussed.
1.3 Statement of Originality
I have published much of the work in this thesis in conference proceedings and a journal. The ma-
terials have been summarized in three major areas of contributions, which are covered in separate
chapters in this thesis. The introduction of each chapter describes the rationale and contributions
in further details. The main contributions of this thesis are listed as follows,
• Surveying up-to-date FPGA-based communication architectures, such as FPGA-based
1.3 Statement of Originality 21
Network-on-Chip and run-time reconfiguration bus architectures. Also, design criteria,
such as quantitative performance metrics and qualitative design factors, are proposed and
discussed. The previously published FPGA-based communication architectures are also
evaluated against these design factors. [MSCL06a].
• Presenting a new method and an analytical expression, for average interconnections delay
estimation. This method is directly applicable to predicting the average delay for communi-
cation links implemented in FPGAs. [MSCL07].
• Proposing a stochastic model for predicting the interconnection lengths and delays of com-
munication links in FPGAs. The model is readily applicable to floorplanning, as it depends
on the parameters such as area dimensions of the connected modules, the Manhattan dis-
tance between them, the number of connections and the number of long lines in a channel.
Also, interconnect fringing, a special phenomenon of routing in FPGAs, is characterized.
[MSCL08a]
• Presenting an analytical model for the electrical characteristics and signal waveforms in
FPGA interconnections. Based on this model, signalling throughput and propagation de-
lay using both delay-based and wave-pipelined approaches can be derived. Based on the
analytical study and SPICE simulations, it is found that the wave-pipelined signalling can
substantially outperform other conventional on-chip signalling schemes. [MDS+08a]
• Proposing two novel source-synchronous circuits to realize wave-pipelined signalling in
FPGAs. The throughput, performance, bit-energy and hardware area are studied using a
Xilinx FPGA device. It is found that the wave-pipelining link can achieve up to 5.66 times
improvement in throughput when compared to the delay-based synchronous link. Also,
13% improvement in power consumption can be achieved by using the source-synchronous
on-chip signalling when compared to the synchronous register-pipelining links. [MSCL08b]
• Presenting two new interconnect design techniques to maximize wave-pipelining through-
put, as opposed to the traditional delay-centric design, in communication links. The theo-
retical models are developed and the performance and power dissipation are evaluated using
SPICE simulations. With these new design techniques, 185% throughput improvement in
interconnects can be achieved. [WMSC09]
1.3 Statement of Originality 22
• Proposing a novel Dynamic Programming (DP)-network to enable real-time dynamic rout-
ing in NoCs. The DP-network provides a parallel architecture for shortest path computation
and enables optimal dynamic routing. The new dynamic routing schemes are evaluated us-
ing a SystemC-based cycle accurate simulator with various benchmarks. It is shown that the
DP-network outperforms both the deterministic and adaptive routing algorithms in average
delay on all benchmark scenarios by 22.3%. [MCLL09]
• Presenting a hybrid analogue-digital routing network to realize efficient dynamic routing in
NoCs. The digital network provides accurate real-time traffic estimation and the analogue
network provides real-time shortest path computation with extremely low power consump-
tion. The proposed digital architecture is evaluated using a cycle accurate simulator and the
analogue design is evaluated using SPICE simulation. The results demonstrate the effective-
ness of the hybrid analogue-digital design, with a significant improvement in latency over
the static routing for random hot spot traffics and a significant reduction in power consump-
tion. [MSC+07]
23
Chapter 2
Background
2.1 Introduction
On-FPGA communication is important to provide high-bandwidth and reliable data transfer be-
tween processing elements. Recently, many new coarse-grained reconfigurable architectures have
been developed, such as the Multi-Platform FPGA [Xil05a, Xil05b] and Field-Programmable
System-on-Chip [Alt05], along with the rapid advancement in technology scaling. These architec-
tures embed pre-fabricated modules including microprocessors, DSP units and memory into the
fine-grain programmable fabric, which enable the implementation of a complete complex system
in a single FPGA. Significant improvements in speed, area as well as hardware configuration time
[Xil05a, BHUH06] are therefore anticipated. As a result, development of communication archi-
tectures are motivated and these architectures will be used as an interconnect backbone to integrate
coarse-grain components, ideally to provide a plug-and-play style of modularity to facilitate sys-
tem integration.
As the semiconductor fabrication technology continues to scale, it has been predicted that the per-
formance and energy consumption characteristics of global interconnect are rapidly deteriorating
[HMH01]. A new motivation to research into high-performance communication architecture is
therefore emerging. Poor wire scaling will result in a significant performance gap between in-
terconnects and logic gates, and the gap is predicted to continue to expand even with the help of
new interconnect materials and aggressive interconnect optimization [Con01]. Moreover, as the
number of embedded components increases, the communication bandwidth requirement also in-
creases drastically. Thus, it is pivotal to advance on-chip communication architectures to tackle
2.2 On-Chip Communication Models 24
the performance bottleneck and technology scaling challenges.
There are many notable efforts devoted to addressing the on-FPGA communication issues, and
a considerable number of different architectures have been proposed. In this chapter a survey of
on-FPGA communication architectures is presented. Although general on-chip communication
architectures will be presented as background information, the main focus here is on the most
recent FPGA-based communication architectures. Notable design factors are identified and used
to evaluate the FPGA-based communication architectures. The contributions of this chapter are as
follows:
• Providing a comprehensive survey of FPGA-based communication architectures and a tax-
onomy to classify architectures based on their structures and function (Section 2.2-2.3).
• Identifying notable design factors, such as quantitative performance metrics and qualitative
design factors, and examining different FPGA-based communication architectures based on
these design factors (Section 2.4).
• High-performance signalling is important and closely related to the throughput-performance
of any communication architecture. Notable contributions on this area are summarised and
compared (Section 2.6).
2.2 On-Chip Communication Models
A taxonomy of on-chip communication architectures is shown in Fig. 2.1(a). In general, com-
munication architectures can be categorized into three main classes. The classification is widely
accepted and these architectures follow a chronological order. The classification of architectures
can be visualised by plotting the degree of logical channel sharing versus the degree of phys-
ical interconnection (See fig. 2.1(b)). The degree of logically channel sharing represents how
many processing units that can use this channel or messages will pass through the channel. The
degree of physical interconnection refers to the number of processing units that are physically
connected to the channel. For point-to-point interconnects, most of the channels are designed for
dedicated communication. Therefore, these channels are neither shared nor physically connected
by other processing units. A lot of physical interconnections are typically found in the bus archi-
2.2 On-Chip Communication Models 25
Point-to-point 
Interconnect Bus Network-on-Chip
Custom Uniform
Hierarchical
share Bus
Split bus
Custom Segmented
bus
Homogeneous Heterogeneous
On-Chip Communication
D e g
r e e  L o g i
c a l  c h a n
n e l  s h a r i
n g
S
e g m e n t e d
b u s
P
o i n t - t o - P
o i n t  
I n
t e r c o n n e
c t s
H i e r a
r c h i c a l  
b
u s
H e
t e r o g e n e o u
s  
n e t w o r k
B u s
H e t e r o g e
n e o u s  
n e t w
o r k
N e t w o r
k - o n - C h i
p
(a ) (b )
Figure 2.1: (a) Classification of on-chip communication architectures. (b) Classification of ar-
chitectures are based on how many processing units sharing the channel logically and how many
processing unit connected to the channel physically.
tecture. These interconnections provide the same logical channel sharing. For network-on-chip,
the packet-switch communication protocol greatly reduces the number of physical interconnection
while high degree of logical channel sharing can be retained. A more detailed discussion about
the classifications is presented in the following sections.
2.2.1 Point-to-Point Interconnect
In a point-to-point interconnect architecture, pairs of processing units communicate directly over
dedicated physically wired connections. Because of its simplicity, point-to-point interconnect
has been widely adopted in many applications. Custom interconnect, sometimes referred as ad-
hoc interconnect, means simply connecting processing elements by wires when necessary. On
the other hand, uniform interconnect often has well defined interconnect topology, which can
be precisely specified by equations or graphs. Typical examples are systolic arrays and neural
networks.
Simplicity is the major advantage of the point-to-point interconnect architecture. A communica-
tion channel between processing elements is simply a set of wires. As the channels are not shared
by other processing elements, the latency and performance is deterministic. However, the most
significant drawback is that the number of wires required grows rapidly as the number of channels
2.2 On-Chip Communication Models 26
increases. Routing of wires may become intractably difficult [Con01, AR95]. Moreover, a point-
to-point scheme, which suffers from low wire utilization for low bandwidth channels, and a high
hardware overhead, as dedicated interfaces for each channel is required.
2.2.2 Bus Architectures
For bus architectures, long wires are grouped together to form a single physical communication
channel which is shared among different logical channels. An arbitration mechanism is used to
control sharing of the bus. This architecture significantly reduces the total length of wires required
and also reduces hardware area necessary for interfaces, communication and control. In addition,
a bus provides a generic infrastructure backbone to interconnects between processing elements.
A hierarchical shared bus defines a segmented bus architecture. Bus segments are connected via
a bridge which may buffer data. Protocols and structures can be varied in different bus segments
and each segment may be dedicated to specific functions, such as providing high-performance or
low-power communication. Partitioning can further allow optimization of local bus architecture
and communication performance. For example, AMBA (Advanced Micro-controller Bus Archi-
tecture) is a bus architecture developed by ARM which is designed for efficient realisation of
complex system-on-a-chip (SoC) design [ARM99]. Two system segmented buses are specified
in AMBA: the Advanced High-performance Bus (AHB) for high-performance communication
and Advanced Peripheral Bus (APB) for low-power application purposes. The overall bandwidth
can be further improved using multi-layer interconnect, or a crossbar switch [ARM01], in which
multiple parallel channels are available to support parallel data transmission.
In a hierarchical bus architecture there is restricted scope to modify or customise the architec-
ture structure and protocol. An alternative approach is to use a split-bus. This refers to a set of
custom-design segmented buses. Bus segments can be interconnected in an ad-hoc manner, or
based on a systematic approach. In addition, several emerging split-bus bus architectures, which
provide multiple segments with adaptive topologies, have been proposed [Sec04, RM04, RSM01].
Enhancements in bandwidth and flexibility were reported.
Despite the advantages, shared bus architectures suffer from power and performance scalability
limitations. Long bus wires are increasingly unfavourable in nanometer process technologies. In
2.3 Architecture Taxonomy 27
addition, handling high complexity design with more IP cores, partitioning and distribution of
cores to bus segments is often an ad-hoc task.
2.2.3 Network-on-Chip (NoC)
The Network-on-Chip (NoC) is an architecture inspired by data communication networks, such as
LANs and WANs, with inter-processor communication supported by a packet-switched network
[DT01, BB05]. The basic concept of a NoC is to communicate across the chip in a similar way that
of messages transmitted over the Internet. Communication is achieved by sending message pack-
ets between blocks using an on-chip packet-switched network. It is believed that the contributions
and architectures from the Internet could be borrowed to resolve the on-chip communication chal-
lenges. It has been reported that the NoC architecture can overcome the long wire disadvantages
from bus architectures, as on-chip switches are connected in a regular topology with point-to-point
basis. Long wires can be eliminated from the architecture. Also, the architecture is decoupled into
transaction and physical layers. Thus the layered architecture enables independent optimization
on both sides.
There are two types of NoC: homogeneous and heterogeneous. This refers to whether the net-
work topology is arranged in a regular network or an ad-hoc interconnection network. Different
topological arrangements of switches and processing elements result in different performance. A
typical example of a homogeneous network uses a mesh-based architecture [KJS+02, Tay02]. A
similar structure was also adopted in other implementations [SsA04, MBV+02]. One concern is
that functional components are usually of different sizes. Mapping these into regular blocks of
fixed sizes can introduce area wastage or larger chip sizes. In [BB04, BJM+05], a heterogeneous
network topology was proposed. It was reported that the heterogeneous architecture was more
area efficient.
2.3 Architecture Taxonomy
Studies targeting on-FPGA communication architectures can be classified as shown in Fig. 2.2.
The approaches are divided into five classes. The first class architectures are built on top of the re-
configurable fabrics which utilize the programmable resources for different applications. Most of
2.3 Architecture Taxonomy 28
Figure 2.2: The taxonomy of on-FPGA communication architectures.
2.3 Architecture Taxonomy 29
the architectures proposed are specified logically, and thus referred to as the Soft communication
architectures. The second class architectures aims to modify and improve the FPGA hardware
architectures, namely the Hard communication architecture. Interconnect architecture design is
conventionally an important issue of FPGA architecture design. It is considered here as a subset
of the on-FPGA communication architecture, simply because bit-level interconnect is the most
primitive way of communicating. The third class is interfaces and controllers design, which have
been introduced recently for tackling the control and interfacing issues between the embedded
microprocessors and coarse-grained modules. The forth class is the Globally Asynchronous Lo-
cally Synchronous (GALS) architecture, which provides a prominent solution for many of the
communication-related problems introduced by technology scaling. For some systems, FPGA is
embedded as a coarse-grained module. The communication architecture within this system is sum-
marized in the fifth class. The details of each class will be presented in the following sub-sections.
2.3.1 Soft Architectures
FPGA architectures provide a high density of pre-fabricated wires and programmable switches that
can be used to realize different communication architectures. Full exploitation of the available on-
FPGA communication bandwidth often requires system-level design effort to judiciously estimate
the interconnection utilization or implement complicated on-chip communication protocols. It
should be noted that the on-FPGA communication architectures discussed in this section inherit
only the logical characteristics of the communication architectures in the previous section.
Point-to-Point Interconnect
An early FPGA-based systolic array architecture, proposed by Dick [Dic96], provides an example
of this architecture with a structural and systematic topology. Processing elements are connected
in a well defined structure with data being pumped through. In [Dic96], the Discrete Fourier
Transform is mapped to a systolic array of processing elements. The well-defined interconnection
between the processing elements enables effective parallel computation and control. The design
was implemented using a Xilinx XC4010 FPGA running at a clock frequency of 15.3 MHz. By
using a more recent FPGA the clock frequency of the design could be improved. In [YKLL03] the
systolic array design for gene sequencing, running at a clock frequency of 202 MHz, was reported.
2.3 Architecture Taxonomy 30
UART PIO
B
r
i
d
g
e
Low power
High speed
OPB
Arbiter
High-
Performance 
Processor
High-bandwidth
on-chip RAM
High-bandwidth 
external memory 
interface
DMA Bus 
master
Master 
ports
Slave 
ports
Smart 
card I/O
Synchronous
Serial port
OPB(32-bit)PLB(128-bit)
Bridge
PE PE PE PE PE
Arbiter Data Message
PE PE PE
Microprocessors and peripherals
Data, control and interrupts
Sonic 
bus
Chain bus
(a)
(b)
 
Figure 2.3: (a) CoreConnect Bus architecture [IBM99]. (b) Sonic custom bus architecture
[SCCL04].
Bus
The CoreConnect Bus architecture [IBM99], developed by IBM, has been adopted by Xilinx for
both hard-core and soft-core processors in their Virtex-II Pro and Virtex-4 FPGAs. There is a
close resemblance between CoreConnect and AMBA. To enhance the bus performance, CoreCon-
nect defines two distinct bus segments, the Processor Local Bus (PLB) and On-chip Peripheral
bus (OPB) which are connected by a bridge. The PLB is a high-speed and high-bandwidth bus.
Usually, processors, on-chip RAM and other high bandwidth devices are attached to the PLB,
such that it provides high performance communication between the processors. OPB is a low
2.3 Architecture Taxonomy 31
(a) (b) (c)
(d) (e) (f)
Figure 2.4: Different topological structures of Network-on-Chips (a) Mesh, (b) Torus, (c) Folded-
torus, (d) Ring, (e) Tree and (f) Irregular
performance and low-power peripheral bus. It is used for asynchronous interfaces and general
purpose peripheral components. The architecture is shown in Fig. 2.3(a). As an alternative, a bus
architecture developed by Altera [Alt03], namely Avalon, provides a simpler solution for system
component integration on reconfigurable logic devices.
In [CLC05], IBM CoreConnect was used as the communication backbone between the embedded
MicroBlaze soft-processor and custom FPGA-based hardware logic in a cryptography applica-
tion. The bus architecture is regarded as the only communication channel between the processing
elements.
In [SCCL04], Sedcole et al. proposed a structured methodology for rapid development of FPGA-
based systems (see Fig. 2.3(b)). Systems comprise two types of buses: a global bus providing com-
munication between all processing elements and a chain bus providing communication between
adjacent processing elements. Extra logic is required for the arbitration of the global bus. Com-
munication and computation tasks were separated by implementing a router at each processing
element. The separation of communication and computation enhances the possibility of system
architecture reuse for other applications. The computation modules can be reconfigured at run-
time, so as to enhance the flexibility of system integration at a late stage of a system development
2.3 Architecture Taxonomy 32
cycle. The proposed methodology was exemplified with an architectural instance of Sonic-on-a-
Chip and was realized using Xilinx Virtex-II Pro FPGA. It was reported that the system can be
executed at 50MHz with a throughput of 59.3 Mega-bit per second at its peak performance state.
Reconfigurable Crossbar Switch
In [KLLY05], a crossbar switch architecture was proposed to interconnect multiple processing
elements. The proposed crossbar switch architecture incorporates an additional bus which can be
scheduled to provide extra bandwidth dynamically. This hybrid architecture scheme is motivated
by the fact that IP cores implement different functions with corresponding differences in required
operating frequency and bandwidth. The additional bus can compensate the extra bandwidth re-
quirement of specific IP cores. The authors of [KLLY05] showed that the architecture can provide
the same bandwidth with 18% lowered operation frequency.
Network-on-Chip
In line with the recently proposed Network-on-Chip (NoC) architecture, several FPGA-based NoC
implementations have been reported [ABD+05, BMN+05, ZKS04, SBKV05]. In [ABD+05],
a dynamic network on chip (DyNoC) for coarse-grain programmable fabrics was proposed. It
is an extension of an 1-D shared bus architecture to a 2-D network interconnect architecture.
The authors pointed out that with dynamic reconfiguration, a newly reconfigured module may
introduce an obstacle to block the packet transmission. A new routing algorithm was proposed so
that that the network-on-chip architecture can be realized with run-time reconfiguration. It should
be noted that on-chip switches consume a large area of the design. For a Xilinx Virtex-II 1000
device, 21% and 46% of logic resources were devoted to the on-chip switches of wordlength 32-bit
and 64-bit respectively. The operation frequency of the system was around 70-77 MHz.
A topology adaptable communication network design was proposed in [BMN+05]. Network gen-
eration is supported by an operating system. The design of switches and routers are critical to
the overall system performance and area. The paper reported that the number of slices required
grows quadratically with the number of input signals. Furthermore, significant hardware resources,
including logic slices and memory, are required for solely network-based communication. Typi-
cally, for a 9-switch network, around 3000 slices are required, which corresponds to around 14%
2.3 Architecture Taxonomy 33
of the overall resources of a Virtex-II Pro FPGA. The resources required is doubled in a 16-switch
network. It appears that optimization of the area consumption is important for the effective real-
ization of NoC architecture. Also, it was reported that the network can run at 50MHz with 16-bit
wordlength, which results in a data rate of 100 Mega-byte per second.
A multiple layered network transmission protocol, as well as the design of routers and switches,
were reported in [ZKS04]. However, these come with a significant overhead on hardware slice
consumption. The overhead is attributed to the complicated switch design and routing algorithm
implementation. For example, Kapre et. al. has investigated design tradeoff of FPGA overlay
networks with time-multiplexed and packet-switched protocol [KMdL+06].
Run-Time Reconfiguration Architectures
In [KPR04], a linear architecture for a System-on-Programmable-Chip was proposed. The system
is initialized with an AMBA bus backbone and is physically floorplanned to be one dimensional.
Processing elements can be attached to and removed from the AMBA bus at run-time. The 1-D
placement has the advantage of a predictable structure. Moreover, the modules in this approach
are not required to be fixed in size. The design consists of run-time configurable Bridges, which
can be use as bridging between modules. Also, it was found that the aspect ratio (ratio of the
length and width of modules) has great impact on the operation frequency of the system.
2.3.2 Hard Architectures
“Hard” interconnect architecture represents interconnections that are physically wired in a chip.
These interconnections have great impact on the overall on-chip communication system perfor-
mance. In this sub-section, two types of hard architectures, namely interconnect and coarse-
grained architecture, will be introduced.
Configurable Interconnects
More than 80% percentage of the FPGA hardware is attributed to the programmable interconnects.
Since all of the “soft” architectures are built on top of the interconnections, significant improve-
ment on performance can be expected if there are innovations and advancements on interconnect
2.3 Architecture Taxonomy 34
Figure 2.5: (a) Conventional interconnect architecture, in which LUTs are connected through
wires with programmable switches. (b) Bundled wire, which can save the hardware area by reduc-
ing number of programmable switches.
architectures.
[Interconnect architectures] Programmable interconnects are fundamental constructs for on-
FPGA communication architecture. Communication between two Look-up-tables (LUTs) can
be achieved by interconnecting the output and input of the tables by physical wires. SRAM-
based switches connecting the wires, such that the LUTs can communicate with each other (Fig.
2.5(a)). Ye and Rose [YR05] observed that structured components were regularly found in data-
path circuits, especially when mapping datapath circuits into FPGAs. By taking the advantage of
data-path regularity, programmable interconnections are bundled together to reduce the number of
configuration bits required (Fig. 2.5(b)).
[Interconnect topological structures] Topological structures of interconnect is critical to the
hardware area efficiency. Different topological structures are proposed. In Fig. 2.6(a), the
Manhattan-style topology is adapted by the Xilinx and Altera FPGA. Alternative architectures,
such as Fat-tree and Mesh-of-Tree (MoT), provides a similar general routing capability for differ-
ent designs but in a different detailed topology and use of hierarchy [DeH04] (See Fig. 2.6(b) and
2.3 Architecture Taxonomy 35
Figure 2.6: Interconnect topological structures: (a) Manhattan-style, (b) Fat-tree and (c) Mesh of
tree (MoT)
(c)).
[Interconnect enhancements] On the other hand, there are several contributions for improving
the interconnect efficiency by modifying the bit-level interconnect architecture. For example,
Sivaswamy et al. [SWA+05, WSA+06] argued that the flexibility in routing provided by the
FPGA switches comes with great performance costs. The system power consumption, area and
circuit delays can be reduced significantly by hard-wiring some switches, while maintaining a
reasonable FPGA routing versatility. By studying a number of benchmark circuits routing struc-
tures, patterns of hardwired junctions between horizontal and vertical segments can be identified
and introduced into the switch boxes. Furthermore, there has been a continuous effort to improve
the routing FPGA architecture and tackling the interconnect overhead by reducing the routing ca-
pacity [LAB+05], or decreasing the number of hubs to neighborhood logic [Xil06c]. Moreover,
exploration of pipelined FPGA is studied in [SCEH04] and further enhancement of interconnect
enhancements for high-speed PLD architecture is reported in [HCK+02].
In [CFH+05], Cong et al. proposed a regular distributed register microarchitecture, which of-
fers regular point-to-point communication between islands of computation resources. Each island
contains a Finite State Machine, register memory and arithmetic units. Algorithms or arithmetic
operations can be partitioned and mapped to the islands, which are then directly connected using
wires. More importantly, the proposed architecture provides direct support for multicycle commu-
nication. Since propagating and synchronizing a single clock signal to the whole chip is predicted
to be a difficult challenge in future technology, multicycle communication takes into consideration
the propagation delay of signal wires. A prototype of the architecture has been realized using an
2.3 Architecture Taxonomy 36
Figure 2.7: (a) Adaptive System-on-Chip router and (b) a SIMPPL controller
Altera Stratix FPGA. A 44% improvement on clock-period was reported for data flow intensive
examples and a 28% improvement on clock-period for control flow examples. One concern of
the multicycle architecture is the requirement of extra wiring to handle many simultaneous data
transfers. The authors of [CFH+05] also suggested that interconnect pipelining and sharing may
be effective in reducing the wire overhead.
2.3.3 Interface and Communication Controller
Communication architectures are often regarded as interfaces between processing elements. In
[SC05], Shannon and Chow proposed a point-to-point interconnection architecture for rapid sys-
tem development. The communication between processing elements is achieved by System Inte-
grating Module with Predefined Physical Links (SIMPPL) as shown in Fig. 2.7. Several modules
are connected on a point-to-point basis to form a generic computing system. The mechanism for
the physical transfer of data across a link is provided so that the designer can focus on the meaning
of the data transfers, rather than how to connect the wires. The SIMPPL model greatly facilitates
the speed and ease of hardware development and it was reported that hardware systems can be
developed in just a few hours. In addition, asynchronous FIFOs were implemented as the interface
between programmable switch and processing element, such that the communication architecture
is more adaptable to IP cores with different operation frequencies. The proposed methodology
2.3 Architecture Taxonomy 37
I
n
t
e
r
f
a
c
e
I
n
t
e
r
f
a
c
e
Figure 2.8: (a) Communication between two clock domains with the asynchronous handshaking
protocol. (b) The FPGA-GALS architecture that unit of FPGAs communicate with each other
through the handshaking asynchronous communication channels.
was exemplified with multimedia applications of video streaming and video camera snap shot.
The design was implemented with Xilinx Virtex-II FPGA and the SIMPPL control was able to run
at 50 MHz.
2.3.4 GALS-Based Architecture
In [RC03], Royal and Cheung proposed an island-based Globally Asynchronous Locally Syn-
chronous (GALS) architecture as shown in Fig. 2.8. FPGA programmable fabrics are partitioned
into islands, each of which is synchronised using a local clock. The global clock signal can be
removed, thus avoiding the risk of clock skew and reducing clock power dissipation. However,
communication between multiple clock islands may result in metastability, which may cause a
functional failure of the system. To avoid metastability, a pausible-clock scheme is employed.
A pausible-clock wrapper is introduced to each of the islands and the clock of the correspond-
2.4 Design Criteria 38
ing island can be stopped and restarted, so as to ensure data registering without metastability.
The FPGA-based GALS architecture is effective in avoiding the clock skew problem and can po-
tentially obtain higher computational speeds. More importantly, different coarse-grain modules
can be integrated easily without the concerns of clocking problems in the GALS framework. In
other words, the island-based GALS architecture provides high composibility for Platform-FPGAs
[Mol97].
In [Sec04], Seceleanu suggested a segmented bus platform supporting communication using mul-
tiple clock frequencies. The bus was partitioned into segments of different operating frequencies.
A synchronizer was introduced into each bus segment. A central arbiter monitors the sharing
of global buses. The proposed architecture was realized using an Altera FPGA. A high system
frequency can be achieved: 124 MHz was reported. In addition, the number of clock cycles re-
quired was reduced from 3000 to 2880 at in their experiment on data item transmission between
processing elements.
2.4 Design Criteria
Communication architecture design is a complex multi-objective optimization problem that can be
evaluated based on a number of criteria and evaluation metrics. The fundamental characteristics
and classification of on-FPGA communication architectures are summarized in Table 2.1.
2.4.1 Quantitative Metrics
• Throughput is one of the most important design criteria for an on-chip communication
system. It defines the amount of bit information being transmitted per unit time.
• Hardware area is an important criterion for circuit design. It can be defined as the number
of transistors, logics, memory and wire length. In particular, for FPGA devices, number of
logic slices are commonly used as an area measurement.
• Interconnect utilization is the amount or percentage of time that the wire is carrying infor-
mation.
• Power is the measurement of energy consumption in interconnect wires per unit time.
2.4 Design Criteria 39
D
es
ig
n
Sc
al
ab
ili
ty
C
om
pa
tib
ili
ty
R
eu
sa
bi
lit
y
R
ou
tin
g
D
es
ig
n
Sc
he
du
lin
g
C
om
pl
ex
ity
Sp
ac
e
P2
P
lo
gi
ca
l
in
te
rf
ac
e
ac
-h
oc
co
m
pl
ex
ac
-h
oc
/
st
at
ic
sy
st
em
at
ic
Is
la
nd
-
lo
gi
ca
l/
in
te
rf
ac
e/
ac
-h
oc
co
m
pl
ex
/
ac
-h
oc
st
at
ic
st
yl
e
ph
ys
ic
al
G
A
L
S
m
od
ul
ar
H
ie
ra
rc
hi
ca
l
lo
gi
ca
l
in
te
rf
ac
e
pr
ot
oc
ol
m
od
ul
ar
ac
-h
oc
dy
na
m
ic
R
un
-T
im
e
lo
gi
ca
l
in
te
rf
ac
e
se
pa
ra
tio
n
m
od
ul
ar
ac
-h
oc
dy
na
m
ic
R
ec
on
fig
ur
at
io
n
Se
gm
en
ta
tio
n
lo
gi
ca
l/
G
A
L
S
se
pa
ra
tio
n
m
od
ul
ar
ac
-h
oc
dy
na
m
ic
ph
ys
ic
al
C
ro
ss
ba
r
lo
gi
ca
l/
in
te
rf
ac
e
se
pa
ra
tio
n
m
od
ul
ar
sy
st
em
at
ic
dy
na
m
ic
ph
ys
ic
al
N
et
w
or
k-
on
-C
hi
p
lo
gi
ca
l/
in
te
rf
ac
e/
se
pa
ra
tio
n
m
od
ul
ar
sy
st
em
at
ic
st
at
ic
/
ph
ys
ic
al
G
A
L
S
dy
na
m
ic
Ta
bl
e
2.
1:
Pr
in
ci
pa
lc
ha
ra
ct
er
is
tic
s
an
d
cl
as
si
fic
at
io
ns
of
on
-F
PG
A
co
m
m
un
ic
at
io
n
ar
ch
ite
ct
ur
es
.
P2
P:
Po
in
t-
to
-p
oi
nt
in
te
rc
on
ne
ct
s;
R
T
R
:R
un
-T
im
e
R
ec
on
fig
ur
at
io
n;
N
oC
:N
et
w
or
k-
on
-C
hi
p;
G
A
L
S:
G
lo
ba
lly
A
sy
nc
hr
on
ou
s
L
oc
al
ly
Sy
nc
hr
on
ou
s
2.4 Design Criteria 40
Based on the published results, the quantitative performance measurements for different archi-
tectures are summerised in Table 2.2. IBM CoreConnect provides a large range of throughput
and its maximum throughput significantly outperforms those of other architectures. Although the
operation frequency of the communication architecture is largely a function of the chosen FPGA
technology, the frequency of the segmented-bus architecture is moderately higher than those of
other architectures. On the other hand, the area consumption for the NoC architectures is sub-
stantial. However, there is marginal improvement on the performance metrics. These include the
operation frequency which ranges from 40 to 64 MHz and the throughput which ranges from 39.8
to 310.4 MByte/s.
2.4.2 Qualitative Metrics
Scalability
A logical scalable architecture implies that as more processing elements are added to the commu-
nication fabrics, the communication bandwidth and peripheral parameters should increase propor-
tionally. In architectures such as a shared bus, the peak bandwidth is fixed. When more processing
elements are added to the shared-bus, the overall communication performance decrease accord-
ingly [BB05]. Alternatively, Network-on-Chip architectures are logically scalable. Total band-
width of the network can be increased when new processing elements are added by increasing the
number of network switches [BB05].
Physical scalability refers to communication architectures that are adaptive to technology scaling
and new process technologies. Architectures comprised of many long wires will not adapt well to
nanometer process technologies. This criterion usually refers to whether the design is tolerant to
physical faults.
Compatibility
Pre-fabricated or pre-designed IP cores usually have specific operating conditions, such as the
maximum operating frequency. To effectively integrate IP cores, the interconnect backbone re-
quires robust and versatile interfaces. Globally Asynchronous Locally Synchronous (GALS)
wrappers provide a flexible interface for a wide range of operating frequencies that can greatly en-
2.4 Design Criteria 41
D
es
ig
n
FP
G
A
N
um
be
ro
fc
om
po
ne
nt
s
T
hr
ou
gh
pu
t
Fr
eq
ue
nc
y
B
it-
w
id
th
T
hr
ou
gh
pu
tp
er
Y
ea
r
(m
od
el
)
(m
od
ul
ar
ity
)
(M
B
yt
e/
s)
(M
H
z)
(b
it)
un
it
ar
ea
C
or
eC
on
ne
ct
X
ili
nx
fle
xi
bl
e
26
4-
16
00
50
-1
00
32
-1
28
1.
5
19
99
V
ir
te
x2
-P
ro
R
un
-t
im
e
X
ili
nx
5-
18
59
.3
50
16
no
tp
ro
vi
de
d
20
04
R
ec
on
fig
ur
e
B
us
vi
rt
ex
2-
Pr
o
Se
gm
en
te
d-
B
us
A
lte
ra
-
2-
4
43
.1
12
4
2-
32
no
tp
ro
vi
de
d
20
04
St
ra
tix
R
M
B
oC
X
ili
nx
16
2.
1-
2.
4
85
-9
6
16
0.
00
1-
0.
00
2
20
05
V
ir
te
x2
In
te
rc
on
ne
ct
X
ili
nx
2×
2
31
0.
4
40
16
0.
1
20
02
N
et
w
or
k
V
ir
te
x
N
et
w
or
k-
on
-C
hi
p
X
ili
nx
3×
3
10
0
50
16
0.
02
20
05
V
ir
te
x2
-P
ro
R
ec
on
fig
ur
ab
le
A
lte
ra
no
tp
ro
vi
de
d
no
tp
ro
vi
de
d
55
.8
-6
4
8,
16
no
tp
ro
vi
de
d
20
04
Sy
st
em
-o
n-
C
hi
p
St
ra
tix
-I
I
L
iP
aR
X
ili
nx
3×
3
39
.8
33
.3
16
0.
00
8
20
05
V
ir
te
x2
-P
ro
Ta
bl
e
2.
2:
Su
m
m
ar
y
of
th
e
qu
an
tit
at
iv
e
pe
rf
or
m
an
ce
m
ea
su
re
m
en
ts
fo
rd
iff
er
en
ta
rc
hi
te
ct
ur
es
2.4 Design Criteria 42
hance the composibility of different IP cores. In contrast, for interconnect, which does not provide
a GALS framework, it is necessary to design a specific interface for each of the IP cores. There-
fore, a much longer development cycle is required. Point-to-point GALS architectures [RC03],
segmented buses [Sec04] and some of the NoC architectures implement the GALS framework.
Reusability
Increased productivity is achieved through system architecture and IP cores reuse. System-level
timing is pre-determined and one can effectively avoid the iterative design and verification cycles
usually necessary to achieve timing closure [SCCL04]. The reusability takes account of architec-
tural modularization, abstraction and the separation of communication and computation. For archi-
tectures such as point-to-point interconnect, interface and interconnection topologies are devoted
to a specific application. Thus, it is almost impossible to reuse the hardware. Using a structured
design approach [SCCL04] explicitly exhibits the concern of separation of communication and
computation among the surveyed architectures. Furthermore, shared buses and network-on-chip
architectures are also highly reusable, simply because of their generic modularity and abstraction
of communication.
Routing Complexity
The routability of multiprocessor network topologies in FPGA is dependent on the complexity
of the network. Some of the existing commercial FPGAs do not support complex applications.
[SSC06].
The architectural and system design will eventually be realized on FPGA physical device. FPGA
routing is a critical step, which translates the higher level communication models into lower level
physical interconnects by programming switches. As the FPGA physical interconnect is not an
unlimited resource, it is intractably difficult to optimize the programmable switches and physical
interconnect for a complex architecture. This may cause routing implementation to take long peri-
ods of time and result in unsatisfied routing constraints. It has been advocated that modular design
would ease the layout tediousness for both microprocessor design and reconfigurable computing
[Con01, SCCL04]. For the surveyed architectures, shared buses and NoC architectures are gener-
ally designed with modularity abstraction. For the island-based architecture, the modularity and
2.5 Design Methods 43
routing complexity depend on the granularity of the “islands”.
2.5 Design Methods
On-FPGA communication systems require specific design method to maximize the system perfor-
mance or provide adequate bandwidth performance. In the following, design methods addressing
the communication architectures are discussed.
2.5.1 Synthesis
Design Space Exploration
Communication architectures provide a large range of architectural parameters and options for
the designer to consider. Due to the unique requirements of different applications, there exists a
problem of searching for the optimal architecture from a huge design space. For example, the NoC
topology has a significant effect on network latency, throughput, area, fault-tolerance and power
consumption. Therefore, the design of NoC topology plays an important role in routing strategy
and mapping of the cores to the network nodes [OHM05, OM05]. Although very little research
focusing on FPGA-based communication design space exploration is found in the literature, it is
considered an important issue when dealing with communication architecture design. Choosing
appropriate architectures and parameters is often performed on an ad-hoc basis. For hierarchical
buses, decisions for mapping tasks and modules to bus segments is usually based on designer
experiences. This ad-hoc design is limited when dealing with large scale problems. In particular,
automating part of the communication architecture exploration task can be invaluable in finding
the optimal or near optimal settings of design parameters.
Exploring Network Architectures
Since the embedded communication network is prefabricated and immersed into the FPGA fab-
ric, it is important to determine the network topological structure. Similar to inter-FPGA systems
design [HBE99], the topological structure is fundamental to the on-chip system processing and
computational performance as well as the resources utilization. Determining the proper network
2.5 Design Methods 44
structure is a difficult problem and the design is further complicated by the enormous programma-
bility and flexibility provided by the programmable fabric. Data traffic is difficult to predict with
configurable hardware and the possible software contents. A proper formulation and a systematic
methodology for exploration and optimization of the embedded communication network topology
are required.
2.5.2 Routing
When communication channels are shared by several processing units, arbitration is required to
control and avoid conflicts of requests from processing units. Design of arbitration units has a
great impact on the overall system performance. There are two types of scheduling: static routing
and dynamic routing.
Static Routing
The overall delay of communication has a great impact on the overall system performance. Static
routing usually refers to the assignment of communication channels at design time. In other words,
the routing will not change according to the real-time traffic of the communication network. For
example, the routing of wires in the multi-cycle architecture is defined at synthesis-time. The pro-
grammable interconnect switches in FPGA architectures are also a kind of static routing resources.
There is an interesting example in network-on-chip scheduling. In [Tay02, LLST04, SC04], in-
struction memory and a programmable controller are embedded in the switches. Routing and
arbitration are programmed and scheduled at compiled time by a software compiler, thus the over-
all latency of packet transmission is deterministic.
Dynamic Routing
The routing can be modified based on the real-time traffic information. For example, in [BJM+05],
the design of routing-table-based shortest path algorithms for minimizing the overall latency is
proposed. Although there is an advantage of simplicity for the routing-table method, extra latency
is attributed to the routing logic which may introduce an overhead in terms of power and speed to
the communication architecture.
2.5 Design Methods 45
Dynamic arbitration mechanisms follow a set of arbitration policies. Typical policies, such as
FIFO (First-In-First-Out), priority queue, round robin and simply random roulette-wheel ap-
proaches are found in most of the existing bus architectures [ARM99, IBM99]. Furthermore,
stochastic arbitration policies were proposed recently [TKDD05]. The bus arbitration is formu-
lated as a Markov Decision Process, such that expected power utilization and efficiency can be
optimized from the perspective of stochastic processes.
2.5.3 Deadlock Avoidance
Deadlock occurs in an interconnection network when a group of packets is unable to make progress
because they are waiting on one another to release resources, usually buffers or channels [DT04].
If a sequence of waiting packets forms a cycle, the network is deadlocked. Deadlock is catas-
trophic to a routing network. When some resources are occupied by deadlocked packets, other
packets block these resources and paralyze the network operation. It is important to design a
deadlock-free routing algorithm. Dynamic routing scheme is particularly susceptible to deadlock,
simply because there is a probability to establish a dependency cycle in the network. To prevent
this, networks must either use deadlock avoidance methodologies, which guarantee that a network
cannot deadlock, or other less common schemes, such as deadlock recovery.
There are multiple deadlock avoidance schemes available and most of these schemes are general
enough to be adopted by any routing algorithm easily. Below are some of schemes:
Restricted physical path (Or turn model [GN92]) In network-on-chip, deadlock is created
when the routing requests form a loop in the network. A way to structuring the resources to
accommodate all possible routes is to restrict the routing function. Placing an appropriate
restriction on routing can remove enough dependencies between packet routing path that the
resulting dependencies graph is acyclic.
Virtual channel [Dal92] The idea behind maintaining deadlock freedom despite a cyclic channel
dependence graph is to provide an escape path for each packet in a potential cycle. As long
as the escape path is deadlock-free, packets can move freely throughout the network.
Distance classes [DT04] By grouping resources into numbered classes and restricting allocation
of resources so that packet acquire resources from classes in ascending or descending order.
2.5 Design Methods 46
2.5.4 Summary of On-Chip Routing Schemes
The major routing schemes for on-chip network are summarized in Table 2.3.
2.5.5 Globally Asynchronous Locally Synchronous (GALS)
Globally-Asynchronous Locally Synchronous (GALS) designs have emerged in the recent years
as a way to retain the simple and powerful abstraction of synchronous design methodology while
avoiding the problematic global clock reference, by employing asynchronous communication be-
tween synchronous local modules. In a GALS design, an asynchronous external interface is com-
bined with an internally clocked circuit. The Locally Synchronous (LS) blocks communicate using
asynchronous handshaking so there is no need to consider the global clock propagation. Ideally, if
LS module operates correctly according to the handshaking protocol, they can be freely intercon-
nected together, providing excellent modularity and compatibility. In addition, the internal timing
at each LS module is independent of the rest of circuits outside the module, as modules are inter-
faced asynchronously. Thus, each module can executed at its own clock speed. Hence, a GALS
system provides an excellent framework for System-on-Chip or Network-on-Chip designs, which
allows modules to be reused and recombined to shorten the design cycle.
Implementation of GALS requires asynchronous interface wrappers, to hold synchronous modules
together. This interface facilitates high-speed communication between synchronous modules op-
erating at different clock frequencies. However, the design of this interface is not trivial. There is a
rick of metastability at the interface which is not present in conventional self-timed asynchronous
logic. Metastability is a condition where the voltage level of a signal is at an intermediate level,
neither zero nor one, which may persist for an indefinite period of time but with exponentially
decreasing probability. This problem occurs at the synchronous module mainly during the reg-
istration of incoming data from an asynchronous path. Data and clocking signal are not well
synchronized, thus unable to register the data correctly. Metastability also occurs at communica-
tion between synchronous modules operating at different clock domains. Metastability can result
in failure of communication between modules. Generally, there are two approaches to tackle
metastability:
1. Adjustment of individual synchronous module’s local clock and, where necessary pausing
2.5 Design Methods 47
R
ou
tin
g
m
et
ho
d
D
es
ig
n
ob
je
ct
iv
e
St
at
ic
/D
yn
am
ic
D
ea
dl
oc
k
av
oi
da
nc
e
R
em
ar
ks
[D
T
01
]
di
m
en
si
on
-o
rd
er
(X
Y
)
si
m
pl
is
tic
st
at
ic
tu
rn
-m
od
el
fa
st
an
d
sm
al
lo
ve
rh
ea
d
bu
t
m
ay
ca
us
e
ne
tw
or
k
co
ng
es
tio
n
[C
hi
92
]
od
d-
ev
en
(O
E
)
m
in
im
iz
e
de
la
y
dy
na
m
ic
tu
rn
-m
od
el
ef
fic
ie
nt
bu
tm
ay
ca
us
e
liv
el
oc
k
[B
D
M
07
]
flo
od
in
g
(s
to
ch
as
tic
)
fa
ul
tt
ol
er
an
ce
dy
na
m
ic
no
de
ad
lo
ck
av
oi
da
nc
e
la
rg
e
ov
er
he
ad
di
sc
us
se
d
[H
M
04
]
D
yA
D
m
in
im
iz
e
de
la
y
pa
rt
ia
ld
yn
am
ic
tu
rn
-m
od
el
ob
ta
in
a
ba
la
nc
ed
pe
rf
or
m
an
ce
be
tw
ee
n
X
Y
an
d
O
dd
-E
ve
n
T
hi
s
th
es
is
(C
ha
pt
er
5)
D
P-
ne
tw
or
k
m
in
im
iz
e
de
la
y
dy
na
m
ic
tu
rn
-m
od
el
co
m
pu
te
op
tim
al
pa
th
on
-t
he
-fl
y
an
d
fa
ul
tt
ol
er
an
ce
ba
se
d
on
re
al
-t
im
e
tr
af
fic
[A
C
PP
08
]
N
op
e-
on
-P
at
h
(N
oP
)
m
in
im
iz
e
de
la
y
dy
na
m
ic
tu
rn
-m
od
el
lo
ca
ls
ea
rc
h
hi
ll
cl
im
bi
ng
[G
B
L
M
07
]
X
Y
-Y
X
an
d
fa
ul
tt
ol
er
an
ce
dy
na
m
ic
tu
rn
-m
od
el
re
du
ce
th
e
un
re
ac
ha
bl
e
by
pa
ss
lo
gi
cs
ar
ea
by
al
te
rn
at
in
g
be
tw
ee
n
X
Y
an
d
Y
X
Ta
bl
e
2.
3:
A
co
m
pa
ri
so
n
of
ro
ut
in
g
al
go
ri
th
m
s
fo
ro
n-
ch
ip
ne
tw
or
k.
2.5 Design Methods 48
the clock, to avoid synchronization failure.
2. Brute-force synchronization of communication signals to each modules’s free-running clock
with an acceptable probability of synchronization failure
Besides, there are other approaches to tackle communication across clock domains. For example,
mesochronous communication is a scheme in where the clocks run with the same frequency but
unknown phases. Synchronization for mesochronous is relatively simpler and provides a higher
reliability for cross-clock domain communications.
Stretchable Clock
Metastability can be avoided by using a technique called stretchable clock (or pausable clock). The
technique guarantees that communication signals will never violate setup and hold time constraints
with respect to the local clock. The idea is to temporarily stop the clock in the receiver when there
is an asynchronous requesting signal such that the receiver circuit will not enter the metastable
state. The clock is restarted again shortly after the setup time criterion is fulfilled. Although this
method is robust and incurs little overhead in communication delay, it generally involves designing
a special local clocking circuit, which is challenging to digital designers.
Chapiro first studied the GALS architecture and employed the pausable clock scheme in the com-
munication interface in early 80’ [Cha84]. In [RMCF88], the architecture of Q-modules is pro-
posed. The clocks stretch over periods of metastability instead. More complex interfacing scheme
with judicious design of local clock generators abound [BC97, MTC+00, YD99]. To improve
the clocking speed of locally synchronous modules, and to scale up the module size, Mekie et
al. proposed a partial handshaking protocol to allow LS running at gigahertz and above frequen-
cies [MCS04]. However, the proposed circuit had a non-zero probability of failure which tends
to violate the original motivation of pausable clocking scheme. This technique was employed by,
Gu¨rkaynak et al. to implement a cipher algorithm with GALS architecture and pausable clock
scheme [GOV+03].
2.5 Design Methods 49
Brute-Force Synchronization
Double-latching scheme is a well-known approach that reduces the probability of sychronization
failure to an acceptable level by resynchronizing communication signal with back-to-back latches
[Kin07]. A natural extension of this approach is known as pipeline synchronization, which in-
creases the number of latches. The brute-force approach is easier to design and implement. The
major draw back is the long latency of communication, as multiple latches are added to increase
the Mean Time Between Failure (MTBF).
Implementation
An implementation of GALS on a FPGA platform was reported in [NSN+06], in which a Reed-
Solomon decoder was implemented using the GALS technique including implementation of syn-
chronous island wrapper. Arbiters are realized using on-chip LUT that can efficiently minimize
the probability of metastability threat. In [HBH05], Heath argued that for data transmitting and re-
ceiving by two LSs at constant and exactly equal rate, LS module was able to finish processing one
input data word before receiving the next one. A deterministic GALS methodology called Syncho-
Tokens whose LS module wrappers employ token rings between pairs of communication blocks
was used. The local clock is controlled by the LS module wrapper such that the system behaviour
is deterministic. An implementation of an adaptively-pipelined mixed synchronous-asynchronous
digital FIR filter is reported in [STR+02].
GALS system with dynamic voltage and frequency scaling can use the slowest frequency possible
to accomplish a task with minimal power consumption. With the mechanism for implement-
ing dynamic voltage scaling at each synchronous domain left up to the designer, Globally Asyn-
chronous and Locally Dynamic System (GALDS) are proposed in [CZ05]. A bidirectional asyn-
chronous FIFO, with dynamic clock generation to communicate between independently clocked
synchronous blocks was proposed.
In [EBE04], a partial scan approach, which achieves a test coverage of 99.5% on the CHAIN
network-on-chip interconnect fabric, which is used as an example. This approach provides an
effective approach to test the fabricated asynchronous interconnects.
2.5 Design Methods 50
Mekie et al. proposed a high-level reasoning abstraction for GALS interfacing in [CMS03]. A
notion of abstract time is regarded as a formalism to model GALS in a systematic approach. thus,
interfacing schemes can be evaluated in a relatively abstracted level under the abstract timing ar-
chitecture. In [DGS04], design with trade-off in arbitrated clocks may be used for locally delayed
input and output ports, thus facilitating high data rates. A review of GALS and asynchronous
interconnect network is reported in [AFE+05].
Desynchronization
Desynchronization appears as an alternative to the GALS technique, but the datapath remains es-
sentially synchronous and its clocks are locally generated thus sharing some similarity with the
GALS concept. It prevents metastability completely by using handshakes between local mod-
ules. The most important motivation of this approach is that an asynchronous architecture can be
generated from synchronous designs automatically [BCK+04, SL03]. This approach is greatly
benefited from the mature synchronous CAD tool, while asynchronous circuit can be synthesized
without extra effort. Approximately the same area, power and performance could be obtained
from a desynchronized circuit, as reported in [BCK+04]. Even complex designs, such as the DLX
microprocessor, can be desynchronized.
The key idea of desynchronization is that the clock distribution tree of a traditional synchronous
circuit is replaced by a local synchronization mechanism, built out of very simple standard hand-
shaking circuits. The idea has been discussed in the past as phased logic in [LH96]. But a for-
mal method with proof of correctness for different implementation protocol has been studied in
[BCK+04]. Since desynchronization builds on these theoretical foundations of formal methods,
it provides an automatic synthesis method to derive medium-grained asynchronous circuit. As-
suming the initial design was implemented with edge-triggered flip-flops, desynchronization com-
prises of the following steps:
1. Conversion of the flip-flop-based synchronous circuit into a latch-based one
2. Generation of matched delays for combinational logic
3. Implementation of local controllers, which serves as local clocks.
2.6 High-Performance Intra-Chip Signalling 51
De-synchronization is an interesting approach to address the design complexity of asynchronous
circuits.
2.6 High-Performance Intra-Chip Signalling
Research on intra-chip signalling focuses on providing a high-throughput and low-power solution
for signal transmission inside a chip. Since the conventional signalling is based on a delay-based
approach, its throughput is dictated by the impedance of wires, which is deteriorating in future
technology. New approaches, such as wave-pipelining, apply a radical approach by exploiting
the intrinsic wire characteristics, allowing simultaneous multiple-bits transmission. Even though
significantly higher throughput can be achieved using the wave-pipelined approach, the analogue
asynchrony characteristics call for a dedicated transceiver to be designed. Other research has
focused on low-power and reliable links designs which are also important topics within the on-
chip communication system area. In the following, some research work on link design will be
highlighted.
2.6.1 The Princeton Approach
Xu and Wolf, from Princeton University, published an early paper on using wave-pipelining for
communication links in Network-on-Chip [XW02]. The authors propose the concept of pipelining
data bits along the communication link, as it is in register pipeline stage but bringing this idea to a
lower level. Simulation study on 10mm global interconnection was performed and the energy and
area efficiencies were demonstrated. Particularly, 1.7 times speed-up was found when comparing
to a conventional delay-based signalling. Although, the study was carried out in an old technology
process, 0.25µm aluminium process, the benefits of using wave-pipelining have been revealed.
The same authors published a second paper to described a structured communication link design
technique, wave-pipelined interconnect, for networks-on-chip [XW03]. This paper shows in detail
the various techniques to save power and area, and to achieve high performance using relatively
old technology. Circuit techniques such as interleaved lines and misaligned repeaters were used to
reduce the effect of crosstalk. The authors designed and simulated a 10mm 16-bit wave-pipelining
interconnection in a 0.25um technology, and achieved 3.45GHz and 55.2Gbps performance. Al-
2.6 High-Performance Intra-Chip Signalling 52
though the potential benefits of wave-pipelined signalling has demonstrated through design simu-
lation, the intrinsic characteristics of wire and the theoretical performance limits of wave-pipelined
link are still unclear.
2.6.2 The GIT Approach
Deodhar and Davis, from Georgia Institute of Technology, published an interesting analytical
method to analyze and predict the throughput of interconnects based on the electrical properties of
wires and repeaters [VD05]. This method applies a distributed RC model to capture the waveform
characteristic of buffered line and, thus, the pulse-width and attenuation of waveform throughput
the long interconnection can be predicted analytically.
The method is simple and effective and it gives a clearer picture of the signalling mechanism
which executes wave-pipelining along the line. However, several basic assumptions were made
throughout their analysis, which limits the application of the theory and prohibited the prediction
of throughput for interconnect structures that are different from traditional ASIC, such as FPGAs.
Joshi, Deodhar and Davis, from the same research group, further extended their wave-pipelining
design and applied them to multiplexed routing in general circuit [JDD06]. Because the wave-
pipelined link can provide higher throughput, interconnects can be reduced. A multiplexed line
that can replace two conventional delay-based signalling lines has been shown in the paper. The-
oretically, the number of global interconnects can be halved by adopting this approach. However,
a new design flow and associated CAD tools are needed in order to introduce these multiplexed
links.
2.6.3 The KAIST Approach
The research group at Korea Advanced Institute of Science and Technology (KAIST) developed
several comprehensive Network-on-Chip system to demonstrate the feasibility of on-chip com-
munication network. The most celebrated test-chip: Prototype of On-Chip Network (PROTON)
[MB06] implemented an on-chip network with 800MHz and used a 4:1 on-chip serialization link.
The architecture featured a low-power characteristic as a low-voltage swing signalling of is ap-
plied.
2.6 High-Performance Intra-Chip Signalling 53
 
Figure 2.9: The Wave-pipelined Wave Front Serializer (WAFT)
The novel circuit WAFT-SERDES proposed in [LKK+05] presented an interesting topology to
realize on-chip serialization. The circuit was simple and without complex control circuitry. The
serialization was achieved by using a chain of carefully calibrated delay element and multiplexers.
In contrast to register-based serializer, WAFT has the advantage of simplicity, which utilizes fewer
transistors and consumes less power. The circuit schematic is shown in Fig. 2.9. When the data is
stored at the multiplexer, the pilot signal drives the multiplexer to forward the data through the de-
lay chain. The delay elements are used to separate the data bits to avoid inter-symbol interference.
The data is then transmitted to IOUT as serialized data train. The receiver has a similar design and
data are sampled based on the arrival time of the pilot pulse.
The circuit greatly simplified the design of SERDES and has demonstrated a 800MHz signalling
speed. However, this approach introduces potential hazard of transmission failure due to its vulner-
ability to process variation and on-chip noise. The accuracy of transmission is highly dependent on
the matching of delay elements between the transmitter and receiver, which yields an unreliability
on the link design.
2.6.4 The Newcastle Approach
Among those new proposals on intra-chip signalling to target high throughput performance, other
proposals have been published to advocate reliable data transfer across a chip. Certain charac-
2.6 High-Performance Intra-Chip Signalling 54
teristics, such as crosstalk, power noise and interconnect defects all impact the performance of
high-speed signalling. Also, some of the high-speed approaches, such as wave-pipelining, are
more vulnerable to dynamic noise. Therefore, new signalling schemes that focus on signalling
reliability are proposed. One example is the phase encoding approach from Newcastle University.
Figure 2.10: The Phase Encoding Communication System
D’Alessandro, Shang, Bystrov and Yakovlev, from Newcastle University, proposed a new form of
intra-chip signalling called phase encoding. Phase encoding was first proposed in [DSBY05] as an
implementation strategy for high-speed serial links and later an FPGA-based implementation was
proposed [MDS+08b]. The concept employs the order of events on a pair of wires to indicate the
bit value. The last-occurring event indicates instead the completion of transmission of a single bit,
enabling clock recovery (or, in a self-timed/asynchronous domain, recovery of a request signal).
The scheme allows the use of both rising and falling edges for transmission, allowing a natural
multiplexing of two channels onto the same link. Fig. 2.10 shows a conceptual arrangement of
a dual-channel link. This can be exploited to increase the bit-rate of the link, while maintaining
a reduced operating frequency. The time separation between the edges is immaterial and only
needs to satisfy the setup/hold condition of the receiver’s phase detector in order to ensure correct
operation.
Phase encoding can be used with wave-pipelining if both wires are pipelined by the same amount,
that is, the delay introduced by the delay elements matches on both wires. This limitation will be
significant if simple buffers are used: process variations may introduce significant skew between
2.6 High-Performance Intra-Chip Signalling 55
the two wires, thus effectively preventing reliable transmission. In particular, wire performance
deteriorates due to process variation, such that if the net variation on one wire is greater than the
variation on the other wire the phase relationship will be corrupted. However, dedicated buffers
can be used, which will allow more reliable signalling at the expense of absolute latency. In
[DMBY07] various designs are proposed to implement repeaters for phase-encoded links, which
can be successfully implemented on FPGA architectures. The phase encoding provides an alter-
native to on-chip signalling which alleviate the concern of crosstalk due to the dual wiring.
2.6.5 The Cambridge Approach
Hollis and Moore, from the Computer Architecture Group of Cambridge University, proposed
an interesting high speed pulsed-based interconnect architecture for on-chip network communi-
cation [HM06]. The architecture, named RasP, is based on the GasP FIFO developed at Sun
Microsystems, and requires significantly fewer wires for transmission because of the serial and
deserial logic embedded. Furthermore, it is based on an asynchronous hand-shaking protocol that
allows cross clock domain communication. Unlike other synchronous schemes that require source
synchronous clock reference, the RasP scheme is reliable and efficient. When comparing to the
CHAIN interconnect [BF02], by Bainbridge et al. from Manchester University, the RasP requires
significantly less wire area and improves in speed.
Francis, Moore and Mullins, from the same research group, proposed a time-division multiplexed
(TDM) wiring scheme for FPGAs [FMM08]. Their idea was to introduce novel fine-grained
TDM units to the FPGA routing network with the pre-fabricated wires in FPGAs being shared
for different signals. An advantage of this technique is the reduction of wire usage in an FPGA
circuit and, thus, potentially reducing FPGA physical area and power consumption. Results in
[FMM08] show that the amount of FPGA configuration wiring can be reduced by more than 72%.
This implies a possibility for significant reduction on FPGA silicon area. However, additional
logic area is required to implement the TDM at each routing switch and the design of the TDM
may also need to be optimized through further analytical studies. Nonetheless, this is a novel and
interesting approach to tackle the FPGA routing congestion and to provide an effective solution to
alleviate the interconnect complexity in FPGA architecture design.
2.7 Summary 56
2.6.6 A Summary of High-Performance Intra-Chip Signalling
Different approaches contributing to the on-chip signalling and communication which are listed
and summarized for comparative purposes in Table 2.4. Each approach is classified according to
the several defining features described below:
• Is the proposed signalling method analyzed through analytic or simulation-based means?
• What technology process is used?
• What kind of scheme is employed for cross-clock domain communication?
• What is the throughput achieved?
2.7 Summary
This chapter has provided a background to previous, concurrent, and ongoing work in the field
of architectural design and circuit techniques for on-chip communication systems. A survey has
been given of the architectures and taxonomy of to classify the architectures based on their struc-
tures, such as “soft” and “hard” communication architectures, and functionalities are presented. It
has been noted that there is a gap between the ever increasing communication demand from the
multiple on-chip modules or cores, and the available bandwidth provided by the existing hardware
architectures.
The major design criteria for on-chip communication systems have been briefly reviewed, and a
detailed survey and summary have been made of the existing published approaches to the problems
of on-chip signalling and routing. On-chip signalling and routing are the two important concepts to
design an efficient communication system. The on-chip communication bandwidth is determined
by the signalling throughput whereas routing determines the utilization of the overall bandwidth.
FPGAs provide a flexible platform to realise hardware architectures and systems. Due to the
complex programmable interconnect architecture, circuit design and optimization for on-FPGA
signalling and routing are challenging, and have not been studied. To directly adapt on-chip com-
munication architectures that are proposed for ASIC design can result in significant hardware cost
2.7 Summary 57
Pr
op
os
ed
m
et
ho
d
A
na
ly
tic
/S
im
ul
at
io
n
Te
ch
no
lo
gy
H
an
d-
sh
ak
e
pr
ot
oc
ol
T
hr
ou
gh
pu
ta
Pr
in
ce
to
n
w
av
e-
pi
pe
lin
in
g
si
m
ul
at
io
n
(C
ad
en
ce
)
25
0n
m
so
ur
ce
sy
nc
hr
on
ou
s
3.
45
G
H
z
on
10
m
m
[X
W
02
,X
W
03
]
G
IT
w
av
e-
pi
pe
lin
in
g
an
al
yt
ic
al
an
d
si
m
ul
at
io
n
13
0n
m
tim
e
di
vi
si
on
m
ul
tip
le
xi
ng
2.
5G
H
z
on
1c
m
[V
D
05
,J
D
D
06
]
K
A
IS
T
w
av
e-
pi
pe
lin
in
g
si
m
ul
at
io
n
(H
SP
IC
E
)
13
0n
m
W
A
FT
80
0M
H
z
[L
K
K
+
05
]
N
ew
ca
st
le
ph
as
e
en
co
di
ng
si
m
ul
at
io
n
(H
SP
IC
E
)
90
nm
as
yn
ch
ro
no
us
cl
oc
k
re
co
ve
ry
80
0M
H
z
[D
SB
Y
05
,D
M
B
Y
07
]
T
hi
s
th
es
is
w
av
e-
pi
pe
lin
in
g
an
al
yt
ic
al
an
d
si
m
ul
at
io
n
90
nm
so
ur
ce
sy
nc
hr
on
ou
s
1.
3G
H
z
(C
ha
pt
er
4)
an
d
re
al
FP
G
A
te
st
in
g
C
am
br
id
ge
pu
ls
e-
ba
se
d
si
gn
al
in
g
si
m
ul
at
io
n
(C
ad
en
ce
)
90
nm
as
yn
ch
ro
no
us
ha
nd
-s
ha
ke
1.
2G
H
z
[H
M
06
]
Ta
bl
e
2.
4:
A
co
m
pa
ri
so
n
of
hi
gh
-p
er
fo
rm
an
ce
in
tr
a-
ch
ip
si
gn
al
lin
g
m
et
ho
ds
.
a T
he
th
ro
ug
hp
ut
fo
r
on
-c
hi
p
si
gn
al
lin
g
is
af
fe
ct
in
g
by
va
ri
ou
s
fa
ct
or
s,
su
ch
as
bu
ff
er
de
si
gn
,i
nt
er
co
nn
ec
tio
n
le
ng
th
,v
ol
ta
ge
et
c.
,a
s
w
ill
be
di
sc
us
se
d
in
C
ha
pt
er
5.
T
he
va
lu
es
pr
es
en
te
d
he
re
ar
e
fo
rr
ef
er
en
ce
on
ly
an
d
no
tf
or
a
co
m
pa
ra
tiv
e
pu
rp
os
es
.
2.7 Summary 58
and limited improvement in communication bandwidth. This thesis provides an in-depth analytical
study of the FPGA interconnect architecture. Novel signalling techniques and routing methodolo-
gies that can exploit the programmable interconnect fabrics are proposed and these techniques can
substantially enhance the on-FPGA communication performances.
59
Chapter 3
Fringed Interconnects and Bandwidth
Degradation in Communication Links
3.1 Introduction
On-chip communication system, such as hierarchical bus and network-on-chip, always comprises
of multiple instances of high-performance communication links. These links are fundamental
to the overall system and serves as a backbone to interconnecting modular blocks, embedded
processors, DSP modules and memories. They are also responsible for delivering bandwidth as
on-chip communication channels. The bandwidth of these links determines the inter-modular
communication speed and is critical to the overall system performance.
3.1.1 Degraded Bandwidth in Communication Links
The basic realization of a communication link uses parallel lines or buses. When considering a
synchronous system, ideally, the bandwidth of a bus is proportional to the bit-width. Mathemati-
cally, the bandwidth is given by β = S · α where S is the bit-width and α is the throughput of a
single line. However, this assumption may not be valid, especially when these lines are long and
are implemented in a complex programmable architecture. The interconnections in a communi-
cation link have different lengths. Increase in bit-width can potential create routing congestions
and, thus, adversely affect the delay and bandwidth of the link. An example is shown in Fig. 3.1,
which sketches a general observation of the bandwidths of parallel lines for different bit-widths.
Dashed line represents the ideal bandwidth and the solid line represents the degraded bandwidth
3.1 Introduction 60
B
a
n
d
w
i
d
t
h
 
(
b
i
t
s
/
s
e
c
o
n
d
)
Figure 3.1: The ideal and degraded bandwidths against bit-width of a communication link in
FPGAs.
after the place-and-route in a reconfigurable architecture. The bandwidth degradation increases
with bit-widths. This would eventually hamper the overall performance of communication sys-
tems and diminish the utilization of hardware resources for bandwidth delivery. This chapter aims
to investigate the cause of bandwidth degradation and to understand the impact of the underlying
reconfigurable architecture to high-performance links implementation.
3.1.2 Interconnection Length Prediction
Delay and bandwidth of a link is determined by interconnection length. Estimation of interconnect
length can be used to predict the bandwidth of parallel lines and provides an in-depth understand-
ing on the bandwidth degradation issues. Although there are a number of methods to estimate the
interconnection lengths and distribution for general circuits [Str01, BB03, MCSB06], a method
to accurately characterise and predict communication link bandwidth is still missing. The spe-
cific difference in assumptions for “generic logic” and communication links is that the routing
endpoints for generic logic are generally evenly distributed over an area, whereas communication
3.1 Introduction 61
links involve S wires that are routed in parallel between two endpoint regions.
A heuristic approach based on the bounding box method [BB03] has been proposed. This method
can provide an approximation for generic logic delay in FPGAs. In [MCSB06], by studying a
number of large design examples and applying statistical methods, an estimate on delay distribu-
tions can be inferred. However, all of these approaches target delay prediction for generic logic
and the linear relationship assumption for bandwidth-bitwidth is still applied. To accurately pre-
dict the bandwidth of communication links, a more detailed routing model is required. Early work
by Brown [BRV93], proposed an analytical model to predict the routability of a particular FPGA
architecture for a given channel width is particularly interesting. Although the analytical model
presented in [BRV93] aims to compute the circuit routability and to evaluate the FPGA architec-
ture, the model provides a solid framework for extension to communication link modeling and
bandwidth prediction.
In [MSCL07], a simple method to approximate the interconnection length in a communication is
presented (See Appendix A). This approach greatly simplifies the routing configuration, by re-
garding the utilization of number of long interconnects in routing channels is a constant, in the
interconnect channels and, thus, can provides a simple and fast approximation for the interconnec-
tion lengths.
In this chapter, methodology to model communication links and to predict the length of the inter-
connections and, subsequently, the link bandwidth is presented. Analytical model and expressions
of the interconnection length and variance of a link have been rigorously derived. The model is ver-
satile, engendering the exploration of a wide range of communication link designs, and is readily
applicable to explore and investigate the bandwidth degradation issue for different architectures.
FPGA architecture parameters, such as channel width and number of tracks in the channel, are also
captured in this model and, thus, the theory could be extended to study alternative FPGA routing
architectures. This model improves the simple approximation in [MSCL07] by considering a more
realistic and comprehensive routing architecture. Furthermore, a general phenomenon, which is
termed here “interconnect fringing”, has been identified in communication link implementation.
This introduces additional delay to a link because of the dispersed routing of the “fringed” inter-
connections. The proposed model can also effectively captures the fringing phenomenon in order
to provide a more accurate delay estimate.
3.2 A General Communication Link Model 62
The contribution of this chapter is as follows:
• A model of communication link in FPGAs is presented. This model captures the reconfig-
urable architecture and the basic routing mechanics. Interconnect fringing, a special phe-
nomenon of routing at FPGAs, is also characterized. (Section 3.2)
• A comprehensive model with detailed routing and architecture modelling is proposed to
compute both the average and variance of interconnections lengths of communication link.
This model provides an accurate estimation of wire channel utilization that the bandwidth
degradation can be rigorously characterized. (Section 3.3)
• The interconnection length and delay generated from the stochastic model are evaluated
using FPGA CAD tools. Also, bandwidth degradations of communication link is discussed
and analyzed with the aid of the proposed model. (Section 3.4)
3.2 A General Communication Link Model
3.2.1 An Island-Based FPGA Model
All FPGA devices have a similar basic logical structure, comprising an array of logic blocks
surrounded by routing channels which are connected via switch boxes, as illustrated in Fig. 3.3.
The routing channels contain wire segments which, in modern FPGAs, span across a variable
number of tiles. Wire segments are connected together to form signal paths between logic blocks
by configuring the switch blocks. The cost of the interconnect flexibility is primarily reduced
performance, due to the increasingly significant wire delays and also the area consumed by routing
channels and switch boxes.
Configuring an FPGA involves connecting together wire segments and programming logic blocks
and other heterogeneous configurable logic. Wire segments are connected using transmission gates
or NMOS pass-gates controlled by configuration bits. Each configuration bit is usually stored in
an SRAM cell, although other non-volatile technologies, such as Flash (e.g. ProASIC devices
[Act02] from Actel and LatticeXP devices [Lat09] from Lattice Semiconductor) and one-time-
programmable antifuse (e.g. Axcelerator [Act09] from Actel) are available.
3.2 A General Communication Link Model 63
Figure 3.2: The basic logical architecture of an island-style FPGA.
3.2.2 An Interconnections Model
A communication link can be modeled as a set of parallel interconnections that connect two re-
gions which are separated physically by a significant distance on the chip. Consider two modular
blocks, which are constructed from FPGA programmable logic, and which are connected by a link
(see Fig. 3.3(a)). Following the conventional tile-based model [BRM99], each modular block is
realized in a rectangular array of tiles. Each tile comprises a logic block, connection points and
a switch block. It has been assumed that all local signals within the modules will be successfully
routed with the shortest path. The dimension of the modular block is defined by a placement
constraint during implementation. It is assumed that there are channels of programmable inter-
connect available for realizing any connections between the two modules. An example is shown
in Fig. 3.3(b) for the channels and a typical interconnection.
In many commercial and state-of-the-art FPGAs [Xil06c, Alt05], different types of interconnects
are available in the routing channel for realising different interconnections and nets. Having these
heterogeneous interconnects is very versatile and is effective for accommodating different require-
3.2 A General Communication Link Model 64
Figure 3.3: A typical communication link. (a) An abstract view of the link, which provides S-bit
interconnections for connecting two modular blocks. (b) An global routing model for the link.
ments of the design. Interestingly, although there are different types of interconnect available,
long wires are most frequently used to realize the interconnections between two separated mod-
ules while other types of interconnects are used for interconnections within the modular block, as
can be observed from Fig. 3.4. This can be easily explained by the fact that using long wires is
efficient for long distance routing.
The experimental result depicted in Fig. 3.4 shows the utilization of different types of intercon-
nects. This is the result of a 256-bit link for connecting two modular blocks which are 50 tiles
apart and each tile contains two look-up-tables. Sufficient time is allowed to place and route the
design. In the upper graph, the statistics of interconnections delay for the communication link
are shown. The corresponding statistics for different type of interconnects are shown in the lower
graph. It clearly shows that the long wire contributes to most of the signal propagation delay. A
closer look of the routed circuit reveals that only long wires are used for the global routing in the
channels that traverse across the two modules. Other wires are used as local interconnections to
connect the logic blocks to the long wires. Therefore, it is important for the model to capture the
long wire utilization in the channels and the simplification on the routing within the modular block
3.2 A General Communication Link Model 65
2.5 3 3.5 4 4.5
0
20
40
60
80
100
O
cc
ur
re
n
ce
s 
 
2.5 3 3.5 4 4.5
0
100
200
300
400
Delay (ns)
O
cc
ur
re
n
ce
s 
 
Long
Hex
Short
Figure 3.4: Routing result of a 256-bit link for connecting two modular blocks which are 50 tiles
apart. The occurrences of interconnections for different delay in the link is shown in the upper
panel. Among these interconnections, the number of different types (short, hex, and long) of wires
that are used to construct the interconnections is shown in the lower panel.
is adequate.
3.2.3 Fringed Interconnections
The number of long wires in each channel is limited. If the number of signals to be routed between
modules is greater than the capacity of long wires in the channels that the module overlaps, long
wires in channels that are further away will be used for some signals. The average interconnec-
tion length will be increased because of the extra distance traversed by some signals to the closest
available long wire. Interconnections that are displaced into distant channels are the designated
fringed interconnections. Fig. 3.5(a) shows an abstracted view of fringing when realizing a link
in an FPGA. The fringed interconnections will generally have longer length than the direct inter-
connections. Fig. 3.5(b) shows a typical example of a fringed interconnection. When the long
wires at channels 1, 2 and 3 are all routed, although these channels are the shortest paths for the
3.2 A General Communication Link Model 66
Figure 3.5: (a) An abstract view of both direct and fringed interconnections in a link after the
placement and routing steps. (b) An typical example of a fringed interconnection.
interconnection, the interconnection will traverse to channel 4, which has available long wires.
Thus, it is important for the stochastic model to be able to capture the fringing effect.
3.2.4 Notations and Indexes
The notations used in the theoretical model derivation is listed in Table 3.1. (i, k) is used to the
denote the i-row and k-column of the placement area where i = 1, 2, ..., u, k = 1, 2, ..., v and u, v
are the number of rows and columns of the placement area, respectively. (See Fig. 3.6). Index j
is used to denote the j-th channel where j = 1, 2, ..., N . Since a symmetric routing on the two
sides of a link is assumed, as shown in Fig. 3.4, the channels are indexed from the middle of the
placement area. The channel j = 1 corresponds to row u/2 + 1 of the placement area. In other
words, the relationship between i, j can be expressed as i = j + u/2.
3.2 A General Communication Link Model 67
Figure 3.6: An illustration of the notations and modeling for the local routing of interconnection
within a modular block.
Table 3.1: Notations used in the interconnection model
u Width of the placement area
v Length of the placement area
L Distance (tiles) between two placement area
W Capacity of channels (long wire)
S Total number of interconnections to be routed
ξl Length of the l-th interconnection
Xi,j Length of the i-th interconnection at channel j
Yj Number of interconnections at channel j
R Average length of all interconnections
N Total number of channels available
D Distance between the placement and the long
wire channel
3.3 A Stochastic Model for Interconnections Length Prediction 68
3.3 A Stochastic Model for Interconnections Length Prediction
In this section, the derivation of the expected length and variance of interconnections is presented.
A stochastic model to estimate the mean and variance in the interconnect length, which subse-
quently used to compute the delay and bandwidth, is also derived. The major steps in the derivation
are to evaluate the expected utilization and the expected length of interconnection that traverses
through each routing channel. Based on these estimations, the overall interconnection length and
variance can be computed. The proposed model which characterizes the result of FPGA routing as
a stochastic process is similar to that used in earlier research on evaluation of FPGA architecture
[BRV93] and modelling of circuit interconnection lengths [ES81].
3.3.1 Average Interconnections Length
Suppose that the average interconnection length is R and the total number of interconnections is
S. Let ξl be a random variable that denotes the length of the l-th interconnection, l = 1, 2, ..., S,
then the average length of interconnections becomes R =
∑S
l=1 ξl/S. By taking the expectation,
E[R] becomes
E[R] =
1
S
E
[
S∑
l=1
ξl
]
(3.1)
By connecting the average interconnections length to the physical mapping parameters, such as
interconnection channels, two important random variables need to be defined:
(1) Xi,j = length of the i-th interconnection that is going through channel j.
(2) Yj = total number of interconnections at channel j.
Xi,j and Yj are assumed to be independent, as the length of the interconnections are the uti-
lization of interconnects at each channel are usually considered separately in the routing algo-
rithms. Suppose N channels are available for realizing these global communication links, each
channel has W tracks and NW À S. Assume that the routing and placement are symmet-
ric for the two sides of a link, as depicted in Fig. 3.3, the total interconnection length becomes∑S
l=1 ξl = 2
∑N
j=1
∑Yj
i=1Xi,j , given that
∑N
j=1 Yj = S/2 and Yj ≤ W, j = 1, 2, ..., N . Thus,
3.3 A Stochastic Model for Interconnections Length Prediction 69
E[R] becomes
E[R] =
1
S
E
2 N∑
j=1
Yj∑
i=1
Xi,j
∣∣∣∣∣
N∑
j=1
Yj = S/2, Yj ≤W
 (3.2)
=
2
S
N∑
j=1
E
 Yj∑
i=1
Xi,j
∣∣∣∣∣
N∑
j=1
Yj = S/2, Yj ≤W
 (3.3)
By applying the identity1 E[
∑Y
i Xi] = E[Y ]E[X], where Y is assumed to be a random variable
that is independent of the Xi’s and Xi is independent of Xj for i 6= j. The expected interconnec-
tions becomes:
E[R] =
2
S
N∑
j=1
E[Xj ]E
Yj
∣∣∣∣∣
N∑
j=1
Yj = S/2, Yj ≤W
 (3.4)
=
2
S
N∑
j=1
X(j)Y (j) (3.5)
In the following sections, the derivation for Y (j) and X(j) will be presented.
Derivation of Y (j)
Y (j) is the expected number of interconnections in channel j. Suppose that all interconnections
for two modular blocks go through any one of the channels, Y (j) can be modeled as the number of
interconnections that enter the channel j. This assumption simplifies the process of interconnect
generation, which does not intend to model the exact routing mechanics. However, this provides a
simple analytical model for the results of routing algorithms. Furthermore, this assumption yields
an optimistic routing, as a shortest path is assumed for connecting two tiles through the routing
channel. More sophisticate models can be extended based on this work to provide additional detail
characterizations based on different routing assumptions.
1See proof in [Ros02].
3.3 A Stochastic Model for Interconnections Length Prediction 70
An example is depicted in Fig. 3.6. An interconnection is generated at tile (i, k) and traverses to
join the channel j. The interconnection traverses through the shortest path to the nearest available
channel, following the direction of the arrow in the figure. It is assumed that there are always
switches and connection points available in the module for the interconnection to traverse locally.
However, when all tracks in the global interconnection channel are used up, the interconnection
has to traverse further to find an available track to pass through the channel.
Let Qj(i,k) be the indicator variable for the event that an interconnection is generated at tile (i, k)
and traverses to connect to one of the long wires at channel j. It is assumed that interconnections
originate at the tile (i, k) with a probability κ, and that the interconnection originating at (i, k)
moves toward channel j with probability θ. Following [ES81] to model the interconnection length,
let D be the variable for the distance that the interconnection traverses before stopping at a long
wire channel; D has a geometric distribution with parameter λd, as the geometric distribution
is a discrete version of the exponential distribution. Further, this assumption has been verified
from an experiment. A 256-bit link for connecting two modular blocks which are 50 tiles apart is
placed and routed using Xilinx ISE design environment for a Virtex-4 (XC4VLX200) FPGA. The
routing results can be obtained using the Xilinx FPGA Editor. With the help of the FPGA Editor,
distribution of the distance parameter D can be obtained by counting the number of tiles that an
interconnection traverses before joining the long range routing track.
Fig. 3.7 shows the distribution of distance that interconnections travel vertically before connecting
to a long wire. The probability mass function (pmf) for D is denoted as hλh,d(i,j), where d(i, j) =
|j + u/2 − i| is the distance between row i and channel j. Thus, the pmf for Qj(i,k) is zij =
κθhλh,d(i,j) = κθ(1− λh)d(i,j)λh.
The total number of interconnections at channel j becomes Yj =
∑u
i
∑v
kQ
j
(i,k), which is a joint
probability of interconnections generated from all tiles that converges to channel j given that the
total number of interconnections is S/2 and Yj ≤ W . In the following, it is shown that Yj can be
approximated by a Poisson distribution with parameter λj = v
∑u
i zij .
Lemma 1. For any interconnection that is generated at tile (i, k) and joins a long wire track
at channel j with probability zij , the number of interconnections at channel j, Y ′j , can be ap-
proximated by a Poisson distribution with parameter λj = v
∑u
i zij where i = 1, 2, ..., u and
j = 1, 2, ..., N .
3.3 A Stochastic Model for Interconnections Length Prediction 71
0 5 10 15
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
Distance (tile)
Pr
ob
ab
ilit
y
Exp
Theo
Figure 3.7: Distribution of the parameter D, which is the distance an interconnection traversed
before connecting to a long wire.
Proof. Let J ji be the total number of interconnections generated from row i and joining tracks at
channel j. Thus, J ji =
∑v
kQ
j
(i,k) can be obtained. Therefore
P{J ji = n} =
(
v
n
)
(zij)n(1− zij)v−n (3.6)
It is obvious that J ji has a binomial distribution with mean vzij(1− zij). Therefore, this binomial
distribution can be approximated by a Poisson distribution2 with parameter λij = vzij .
Since Yj =
∑u
i J
j
i and as J
j
i is a Poisson distribution variable with parameter λi, Yj becomes a
Poisson distribution variable with parameter λj = v
∑u
i zij
There are N independent variables Yj , j = 1, 2, ..., N , which have Poisson distribution with pa-
rameters λj , j = 1, 2, ..., N . The probability mass function for Yj is therefore
2To approximate binomial probabilities is one of the important applications of the Poisson probabilities. It can be
easily see that when let λ = np be a constant, where n and p are the binomial parameters, the binomial pdf converges
to a Poisson pdf when n→∞.
3.3 A Stochastic Model for Interconnections Length Prediction 72
P
{
Yj = y
∣∣∣ N∑
j=1
Yj = S/2, Yj ≤ W
}
=
P{Yj = y,
∑N
j Yj = S/2, Yj ≤W}
P{∑Nj Yj = S/2, Yj ≤W} (3.7)
Let Zj = Yj and Zj ≤ W for all j = 1, 2, ..., N . The probability mass function of Zj becomes a
truncated Poisson, as Yj has a Poisson distribution with parameter λj . Therefore,
P{Zj = y} = P{Zj = y, Zj ≤W}
=
λyj
y!
W∑
i=0
λij
i!
, y ≤W (3.8)
Now, Eq. 3.7 can be further simplified with replacing Yj and Yj ≤W , by Zj as follows,
P
{
Yj = y
∣∣∣ N∑
j=1
Yj = S/2, Yj ≤W
}
=
P{Zj = y,
∑N
j Zj = S/2}
P{∑Nj Zj = S/2}
=
P{Zj = y}P{
∑N
k 6=j Zk = S/2− y}
P{∑Nj Zj = S/2} (3.9)
It is needed to compute the joint probability mass function (pmf) for variables. Let q′ be the pmf
for
∑N
j Zj=1 and qj be the pmf of
∑N
k 6=j,n=1 Zk. The variables q
′ and qj can be readily computed
by convolutions as follows,
q′(y) = P
{ N∑
j
Zj = y
}
= Z1 ∗ Z2 ∗ · · · ∗ ZN (3.10)
and also
3.3 A Stochastic Model for Interconnections Length Prediction 73
qj(y) = P
{ N∑
k 6=j
Zk = y
}
= Z1 ∗ · · · ∗ Zj−1 ∗ Zj+1 ∗ · · · ∗ ZN (3.11)
where (∗) denotes convolution3 of probability mass functions. By substituting Eqs. 3.10, 3.11 and
3.8 into Eq. 3.9, it yields,
P
{
Yj = y
∣∣∣ N∑
j=1
Yj = S/2, Yj ≤W
}
=
λyj qj(S/2− y)
y!q′(S/2)
∑W
i=0
λij
i!
(3.12)
Hence, the expected value for Yj = y where y = 0, 1, 2, ...,W can be computed.
Y (j) = E
Yj
∣∣∣∣∣
N∑
j=1
Yj = S/2, Yj ≤W

=
W∑
y=0
yλyj qj(S/2− y)
y!q′(S/2)
∑W
i=0
λij
i!
(3.13)
Given the total number of bits in the link (S), the equation expresses the number of interconnec-
tions travelling via channel j limited by the number of available wires (W ) in the channel.
Approximation of Y (j)
Consider a case where W , channel width, is arbitrarily large or is adjustable. Then, the expression
for Y can be further simplified. Yj has a Poisson distribution with mean λj and by using the
property of Poisson random variable,
∑N
j Yj has a Poisson distribution with parameter
∑N
j λj ,
then
3The convolution of f and g, where f , g are independent (not necessary identical) pmf, is defined as f ∗ g =∑n
k f(k)g(n− k)
3.3 A Stochastic Model for Interconnections Length Prediction 74
P
{
Yj = y
∣∣∣ N∑
j=1
Yj = S/2
}
=
P{Yj = y,
∑N
k 6=j Yk = S/2− y}
P{∑Nj Yj = S/2}
=
e−λjλyj
y!
e−
∑N
k 6=j λk(
∑N
k 6=j λk)
S/2−y
(S/2− y)! ·
e−∑Nj λj (∑Nj λj)S/2
(S/2)!
−1
=
(S/2)!
y!(S/2− y)!
λyj (
∑N
k 6=j λk)
S/2−y
(
∑N
j λj)S/2
=
(
S/2
y
)(
λj∑N
j λj
)y (∑N
k 6=j λk∑N
j λj
)S/2−y
(3.14)
In other words, the conditional distribution for Yj given that
∑N
j Yj = S/2, has a binomial
distribution with parameters n = S/2 and p = λj/
∑N
j λj . Hence,
Y (j) ≈ E
Yj∣∣∣ N∑
j=1
Yj = S/2
 = Sλj
2
∑N
j λj
(3.15)
It is interesting to interpret the distribution of interconnections in routing channels as a binomial
distribution, assuming that the channel width is arbitrarily large. It gives the probability as shown
in Eq. 3.14 of getting exactly y successes in S/2 trials in generating interconnection across the
two channels.
Derivation of X(j)
The length of an interconnection at channel j in the communication link Xj can be considered
as a sum of three segments. It is best illustrated using Fig. 3.8. For net (a), the length would be
the sum of the length from the tile in the left area to M1 (Λ), (L) and the length from M2 to the
tile in the right area (Λ′). Similarly, for net (b), the lengths are the sum of three segments. Thus,
Xj = L+ Λj + Λ′j can be obtained and
X(j) = E[Xj ] = L+ 2E[Λj ] (3.16)
3.3 A Stochastic Model for Interconnections Length Prediction 75
where E[Λj ] = E[Λ′j ].
Figure 3.8: Illustration of parameters for computing the X(j). (a) Direction interconnection. (b)
Fringed interconnection.
The pmf zij defined earlier for the probability of an interconnection connects a track at channel
j from any tile at row i. By assuming an uniform distribution of the placement, a pmf ζi,j for
channel j that connected from any of the tile at row i can be obtained as follows,
ζi,j =
zij
v
∑u
i zij
(3.17)
Thus, all possible interconnections length can be readily explored by analyzing Manhattan distance
shown in Fig. 3.9. Basically, E[Λj ] can be clearly defined based on two different cases, which
are j ≤ u/2 and j > u/2, for direct and fringed interconnections, denoted by E[Λjj≤u/2] and
E[Λjj>u/2] respectively. To compute E[Λ
j
j≤u/2], the constraint area is clearly defined (in Fig. 3.9).
The expected value would be the sum of enumerating all possibilities in these three panels.
3.3 A Stochastic Model for Interconnections Length Prediction 76
Figure 3.9: An example of computing the Manhattan distance within a modular block. Each
number in the block represents the Manhattan distance to connect the particular tile to the point at
M1.
Hence,
E[Λjj≤u/2] =
u/2+j−1∑
l=0
v∑
k=1
(k + l)ζu/2+j−l,j +
v∑
k=1
kζu/2+j,j +
u/2−j∑
l=1
v∑
k=1
(k + l)ζu/2+j+l,j
(3.18)
For the case of E[Λjj>u/2],
E[Λjj>u/2] =
v∑
i=1
u∑
l=1
(k + l + j − u/2− 2)ζu−l+1,j (3.19)
3.3 A Stochastic Model for Interconnections Length Prediction 77
0 5 10 15 20 25 30
0
10
20
30
40
50
Ex
pe
ct
e
d 
n
u
m
be
r
0 5 10 15 20 25 30
20
30
40
50
60
Channel
Ex
pe
ct
e
d 
Le
n
gt
h
Approximation
W=24
W=18
Fringed
interconnections
Figure 3.10: Plots of number of interconnections, Y (j), and average interconnection length,X(j).
3.3.2 Summary
The expected number of interconnections that go through channel j is Y (j) and the expected
length of interconnection at channel j is X(j). Fig. 3.10 shows the plot of Y (j) (upper graph) and
X(j) (lower graph) as a function of j. The channel is indexed as the distance from the center of the
placement area. Three cases of Y (j) are shown. The approximation case assumes that the number
of long wires per channel W is a very large number, so that the number of interconnections at each
channels can be approximated by a simple express in Eq. 3.15. For different channel capacity,
W = 24 and W = 18, the number of interconnections in the channel can be approximated by
Eq. 3.13. Clearly, when the number of available tracks in a channel is smaller (W = 18), more
channels are required to fit all interconnections, as depicted in the figure. Interconnections lengths
can be modeled by Eq. 3.16. Interconnections in a channel that is further away from the placement
area have longer length. Specifically, for the direct interconnection, the length is modelled by
Eq. 3.18 and for the fringed interconnection, it is modelled by Eq. 3.19. Based on X(j) and Y (j),
the overall average and variance of interconnection lengths can be derived and thus compute the
delay.
3.3 A Stochastic Model for Interconnections Length Prediction 78
3.3.3 Variance of the Interconnections Length
It is also important to compute the variance of interconnection lengths VAR(ξ). For the average
interconnections length R, its variance can be computed as
VAR(R) = VAR
(
S∑
l=1
ξl
)
/S2 (3.20)
= VAR(ξ)/S (3.21)
Eq. 3.21 gives the relationships between R and ξ: the variance of ξ is S times larger than VAR(R),
the variance of the average length. From Eq. 3.2, the variance can be computed as follows,
VAR(ξ) =
N∑
j=1
VAR
2 Yj∑
i=1
Xi,j
∣∣∣∣∣
N∑
j=1
Yj = S/2, Yj ≤W
 /S (3.22)
By using the identity VAR(
∑Y
i=1Xi) = E[Y ]VAR(X)+(E[X])
2VAR(Y )4, where Y is assumed
to be a random variable that is independent of the Xi’s and Xi is independent of Xj for i 6= j.
The variance becomes:
VAR(ξ) =
4
S
N∑
j=1
{
E
Yj
∣∣∣∣∣
N∑
j=1
Yj = S/2, Yj ≤W
VAR(Xj)
+ (E[Xj ])2VAR
Yj
∣∣∣∣∣
N∑
j=1
Yj = S/2, Yj ≤W
}
=
4
S
N∑
j=1
{
Y (j)VAR(Xj) + (X(j))2VAR
Yj
∣∣∣∣∣
N∑
j=1
Yj = S/2, Yj ≤W
} (3.23)
4See proof in [Ros02]
3.4 Results for the Stochastic Modelling 79
To compute VAR(Yj |
∑N
j Yj = S/2, Yj ≤W ), the pmf from Eq. 3.12 can be used. For VAR(Xj),
recall that E[Xj ] = L+ 2E[Λ]. Thus,
VAR(Xj) = 4VAR(Λ) (3.24)
And by using the identity VAR(Λ) = E[Λ2] − (E[Λ])2 and by modifying Eqs. 3.18-3.19,
VAR(Xj) can be obtained.
3.4 Results for the Stochastic Modelling
3.4.1 Interconnection Lengths Prediction
In this section, the comparisons between the theoretical and experimental results for interconnec-
tion lengths prediction are discussed. There is no readily available tool to extract the interconnec-
tion length of a net after the placement and routing of a design. However, Xilinx ISE [Xil06a]
provides a tool, namely ncd2xdl, that can convert the ncd file, which is the file stored the informa-
tion of placement and routing, into a human-readable xdl format. Interconnects and switches for
each net are described in the xdl file. A program was developed to parse the xdl file to extract the
type of interconnect used in realizing the design. For example, in the xdl file, “LH24” represents a
horizontal wire that spans across 24-tile and “W6BEG0” represents a wire that spans across 6-tile.
Similarly, all different types of wires that realize the interconnections can be extracted and, the
length of the interconnections can be obtained. A more detailed document about the naming and
specifications of the interconnects can be found in [Ste02].
There are several approaches to implement a link. One conventional approach is to implement
two FIFOs (First-in-first-out blocks) as the master and slave blocks and each with a data width
that matches the link bandwidth. In the placement of the design, placement constraints such as the
width of the FIFO area and the distance between two FIFOs can be specified. For instance, the
Xilinx ISE 7.0 design toolbox is used in our experiment. The placement constraints described in a
“ucf” file can be specified. In this experiment, a Xilinx Virtex-4 (XC4VLX200) FPGA, which is
large enough to demonstrate the effect of fringing, was used.
3.4 Results for the Stochastic Modelling 80
30 40 50 60 70 80 90
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Length (tile)
Pr
o
ba
bi
lit
y
Theoretical
Experimental
Figure 3.12: Experimental and theoretical results
for interconnection lengths of a 32-bit link.
30 40 50 60 70 80 90
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Length (tile)
Pr
o
ba
bi
lit
y
Theoretical
Experimental
Figure 3.13: Experimental and theoretical results
for interconnection lengths of a 64-bit link.
30 40 50 60 70 80 90
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Length (tile)
Pr
ob
ab
ilit
y
Theoretical
Experimental
Figure 3.14: Experimental and theoretical results
for interconnection lengths of a 128-bit link.
30 40 50 60 70 80 90 100 110
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Length (tile)
Pr
o
ba
bi
lit
y
Theoretical
Experimental
Figure 3.15: Experimental and theoretical results
for interconnection lengths of a 128-bit link.
3.4 Results for the Stochastic Modelling 81
In order to compare the experimental and theoretical results, the interconnection lengths are as-
sumed to have a Gaussian distribution. Let the expected length and variance from the theoretical
results be the mean and variance of a standard Gaussian distribution. A distribution model of inter-
connection lengths of a link based on the derived mean and variance is used as comparison to the
experimental results. Also, let the parameter λD=2.0, which is parameter of a Poisson distribution
for the distance that an interconnection travels before connecting to the long wire. This value was
obtained from the experimental results shown in Fig. 3.7.
Fig. 3.12–3.15 show the results of comparing theoretical and experimental studies for links with
32, 64, 128 and 256 bits. From the results, it can be found that the interconnection lengths can
be well approximated by a Gaussian distribution and the theoretical model can provide a high
quality fit to the experimental results. Specifically, the model provides a better fit for links with
larger number of bits, i.e., the 256-bit link can be better predicted than the 128-bit and 32-bit
links. It can be observed that the predicted length is slightly shorter than the experimental cases.
Specifically, for the 32-bit and 128-bit links, the actual length is longer than the predicted one. A
more comprehensive comparison between the experimental and theoretical results are shown in
Table 3.2. One can see that the theoretical mean is slightly less than the experimental mean in
many examples. This is also the case for variance. Nevertheless, the overall prediction is accurate
with an average relative error 6.3% for the mean and 24% for the variance.
Table 3.2: Comparison of experimental and theoretical interconnections length of high-bandwidth
communication links
Bit Length Width µ(Exp) µ′ (Theo) |µ′ − µ|/µ σ (Exp) σ′ (Theo) |σ′ − σ|/σ
(S) (L) (u)
32 50 8 55.6 53.6 0.036 9.8 5.6 0.43
64 50 8 55.6 56.4 0.014 9.8 7.4 0.24
64 50 20 56 52.0 0.071 10.9 6.3 0.42
128 50 8 66.9 61.1 0.087 9.3 8.3 0.11
128 50 10 65.3 58.4 0.11 9.18 7.3 0.21
128 50 20 61.0 55.7 0.087 7.3 6.8 0.068
256 50 10 68.7 66.4 0.033 11.3 11.4 8.8e-3
256 50 20 62.9 58.8 0.065 12.0 6.4 0.46
6.28% 24%
In general, the theoretical model can give accurate interconnection length prediction with an av-
erage error less than 7%. However, it can be observed that the predicted interconnection length is
3.4 Results for the Stochastic Modelling 82
shorter than the experimental length in most of the cases. Thus, the prediction is relatively opti-
mistic. The main reason is the simplification assumption in local routing within the modular block,
as interconnections may traverse extra distance rather than the shortest path before connecting to
long wires. Several examples were found, in which routings within the modular block (local rout-
ing) is not following the shortest path. An explanation of these can be attributed to the limitations
of the routing architecture and the limited number of pins in the switches. To incorporate a more
detail routing model, especially for local routing such as the model proposed in [BRV93], may
improve the prediction accuracy. This is regarded as future work.
3.4.2 Interconnection Delay Prediction
Interconnections length and delay are highly correlated. Delay is linearly related to length, as the
long wires and switch points are buffered and the delay has a linear relationship with the line length
[MSCL07]. To estimate delay, an effective approach is to predict the average interconnections
length and, thus, the interconnections delay can be determined using a linear model. The delay is
given by the linear model αR¯ + β, where α and β can be measured for any specific FPGA. The
offset β corresponds to the buffer delay of the line.
The experimental results gives α = 0.017 and β = 0.922. With the linearity assumption, the
length and delay can be related through a Gaussian distribution as
µ∆ = αE[ξ] + β (3.25)
σ∆ = α2VAR(ξ) (3.26)
where µ∆ and σ∆ become the theoretical mean and variance.
There are complicated models for predicting the delay based on interconnection length. For exam-
ple, resistance and capacitance of wires are considered and captured in the delay equations. Also,
interconnections in FPGA are multiple segments of short wires and these segments are buffered
and interconnected by switches. Multi-stage delay model [MDS+08a], in which delay of a long
line is the sum of delays of each short buffered segment. These models can be applied to further
3.4 Results for the Stochastic Modelling 83
enhance the accuracy in the delay prediction and with different specifications. Nevertheless, the
simple linear length-delay model provides a basic example for predicting the link delay.
Table 3.3: Comparison of experimental and theoretical Interconnections Delay of high-bandwidth
communication links
Bit Length Width µ(Exp) µ′ (Theo) |µ′ − µ|/µ σ (Exp) σ′ (Theo) |σ′ − σ|/σ
(S) (L) (u) ×10−2 ×10−2
32 50 8 1.76 1.83 3.90 0.14 0.095 32
64 50 8 2.02 1.88 6.90 0.17 0.13 24
64 50 20 1.87 1.8 3.70 0.15 0.11 27
128 50 8 1.95 1.96 0.51 0.145 0.14 3.4
128 50 10 1.94 1.98 2.0 0.17 0.12 30
128 50 20 1.93 1.86 3.60 0.15 0.12 20
256 50 10 2.1 2.2 4.80 0.19 0.24 26
256 50 20 2.02 1.9 5.90 0.17 0.16 5.9
3.9% 21%
The comparisons between theoretical and experimental results are detailed in Table 3.3. In general,
the analytical model provides a good approximation of the mean and variance for the delays. The
prediction error for the mean delay is 3.9% and for variance is 21%. Note that the prediction
error for delay is actually smaller than the prediction error for lengths, even though the delay is
computed based on the length prediction. This could be attributed to the simple linear model for
the length-delay relationships, as two sources of systematic error that cancel out each other. The
length prediction is optimistic, and the delay estimation is pessimistic. From Fig. 3.15, it can
found that there is a small variation of delay for a fixed length. Thus, if the interconnect delays
are smaller than the one from the linear prediction model, the optimistic length prediction can
potentially increase the delay prediction accuracy. This also suggests that a more complicated
length-delay model can potentially improve the accuracy. A model which includes the delay from
switch points which can also differentiate delays from local and global interconnects can also
potentially enhance the prediction accuracy. This is also regarded as future work.
3.4.3 Bandwidth Degradation
Consider a link based on k parallel lines, and that the bandwidth of the link equals to the reciprocal
of the delay of the slowest line. The delay of k lines is denoted by Dk and thus, the bandwidth of
3.4 Results for the Stochastic Modelling 84
0 30 60 90 120 150 180 210
1.0
1.5
2.0
2.5
3.0
De
la
y 
(ns
)
Length (tile)
 Length (tile)
 linear fit
 
Figure 3.15: Linear relationship between interconnections delay and length.
the link with k lines becomes k/max∀k{Dk}. Because of the fringed interconnects, degradations
in bandwidth can be expected. The bandwidth degradation can be defined as a ratio between the
ideal bandwidth and the degraded bandwidth:
Bandwidth Degradation =
Bandwidth of k lines
k × Bandwidth of one line (3.27)
=
k/max∀k{Dk}
k/D1
(3.28)
=
D1
max∀k{Dk} (3.29)
The expression is based on the assumption that the link spans across the chip with the same dis-
tance and the aspect ratios of the placement of the modules are the same. Although only the mean
and variance of the lengths and delays are obtained from the analysis in the previous section, the
bandwidth degradation can still be approximated. Consider a Gaussian distribution, 99.7% of the
interconnections are shorter than µDk+3σDk , where µ is the mean and σ is the standard deviation.
3.4 Results for the Stochastic Modelling 85
0 50 100 150 200 250
0
50
100
150
200
250
300
350
400
450
500
Bit−width
Ba
nd
wi
dt
h 
(M
bp
s)
Ideal
Degraded
Figure 3.17: Bandwidth comparison between the
ideal and degraded link with length L=20 tiles.
0 50 100 150 200 250
0
50
100
150
200
250
300
350
400
Bit−width
Ba
nd
wi
dt
h 
(M
bp
s)
Ideal
Degraded
Figure 3.18: Bandwidth comparison between the
ideal and degraded link with length L=30 tiles.
0 50 100 150 200 250
0
50
100
150
200
250
Bit−width
Ba
nd
wi
dt
h 
(M
bp
s)
Figure 3.19: Bandwidth comparison between the
ideal and degraded link with length L=50 tiles.
0 50 100 150 200 250
0
20
40
60
80
100
120
140
160
180
Bit−width
Ba
nd
wi
dt
h 
(M
bp
s)
Ideal
Degraded
Figure 3.20: Bandwidth comparison between the
ideal and degraded link with length L=80 tiles.
3.5 Model Assumptions and Future Work 86
Therefore, the bandwidth degradation can be approximated as follows,
Bandwidth Degradation ≈ D1
µDk + 3σDk
(3.30)
Fig. 3.17–3.20 compares the results of degraded and ideal bandwidths for different bit-widths. The
ideal bandwidth assumes the case of no bandwidth degradation. Therefore, bandwidth increases
linearly with bit-width. Because of interconnect fringing, bandwidth degrades with increasing bit-
widths. For the case of 256 bits, the ideal bandwidth is 244 Mbits per second and the degraded
bandwidth is 199 Mbits per second. The degradation in communication bandwidth is 22.6% . It
can also be observed that for longer links, the degradation in bandwidth is smaller. This is because
the impact of interconnect fringing on shorter link is relatively larger compared with long links.
3.5 Model Assumptions and Future Work
Prediction of interconnection length is a challenging problem. It is further complicated with a pro-
grammable interconnection model, such as FPGAs. This chapter proposes a stochastic model to
predict the lengths and delays of a communication link in FPGAs. There are several simplifications
of the interconnection model and assumptions were made in the derivations. The assumptions to
establish the interconnection length prediction model are listed as follows,
• The interconnection model in this chapter only considers a communication link between
two modules or regions.
• It is assumed that only long lines are used for the global routings and local routings are
constructed using short interconnects.
• The connected modules or regions are separated by a significant distance, which is charac-
terized by L, on the chip.
• It is assumed that there are no congestions nor circuits to consume or compete the long lines
in the channels.
The stochastic model can be further enhanced by relaxing the assumptions and to consider more
complex interconnection models. For examples, interconnection model for communication links
3.6 Conclusion 87
with multiple sources or destinations can be studied but requiring a more complex model on rout-
ing of wires for different directions or dimensions. However, the stochastic model in this chapter
can be extended to capture the utilization of long lines in a multi-modular scenario.
Only long lines that are used for the global routings is a reasonable assumption, as it has been
shown in Fig. 3.4 that majority of the global routings are constructed using long lines. A more
comprehensive model can be developed to consider detail routings using both short and long inter-
connects. Also, the congestions and routings within modules or processors can also be considered
as to improve the accuracy of the model.
Only two dimensional routings are considered in this analysis. In physical realization, intercon-
nections are realized using multi-levels of wires and metal vias. The interconnection length and
delay performance would be greatly affected by the actual physical planning and implementation
of metal wires. Therefore, both the accuracy and reliability of the interconnect predictions can be
greatly improved by considering the physical parameters and characteristics.
3.6 Conclusion
This chapter presents a new stochastic model to predict interconnection lengths of communication
links in FPGAs. Based on a stochastic inter-module routing model, expected length and variance
of interconnections have been rigorously derived and, thus, delay can be computed based on the
length estimate. The theoretical results, which are obtained from implementations of links circuits
in an FPGA, are compared with experimental results of lengths and delays. The stochastic model
provides an accurate prediction of length with an average error of 6.3%. Results also show that
the proposed model produces reliable predictions of delay and therefore the methodology can
be applied to early stage planning and design optimization for communication links. Moreover,
this chapter presents an interesting phenomenon which is termed “interconnection fringing”. The
fringing effect is attributed to the competition for routing resources in a communication link and
will lengthen interconnections and, therefore, increase the delay.
Most interestingly, the results demonstrate that parallel long link implemented in FPGA is not ef-
ficient, especially with large bit-width. Because of the constrained routing architecture, intercon-
nections are inevitably to be fringed and, thus, reduce the bandwidth of the link. The degradation
3.6 Conclusion 88
in bandwidth can be as high as 22%. This suggests that using parallel link to implement high
bandwidth link in FPGA is inefficient and alternatives approach should be considered in order to
achieve a higher efficiency bandwidth implementation.
89
Chapter 4
Wave-Pipelined Intra-Chip Signalling
in FPGAs
4.1 Introduction
The interconnect challenge is exacerbated in FPGA with its reconfigurable interconnect architec-
ture. Interconnections in FPGAs are generally constructed from segments of interconnect fabrics
which are slower and dissipate more energy when comparing to ASIC custom designs (see Ap-
pendix B for a study of the power dissipation of long interconnections in FPGAs). This is espe-
cially the case for global interconnections, which span a large physical distance across the chip.
Although more components and silicon can be fitted into a single chip, across-chip communica-
tion would introduce significant performance hindrance and alternative signaling techniques may
be needed to mitigate the interconnect challenge.
Long interconnection in FPGAs is a composite of multiple segments of programmable intercon-
nects and switches. Despite the irregular and idiosyncratic nature of FPGA long interconnections,
buffers were embedded at switches to speed up the signal propagation in the modern FPGA ar-
chitectures, such as Xilinx Virtex-IV [Xil05a] and Altera Stratix-II [Alt05]. With the help of
these buffers, the long interconnections can be modeled as multiple stages of RC transmission
line, which facilitate the realization of wave-pipelined signaling [MDS+08a]. In wave-pipelined
signaling, multiple bits are allowed to traverse simultaneously along the line, thus a significant
throughput improvement can be obtained. For many complex applications, bandwidth is the main
concern for communication between modules or processors. A conventional interconnect has its
4.2 Related Work 90
bandwidth dictated by the RC time constant (or characteristic impedance) of the wire, thus lim-
iting the data throughput. Using wave-pipelined signaling, as will be shown later in this chapter,
the data throughput can exceed the limit imposed by the RC time constant.
In this chapter, a pulsed wave signaling (or wave-pipelining) design strategy is presented to miti-
gate the interconnect challenge in FPGAs. The contributions of this chapter are:
(1) Propose a new interconnection model for global routing in FPGAs. The new model gener-
alizes the irregular interconnection circuits as multiple buffered interconnect stages which can be
applied in FPGA architectural evaluation and prediction for interconnect performance for different
technology processes (see Section 4.3).
(2) Derive the fundamental delays and throughput for FPGA interconnects. The expressions can be
applied to analyze delay, throughput and power consumptions for different signalling strategies,
such as delay-based signalling, wave signalling, register pipelined link. This analytical model
provides the basis for studying performance benefits and shortcomings for different signalling
approaches in FPGAs (see Section 4.4).
(3) Propose two circuit designs to implement analogue wave signalling in FPGAs. These two ap-
proaches improve the signalling throughput using phase-adaptation and oversampling techniques.
The new methods are evaluated through actual implementations on a Xilinx FPGA device. Trade-
offs between power, speed and area are studied and comparisons with conventional synchronous
and asynchronous pipelining techniques are also investigated (see Section 4.5).
4.2 Related Work
Wave-pipelined logic circuits were originally proposed in 1969 [Cot69] and new techniques and
applications have been continuously developed throughout the years [BCKL98]. The original
idea was to allow combinational logic to process new data before the previous data reached the
registers, such that few combinational logic sit idle. Wave-pipelining, thus, suggests simultaneous
existence of multiple data bits in a combinational logic data path. Realization of wave-pipelining
on FPGA has been reported in [BSJ96] by Boemo et al. A wave-pipelined multiplier circuit has
been realized by constructing a well structured network topology of LUTs. Various techniques
4.3 Global Interconnect Model for FPGAs 91
for implementation of wave-pipelining circuits on FPGA can be found in [LV05]. However, due
to variability of logic delay and the design complexity, wave-pipelining is difficult to put into
practice.
On the other hand, interconnects with repeaters are fairly regular circuits. Recently, the focus of
wave-pipelining has shifted from the logic to the interconnection circuits and a number of intercon-
nect wave-pipelining designs for ASIC have been proposed [VD05, JLD07, LKK+05, DPL+07]
in order to achieve a higher throughput of interconnections. On the wave-pipelined interconnect,
a new data bit is sent before the previous data bit reaches the sink, thus the maximum bit rate is
not limited by the total wire delay. Instead the minimum data pulsewidth that can be sustained
on the wave-pipelined interconnects, which is smaller than the interconnect latency, determines
the maximum interconnect throughput. As a result, there can be a significant enhancement in the
interconnect throughput through simultaneous presence of multiple bits on the interconnect.
While these new signaling techniques are promising to improve the interconnect throughput, it is
more challenging to implement analogue wave signaling in FPGA than in its ASIC counterparts.
Since the FPGA architecture is a pre-fabricated digital platform with limited flexibility, it prohibits
a straightforward realization of analogue circuits or modifications of the physical architecture.
Also, specially designed interconnects and analogue-digital interfacing circuits are required for
realizing the wave-pipelined links.
The work presented in this chapter aims to exploit the interconnect throughput in FPGAs and to
develop reliable high throughput on-chip communication links. To our knowledge, this work is
the first to propose using wave-pipelined signaling for on-FPGA communication and present new
design methodologies for its realisation in FPGAs.
4.3 Global Interconnect Model for FPGAs
In order to exploit the interconnect throughput in FPGA, it is important to develop a detailed model
to characterize the electrical properties and predict the throughput performance. Previous FPGA-
based interconnect models, such as in [KBV93] and [BRM99], focus on delay estimations and
architectural explorations. These models employ a simplified electrical characterisation and are
not able to be extended to analyse the wave-pipelined throughput. In this section, a new FPGA-
4.3 Global Interconnect Model for FPGAs 92
Table 4.1: Notations used in this chapter
Rdi Input resistance of the driver at the i-th segment
Cdi Load and intrinsic capacitances of the driver at the i-th segment
Rsi Thevenin equivalence resistance of the i-th segment
Csi Thevenin equivalence capacitance of the i-th segment
ki The approximation coefficient for the i-th segment
σi The time constant for i-th segment
vi The voltage of the i-th segment
t Time
γi The discount factor for i-th segment
Tn Propagation delay for the n-segment interconnect
Γn Throughput for the n-segment interconnect
specific global interconnect model is presented. This model generalises FPGA interconnections
into multiple stages of buffered line and provides an analytical solution to study wave-pipelined
signalling.
4.3.1 Multiple-Stage Model
Consider a typical island-style FPGA architecture [BRM99, CCP06] which comprises a 2D array
of logic blocks or slices that can be interconnected via programmable routing. Global intercon-
nection comprises a combination of programmable interconnects that is physically spanning over
a long distance from the source to the sink. The interconnection in FPGAs is much more compli-
cated than a conventional long line found in an ASIC design. Fig. 4.1(a) shows a typical example
of an FPGA interconnection. It starts from a source, which is the output of a logic block, and is
connected to a short wire segment via interconnect switch. There are a number of ways to realize
the interconnect switch. The simplest way is by using a pass transistor [LLTY04, BRM99]. Other
approaches, such as stages of multiplexer [LLM06] and transmission gates, can also achieve the
same functionality with higher speed. Although a pass transistor is shown in the figure, it can be re-
placed by other logic, such transmission gates and the closed-form analytical solution remains the
same with different resistance and capacitance values. Most modern FPGAs have wire segments
of different lengths that can be used to implement interconnections with different requirements.
For example, in Xilinx Virtex-4 series FPGAs, there are four different types of wire segments,
which are with lengths of 1, 3, 6 (Hex-line) and 24 units. Through programmable interconnect
4.3 Global Interconnect Model for FPGAs 93
(
a
)
S
h
o
r
t
 
w
i
r
e
B
u
f
f
e
r
S
h
o
r
t
 
w
i
r
e
P
a
s
s
 
T
r
a
n
s
i
s
t
o
r
L
o
n
g
 
W
i
r
e
(
c
)
R
s
1
,
C
s
1
R
s
2
,
C
s
2
R
s
3
,
C
s
3
R
s
4
,
C
s
4
R
s
5
,
C
s
5
R
s
6
,
C
s
6
R
s
7
,
C
s
7
S
h
o
r
t
 
w
i
r
e
S
h
o
r
t
 
w
i
r
e
P
a
s
s
 
t
r
a
n
s
i
s
t
o
r
(
b
)
R
P
T
,
C
P
T
R
S
W
,
C
S
W
R
M
U
X
,
 
C
M
U
X
R
S
W
,
C
S
W
R
M
U
X
,
 
C
M
U
X
R
L
W
,
C
L
W
R
T
a
p
,
 
C
T
a
p
R
L
W
,
C
L
W
R
L
W
,
C
L
W
R
S
W
,
C
S
W
R
M
U
X
,
 
C
M
U
X
T
a
p
T
a
p
R
T
a
p
,
 
C
T
a
p
R
M
U
X
,
 
C
M
U
X
R
S
W
,
C
S
W
R
P
T
,
C
P
T
R
M
U
X
,
 
C
M
U
X
v
1
R
d
1
C
d
1
R
d
2
C
d
2
R
d
3
C
d
3
R
d
4
C
d
4
R
d
5
C
d
5
R
d
6
C
d
6
R
d
7
C
d
7
v
2
v
3
v
4
v
5
v
6
v
7
I
n
t
e
r
c
o
n
n
e
c
t
 
s
w
i
t
c
h
I
n
t
e
r
c
o
n
n
e
c
t
 
s
w
i
t
c
h
I
n
t
e
r
c
o
n
n
e
c
t
 
s
w
i
t
c
h
I
n
t
e
r
c
o
n
n
e
c
t
 
s
w
i
t
c
h
S
t
a
g
e
 
1
S
t
a
g
e
 
2
S
t
a
g
e
 
3
-
5
S
t
a
g
e
 
6
S
t
a
g
e
 
7
G
N
D
G
D
N
C
G
N
D
C
G
N
D
C
C
o
u
p
l
i
n
g
C
C
o
u
p
l
i
n
g
(
d
)
Figure 4.1: A model of a typical global interconnection in FPGAs. (a) The schematic of an
interconnection comprises of short and long wires and which are connected through switching
points. (b) Circuit model of the corresponding interconnection. (c) Switch-level RC circuit for
the interconnection as a chain of segments driven by drivers. (d) Parasitic capacitance model for
the interconnects. The capacitance, Cs, is the sum of the capacitance to ground planes and the
coupling capacitances.
4.3 Global Interconnect Model for FPGAs 94
switches, wire segments can be connected to form a long range route. Since constructing a long
range interconnection by aggregating multiple short wire segments increases energy dissipation,
long wire segments are frequently being used to realise global interconnection. Typically these
long wire segments will span a long distance (for example 24 tiles in a Xilinx Virtex-4 [Xil05a]).
In addition, different interconnect routing architectures have been proposed. Examples are the
tapped branching lines from long wire, known as early-turns [LLM06]. They are proposed to
provide flexible routing to other channels. Also, various short-cuts for the interconnect switches
are proposed in [LLTY04], such as buffers inserting into long wires to reduce delay.
The interconnection depicted in Fig. 4.1(a) can be generalized as a network of resistance and
capacitance (RC) pairs, with buffers in between. This is depicted in Fig. 4.1(b). A simple lumped
model can be used to model the short interconnect segment. More complicated models, such
as β and T models [Bak90] can also provide a good approximation with higher accuracy. The
switching logics and routing branchings can be modeled by RC circuits as in [LWH+05] and
equivalent transistor parasitics can be found by characterizing circuits. To further generalise the
model, the whole interconnection can be divided into n stages. Each stage is either driven by
a driver at each programmable switch or by a buffer in the long wire. By Thevenin’s theorem,
the combinations of RC pairs at the i-th stage can be summarized into Rsi and C
s
i , respectively.
Following [Bak90, Sak93], the driver or buffer is modelled by a switch-level RC circuit and, thus,
the input resistance and load capacitance of the driver are denoted as Rdi and C
d
i respectively.
Note that the parasitic capacitance, Cs, is a composite of both the coupling capacitance and the
capacitance to the ground plane, as in Fig. 4.1(d). Therefore, the total capacitance, Cs, is the sum
of the four different capacitance in this case.
The model generalises the complex interconnection into simple multiple-stage link, which can be
characterised by the RC pairs of buffer and interconnects. By resolving the waveform at each of
the stages, the whole interconnection can be studied and evaluated analytically. In the following
subsection, a simple closed form approximation will be reviewed and applied to approximate the
waveform at each of the stages.
4.3 Global Interconnect Model for FPGAs 95
4.3.2 Waveform Approximation at Each Stage
The waveform at each stage of the interconnection can be modelled by RC lines. Step response
for RC lines is well studied in the literature [Bak90]. In general, two types of model can be used,
namely the distributed and the lumped models. The lumped RC model is a pessimistic model for
a RC line whereas the distributed RC model is more realistic. It is well known that for 1.0RC, the
VOUT is only charged up to 63% of Vdd for the lumped model versus a yield of 90% of Vdd for the
distributed model. [Bak90].
The two RC models for buffered lines are presented in the following section. Suppose, variable
vi is denoted as the fraction of the supply voltage at the far end of the i-th stage. Thus vi =
Vi(t)/VDD. The response of a lumped RC network is
vi(t) =
1− e−t/RsiCsi
Vdd
(4.1)
The response of the distributed network is harder to calculate and there is no closed-form solution
exists for this equation. Only an approximations such as the formula presented below is available
[Bak90]:
vi(t) =
 2erfc
(√
RsiC
s
i
4t
)
t¿ RsiCsi
1.0− 1.366e−2.5359
t
Rs
i
Cs
i + 0.366r
−9.4641 t
Rs
i
Cs
i tÀ RsiCsi
(4.2)
These equations are difficult to use for ordinary circuit analysis. An alternative approximation
given by Sakurai in [Sak93] provides a closed-form expression for a distributed RC line, which is
easier to be applied for interconnect modelling. In Sakurai’s approximation, the output of an RC
line for a step function is given by
vi(t) = 1−
∞∑
j=1
ki,je
−t/σi,j ≈ 1− ki,1e−t/σi,1 (4.3)
To simplify the expression, the index j can be dropped, thus it becomes
vi(t) = 1− kie−t/σi (4.4)
where ki is a coefficient and σi is time constant for charging or discharging the buffered line
[Sak93]
σi = RdiC
d
i +R
d
iC
s
i +R
s
iC
d
i + 0.4R
s
iC
s
i (4.5)
4.3 Global Interconnect Model for FPGAs 96
and
ki = 1.01
RdiC
s
i +R
s
iC
d
i +R
s
iC
s
i
RdiC
s
i +R
s
iC
d
i +
pi
4R
s
iC
s
i
(4.6)
Also a more general expression that suits FPGA interconnects can be provided, which is applicable
for some stages that have a pass transistor. It is well-known that a pass transistor implemented with
an NMOS device is not effective at pulling a node to VDD. When the pass-transistor pulls a node
high, the output only charges up to VDD − VTn [RCN03]. Therefore, let γi be the discount factor
for the i-th stage, γi = (VDD − VTn)/VDD for i ≥ 2 and γi = 1 for i = 1. Note that using a pass
transistor has been abandoned in many commercial FPGA architectures. Instead, multiplexer and
transmission gates are used in modern architecture for switch and connection block design. For
these cases, simply let the discount parameter γ = 1. Then, the general approximation for the rise
and fall of vi at the i-th stage becomes
(Rise) vi = γi − γikie−ti/σi (4.7)
(Fall) vi = γikie−ti/σi (4.8)
The complex interconnection in FPGA has been generalized as a series of buffered segments.
Methodologies that were used for ASICs to study global interconnects can be applied here to
analyse the overall delay and throughput in an FPGA interconnects. With this model, the delay
and throughput for the overall link can be derived which will be discussed in the following section.
4.3.3 Delay of Long Lines in FPGA
Interconnect is modeled as a multiple-stage link and by resolving the waveform at each of the
stages. The delay of the link can be derived analytically. Consider an interconnection that is
partitioned into n stages of buffered line. The Thevenin’s equivalent RC at the i-th stage are
denoted as Rsi and C
s
i , and the input resistance and load capacitance of the driver are denoted as
Rdi and C
d
i respectively.
It is assumed that delay in each stage is given by the 50% rise time (or the fall time)1 of the
interconnect segment. The delay of a driver at the i-th stage is denoted as δi. Particularly, the
delay δi will also include the delay of logics at the interconnect switch. Thus, the parameter δi can
1It is common to use rise time to refer to both rise and fall times. This nomenclature will used in the rest of the
chapter.
4.4 Wave-Pipelined Signalling 97
be unique to each stage as all the interconnect switches can be different, subjected to the routing
of the interconnections. The time ti for the i-th stage to reach vi can be derived from Eq. (4.7) as
ti = σi ln
(
γiki
γi − vi
)
(4.9)
where ti is the time required for the output of the i-th stage to reach a fraction vi of the full-scale
voltage. Alternative delay models are available in the literature, such as Elmore Delay [Elm48].
The Elmore delay model is a RC lumped network model which provides a pessimistic estimation.
The Bakoglu delay model [Bak90] is very similar to the equation derived here based on Sakurai’s
model. The total delay for the voltage at the n-th stage reached vn is given by
TDelayn = σn ln
(
γnkn
γn − vn
)
+
n−1∑
i=1
σi ln
(
γiki
γi − 0.5
)
+
n∑
i=1
δi (4.10)
The delay-based throughput ΓDelayn for an interconnection with n stages is given by the inverse of
the propagation delay as
ΓDelayn =
1
σn ln
(
γnkn
γn−vn
)
+
∑n−1
i=1 σi ln
(
γiki
γi−0.5
)
+
∑n
i=1 δi
(4.11)
From the equations above, delay has a linear relationship with time constant σ. With technology
scaling, coefficient k converges to 1 and δ decreases with the reduction of gate delay. However,
time constant σ will increase significantly with the new technology processes, simply because the
explosive growing of wire resistance [HMH01].
4.4 Wave-Pipelined Signalling
Despite the large delay of long interconnections in FPGAs, it has been shown that the intrinsic
interconnect architecture benefits the realisation of wave-pipelining, which allows multiple bits to
simultaneously traverse along the line [MDS+08a]. Pre-fabricated buffers have been inserted into
the switches and long interconnect facilitates and preserves the pulse waveform and propagation
throughout the lines. The overall throughput (or data rate) of the line can therefore be significantly
increased. In the following section, the fundamental wave throughput and power dissipation mod-
els will be presented.
4.4 Wave-Pipelined Signalling 98
4.4.1 Throughput of Wave-Pipelined Links
Figure 4.2: Signal transmission using (a) delay-based and (b) wave-pipelined schemes.
Conventional delay-based signalling uses a synchronous mechanism that a new bit of data will
be transmitted from the source only if the last bit of data has arrived the sink. Therefore, the
throughput of interconnect for this approach will be the reciprocal of the interconnect delay, as
shown in Eq. 4.11. Fig. 4.2(a) shows the mechanism of delay-based signalling. A new data bit, as
illustrated by the falling edge in the figure, is started to be transmitted when the previous data bit
has arrived at the sink.
The throughput of an interconnect can be increased by increasing the data injection rate. The
source injects a new data bit into the link even when the previous data bit has not arrived at the
sink (as shown in Fig. 4.2(b)). In this way, the throughput of the link can be greatly improved, as
multiple bits are traversing along the line simultaneously. However, there exists an upper bound
for the throughput, which guarantees that data bits do not interfere and corrupt each other along
the line and data will be transmitted reliably. Particularly, when a pulse signal propagates along
a line, the amplitude of the pulse attenuates. This is because that there is a discrepancy between
the charging and discharging time of an interconnect and the discharging time is shorter and, thus,
results the pulse attenuation [WMSC09]. For example, Fig. 4.3(a) shows a stimulus waveform that
is injected into the interconnect. This input is then transformed into a waveform in Fig. 4.3(b),
which is observed at the end of first interconnect stage. It has a smaller pulse width and pulse
amplitude is reduced to v1, where v1 < V dd. When the pulse propagates through n stages, the
4.4 Wave-Pipelined Signalling 99
pulse width and amplitude are further reduced, as shown in Fig. 4.3(c). It is important to ensure
that the amplitude of the pulse at the sink is large enough2, so that a valid data can be registered at
the output.
Mathematically, this can be determined by evaluating the maximum data rate which is the recip-
rocal of minimum pulse width from the source, such that the amplitude of this pulse at the end
of the n-th segment can be pull-up to vn. If the data rate is higher than the maximum data rate,
the amplitude of the pulse at the n-th segment will be less than vn, which may not necessarily be
registered correctly and may cause signalling failure.
The minimum pulse width for an interconnection with n stages is denoted as TWaven and let vi to
be the amplitude of the pulse at the i-th stage of the link. Using the buffered interconnect model
in Eq. 4.9, the time required for voltage at the first segment to charge up to v1 becomes
TWaven = σ1 ln
(
γ1k1
γ1 − v1
)
+ δ1 (4.12)
where δ1 is the delay of the buffer at the first stage. Therefore, maximum data rate (or throughput)
ΓWaven for an interconnection is given by
ΓWaven =
1
σ1 ln
(
γ1k1
γ1−v1
)
+ δ1
(4.13)
Note that v1 is an unknown variable and can only be computed backward from vn. The computa-
tion of v1 is presented as follows.
Computing v1 backward from vn
In [VD05], a recursive relationship between vn−1 and vn is given by
vn−1 =
1
2− vn (4.14)
This expression is obtained based on the assumption that all stages are equal and thus a technolog-
ical independent recursive relationship of vi and vi−1 can be derived [VD05]. However, it is not
applicable to the FPGA interconnect model, as all stages have their own corresponding parame-
ters and bound to be different. Thus, a more general expression for the vi and vi−1 relationships
is required.
2It has been suggested that vn equals to 90% of Vdd in [VD05], which is a pessimistic assumption to ensure reliable
data transfer and a transistor switch is fully turned off until the waveform reaches.
4.4 Wave-Pipelined Signalling 100
Figure 4.3: Signal transmission using (a) Stimulus pulse at the source, (b) waveform of the pulse
observed at the output of the first stage and (c) attenuated waveform observed at the output of the
last stage.
4.4 Wave-Pipelined Signalling 101
Since the rise time for the (i−1)-th segment reaches 50% of the supply voltage is TDelayi −TDelayi−1 −
δi, the following expression can be obtained:
vi−1γi−1ki−1e−(T
Delay
i −TDelayi−1 −δi)/σi−1 = 0.5 (4.15)
By substituting Eq. (4.10) into Eq. (4.15), it can be further reduced to
vi−1γi−1ki−1e
− σi
σi−1 ln
γiki
γi−υi−ln
(
γi−1−vi−1
γi−1−0.5
)
= 0.5 (4.16)
which leads to the recursive relationship of vi and vi−1,
vi−1 =
γi−1
γi−1ki−1(2γi−1 − 1)
(
γi−vi
γiki
) σi
σi−1 + 1
(4.17)
Eq. (4.17) can be converted back to Eq. (4.14) by letting ki−1 = ki, σi−1 = σi and γi = 1 for
i = 1, 2, ..., n. As can be seen from Eq. (4.17), it comprises of the parameters σi, ki to provide a
specific modeling of each stage in a long interconnection in FPGA. Note that vi−1 from Eq. (4.17)
does not depend on δi, which is the delay of buffer at stage i. This is a counter-intuitive result.
It implies that the throughout is independent of the buffer delay. An example is presented in the
following to exemplify the calculation for the interconnect throughput.
Example 4.1. Suppose an interconnection with three identical buffered segments of length 0.12
mm. Let the resistance Rsi = 489 Ohm/mm and capacitance C
s
i = 187 fF/mm where i = 1, 2, 3.
Also, let the output resistance of the buffer Rdi = 245 Ohm and the output load of the buffer,
Cdi = 201 fF. The time constant σ = 0.23 × 10−9 s and coefficient k = 1.10 can be obtained
using Eqs. 4.5–4.6. Suppose the signal at the end of the interconnection will be registered as
HIGH when the voltage is larger or equal to 90% of Vdd. Let v3 = 0.9, γi = 1, i = 1, 2, 3
and using Eq. 4.17, the voltage at the end of the second buffered stage becomes v2 = 0.901.
Similarly, voltage at the first stage becomes v1 = 0.9167, which implies that the first segment
has to be driven up to 91.67% of Vdd, so that the signal can be captured at the sink. Finally,
using Eqs. 4.12–4.13, the minimum pulse width equals to 0.643 ns and the maximum throughput
becomes 1.56 Gbit/s.
Note that the performance of an interconnection could be also affected by various other on-chip
noise, such as coupling capacitance to neighbouring wires, crosstalk, DC and jitter noise. Inter-
wire coupling or crosstalk is one of the error sources in the circuit and has been widely studied.
4.4 Wave-Pipelined Signalling 102
Wave-pipelining is thought to be susceptible to crosstalk noise and is because of the utilization
of dynamic circuit and extremely high data throughput, which is about 2-4 Gigabit per wire per
second.
There are a number of techniques to mitigate the impact of crosstalk. Many of these techniques
require special physical design, such as shielding [Wal00], interleaving buffers structures [XW02]
or the surfing circuit proposed by Greenstreet et al. in [TLG09]. These techniques are well
applicable in the ASIC domain.
For FPGA implementation, crosstalk is not the biggest limiting factor for realizing wave-
pipelining. The clocking frequency of an FPGA is slow enough that the impact of crosstalk is
not significant. Note that neither the physical layout nor circuits are amendable in FPGAs and
under the principle of keeping the existing FPGA interconnect architecture, the proposed model
in this chapter provides a reasonable estimation of the fundamental throughput to realize wave-
pipelining in FPGAs.
In the discussion above, it can be seen that the relationship between throughput and the electrical
parameters of an interconnect is complex. The important result from this analysis is that through-
put does not merely depend on time constant σ, but the time constant ratio σi/σi−1 between the
i-th and (i−1)-th stages, as this ratio determines the voltage vi−1 at previous stage. It means that a
wave-pipelined link is less sensitive to wire scaling, but rather the interconnect lengths of the seg-
ments determine the fundamental throughput of the link. This indicates an important opportunity
to mitigate the large RC delay challenge from technology scaling.
Power
There are no registers along the interconnection in a wave-pipelining link. Therefore, power con-
sumption of a wave-pipelining link equals to the power to drive the signals along an interconnec-
tion only. Let fsw be the average toggling frequency of the line and the capacitive load at each
stage equals to the sum of the wire capacitance Csi and the buffer output load C
d
i . The dynamic
power consumption for a wave-pipelined link with n-stage buffered interconnect can be computed
using the following equation.
PWave = 0.5V 2ddfsw
n∑
i=1
(Csi + C
d
i ) (4.18)
4.4 Wave-Pipelined Signalling 103
where fsw is the average bit toggling frequency of the link, and Csi and C
d
i are the interconnect
and driver capacitance, respectively. Note that the power expression here only captures the power
consumption of the data lines. Wave pipelined signalling requires additional circuits for source
synchronous clock reference and data sampling which require additional power consumption, as
will be discussed in Section 4.5.
4.4.2 Register-Based Pipelining Link
Apart from wave-pipelining, inserting registers into a long line is a well known technique [CW97]
to improve circuit throughput. Long interconnections can be partitioned into shorter segments,
each is driven by a register and, therefore, the throughput increases, but at an expense of higher
power dissipation and larger latency.
Throughput
The throughput of a register-pipelined link can be approximated as follows. Suppose there are
k registers to be inserted into the long line and the long line can be equally divided into k + 1
short segments. Unlike the wave-pipelining link, extra interconnects are required to route the
line into logic blocks and reconnect the output from the logic blocks back to the routing tracks.
Furthermore, there is an extra propagation delay which is attributed to register itself. Assuming a
same structure and network topology for each slice in an FPGA, the additional delay to the line
for adding one register is denoted by ζReg. The overall delay of the link becomes
DkReg = (k + 1)Tk + kζReg (4.19)
where Tk is the delay of each short line segment and ζReg is the delay of each register. The
throughput of the link will then be the reciprocal of the delay of each segment and register, thus
the overall throughput can be obtained as follows,
ΓkReg =
1
Tk + ζReg
(4.20)
4.4 Wave-Pipelined Signalling 104
Power
The overall power of the link is similar to the wave-pipelined link but with additional capacitances,
CReg, which is introduced by the registers (and local interconnects in the logic blocks). The power
consumption expression is as follows,
PReg = 0.5V 2ddfsw
[
kCReg +
n∑
i=1
(Csi + C
d
i )
]
(4.21)
Table 4.2: Comparison between register-based pipelining and wave-pipelining based on the theo-
retical analysis
Register-based pipelining Wave-pipelining
Technology 65 nm 65 nm
Length 6.45 mm (75 tile) 6.45 mm (75 tile)
Registers/Stages 5 14
Throughput 1.4 Gb/s 1.4 Gb/s
Latency 6.12 ns 4.12 ns
Power 10.08 mW 7.98 mW
Table 4.2 compares the throughput and power performance of register-based and wave-pipelined
signalling. The two links are identical with the same programmable configuration and electrical
parameters. Using the same technology, interconnect length and throughputs, as shown in the
table, power dissipations and latency results are obtained based on Eqs. 4.13-4.21. The register
link requires 26% more power and has a 49% longer delay, when comparing to the wave-pipelined
link. These results demonstrate a significant benefit for wave-pipelined link in global intercon-
nection. However, the register-pipelined link is much more straightforward when implemented
in a synchronous system, such as FPGAs. Wave-pipelined link applies analogue signalling for
data transmission, which requires new design of on-chip communication components and these
techniques will be presented in the following section. Note that the above power evaluation has
neglected the synchronisation costs and the need to send a source synchronous request signal,
which may introduce additional power for the communication link. However, there are various
techniques and circuits can be applied to realize the synchronization circuit and with different
power consumptions and trade-offs. The power analysis in this chapter provides a general estima-
tion and comparison for the two pipelining approaches.
4.5 Wave-Pipelining Circuit Design in FPGAs 105
P h a s e  
a d a p t o r
C l o c k  
d i v i d e r
C l o c k  
d i v i d e r
D Q
R e g R e g
P r o g r .  
D E
R e g
c l k _ r c vc l k _ s n d
D a t a
T r a n s m i t t e r R e c e i v e r
a d j u s t
c l k
r e f
Figure 4.4: Schematic of transmitter and receiver for the phase adaptation design.
Summary
The maximum data rate (throughput) of a wave-pipelined link has been derived. It is interesting
to find that throughput is independent of buffers or interconnect RC delay in a wave-pipelined
link. Especially, the interconnects in FPGA present the required characteristics for realizing wave-
pipelined signalling and, thus, the wave-pipelining approach provides an opportunity to mitigate
the delay problem attributed to the large RC time constant in FPGAs. Analytical results also show
that the wave-pipelining link is more energy efficient and has smaller latency than the register-
based link. However, the register pipelined provides a mechanism of higher safety and immunity
to on-chip noise when compared to the wave-pipelining link because the analog-nature of wave-
pipeline link. Further evaluation of wave-pipelining links in FPGAs on performance, area, energy
utilization and reliability will be discussed in the following section.
4.5 Wave-Pipelining Circuit Design in FPGAs
4.5.1 Phase Adaptation
Wave-pipelined signalling is analog in nature and requires interfacing circuits to communicate
with the digital synchronous counterparts. Typically, a sampling circuit, which can be a register, is
used at the receiver end of a wave-pipeline link. Suppose the data rate of the link is already known,
4.5 Wave-Pipelining Circuit Design in FPGAs 106
Figure 4.5: Design of a phase adaptor.
the challenge is to synchronize the incoming data with the receiver clock. This can be achieved
by simply adding delay elements in the clocking path of the receiver to match the skew of the
incoming data. However, when the skew is unknown or unpredictable, it is difficult to match the
delay. For an FPGA design, the clock skew is unknown before the place-and-route of the circuit.
Therefore, an intelligent circuit is required to adaptively adjust the delay to match and synchronize
the incoming data from the line.
Fig. 4.4 shows the schematic of a phase adaptation circuit for a wave-pipelining link. Data is
transmitted source synchronously with a clock signal generated by the transmitter. The clock
signal will provide a reference signal to adjust the receiver in order to register the incoming data
at the correct timing. Utilising a delay locked-loop or phase locked-loop, the received data can
be sampled with a variable amount of delay relative to the transmitter clock signal as shown in
Fig. 4.4. Fig. 4.5 shows the phase detector circuit implemented in this design. The XOR gate
performs phase difference detection and the counter output is used to adjust the programmable
delay element in order to modify the phase of the lock clock. Note that there is a chance to
generate a glitch on en when clk and ref change simultaneously and The counter may start to
count even when there is no phase difference between the clk and ref. This can be resolved by
adding an AND-gate between the en and the up or down outputs to ensure that the counter only
starts to counter when a phase difference is detected.
4.5 Wave-Pipelining Circuit Design in FPGAs 107
Figure 4.6: Implementation of Delay Element in FPGA.
The phase detector shown in Fig. 4.5 identifies the differences in phase between two input clock
signals. The design is a digital adaptation of the phase detector design in a conventional Phase-
Lock-Loop (PLL), except that the output of the XOR-gate becomes an enabling signal to the
counter to adjust the delay. The output of the counter will adjust the programmable delay element
in order to modify the phase of the local clock.
Design of an accurate and high resolution programmable delay element is important for the phase
adaptation circuit. This is a challenging task to FPGA designers, as there is no available circuits to
implement delay elements3. In our design, a chain of LUTs are used for the delay circuits. Fig. 4.6
shows the design of a delay element using LUTs in FPGA. The circuit can provide an arbitrary
delay by selecting the appropriate combination of LUTs. Each LUT implements two invertors
and thus the resolution of the delay circuit can be around 50 ps. Alternative methods, such as
using interconnects which may provide delay element with higher resolution, can also be used to
realize the delay circuit in FPGA. However, it is challenging to manage interconnect and routings
in FPGA with existing tools and FPGA software.
Another issue is the alignment of the data path with the transmitter clock. If the skew of the data is
large, the register may fail to capture the data. When the data link has a large bit-width, variance
of the interconnection lengths will become significant and thus increases the skew. In [MSCL08a],
study shows that interconnection length can be increased drastically because of the limited routing
resources in an FPGA. Interconnects can also be routed manually to minimize the variance in
interconnection lengths. It is regarded as future work to develop specific routing algorithms that
can minimize interconnects variance in communication links.
3For some recent FPGA models, delay elements are implemented at the I/O modules for inter-chip chip communi-
cation circuit implementations.
4.5 Wave-Pipelining Circuit Design in FPGAs 108
Accumulate 
and 
compare
Multi-phase 
sampling
Edge 
detection
Clock 
generator
clk_rcv
ref
Data_out
clk
0
-clk
n
Multi-phase 
sampling
Data_in
clk
0
-clk
n
 
Figure 4.7: The design of an oversampling receiver.
4.5.2 Multi-Phase Oversampling
An oversampling receiver captures incoming data multiple times within one clock cycle with the
multi-phase clocks. A decision has to be made in order to pick the right sample as the data
output. The schematic of a multi-phase oversampling circuit is presented in Fig. 4.7. A number
of clock outputs at difference phases (clk0−n) are produced from the clock generator. These clock
signals are used to sample the incoming data at different time instances within a clock cycle. The
sampling process is realised within the multi-phase sampling block. The edge detection block
analyzes the registered data and by using XOR-gate, the bit 0-1 transition point can be located.
The accumulate-and-compare block decides which sampled data is valid and outputted for later
processing. The valid data point is usually located between two 0-1 transition edges. Fig. 4.8
shows a typical example of data recovery using oversampling. In this case, three samples are
obtained by the multi-phase clocks. The bit transitions between samples s3 and s1 are detected
by the XOR-gate. The statistics of the bit transitions will be accounted over the moving window,
which is four phases in this case. Thus, data sample at s2 will be selected as the valid data.
4.5 Wave-Pipelining Circuit Design in FPGAs 109
 
Figure 4.8: An illustration of data sampling and moving windows for data recovery.
The oversampling clock requires a much higher clocking frequency in order to sample the data
multiple times within a clock period. An effective way to achieve this is to use delay elements.
By inserting delay elements into the clocking path, multiple clocking edges within a clock period
can be realized. Normally, the edge detection and valid data selection would apply to every single
incoming data line. If such a circuit were to be implemented for communication link with large bit-
width, large hardware area will be required. A source synchronous approach can also be applied
here to reduce hardware area consumption, as the edge detection and valid data selection only
apply on the reference signal. The output of the accumulate-and-compare block will be used to
select valid data for all other lines. As a result, the area overhead for oversampling can be greatly
reduced.
4.5.3 Bit Error Rate Computation
Wave-pipelining can be susceptible to crosstalk and different dynamic noise that appears in a chip.
Although there are a number of techniques to mitigate the impact of noise, these techniques gen-
erally require special physical design or modification of routings, which is impractical in FPGAs.
However, it is reasonable to characterize the impact of various noise, including power supply,
clock jitter and data-dependent noise, from a high level perspective and, thus, an evaluation of
communication reliability in terms of bit-error-rate can be provided.
In the source synchronous circuit, unbounded Gaussian noise sources account for a significant
fraction of signalling noise. There is always some probability that the Gaussian noise source will
exceed the margin and that a probabilistic analysis is required to compute the bit error rate for
4.5 Wave-Pipelining Circuit Design in FPGAs 110
evaluating the reliability of the wave-pipelined communication link. The network noise margin,
VM , is the amount of voltage margin available to tolerate noise. In general, there are error sources
can be described as amplitude error and phase error, as depicted in Fig. 4.9. Noise sources are
usually approximated as Gaussian sources, with a normal distribution, and described by their
standard deviation or root mean square value.
Figure 4.9: Timing diagram for error computation.
The throughput limit for a single line is limited by its minimum pulse width, as have discussed
in Section 4.4. When communication comprises of multiple lines as a bus, pulses may arrive at
the receiver at different time stamp because of various on-chip dynamic noise and data dependent
crosstalk. The skew between the data path and the reference clock would result in an error sam-
pling wrong data. Whether the communication link is susceptible to error depends on the variance
(σk) in the interconnection lengths in the bus. A smaller throughput receiver is needed to compen-
sate the large skew in order to provide a large enough margin to register the correct samples. This
is in contrast to the delay-based of register-based pipelining, of which the throughputs depend on
the longest interconnection length of the bus.
Furthermore, skew error also contributes to the bit-error. Sampling away from the peak of the
waveform results in a sampled valued that is less than the peak signal, thus Vdd−Vφ. The slewing
portion of the signal effectively translates phase noise into amplitude noise, as a result timing noise
4.5 Wave-Pipelining Circuit Design in FPGAs 111
is translated into signal noise. For the clarity of analysis, a triangular waveform is considered, the
phase noise translates linearly into amplitude noise by a linear equation. Thus, the phase noise
becomes
g(σφ) = Vdd
∣∣∣∣2σφT
∣∣∣∣ (4.22)
The effect of jitter has a similar effect as static phase error except that the error depends on the
phase-noise characteristics and its probability distribution.
The Root-Mean-Square (RMS) voltage of the i-th amplitude noise source is denoted by VAi and
the i-th phase noise source is denoted by Vφi. The standard deviation of the amplitude noise is
denoted as σA. The multiple Gaussian noise sources can be modelled as a single combined source
with rms voltage VR by summing the variation of each source and taking the square root as follows:
VR =
√∑
i
(σ2Ai + g(σφi)2) (4.23)
The effective voltage signal-to-noise ratio is computed from VM and VR as [BS79]
V SNR =
VM
VR
(4.24)
Since the noise is Gaussian, the probability density function of the noise voltage is given by
p(x) =
1
VR
√
2pi
exp
(
− x
2
2V 2R
)
(4.25)
where VR is the standard deviation of the combined noise. The probability that the noise exceeds
noise margin VM , the threshold voltage, is given by the error probability
P (x > VM ) = 1− erf(VM ) (4.26)
and, thus, the upper bound of the probability gives the Bit Error Rate [BS79],
P (error) <
VR
VM
√
2pi
exp
(
− V
2
M
2V 2R
)
(4.27)
4.5 Wave-Pipelining Circuit Design in FPGAs 112
C C
R e g
C
D a t a
A c k
R e q
A c k
A c k
A c k
R e q
R e q
R e q
R e gR e g
D a t a
Figure 4.10: 4-phase bundle of data asynchronous communication link (Figure adapted from
[SF01]).
4.5.4 Register Pipelining: Synchronous Versus Asynchronous
In a conventional synchronous system, register pipelining can be easily realized by inserting regis-
ters into the data path to increase the data throughput. This can also be applied for communication
link design. Registers can be inserted into the long interconnections to increase the throughput
of the link. However, this approach assumes a synchronous clock domain between the two com-
municating ends. Such an assumption is not scalable as large systems generally contain multiple
clock domains. Furthermore, if the clock spans a large area, issues such as clock skew and jitter
will become problematic.
Therefore, an asynchronous design will be required in such a scenario and to provide more reliable
data transfer between modules by replacing clock circuits with handshake circuits. A number of
asynchronous design have been proposed and thoroughly studied in the last two decades. Asyn-
chronous communication link has yet to be explored in FPGAs.
One of the most popular asynchronous handshake protocols is the 4-phase bundled data approach
[SF01]. At each of the stages, there is a register to pipeline the data and the register is enabled
by the handshaking logics (See Fig. 4.10). The drawback of this 4-phase protocol is that the
register will be written only at the rising edge of the request signal. The speed of the pipeline can
be improved by introducing protocol that operates at both the rising and falling edges of control
logics.
4.5 Wave-Pipelining Circuit Design in FPGAs 113
A 2-phase bundle protocol, originated by Sutherland in [Sut89], provides such an improvement.
At each of the stages, there are two registers, one will be written at the rising edge while the other
triggered at the falling edge. There are two registers operating in one cycle. The change of a
stage of the request and acknowledgement signals symbolizes the read and write processes of the
registers.
The 4-phase and 2-phase bundled data pipeline is conventional asynchronous approach for the use
of edge triggered D-FFs in the data path. Often transparent D-latches are used and the control
path can be modified to keep these latches transparent most of the time order that data can be
transmitted through [DPL+07]. However, realization of such asynchronous link is rather complex
and requires complicated ASIC circuit techniques, which is not straightforward or viable in FPGA
implementations. Therefore, the 4-phase and 2-phase bundled data pipelines are employed in this
chapter as an asynchronous link reference.
Although the register-based pipelining is reliable, several sources might introduce errors into the
link and especially when operating at very high speed. The most obvious source of error is when
the data signal travels faster than the control signal. Register will capture undesirable data. An
effective approach to resolve this problem is by adding delay into the request and acknowledge-
ment lines to guarantee the control signals lag behind the data. In FPGA, this can be achieved by
adding delay elements, which is a chain of LUTs, into the control path. However, this will be at
an expense of throughput performance.
4.5.5 Summary
The five on-chip communication strategies are summarized and compared in Table 4.3. They
are compared based on six attributes, such as data rate and reliability. In general, asynchronous
and wave-pipelining approaches are versatile to clock domain crossing. Also, the wave-pipelining
approach can achieve the highest data rate. However, in term of design complexity, power and
hardware area consumption, the synchronous approach is more preferable. More detail quantita-
tive studies and results for comparing different approaches will be presented and discussed in the
the following section.
4.6 Experimental Results 114
Table 4.3: A brief summary and comparison between the different on-chip communication ap-
proaches
Register pipelining Wave pipelining
Async. Async. Synchronous Phase Oversampling
(4-phase) (2-phase) adaptation
Cross clock regions Yes Yes No Yes Yes
Data rate Slow Moderate Fast Fast Fast
Hardware area Moderate Large Small Moderate Large
Power consumption Low High High Moderate High
Reliability High High High Moderate Moderate
Design complexity Moderate Moderate Easy Complex Moderate
4.6 Experimental Results
4.6.1 Comparing Analytical Results with SPICE Simulations
s w i t c h
s w i t c h
i n t e r c o n n e c t
i n t e r c o n n e c t i n t e r c o n n e c t
s w i t c h
S t a g e  1
S t a g e  2
S t a g e  3  t o
S t a g e  n - 1
S t a g e  n
Figure 4.11: Circuit model of a typical global interconnection in FPGA.
A SPICE model of a global interconnection comprising buffers, interconnect segments and
switches is shown in Fig. 4.11. The wave-pipelined, delay-based and register-based pipelining
are compared in the Cadence Virtuoso design environment and simulated with the SPECTRE
simulator. The interconnection circuits are modeled in distributed RC network and technology
parameters are based on the Predictive Technology Model (PTM) 4. Also, interconnects of single,
double, length-3 and length-6 are considered in the experiments to construct long interconnections.
Comparisons between my analytical model and SPICE simulations are reported in the following
paragraphs. Fig. 4.12 compares the analytical and experimental results of SPICE simulation for
4http://www.eas.asu.edu/ ptm/
4.6 Experimental Results 115
0 20 40
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5 x 10
9
Tile (0.12mm/tile)
Th
ro
u
gh
pu
t (b
ps
)
(a)
0 20 40
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5 x 10
9
Tile (0.12mm/tile)
Th
ro
u
gh
pu
t (b
ps
)
(b)
SPICE
Analytical
SPICE
Analytical
Figure 4.12: Analytical model versus SPICE simulation for interconnection comprises of Single
and Double lines only. (a) Interconnect wave-pipelining throughput. (b) Delay-based throughput.
interconnections with different lengths, which is measured in tile. Interconnects, in this case, are
constructed solely by using single and double lines. Stages of multiplexers are used in connecting
these short wire segments. Experimental results for interconnects constructed by using single,
double, hex (6 tiles) and long (24 tiles) lines are shown in Fig. 4.13.
An identical trend between the analytical and SPICE simulation results can be found. The accu-
racy of the analytical models are evaluated based on the relative errors, which are the discrepan-
cies between the analytical prediction and the SPICE simulation. The average relative error for
predicting the throughputs are 38.8% and 51.9% for wave-pipelined and delay-based approaches
respectively. The analytical results are generally in line with the experimental results. The errors
can be attributed to the inaccuracy in estimating the on-resistance of the drivers. It is found that
these resistances varies according to the input waveform. The error can be significant, even though
the resistances are carefully calibrated in the experiments. Similarly, an identical trend between
the analytical and SPICE results is found for the case of more complex interconnect configuration
in Fig. 4.13. The average relative error are 30.7% and 41.7% for the wave-pipelined and delay-
based approaches respectively. There are two spikes appearing in Fig. 4.13 in the cases of tile 8
4.6 Experimental Results 116
0 20 40
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5 x 10
9
Tile (0.12mm/tile)
Th
ro
u
gh
pu
t (b
ps
)
(a)
0 20 40
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
5.5 x 10
9
Tile (0.12mm/tile)
Th
ro
u
gh
pu
t (b
ps
)
(b)
SPICE
Analytical
SPICE
Analytical
Figure 4.13: Analytical model versus SPICE simulation for interconnection comprises of Sin-
gle, Double, Hex and Long lines. (a) Interconnect wave-pipelining throughput. (b) Delay-based
throughput.
and 24. This is caused by the specific short-cut interconnects within the FPGAs.
The analysis provides a reliable prediction of the throughput performance on both wave-pipelined
and delay-based signalling. The prediction captures accurately the throughput trends on different
interconnection lengths. The errors in the prediction can be attributed to the inaccurate evaluation
of resistance and capacitances of the interconnect drivers. Most importantly, the results show a
significant performance gap between the two signalling approaches and this will be studied in the
following section.
4.6.2 Comparing Wave-Pipelining with Delay-Based Signalling
Fig. 4.14 compares the throughput between delay-based and wave-pipelined approaches with dif-
ferent interconnection lengths. In general, throughput decreases with increasing interconnection
length. The throughput of the delay-based approach decreases by 85% with an increasing length
from 32 to 150 tiles. In contrast, the wave-pipelined throughput only decreases 35%. Also, the
wave-pipelined approach consistently outperforms the delay-based approach. With a 32-tile in-
4.6 Experimental Results 117
20 40 60 80 100 120 140 160
0.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
1.6
1.8
2.0
566%
Th
ro
ug
hp
ut
 (G
bi
t/s
)
Length (Tile)
 Delay-based
 Wave-pipelining
21%
Figure 4.14: Throughput comparison between delay-based and wave-pipelined signalling for dif-
ferent interconnection length.
130 90 65 45 32
0.5
1.0
1.5
2.0
2.5
216%
Th
ro
ug
hp
ut
 (G
bi
t/s
)
Technology Node (nm)
 Delay-based
 Wave-pipelining
303%
Figure 4.15: Throughput comparison between delay-based and wave-pipelined signalling for dif-
ferent technology nodes.
4.6 Experimental Results 118
130 90 65 45 32
6
8
10
12
14
130 90 65 45 32
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
P
ow
er
 (m
W
)
Technology node
 Synchronous (Reg=3)
 Wave-pipelining
 Im
pr
ov
em
en
t (
%
)
Figure 4.16: Power improvement of wave-pipelined signalling over register pipelining.
terconnect, the wave-pipelined approach outperforms delay-based approach by 21%. For a longer
link, such as 150-tile, the throughput performance differences is 566%.
Fig. 4.15 compares the throughput of the two approaches with different technologies. As the
technology continuous to scale, both signalling approaches will constantly gain in throughput per-
formance. This is because the number of tiles is fixed and the length of each tile is shortened along
the technology scaling. Despite of the fact that performance gap between the two approaches is
getting narrower (from 303% to 216%) over next few technology generations, the wave-pipelining
still significantly outperforms the delay-based approach by a large margin.
It is well known that interconnection throughput can be increased by inserting registers into the
long line. However, this is at the expense of power consumption and latency of the link. A
comparison between the registered-based and wave-pipelined approaches will be presented in the
following. Specifically, for a typical long interconnection with length of 72-tile, three registers are
required to provide the same throughput as the wave-pipelined link.
Fig. 4.16 shows the improvement for using the wave-pipelined link over the register-based link
for different technologies. Note that the interconnect length has been kept the same for different
4.6 Experimental Results 119
130 90 65 45 32
15
20
25
30
35
130 90 65 45 32
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
D
el
ay
 (n
s)
Technology node
 Synchronous (Reg=3)
 Wave-pipelining
 Im
pr
ov
em
en
t (
%
)
Figure 4.17: Delay improvement of wave-pipelined signalling over register pipelining.
technology nodes. The interconnect metal wire dimension has been varied for different technolo-
gies, such as the width, space and thickness, according to the predictive technology model. So as
the metal wire resistance and capacitance have been varied for different technology nodes [bib09].
Throughout the technology scaling, the power consumption decreases for both approaches. The
improvement for using wave-pipelined approach increases from 7% to 13%. The results for the
latency comparison are shown in Fig. 4.17. The latency for both approaches decreases as the tech-
nology scales down. Also, the delay difference between wave-pipelined and register-pipelined
approaches increases from 20% to 35%. In summary, both approaches will benefit from the tech-
nology scaling and the wave-pipelining approach appears a better signalling scheme in terms of
power dissipation and delay versus its register-pipelined counterpart.
4.6.3 Evaluation of Wave-Pipelining in a Real FPGA
Consider a typical communication link, which comprises a transmitter, a receiver and two RAMs,
as shown in Fig. 4.18. The register bank will only be applied on the register-pipelining schemes. In
this section, the results from studying the performance of such communication link with different
design strategies are presented. A Xilinx Virtex-4 XC4V-LX200 FPGA is used. The transmitter
4.6 Experimental Results 120
R A M T x R A MR x
B i t - w i d t h
R e g i s t e r  b a n k
Figure 4.18: An typical communication link for data transfer between two RAMs in FPGAs
and receiver are placed at the two corners in order to obtain a long signal path. The circuits are
designed in VHDL and synthesized using Xilinx ISE tools. The designs have also been verified
with real chip testing on a Xilinx ML400 board with a Virtex-4 FPGA. The wave-pipelining testing
circuit is presented in Appendix C.
The implementation results for a link with 5 different designs are presented at Table 4.4. When
comparing to a simple synchronous link design, it is not surprising to find that asynchronous link
is relatively slow with 2.25 Gbit/s and 3.69 Gbit/s versus 8 Gbit/s in a synchronous link. Al-
though the handshaking protocol is relatively simple, the signal path requires multiple register
stages and interconnects, which are constrained by the synchronous architecture in FPGAs. But
the asynchronous approaches is the most energy efficient when comparing to the synchronous or
wave-pipelining. This is due to the fact that the asynchronous approach does not require clocking
signals and thus reduce the energy dissipation in the clock net. In contrast, the phase adapta-
tion and oversampling approaches with wave-pipelined signalling can achieve a larger throughput
(48% and 100%, respectively) compared to the synchronous approach. Moreover, the through-
put obtained from actual FPGA implementation is far less than the theoretical prediction (See
Table 4.4). This is due to the clocking frequency limitation of transmitter and receiver logic, de-
spite of the fact that interconnects can achieve a significantly high throughput. Further, there are
area and power overheads for the wave-pipelining schemes when compared to the synchronous
approach. In the following, the results of the trade-offs are presented.
Fig. 4.19 compares the throughput achievement for four different implementation strategies of
a communication link with different bit-width. In general, throughput increases monotonically
4.6 Experimental Results 121
Table 4.4: Implementation of Communication Links (64-bit) in FPGAs.
Async. Async. Sync. Phase Oversampling
(4-phase) (2-phase)
Length (tile) 100 100 100 100 100
Freq. (MHz) 35.1 57.6 125 185 250
Area (Slice) 67 204 80 99 221
Power (mW ) 4.25 10.53 28.3 40.6 62
Γ (Gb/s) 2.25 3.69 8 11.8 16
Power/Γ (pJ/bit) 1.8 2.9 3.5 3.4 3.8
Table 4.5: Standard deviation of interconnection lengths for different bit-width in a link
Bit-width Standard deviation
1 0
4 0.05
8 0.5
16 0.55
32 0.58
48 0.77
64 1.73
for all approaches. The register-pipelined link with three registers inserted provides the highest
throughput and closely followed by the wave-pipelined link with oversampling receiver, which
doubles the throughput of a simple synchronous link. The wave-pipelined link with phase adapta-
tion also outperforms the simple synchronous link. Note that for the synchronous link, the through-
put against bit-width is not linear but curved as the bit-width increases. This is because long wires
at each channel are limited. As the bit-width of a communication link increases, interconnections
are required to traverse further away for long wires which are available. This increases the aver-
age interconnection length [MSCL08a]. The wave-pipelining approaches are not affected by the
average interconnection length, but by their variance. Interconnect length variance also increases
moderately with the bit-width (See Table 4.5). It can be observed that the curve for phase adap-
tation is also not linear. The oversampling approach is less affected by either the interconnection
length and variance.
The energy efficiencies of the four approaches using the bit-energy metric, which quantifies the
energy requirement for transmitting one bit of information, are compared. Bit energy can be
4.6 Experimental Results 122
0 10 20 30 40 50 60 70
0
2
4
6
8
10
12
14
16
18
Th
ro
ug
hp
ut
 (G
bi
t/s
)
Bit-width
 Synchronous
 Phase adapt
 Oversampling
 Synchronous (Reg=3)
Figure 4.19: Throughput for the communication links.
computed based on the power and throughput as follows,
Bit-Energy =
Energy/Time
Bit/Time
=
Power
Throughput
(4.28)
Comparisons of energy efficiencies for the four approaches are shown in Fig. 4.20. A smaller
bit-energy implies better energy efficiency of a link. The register-pipelined link has a relatively
constant bit-energy for different bit-widths. The energy efficiency decreases slightly with the
increasing bit-widths for the synchronous link. This is because the average interconnection lengths
increases with the bit-widths, extra energy is required for signalling in longer interconnections.
The energy efficiencies of the two wave-pipelining approaches decreases over the bit-widths. The
power overhead for the source synchronous signalling can be effectively compensated with larger
bit-widths. The oversampling and phase adaptation approaches outperform the register-pipelined
approach at 32-bit and 16-bit respectively. The wave-pipelining approach is energy efficient when
there is a large bit-width in the link. When the high throughput performance in the analogue wave
signalling outweighs the power overhead, better energy efficiency for on-FPGA communication
can be obtained.
4.6 Experimental Results 123
0 10 20 30 40 50 60 70
2
4
8
16
32
64
E
ne
rg
y 
(p
J/
bi
t)
Bit-wdith
 Synchronous
 Phase adapt
 Oversampling
 Synchronous (Reg=3)
Figure 4.20: Energy consumption (Energy per bit transfer) for the communication links.
0 10 20 30 40 50 60 70
0
40
80
120
160
200
240
Lo
gi
c 
ar
ea
 (t
ile
)
Bit-width
 Synchronous
 Phase adapt
 Oversampling
 Synchronous (Reg=3)
Figure 4.21: Area for the communication links.
4.7 Conclusion and Future Work 124
The area comparisons between the four approaches are presented in Fig. 4.21. The synchronous
approach consumes the least area among the four approaches. For phase adaptation, it consumes
more area than the register-pipelined approach when bit-width is smaller than 8-bit. However,
for a larger bit-width, the phase adaptation approach consumes less area. For a link with bit-
width larger than 32-bit, the oversampling approach consumes on average 16.6% more area than
the register-pipelining approach. The register-pipelining link consumes 39.4% more area when
compared to the phase adaptation.
The bit error rate (BER) can be estimated based on the operating frequency of the communication
link and using the Eq. 4.27. The standard deviations of noise sources, including both static and dy-
namic noise, are input to the expression to obtain the BER. In [TLG09], the range of noise sources
obtained from the HSPICE simulations is reported. These values can be used in Eq. 4.27 to obtain
an approximation of BER and provide a reliability guideline. Furthermore, one important source
of noise is the static skew which is caused by the varying interconnect lengths in the communica-
tion link. The standard deviation of the static skew can be obtained from the direct measurement
of interconnection lengths of the circuits, such as results in Table 4.5. The other noise sources are
shown in Table 4.6. The Vdd of the circuit and noise margin are assumed to be 1.2V and 0.6V
respectively.
The BER results are shown in Fig. 4.22. As the link bit-width increases, the reliability of the link
decreases. The link with 64-bit bit-width can achieve a maximum 300 MHz operating frequency
and, thus, can provide 19.2 Gbps communication throughput. Links with smaller bit-width can
achieve a higher throughput. For example, a 8-bit link can achieve 400 MHz and, thus, can provide
3.2 Gbps data rate.
4.7 Conclusion and Future Work
The ever-increasing interconnection delay presents a difficulty for high bandwidth communication
for FPGAs. Wave-pipelined signalling allows multiple bits traversing along the line simultane-
ously and thus can substantially increase the interconnection throughput. A novel FPGA intercon-
nect model for throughput analysis has been presented. The model captures the important electri-
cal characteristics of the interconnects and is able to accurately predict wave-pipelining throughput
4.7 Conclusion and Future Work 125
Noise source σ
Crosstalk (with shielding) 1.74ps
Crosstalk (without shielding) 12ps
DC noise 15mV
Jitter & dynamic skew 50ps
Skew 16.9ps (8-bit)
(Static) 19.6ps (32-bit)
26.0ps (48-bit)
58.0ps (64-bit)
V dd 1.2V
VM 0.6V
Table 4.6: Parameters used in the Bit-Error-Rate analysis
2 3 4 5 6 7 8 9 10
x 108
−35
−30
−25
−20
−15
−10
−5
Throughput (bps)
lo
g 1
0(B
ER
)
8−bit
32−bit
48−bit
64−bit
Figure 4.22: Bit-Error-Rate (BER) estimates for wave-pipelined link with different throughput
and bit-width.
4.7 Conclusion and Future Work 126
of the link. Besides, the new signalling scheme requires novel designs of transmitter and receiver
circuits in order to sample the high-speed analogue signals correctly. Two new on-chip communi-
cation circuits are presented, which can significantly improve the communication throughput and
energy efficiency. Especially, for communication link with large bit-width, the oversampling and
phase adaptation approaches outperform the register-pipelining in terms of energy efficiency. It is
also interesting to observe that the throughput performance of a wave-pipelined link depends on
the skew of the parallel lines instead of the delays.
The new FPGA-based signalling scheme poses a throughput-centric paradigm and an interesting
problem open to further investigation for studying new FPGA interconnect architectures, such as
design and optimization of throughput-centric interconnects (See Appendix D) and modification of
FPGA interconnect fabrics for wave-pipelining as in [TLG09]. Besides, development of synthesis
tools or routing algorithms that incorporate wave-pipelined links is important and will be investi-
gated in the future. Ideally, the wave-pipelining link can be encapsulated as a functional module
and abstracted to be used in high-level design environment, such that the timing and reliability
issues in on-chip communication are not a concern to hardware designers.
127
Chapter 5
A DP-Network for On-Chip Dynamic
Routing
5.1 Introduction
Because of the rapid advancement in process technology, the complexity and programmable logic
resources in FPGA are drastically increasing. In particularly, coarse-grained prefabricated mod-
ules, such as memory blocks and DSP units, are embedded into the FPGA chips and provide
significant improvements in speed and area as well as hardware configuration time. The hetero-
geneous FPGA architecture is an effective system solution for complex applications and product
development with time-critical constraints. However, as the number of embedded components
increases, the communication bandwidth between embedded components becomes a fundamental
factor to consider in overall system performance. As a result, there is a requirement to develop
advanced communication architectures for better performance and scalability [MSCL06b].
Recently, Network-on-Chip (NoC) has been proposed as a promising solution to the increasingly
complicated on-chip communication challenges [BB05, KJS+02]. Such architectures consist of a
network of regular tiles where each tile can be an implementation of general-purpose processors,
DSP blocks, memory blocks and embedded reconfiguration modules etc. Communications among
these tile-based modules is followed a packet-switch or circuit-switch scheme where messages are
transmitted among the processing elements. The NoC architecture would be an ideal solution to
provide effective on-FPGA communications for the heterogeneous architectures [GBHW08].
In such a NoC environment, routing of flits (or packets) becomes a critical issue, which deter-
5.1 Introduction 128
mines the inter-processor communication performance. Routing provides a protocol for moving
data through the NoC infrastructure and also determines the path of data transport. The selection
of communication pathway would greatly affect the latency of packets transmitted from the source
to the destination and therefore can have significant impact on the overall traffic flow in the net-
work. An intelligent routing mechanism is required to utilize the communication bandwidth and
minimize the transportation latency.
Dynamic routing (or adaptive routing) has been widely used in computer and data network de-
sign. Using the on-line communication patterns and real-time information, dynamic routing can
effectively avoid traffic hot spots or faulty components, and can reduce the possibility of dead-
lock. Several partially adaptive routing algorithms within the context of NoC were proposed and
the evaluations of their performances were reported. For example implementation of wormhole
adaptive Odd-Even routing was described in [Chi92, SZBR07, Hu05]. In [ACPP08], a minimal
routing mechanism with partially adaptive protocols was proposed. However, implementation
of adaptive routing in a network-on-chip (NoC) system is not trivial and further complicated by
the requirements of deadlock-free and real-time optimal decision making. Also, the previously
proposed adaptive approaches only exploit local traffic which lead to a moderate improvement in
packet latency and traffic load balancing. Optimal path planning and routing adaptations are rarely
studied.
In this Chapter, a novel adaptive optimal routing architecture for NoC is presented. The architec-
ture employs a dynamic programming (DP) network to provide optimal path planning based on
the real-time network status. The DP-network architecture provides a simple and efficient real-
ization of optimal shortest path computation for NoC dynamics. Thus, optimal path planning and
dynamic routing can be readily realized. Also, a new routing strategy called k-step Look Ahead
(KSLA) that is based on the DP-network is introduced. This new strategy can substantially re-
duced the routing table storage and maintain a high quality of adaptation which leads to a scalable
adaptive routing solution with minimal hardware overhead. The contributions of this chapter are
as follows:
1. A Dynamic Programming (DP) network for shortest path computation is proposed. The
characteristics of the DP network, such as discrete and continuous-time formulations, con-
vergence and numerical examples are discussed. (Section 5.3)
5.2 Background 129
2. A DP-network architecture to realize real-time dynamic routing in a Network-on-Chip is
introduced. Routing mechanics and a routing table updating strategy with an aid of the
DP-network are presented. The network scalability and dead-lock issues are also studied.
(Section 5.4)
3. A k-Step Look Ahead (KSLA) routing strategy, which significantly reduces the routing
table, is proposed. This approach provides a trade-off between the routing optimality and
memory consumption. (Section 5.4.2)
4. Performances and merits of the DP-network are investigated through experimental studies
and comparisons with other popular routing schemes, such as XY and Odd-Even, in differ-
ent traffic benchmarks. Also, hardware overhead of the DP-network is also studied using
Xilinx FPGA device. (Section 5.5)
5.2 Background
5.2.1 Related Work
Routing strategies can be categorized into deterministic and adaptive. In a deterministic rout-
ing strategy, source and destination determine the traversal path. Popular deterministic routing
schemes for NoC are source routing and XY routing, which are also referred to as 2D dimension
order routing [DT04]. In source routing, the source core specifies the route to the destination. In
XY routing, the packet follows the rows first, then moves along the columns toward the destina-
tion or vice versa. XY routing can be implemented using simple algorithmic routing logic, but is
limited to regular network topologies.
In an adaptive routing strategy, the path traverse is decided on a per-hop basis. Adaptive schemes
involve dynamic arbitration and next-hop selection mechanisms, i.e. based on local link conges-
tions. There are several adaptive routing algorithms that have been proposed within the context
of NoC [WAHS06]. For example, a methodology that focuses on deadlock-free adaptive routing
has been proposed in [PHK06] which provides a framework to design routing tables that can out-
perform the turn-model based deadlock-free routing algorithm. Other schemes, such as adaptive
Odd-Even [SZBR07, Hu05] and adaptive selection Node-on-Path (NoP) [ACPP08] also provide
routing adaptability but only exploit local traffic conditions or the status of neighbours. There
5.2 Background 130
is a great potential to improve the communication efficiency by considering the global traffic at
run-time using adaptive routing with on-demand shortest path computation.
Minimal cost (or shortest path) computation is fundamental among different dynamic routing
strategies. The basic idea is that the routing algorithm always chooses the least congested path
towards the destination through optimal path planning. The least congested route can be found
based on the shortest path computation where the path cost is the obtained at run-time. Since the
traffic intensity and conditions are changing at run-time, the dynamic routing algorithm should
be able to discover the congestions and perform shortest path computation at the same time. A
novel DP-network architecture that provides real-time shortest path computation and optimal path
planning is proposed in this chapter. The background of shortest path computation and the parallel
computation architecture are described below.
5.2.2 Shortest Path Computation
Dynamic programming (DP) is a powerful mathematical technique for making a sequence of in-
terrelated decisions. Bellman formalized the term DP and used it to describe the process of solving
problems where one needs to find the best decision one after another [Bel58]. It provides a system-
atic procedure for determining the optimal combination of decisions which takes much less time
than naı¨ve methods. In contrast to other optimization techniques, such as linear programming,
dynamic programming does not provide a standard mathematical formulation of the algorithm.
Rather, dynamic programming is a general type of approach to problem solving. It restates an
optimization problem in recursive form, which is known as a Bellman equation. The Bellman
equation for optimal value function V (·) is unique and can be defined as the solution to the recur-
sive equation [Bel58, BT89].
Shortest Path Problem Formulations
The shortest path problem can be described as follows: given a directed graph G = (V,A) with
n = |V | nodes,m = |A| edges, and a cost associated with each edge u→ v ∈ A, which is denoted
as Cu,v, the edge cost can be defined subject to different applications. In this Chapter the cost is
defined as the number of flits or packets in each buffer. The total cost of a path p = 〈v0, v1, ..., vk〉
is the sum of the costs of its constituent edges: Cost(p)=
∑k
i=1Ci−1,i. The shortest path of G from
5.2 Background 131
Table 5.1: Notations used this chapter
V A set of nodes in network G
A A set of edges in network G
Cu,v Cost of the edge u→ v
n Number of nodes in the network
vi The i-th node in the network
dt(s) Shortest path cost from s to t
Vt(s) The DP-value for node vs
V ∗t (s) The optimal cost-to-go for node vs
W kt (s) The expected cost to go for node vs with k-step
look ahead
P ki,j The set of paths from vi to vj , all of which
have k edges
µt(s) The decision variable (routing direction) at node vs
N (i) A set of neighbour nodes that can be directly accessed from node i
i to j is then defined as any path p with cost that is min
∑k
i=1Ci−1,i for all constituent edges vi.
The shortest path problem as a linear optimization problem that can be formally stated. Suppose
node t is the destination node and it aims to compute the shortest path cost dt(v),∀v ∈ V . To
express this as a linear program, the constraint becomes dt(v) ≤ dt(u) + Cu,v to denote that the
cost of shortest path from any node v to destination t is less or equal than the shortest path from
node u plus the cost of a direct path from u to v. The destination t vertex initially receives a value
dt(t) = 0. Thus, the following linear programming formulation can be obtained:
minimize
∑
∀v∈V dt(v)
subject to dt(v) ≤ dt(u) + Cu,v,∀{u, v} ∈ V
dt(t) = 0
The above formulation yields the shortest path from any nodes in V to destination vt which is
known as multiple-source single-destination shortest path problem. The solution of a linear pro-
gramming (LP) problem can be resolved readily using any standard LP solver. Alternatively, the
shortest path problem can be stated in the form of Bellman equations, which defines a recursive
procedure in step k and can lead to a simple parallel architecture to speedup the computation. To
find the cost of the shortest path from vs to vt, it requires the notion of a DP value, which is the
expected cost from vs to vt. This expected cost is being updated recursively based on the previous
estimates until it reaches its optimality criteria. This algorithm is known as dynamic programming.
5.3 Shortest Path Computation Using DP-Network 132
We denote the DP value for vs to vt at the k-th iteration as V
(k)
t (s) and V
∗
t (s) is the optimal DP
value which equals to the dt(s).
V
(k)
t (v) = min∀u∈V
{
V
(k−1)
t (u) + Cu,v
}
(5.1)
where Vt(t) = 0. If the recursion is expanded with from vs0 to vsk , the DP-value can be expressed
as the cost of the path from vs0 to vsk ,
V ∗k (s0) = min{vs0 ,vs1 ,...,vsk}∈Pks0,sk
{
k∑
i=1
Ci−1,i
}
(5.2)
where vsk = vt and P
k
i,j is the set of paths from vi to vj , all of which have k edges. In addition,
the optimal decisions at each node vi that leads to the shortest path can be readily obtained from
the the argument of the minimum operator at the Bellman equation as follows:
µt(v) = arg min∀u∈V
{V ∗t (u) + Cu,v} (5.3)
Both the linear programming (LP) and dynamic programming (DP) approaches can yield the opti-
mal solution for shortest path problems. But the DP approach presents an opportunity for solving
the problem using a parallel architecture and can greatly improve the computational speed.
5.3 Shortest Path Computation Using DP-Network
Mapping Bellman recursive dynamic programming to a parallel computation platform can be re-
alized with the introduction of a DP-network architecture. The network has a parallel architecture,
and can be used to derive dynamic programming solution through the simultaneous propagation of
successive inferences. Originally, it provides an efficient platform for checking data inconsistency
due to results from different inference paths [LT96]. In [LT96], with close resemblance to the de-
terministic type dynamic programming formulation on closed semiring, Lam and Tong introduced
DP-network to solve a set of graph optimization problems with an asynchronous and continuous-
time computational framework. This new class of inference network is inherently stable in all
5.3 Shortest Path Computation Using DP-Network 133
cases and it has been shown to be robust and with arbitrarily fast convergence rate [LT96]. A sim-
ilar parallel computational network for dynamic programming has also been proposed in [BT89].
The network was proven to converge to optimal solution even under an asynchronous network.
A DP-network is formed by the interconnection of self contained computational units. Fig. 5.1
shows the structure of a unit and the connections in a general inference network. Each unit is
to represent a binary relation (i, j) between two objects i and j and is denoted by U(i, j). At
each unit, there are N sites to carry out the inference operations as defined in the site function.
The value of the corresponding relationship between i and j is then determined by resolving the
conflict among all of the site outputs. In essence, if Sk(i, j) represents the site output at the k-th
site and g(i, j) stands for the unit output of unit (i, j), then
Sk(i, j) = g(i, k)⊗ g(k, j) (5.4)
g(i, j) = ⊕∀kSk(i, j) (5.5)
where⊗ is the inference operator for the site function (which is usually the same at all of the sites)
and is the conflict-resolution operator for the unit function. Also the computational unit⊕ denotes
the unit which resolves the binary relation (i, j).
(k,j)
(i,k)
unit output function
site
unit
i
(k,j) (i.j)
k
j
(i,k)
 
Figure 5.1: Unit interconnection in a general DP-network where 1 ≤ i, j, k ≤ n; k 6= i, j.
The shortest path problem can be mapped to the DP-network. For the original problem graph,
each node refers to a processor unit. But in the DP-network, each computational unit U(i, j) rep-
5.3 Shortest Path Computation Using DP-Network 134
resents the binary relation, i.e. the expected distance between node i and j. When the network has
converged, the solution of the problem would be found at the output of each computational unit.
In general, if there are m nodes in the original graph, then the DP-network (based on the Bellman
equation) will have m− 1 functional units with U(i), where i = 1, 2, ..., j− 1, j+1, ...,m.. Sup-
pose the interconnection network has a fixed topology, the multiple-source multiple-destination
(MSMD) solutions can be obtained by applying the DP-network m times for computing shortest
paths for m different destinations.
Let g(i, k) = Ci,k and g(k, j) = Vt(k), the architecture of the DP-network can then be defined as
follows. A DP-network for shortest-path problem can be stated in the terms of network structure
as ⊗ is substituted by “+” and ⊕ is substituted by “min” as
Sk(i, j) = g(i, k) + g(k, j) (5.6)
g(i, j) = min
∀k
Sk(i, j) (5.7)
The computational units are interconnected and resembles the shortest path problem structure.
Each unit represents a node and an interconnection represents an edge. With the realization, the
network converges to the optimal solution can be readily implemented using a distributed network.
Also, this network architecture encompasses the advantage of simplicity and parallelization, which
presents a great opportunity to be applied for on-chip routing and optimization.
5.3.1 Discrete and Continuous-Time Formulations
The recursive formulation of the Bellman equation only specifies the mechanism to update value
Vt(s), as can be found from the classical Value Iteration algorithm [CLR01]. Therefore, the pri-
ority and order of the updating process are not relevant and the value Vt(s) can be computed
asynchronously. This allows an opportunity to design distributed computation system to realize
the DP-network with distributed computational units without synchronous control. Furthermore,
the asynchronous property can be further exploited to consider a continuous-time framework of
DP-network, as opposed to the discrete-time DP-network. The continuous-time formulation pro-
vides an analytical framework to study the network properties, such as network convergence. In
the following, both the discrete and continuous-time formulations are discussed.
5.3 Shortest Path Computation Using DP-Network 135
Continuous-Time Formulation
Assume that the min operator requires an infinitesimal time, δt, for evaluation, then the output of
the operator at time t+ δt can be expressed as:
gt+δt(i) = min∀k∈N (i)
{Vt(k) + Ci,k} (5.8)
Suppose each computational unit U(i) behaves dynamically as a first-order system, the whole
network can be described by a set of differential equations
dgt(i)
dt
= −λigt(i) + λi min∀k∈N (i){Vt(k) + Ci,k}, ∀i (5.9)
where λi is the system pole for unit U(i) which controls the rate of how gt(i) may change. If λi =
0 then |dgt(i)/dt| = 0 and gt(i) becomes a constant and the unit is said to be fully constrained
and has a fixed memory. Whereas for a memoryless unit with λi = ∞, it has infinite power to
change because |dgt(i)/dt| can be made arbitrarily large. Also, the units are interconnected based
on N (i), which defines the set of adjacent nodes of unit U(i). Therefore, Vt(k) is the output of
unit U(k), which is an adjacent unit of U(i) in N (i).
Discrete-Time Formulation
The discrete formulation can be obtained based on Eq. 5.10. Let δt = 1, the system of differential
equations, Eq. 5.10, becomes
gt+1(i) = λi min∀k∈N (i)
{Vt(k) + Ci,k}, ∀i (5.10)
where λi defines the discount factor which controls the convergence speed of the system, as will
be shown in later section.
5.3.2 Convergence of the Network
There are two important considerations in using a DP-network. Firstly, will the network always
converge to the desired solution? Secondly, what are the parameters or conditions that affect the
convergence rate of the network? The answer to the first question is an affirmative “yes”, because
it follows directly from the principle of Bellman optimality equation that the constituent optimal
5.3 Shortest Path Computation Using DP-Network 136
expected value of all states are optimal. The local minimization based on the Bellman equation
performed at each distinct unit, in fact, is driving the network to a global optimal state, which is
the desired solution. To measure the “distance” of the network from this global minimum and in
line with Hopfield’s energy modelling in [Hop84], computational energy E(t) can be defined as
the root-mean-square error of the system deviates from the optimal solution. From Eq. 5.10, the
energy function for the continues-time ordinary differential equation can be stated as
E(t) =
∑
∀i
(
−λigt(i) + λi min∀k∈N (i){Vt(k) + Ci,k}
)2
(5.11)
where E(t) = 0 when the network has converged. To determine the convergence rate of the
network, an explicit expression for dE(t)/dt has to be evaluated. By differentiating the energy
function in Eq. 5.11, the following expression is obtained.
dE(t)
dt
=
dE(t)
dgt(i)
· dgt(i)
dt
=
∑
∀i
[
d
dgt(i)
(
−λigt(i) + λi min∀k∈N (i){Vt(k) + Ci,k}
)2
· dgt(i)
dt
]
(5.12)
By evaluating the first term in Eq. 5.12, the following expression is obtained.
dE(t)
dt
=
∑
∀i
[
−2λi
(
−λigt(i) + λi min∀k∈N (i){Vt(k) + Ci,k}
)
· dgt(i)
dt
]
(5.13)
=
∑
∀i
[
−2λi
(
dgt(i)
dt
)2]
(5.14)
Note that in order to establish the above expression, it is assumed that all outputs of units gt(i) do
not provide a feedback to the unit itself. Thus, the current node vi is not in the set of neighbour
nodes, N (i). Hence, all the factors that make up the sum of the right-hand side of Eq. 5.13 are
non-negative. In other words, the energy function E(t) defined in Eq. 5.11 is a monotonically
decreasing function of time, as
dE(t)
dt
≤ 0 (5.15)
5.3 Shortest Path Computation Using DP-Network 137
From the definition of Eq. 5.11, note that the function E(t) is bounded. The time evolution of
the continuous DP-network model described by the system of first-order differential equations in
Eq. 5.10 represents a trajectory in the station space, which seeks out the minima of the energy
function E(t) and comes to a stop at such fixed point. From Eq. 5.13, note that the derivative
dE(t)/dt vanishes only as the point that satisfies the Bellman optimal criterion.
dgt(i)
dt
= 0, ∀i (5.16)
In other words, E(·) is a Lyapunov function.
5.3.3 Numerical Examples
Example 5.1. Computing expected costs in a 10-node array
A 10-state random walk problem can be solved by a 10-unit continuous-time DP-network. The 10
states are indexed by Si, i = 1, 2, ..., 10. The outputs of the 10 units of the network, signifying
the expected costs to the destination, are described by a vector xt ∈ {gt(Si)|i = 1, 2, ..., 10},
which has semantic meaning of expected reward of VSi , i = 1, 2, ..., 10. Also, the transition cost
is defined as Ci,i+1 = 1 and Ci+1,i = 1, for all i, j = 1, 2, ..., 9 and Ci,j = ∞ for all j 6= i + 1
and j 6= i− 1. The continuous-time DP-network can be modeled by a set of differential equations
on the 10 nodes Si. The expected rewards VSi evolve as first-order lag controlled by λ (a system
or implementation related parameter which is non-problem related). γ is implementation related
parameters, defined as the discount factor for multistage cost and is equal to 0.9 in this example.
dVS1
dt
= −λVS1 + λγVS2 (5.17)
dVSi
dt
= −λVSi + λmin{Ci,i−1 + γVSi−1 , Ci,i+1 + γVSi+1}, ∀i = 2, 3, ..., 9 (5.18)
VS10 = 0 (5.19)
Eq. 5.17 describes VS of the boundary node S1 which has a single “right” action. For nodes Si, i =
2, 3, ..., 9, they have both left and right actions and can be readily shown to follow equations as
typified in Eq. 5.18. A destination node VS10 is defined to be zero as in Eq. 5.19.
5.3 Shortest Path Computation Using DP-Network 138
Given arbitrary positive initial values of VSi , the converged values of the respective differential
equations (Eqs. 5.17-5.19) can be verified to be identical with the optimal values governed by the
Bellman equations. Fig 5.2 shows the convergence results obtained by using Matlab ODE solver1
for the differential equations. The converged values are found to be [6.10, 5.67, 5.20, 4.68, 4.10,
3.44, 2.71, 1.90, 1.00, 0]. The results are verified correctly against the results computed using
the well-known Bellman-Ford algorithm for shortest path problems. Also note that node S9 is
the quickest to converge whereas S1 is the slowest. This is because there is a dependency on the
expected cost and it takes the longest time for information to propagate to S1 from S10.
0 1 2 3 4 5
0
1
2
3
4
5
6
7
8
9
10
continuous plot
Time
Ex
pe
ct
ed
 c
os
t (V
S) VS
1VS
2VS
3VS
4VS
5VS
6
VS
7
VS
8
VS
9
VS
10
Figure 5.2: Convergence of a DP-network for the 10-node array random walk problem.
Example 5.2. Computing expected costs in a 10× 10 mesh
Consider a 100-node network with 10 by 10 mesh interconnection. Each node only connects to
at a maximum of 4 adjacent nodes, while nodes at the edge connect to three and nodes at corner
connect to two. Nodes are orientated as a perfect square. All transitions would result in a cost of
1 and the destination node at the center would have expected cost zero.
1The differential equation solver is based on ode45, which is provided in the Matlab. The ode45 is based on an
explicit Runge-Kutta formula, the Dormand-Prince pair.
5.3 Shortest Path Computation Using DP-Network 139
Similar to Example 5.2, the continuous-time DP-network can be modeled by 100 differential equa-
tions on the 100 nodes, Sij , ∀i, j = 1, 2, ..., 10. The expected cost VSi,j , ∀i, j = 1, 2, ..., 10 evolve
as first-order lag controlled by λ.
0
5
10
0
5
10
0
5
10
15
20
(a) t=1
0
2
4
6
8
10
0
2
4
6
8
10
0
5
10
15
20
(b) t=5
0
2
4
6
8
10
0
2
4
6
8
10
0
5
10
15
(c) t=10
0
2
4
6
8
10
0
2
4
6
8
10
0
2
4
6
8
10
(d) t=20
Figure 5.3: The convergence of the expected costs of the nodes from the 10×10 mesh network.
Let λ = 0.5 and destination node be S5,5, the values of the expected cost are shown in Fig. 5.3.
At time t = 1, the expected costs are randomly initialized and VS5,5 = 0, as S5,5 is the destination
node. The network begins to converge to the optimal solution at time t = 20 and the intermediate
results are also shown in the figure. Convergence of the DP-network in two dimensional mesh
depends on the λ, which in this example equals to 0.5. The network settles to the desired solution
at t = 20. By increasing λ, the time needed for the network to settle decreases. Also, even λ is a
large value (says λ = 0.9), the network still converge to the optimal solution.
Fig. 5.4 shows the convergence of the network with different λs. The results are root mean-square
5.3 Shortest Path Computation Using DP-Network 140
0 20 40 60
0.0
0.5
1.0
1.5
2.0
2.5
R
M
S 
e
rr
o
r,
a
ve
ra
ge
d 
o
ve
r 
st
a
te
s
time
 λ=0.3
 λ=0.5
 λ=0.7
 λ=0.9
 
Figure 5.4: The Root Mean Square (RMS) error of the DP-network for computing shortest paths
in a 10×10 mesh network with different λ values.
(RMS) errors between the VS output from the network and the values obtained using Bellman-Ford
algorithm, averaged over the 10× 10 mesh example. Clearly, λ is a convergence time constant of
the network which governs the time required to obtain the optimal solutions.
5.3.4 Summary
In this section, the characteristics about the DP-network have been discussed. The DP-network can
be formulated in discrete and continuous-time forms. The monotonic property of the continuous-
time network has been shown and the network convergence has been discussed. The convergence
rate of the network depends on λ, which is the time constant that varies based on the different
implementation platforms. In the following, the embedding of the DP-network in NoC to pro-
vide shortest path computation on-the-fly and dynamic routing enabling to enhance the network
utilization are discussed.
5.4 NoC Routing with DP-Network 141
5.4 NoC Routing with DP-Network
5.4.1 Routing Architecture
An interesting feature of an on-chip communication network, or network-on-chip (NoC), is that
the communication network itself defines the graph of the shortest path problem. This provides an
opportunity to compute the optimal path by embedding a DP unit at each node. Unlike the general
computer network, the shortest-path routing computation is solely attributed to the processors at
each node. The network-on-chip environment demands tighter timing and performance constraints
as well as more flexible implementation methodologies which can be achieved by implementing a
DP-network architecture.
Figure 5.5: An example of a 3 by 3 mesh network couples with a DP-network.
The DP network shown in Fig. 5.5 consists of distributed computational units and links between
the units. The topology of the network resembles the defined graph topology, which is the com-
munication structure of a NoC. At each node, there is a computation unit, which implements the
Bellman equations in Eq. 5.6. The numerical solution of the unit will be propagated to the neigh-
bour units via the neighbourhood interconnects. The DP network is tightly coupled with the NoC
5.4 NoC Routing with DP-Network 142
and each computational unit locally exchanges control and system parameters with the tile or core.
The DP network quickly resolves the optimal solution, as will be shown later in the chapter, and
will pass the control decisions to the router or other controllers in the tile while the real-time
information, such as average queuing time, will be inputted to the computational unit.
The DP network presents several distinguished features to an on-chip communication system.
Firstly, the distributed architecture enables a scalable real-time monitoring functionality for the
NoC. Each computational unit acquires local information and through communication with neigh-
bour units, a global optimization can be achieved. Secondly, because of the simplicity of the
computational unit, dedicated DP network provides a real-time response and will not consume
any data-flow network bandwidth. Thirdly, because of the convergence property, as discussed in
Section 5.3.2, DP-network provides an effective solution to optimal path planning and dynamic
routing.
DP Routing Mechanics
Consider a node-table routing architecture in which routing table is stored at each router. The
destination of the header flit will be checked and it will decide the routing direction based on
the routing table entries. In contrast to the table-based routing, in which a routing algorithm
computes the route or next hop of a packet at runtime, algorithmic routing is more restrictive to
simple routing algorithms and can only applied on regular topologies, such as a mesh topology.
The routing table approach enables the use of per-hop network state information, such as queue
lengths, to select among several possible next-hop at each stage of the route.
Algorithm 1 presents an algorithm for updating the routing table with a DP-network. At each node
unit, there are j inputs from the j neighbour nodes for the expected costs. The output of the unit
at node s is the updated expected cost d(s) and is sent to all adjacent nodes. The main algorithm
is outlined in line 4-10. For each destination i and direction j, the expected cost will be computed
and the minimum cost will be selected, as stated in line 8. The optimal direction for routing is
selected and used to update the routing table, as stated in line 9. Although the algorithm consists of
two for-loops, this can be realized in hardware with a parallel architecture and the computational
delay complexity can be reduced to linear.
5.4 NoC Routing with DP-Network 143
Algorithm 1 Update routing table for node s
1: Inputs: Vj(i), j ∈ Ω(s) where Ω(s) returns all neighbor nodes of s and i = 1, 2, ..., |N |
2: Outputs: V ∗i (s)
3: Definitions:
s is the current node;
Cj,s is input queue length node s from direction j
4: for all i destinations such that i ∈ N do
5: for all j directions such that j ∈ Ω(s) where Ω(s) returns all neighbors nodes of s do
6: V ′i (j) = Cj,s + Vi(j)
7: end for
8: V ∗i (s) = min∀j V
′
i (j)
9: µi(s) = argmin∀j V ′i (j) {Update routing table}
10: end for
Many routers use routing tables either at the source (source routing) or at each hop (node-table
routing) along the route to implement the routing algorithm. In adaptive routing, the routing
table is updated dynamically or periodically, such that the communication traffics can be altered
subject to the choice of switching mechanisms. The DP-network does not interact or interfere with
the packet switching mechanisms but alter the routing table at run-time. Also, a mesh network
topology will be used throughout the chapter to illustrate the idea. However, the proposed method
is not limited to the mesh topology and simple modifications can be made for tackling network of
different topology, such as torus, Butterfly Fat-Tree (BFT) and other custom designed topology,
based on the flexible routing-table based design.
Deadlock can effectively be avoided by adopting one of the deadlock-free turn model. In this
chapter, the west-first [GN92] turn model is used. It prohibits all turns to the south-west and
north-west direction. The dynamic routing scheme will be switched to XY routing, whenever the
destination node is within these directions. In this case, the north-west and south-west turns are
removed and, thus, the routing dependencies will never form a cycle in the network. Alternatively,
other turn models, such north-last, can also be applied in the DP network to avoid deadlock with a
similar performance at the designer’s disposal.
DP-Network Convergence Analysis
The delay of DP-network converges to an optimal routing solution depending on the network
topology. Thus determines the delay of information propagates within the network, and the delay
of each computational unit. It can be seen that each unit involves O(|A|) additions and com-
5.4 NoC Routing with DP-Network 144
Table 5.2: Convergence analysis of a DP-network for different network topologies.
Size Topology DP Convergence time
n, k binary k-cube n− 1
n 2D Mesh 2
√
n− 1
n, k n-dimension k-ary mesh nk − 1
n 2D Torus
√
n− 1
n n-node k-ary tree 2 logk n− 1
n, k k-ary n-cube nk/2
n, k k-ary n-flies n− 1
parisons where |A| is the number of edges. Hence, the solution time is O(k|A|) where k is the
number of iterations evaluated by each unit. In software computation, the k equals to the num-
ber of nodes in the network, thus k = |V |, which guarantees that all nodes have been updated
[CLR01]. However, in the hardware implementation with parallel execution, k is determined by
the network structure and A additions can be executed in parallel. Each computational unit can
simultaneously compute the new expected cost for all neighbour nodes. Therefore, the solution
time becomes the time for the updated value to be distributed to every other node and the com-
putational complexity becomes O(1). Consider a mesh network with n nodes with
√
n rows and
√
n columns. The longest path in this network is 2
√
n − 1, which is the minimum time required
to update the expected costs at all nodes. Therefore, the network convergence time is proportional
to the network diameter, which is the longest path in the network. The DP-network convergence
time for some of the network topologies are summerized in Table 5.2.
5.4.2 Optimality and Memory Trade-off
One concern for the table-based routing mechanics is the routing table size, which requires allo-
cation of memory or registers. Even though the adaptive routing brings in substantial advantage
in routing delay and throughput, the memory requirement could sometimes become a hindrance
for the system to scale up [DT04]. In this subsection, a new method, namely k-step Look Ahead,
is introduced. This method yields a sub-optimal solution in dynamic routing but can substantially
reduce the memory requirement.
5.4 NoC Routing with DP-Network 145
Figure 5.6: Comparing two routing strategies, where the shaded area represents the nodes covered
in routing table and vs is the source and vt is the destination. (a) Optimal decision can be made at
vs (b) Since Vu(s) ≤ Vw(s), vu is selected as transition node in the sub-optimal path to vt.
k-Step Look Ahead (KSLA)
Instead of storing routing decisions for all destinations in a routing table, storing a table that
provides optimal decision to local premises can enable a sub-optimal path to the destinations
with a substantial reduction on the storage requirement. The idea is that each router computes the
routing decisions for nodes that are k steps away from the current node. A k-step region is depicted
in the shaded area in Fig. 5.6(b). If the destination is within the k-step region, an optimal decision
is readily available in the routing table. Otherwise, a transition node v is selected such that the
DP value for u is the smallest, Vu(s) ≤ Vw(s), ∀w ∈ k(s), where k(s) is the set of nodes that
are k steps away from vs. These procedures repeat at each hopping step and eventually the packet
arrives at the destination in a sub-optimal route. Fig. 5.6 compares the two strategies graphically.
The k-step Look Ahead algorithm is presented in Algorithm 2. The inputs are the destination
nodes, which are the same as the router designed for the global optimal path planning. For every
flit or packet, the algorithm checks whether this destination is within the k-step region. This can
be achieved differently for different topologies. For a mesh, this can be checked by analyzing the
coordinates and comparing the Manhattan distances. This step is line 5 in the Algorithm 2. If the
destination is within the k-step region, the optimal routing decision can be readily retrieved form
the routing table. If the destination is outside the region, which is not covered in the routing table,
the algorithm finds a node within the region that is closest to the destination and with minimal
5.4 NoC Routing with DP-Network 146
Algorithm 2 k-step Look Ahead Routing Algorithm
1: Inputs: Destination node t
2: Outputs: Routing direction µ(s)
3: Definitions:
s is the current node;
D(s, t) returns the number of steps from s to t;
µt(s) returns the routing direction of destination t at node s;
f(k, s) returns the nodes that are k steps away from s;
4: Initialization: min DP value = RAND MAX
5: if D(s, t) ≤ k then
6: return µt(s)
7: else
8: for all nodes i such that i ∈ f(k, s) do
9: if D(i, t) ≤ D(s, t)− k then
10: if Vt(s) ≤ min DP value then
11: min DP value = Vt(s);
12: K region dest = i;
13: end if
14: end if
15: end for
16: return µK region dest(s)
17: end if
cost. In line 9, the condition ensures that the node chosen is the closest to the destination. Lines
8-15 are aiming to find a node that is leading to the destination node with the minimal expected
cost. Finally in line 16, this node within the region will be output as the next-hop direction.
With the optimal routing scheme, the total cost to go from node s to t is
V ∗t (s0) = min∀si∈Pms0,sm

m∑
j=1
Csi−1,si
 (5.20)
where i = 0, 1, . . . ,m and vsm = vt. In other words, each router is able to look ahead for all
possible paths Pms0,sm to the destination and choose the one with minimal delay. For the KSLA
approach, routers can only look ahead for k steps at each round. So the total expected cost,
W kt (s0), becomes the sum of the bm/kc rounds of k-step propagations plus the expected cost of
the last round, which requires steps less than or equal to k.
W kt (s0) =
bm/kc∑
l=0
min
∀si∈Pkslk,s(l+1)k

(l+1)k∑
j=lk+1
Csi−1,si

+ min
∀si∈Pmsbm/kck,sm

m−bm/kck∑
j=bm/kck+1
Csi−1,si
 (5.21)
5.4 NoC Routing with DP-Network 147
where m ≥ k,i = 0, 1, . . . ,m and vsm = vt. Suppose the intermediate nodes in the KSLA are
the same as those in the optimal path, Pms0,sm , the path produced by KSLA is the optimal. In this
case, the lower error bound for KSLA is zero with W kt (s0) = V
∗
t (s0). Furthermore, the expected
cost between the optimal and KSLA cases have a interesting proven2 relationship which can be
expressed by the following inequality
W k−1t (s0) ≥W kt (s0) ≥ V ∗t (s0) (5.22)
where m ≥ k > 1. This expression implies that the KSLA approximation error decreases mono-
tonically when k increases. Note that there is no theoretical upper bound for the expected cost for
the KSLA approach. If the packet is trapped at a node with a single path to the destination and this
path is faulty, the packet will not reach the destination. Similar to other routing algorithms, such
as XY and Odd-Even, back-tracking or special rescue routines are required to help the packet to
escape from the trapped node. Nonetheless, this situation is rare and the KSLA can approximate
the optimal path in most cases, as shown in the Monte Carlo simulation.
A Monte Carlo simulation has been performed to verify the theoretical results. The relative error of
KSLA with respect to the optimal DP values and with different parameter k is shown in Fig. 5.7.
For each k, the optimal path cost and the cost using KSLA are obtained and computed. The
relative error equals to the differences between the path costs using the two approaches. The
figure presents an average relative error of 1000 networks with randomly generated path costs. The
result consistently shows the monotonicity for the parameters k in KSLA. Also, it is interesting to
observe that the error decreases drastically between k = 1 and k = 4. For the case of k = 4, the
error can be reduced to 10%. Consider the substantial requirement in memory, a relative small k
in KSLA can already provide a good quality sub-optimal routing solution.
For any n node network, the memory addresses required can be reduced to k(k+1)where k ≤ √n
and a 4-ary network topology is assumed. In general, there are 2k(k + 1) nodes within the k-
step region. Using the west-first turn model to avoid deadlock, only k(k + 1) destination cost
are required to evaluate and be stored. Selecting an appropriate k, enables a trade-off between
memory consumption and routing optimality at designer’s disposal. Fig. 5.7 shows the size of the
routing table requirement for each k as well as the relative error for the KSLA routing. For the
2This can be derived using the inequality min∀Pms0,s2 {Cs0,s1 + Cs1,s2} ≤ min∀Pks0,s1 Cs0,s1 +
min∀Pm−ks1,s2
Cs1,s2 , where Ci,j ≥ 0,∀i→ j ∈ A
5.5 Results and Discussion 148
2 4 6 8 10
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
R
el
at
iv
e 
er
ro
r f
or
 K
SL
A
 
 
0
20
40
60
80
100
120
140
Number of steps (k−step look ahead)
R
ou
tin
g 
ta
bl
e 
siz
e 
(ad
dre
ss
)
Relative error
Memory addresses
Figure 5.7: Theoretical estimates for the approximation error of the KSLA approach with respect
to optimal DP values and the routing table size for the corresponding k values.
case of k = 4, the number of memory addresses required is only 20 for the KSLA approach versus
63 for a full routing table.
5.5 Results and Discussion
5.5.1 Simulation Environment
In order to perform a complete evaluation of the proposed routing algorithm, the open Noxim
[bib08], which is an open source SystemC simulation for mesh-based NoC, was modified. In
particular, additional ports to communicate the DP values were added to the router architecture and
a routing-table updating scheme described in previous section was introduced. A new DP routing
function is implemented to realise both the global path planning and k-step Look Ahead. The
traffic patterns benchmarks embedded in Noxim are used for the routing performance evaluation.
These traffic patterns, such as hot-spot random traffic and transpose, provide a comprehensive
evaluation for the routing capability as shown in other related work [ACPP08].
5.5 Results and Discussion 149
By varying the packet injection rate, different routing algorithms produce different average packet
delivery delay and saturation point. The average packet delivery delay is used as a metric to eval-
uate the routing algorithm. The DP-network provides the shortest path planning, by minimizing
the packet delivery delay at every node. Also, the maximum packet delay is used to evaluate the
routing performance, which is important for NoC real-time applications. The experiments carried
out refer to an 8 × 8 sized NoC. Traffic sources generate 8-flit packets with an exponential dis-
tribution, the parameters of which depend on the packet injection rate. The FIFO buffers have a
capacity of 16 flits. Each simulation was initially run for 1,000 cycles to allow transient effects to
stabilize and, afterwards, executed for 20,000 cycles.
3 3.5 4 4.5 5 5.5 6
x 10−3
0
50
100
150
200
250
300
Packet injection rate (packet/cycle/node)
Av
er
ag
e 
pa
ck
et
 d
el
ay
 (c
yc
le)
XY
OE
DyAD
NOP
DP
Figure 5.8: Random traffic with hot-spot at the center.
5.5.2 Results for Average Packet Delay
In order to evaluate the DP-network performance, the average packet delay between DP and four
other well known routing algorithms, namely XY [DT04], DyAD [Hu05], Odd-Even [SZBR07]
5.5 Results and Discussion 150
3 3.5 4 4.5 5 5.5
x 10−3
0
50
100
150
200
250
300
Packet injection rate (packet/cycle/node)
Av
er
ag
e 
pa
ck
et
 d
el
ay
 (c
yc
le)
XY
NOP
DyAD
OE
DP
Figure 5.9: Random traffic with hot-spot at the corner.
and Odd-Even routing with a Node-on-Path selection scheme [ACPP08], are compared. Each
packet is generated randomly from the processors following a traffic pattern and comprises of 2
to 10 flits. A fully optimal DP-network dynamic routing is applied for the experiments in this
sub-section. The results for using KSLA will be presented in the next sub-section.
Fig. 5.8 shows the results of random traffic with hot-spots. This type of traffic pattern is consid-
ered to be more realistic than random traffic with uniform distribution. In most of the applications,
certain processors or tiles are more frequently accessed than others, such as memory nodes and
Input/Output nodes. In this scenario, there are four hot-spots locating in the center of the network
with 20% hot-spot traffic. When traffic is directed to the center of the network, the central re-
gion will be substantially congested. Deterministic routing algorithms, however, would still divert
traffic to these regions. Routing algorithms such as NoP and DP can slightly outperform other
algorithms with deterministic routing. The DyAD routing adopts a scheme that switches between
XY and Odd-Even dynamically, and, thus, presents a result in between the two algorithms. The
5.5 Results and Discussion 151
0.01 0.012 0.014 0.016 0.018 0.02 0.022 0.024 0.026
0
50
100
150
200
250
300
Packet injection rate (packet/cycle/node)
Av
er
ag
e 
pa
ck
et
 d
el
ay
 (c
yc
le)
XY
OE
DyAD
NOP
DP
Figure 5.10: Transpose traffic.
results are consistent with literature [Hu05]. Fig. 5.9 shows the results of another hot-spot traffic
where the hot-spots are located in the corners of the network. In this case, there will be no con-
gested traffic at the center of the network. The dynamic routing algorithm has a larger degree of
freedom to divert the packets to the destination with a potentially smaller delay path. The perfor-
mance advantages from using dynamic routing is more substantial in this case that DP outperforms
other routing algorithms by 24.7%.
Fig. 5.10 and 5.11 show the results for a transpose and butterfly traffics respectively. It can be
observed that the performances of XY routing and DyAD are poor due to the congested routes
along the horizontal hopping, which coincide with results reported in literatures [Hu05, ACPP08,
Chi92]. The DP routing can delay the saturation point significantly, because of the optimal path
planning which is able to utilize the throughput of the network effectively. It is interesting to
observe that NoP also provides an efficient routing scheme which adapts to the congestions by
delaying the saturated packet injection rate to 0.02 in transpose traffic. DP outperforms other
5.5 Results and Discussion 152
0.01 0.015 0.02 0.025 0.03 0.035
0
50
100
150
200
250
300
Packet injection rate (packet/cycle/node)
Av
er
ag
e 
pa
ck
et
 d
el
ay
 (c
yc
le)
XY
OE
DyAD
NOP
DP
Figure 5.11: Butterfly traffic.
routing schemes by 28.4% and 28.9% for transpose and butterfly traffics respectively.
We also compared the maximum packet injection rate for a fixed average delay with different
routing algorithms. The results are summarized in Table 5.3. In this scenario, a larger injection
rate implies a better utilization of network throughput. The results show that DP outperforms other
routing algorithm by 22.3%with the utilization of real-time traffic information. The other dynamic
routing scheme, odd-even routing with NoP selection, also outperforms other deterministic routing
algorithms, such as XY.
5.5.3 Results for k-Step Look Ahead
The recently proposed Node-on-Path (NoP) approach in [ACPP08] is a special case of the k-step
Look Ahead. In NoP, each router chooses the routing direction based on the queue information that
is two steps away from the current node. A hill-climbing heuristic is implemented for the routing.
5.5 Results and Discussion 153
Table 5.3: Comparisons for packet injection rates between five routing algorithms.
Packet injection rate (Packet/cycle/node×10−3)
Traffics XY DyAD Odd-Even NoP DP
Random-1 5.06 5.05 4.92 5.24 5.45
Random-2 3.07 3.12 4.52 4.54 5.08
Transpose 10.8 11.1 13.0 14.7 17.3
Butterfly 20.1 20.9 21.0 21.1 29.2
DP 28.9% 27.5% 18.4% 14.3%
Improvement
However, the NoP approach does not compute the DP values for the destination nodes, whereas
a score value, which resembles the DP expected delay, is computed on demand. For the DP-
network, the DP value is computed by the DP-network and distributed to all routers. This provides
a fast decision time, as only simple table look-up is required when the header flit arrives. In the
following, the experimental results of comparing the NoP and KSLA algorithms are discussed.
A special transpose traffic scenario is considered with a packet injection rate of 0.02 packet per
cycle per node. Performance of KSLA with different k, XY and NoP routings are shown in
Fig. 5.12. When k = 0, KSLA has the same performance as XY. This is because the routing table
is initialized following the XY routing scheme and the routing table is never updated. For the case
of k = 2, KSLA provides a similar performance as NoP (the average delay equals to 124 for NoP
and 108 for DP). This suggests that NoP resembles a special case of KSLA routing, specifically
when k=2. By increasing the k value, the routing delay is further reduced until it converges to 42
packet per cycle per node. These results confirm the trade-off in routing optimality with different
k steps as shown in the earlier Monte Carlo simulation in Fig. 5.7.
5.5.4 Summary
This section presented a novel DP-network for fully optimal routing in Network-on-Chip (NoC).
The DP-network provides on-the-fly shortest path computation using distributed dynamic pro-
gramming and enables dynamic routing based on the real-time traffic conditions and congestions.
Also, a k-step Look Ahead (KSLA) routing strategy has been presented. It can provide trade-off
between routing optimality and memory consumption. Experimental results demonstrate the per-
formance and merits of optimal routing over other deterministic and adaptive routing approaches,
5.6 Hardware Implementation 154
0 2 4 6 8
0
20
40
60
80
100
120
400
600
800
1000
A
ve
ra
ge
 p
ac
ke
t d
el
ay
 (C
yc
le
)
K-Step Look Ahead (Step)
 NOP
 XY
 DP (K-Step)
Figure 5.12: Comparison between k-step look ahead, Odd-Even using a NoP selection and the XY
routing approaches.
which are based on partial or local traffic information. The optimal DP-network based routing
outperforms XY routing by 28.9% and also improves other adaptive routing strategies, such as
adaptive Odd-Even, by 18.4%. Note that the DP-network approach may result out-of-order flits
because of the dynamic changes in the routing tables. Simpler schemes, such as XY routing, can
avoid out-of-order flits in a cost effective way. It is interesting to observe that the new KSLA ap-
proach is a generalization of other adaptive routing algorithm, which applies hill-climbing heuris-
tics for route planning.
5.6 Hardware Implementation
In this section, an implementation of the DP-network and the dynamic routing enabled NoC ar-
chitecture are presented. Comparisons on the utilization of hardware resources, clock frequencies
and power dissipations are discussed.
5.6 Hardware Implementation 155
A r b i t e r
N o r t h
W e s t
E a s t
S o u t h
N o r t h
W e s t
E a s t
S o u t h
L o c a l
L o c a l
S w i t c h
R o u t i n g  t a b l e
D P  C o m p u t a t i o n a l  
U n i t
Q u e u e  l e n g t h  
p r e d i c t i o n
Router
U p d a t e  
r o u t i n g
t a b l e
D e s t i n a t i o n
R o u t i n g
d i r e c t i o n
N o r t h
S o u t h
E a s t
W e s t
Figure 5.13: Design of a router that comprises of optimal path computation unit.
5.6.1 Router Architecture
Fig. 5.13 shows the architecture of a router, which enables dynamic routing. The router design
is similar to that used in NoC [Hu05]. An additional block implements the dynamic routing
algorithm. The queue length prediction unit captures the queue length from the input FIFO and
evaluates the communication cost for that particular direction. The routing table stores the routing
directions, which are constantly updated by the DP-network. The shortest path computation and
optimal routing mechanics are implemented using the DP computational unit, which is shown in
Fig. 5.14. Computation units from different routers are interconnected as to form a DP-network.
This figure signifies that the computational network is simultaneously computing the shortest path
while router keeps feeding the new cost estimates into the network.
The shortest path computation requires a minimum operation to evaluate and compare the cost
5.6 Hardware Implementation 156
+
+
>=
0
1+
+
>=
0
1
>=
0
1
D0
D1
V(out)
V(East)
V(South)
V(West)
V(North)
Cost(East)
Cost(South)
Cost(West)
Cost(North)
Figure 5.14: Design of the DP computational unit.
of all all actions at each node. Also, adders are required to sum of the cost at the current node
and expected cost associated with the action, as shown in Eq. 5.6. Also, a multiplexer is needed
to output the associated action for the minimum expected cost. Therefore, the basic circuit in a
DP computational unit comprises of four adders, three comparators and three multiplexers. This
circuit can be further extended to provide multiple inputs by increasing the number of adder,
minimizer and operator.
D e s t i n a t i o n
C h e c k  M a n h a t t a n  
d i s t a n c e
l e s s  t h a n / e q u a l  
t o  k  s t e p s
G e t  d i r e c t i o n  
f r o m  r o u t i n g  t a b l e
G e t  b o u n d a r y  
n o d e  a d d r e s s e s
m o r e  t h a n  
k  s t e p s
F i n d  t h e  n o d e  
w i t h  m i n i m u m  
e x p e c t e d  c o s t
G e t  d i r e c t i o n  
f r o m  r o u t i n g  t a b l e
R o u t i n g  
d i r e c t i o n
R o u t i n g  
d i r e c t i o n
Figure 5.15: The data flow routine for the k-step look ahead algorithm.
The data flow diagram for the KSLA algorithm is shown in Fig. 5.15. When the destination infor-
mation is obtained from the packet, Manhattan distance to the destination from the current node
is calculated. If the distance is smaller than or equal to k, the routing direction to the destination
5.6 Hardware Implementation 157
can be directly obtained from the routing table. Otherwise, the nodes within the k-step regions are
obtained. The nodes in the k-step region are temporary destinations that are k steps away from
the current node. For a typical mesh topology, it is relatively trivial to obtain the temporary nodes,
which can be done by using the Manhattan distance and look-up tables. One node is selected based
on an arbitrary selection scheme. Other selection schemes can be used, such as using the expected
cost or traffic. For simplicity, a node is selected randomly in this experiment. The address of this
node will be input to the routing table to obtain the routing direction.
5.6.2 Results of FPGA Implementations
To further evaluate the effectiveness and the hardware cost of the proposed methodology, a DP-
network is implemented using a Xilinx Virtex-4 XC4LX80 FPGA device. A mesh Network-
on-Chip is implemented using System Generator [Xil06b] and synthesised using the Xilinx ISE
synthesis tools. The design has been placed and routed to obtain the hardware area consumption
and power dissipation results.
The experiment is designed to evaluate the hardware overhead of the two different routing meth-
ods, which are the DP-network and KSLA routing. The DP-network routing employs a full routing
table, which provides optimal routing directions for all destinations in the network. The KSLA
provides routing directions for destinations that are k steps away. The XY routing is also im-
plemented as a reference. Algorithmic routing is employed for computing the routing directions
for the XY routing. Similar to other NoC architecture, a wormhole routing mechanism is imple-
mented.
Convergence of DP-Network in an FPGA
The performance and network convergence of the DP-network in an FPGA realization is studied.
The DP-networks with topologies of 3 × 3, 4 × 4, 5 × 5 and 6 × 6 are considered. The network
convergence rate for evaluating the shortest paths problems can be observed from the outputs
of the computational units. These outputs are captured at each time stamp and compared with
the optimal values in order for the root-mean-square errors to be computed. Fig. 5.16 shows the
errors of the DP-network in different clock cycles and in different network topologies. It can be
observed that the network converges to the optimal solution in 5 to 11 clock cycles, depending on
5.6 Hardware Implementation 158
1 2 3 4 5 6 7 8 9 10 11
0
5
10
15
20
25
30
35
40
45
50
Clock cycle
R
oo
t−
M
ea
n−
Sq
ua
re
 e
rro
r
(a) 3×3 network
1 2 3 4 5 6 7 8 9 10 11
0
5
10
15
20
25
30
35
40
45
50
Clock cycle
R
oo
t−
M
ea
n−
Sq
ua
re
 e
rro
r
(b) 4×4 network
1 2 3 4 5 6 7 8 9 10 11
0
5
10
15
20
25
30
35
40
45
50
Clock Cycle
R
oo
t−
M
ea
n−
Sq
ua
re
 e
rro
r
(c) 5×5 network
2 4 6 8 10 12
0
5
10
15
20
25
30
35
40
Clock cycle
R
oo
t M
ea
n 
Sq
ua
re
 e
rro
r
(d) 6×6 network
Figure 5.16: Convergence of DP-network in an FPGA implementation. The y-axis is the root-
mean-error of the optimal values obtained from the value outputs at each computational unit. The
x-axis is the clock cycle. The period of the clock cycle is not specified here and is varies with
different FPGA devices.
5.6 Hardware Implementation 159
the network configurations. The convergence time of the mesh network is bounded by 2
√
n − 1,
where n is the total number of nodes in a mesh. Suppose the network is operating at 200Mhz, the
convergence time is bounded by 5(2
√
n− 1)ns. The DP-network can rapidly evaluate the shortest
path and provide optimal path planning for dynamic routing.
Table 5.4: Hardware area results for the DP-networks. Results are obtained by using a Xilinx
Virtex-4 XC4VLX40 FPGA
Network format Area (slice) Power (mW) Frequency (MHz)
3× 3 337 2.76 100.76
4× 4 805 6.84 83.6
5× 5 1416 9.06 80.1
6× 6 1977 12.4 76.2
Hardware Results
The hardware area consumptions for routers with 5 input and output ports are summarized in
Table 5.5. The resource consumption for the XY router in this work is similar to that from the
implementation reported in [OM06]. The overhead of a DP-network router is small. The overall
area is slightly larger than the XY router. The DP router uses 20.6%more slices than the XY router.
For the KSLA router, the area overhead is 40.3%. The KSLA employs more hardware resources
for the procedures in evaluating the intermediate nodes for sub-optimal routing. Theoretically,
the KSLA can significantly reduce the memory consumption. However, memories are mapped to
BRAM in Xilinx FPGA, which is a scratchpad RAM and has a fixed hardware area. The BRAM is
large enough to store a complete routing table for a 8× 8 network. Although the KSLA approach
requires fewer memory addresses, the BRAM utilization between DP and KSLA is still the same.
Nonetheless, the reduction of routing table size is useful when other forms of on-chip memory,
such as registers, are employed. The router area is still dominated by the input FIFO buffers, the
area overhead for the DP-network can be negligible. As seen in table 5.5, the DP overhead is
only 23% for a typical buffer size. The DP-network with the continuous-time formulation can be
implemented using an analogue circuit as proposed in [MSC+07], in which the hardware area and
power consumption can be significantly reduced.
5.7 Conclusion 160
Table 5.5: Hardware area results for the XY, DP and KSLA routers. Results are obtained based on
a Xilinx Virtex-4 XC4VLX40 FPGA
Routing Buffer Area Device BRAM Frequency
algorithm size (slice) utilization utilization (MHz)
XY 16 310 1% 10 136
DP (Optimal) 16 374 2% 11 122.4
KSLA (k = 4) 16 435 2% 11 108.3
XY 32 320 3% 10 135
DP (Optimal) 32 394 4% 11 120
KSLA (k = 4) 32 445 3% 11 105
5.7 Conclusion
This chapter presented a novel DP-network for fully optimal routing in network-on-chip (NoC).
The DP-network provides on-the-fly shortest path computation by using distributed dynamic pro-
gramming and updating the routing table for optimal path planning based on the traffic conditions
and congestions. Also, a k-step Look Ahead (KSLA) routing strategy has been presented which
can provide trade-offs between routing optimality and memory consumption. Experimental re-
sults confirm the performance and merits of optimal routing over other deterministic and adaptive
routing approaches, which are based on partial and local traffic information. The new DP-network
provides an interesting alternative to static routing schemes. The optimal DP-network based rout-
ing outperforms XY routing by 28.9% and also improves other adaptive routing strategies, such as
Odd-Even, by 18.4%. However, the simpler schemes, such as XY routing, can effectively avoid
out-of-order flits. Also, at the same time XY is around 23% smaller in terms of hardware area and
11% faster in operating frequency when compared to DP-network approach. It has been observed
that the new KSLA approach is a generalization of other adaptive routing algorithm, which applies
hill-climbing heuristics for latency minimization. The KSLA approach offers a large flexibility
and choices of optimality for different level of memory consumptions. Moreover, the hardware
overhead for DP-network has been examined. It was found that DP-network consumes less than
20.6% of extra hardware area when compared to deterministic routing algorithms for a standard
router design. The DP-network offers a new effective alternative for high-performance on-chip
communication.
161
Chapter 6
Conclusion and Future Work
6.1 Summary
On-chip communication has become a primary concern in digital system design due to the rapid
deterioration in wire performance caused by the aggressive technology scaling in semiconductor
manufacturing. High performance system takes advantage of inexpensive and plentiful functional
units and processors to provide massively parallel computation and acceleration. However, the
drastically increased demand on silicon-bandwidth quickly becomes a bottleneck. Communication
bandwidth, not arithmetic, is the critical resource in a modern computing system that dominates
cost, performance, and power.
Design for bandwidth emphasizes utilisation of silicon resources for signal and information trans-
portation instead of arithmetic and computational. Bandwidth can be enhanced by both (i) de-
signing new signalling schemes via exploiting semiconductor physical characteristics and by (ii)
introducing intelligence and “on-the-fly” optimization to transport information inside a chip effi-
ciently. These new approaches provide a promising high-bandwidth solution that can mitigate the
future on-chip communication challenges.
This thesis has argued that the reconfigurable architecture in current FPGAs is largely un-
favourable for realizing global interconnections, such as long lines for on-chip networks. Band-
width degradation due to fringed interconnects, as a result of congested routing, has been discov-
ered and analyzed with stochastic mathematical modeling, as presented in Chapter 3. To increase
communication bandwidth merely by adjusting bit-width will result in a large degradation, up to
6.2 Future Work 162
60%, which will eventually lead to inefficient communication systems which hinder the overall
system performance.
As an alternative, this thesis proposed that on-FPGA communication bandwidth can be efficiently
increased by exploiting the electrical properties in the underlying reconfigurable architecture, by
means of wave-pipelining. Using an analogue wave intra-chip signalling scheme, throughput of
interconnection can be increased up to 5.6 times. Evaluations of the wave-pipelining scheme
in devices from the SPICE modeling and Xilinx Virtex-4 families are presented. These have
demonstrated the implementation of testing circuits in real-world commercial FPGAs, as presented
in Chapter 4.
Because of the increase in communication and system complexities, traffic for on-chip transporta-
tion is difficult to predict at design time. Optimization and synthesis of the communication archi-
tecture at design time may not be able to fully utilize the hardware resources to adapt to dynamic
traffic. Therefore, by incorporating intelligent plasticity with “on-the-fly” optimization in the on-
chip communication system will enable a substantial improvement in communication bandwidth
and efficiency. A Dynamic Programming (DP)-network is introduced in Chapter 5 to enable real-
time optimization for on-chip communication. The DP-network implements an optimization algo-
rithm with a distributed architecture which enables the network-on-chip (NoC) to deliver optimal
path planning for packet-routing. The network consolidates and adapt to the real-time traffic. It has
been demonstrated that the DP network can improve the overall delay performance by diverting
the traffic to avoid hot spots when compared to other static routing approaches. The DP-network
provides an interesting alternative to enable adaptation to the network traffic and can potentially
to be applied for fault-tolerant routing in NoC.
6.2 Future Work
The work in this thesis can be continued and extended in a variety of directions.
Firstly, the primary objective of the study presented in this thesis is the optimization of on-chip
throughput or bandwidth performance. In particular, a signalling scheme adopting an analogue
wave can substantially enhance throughput. However, as discussed in Chapter 4, numerous dy-
namic noise sources and static skew can hamper the performance and may even create a communi-
6.2 Future Work 163
cation fault. Similarly, other recently proposed novel aggressive signalling schemes also exhibit a
potential vulnerability in dynamic errors. From a reliability perspective, this is also a critical issue
to be investigated. Some steps in this direction have recently been taken by Teehan, Lemieux and
Greenstreet in [TLG09].
Besides, with aggressive scaling, interconnects at nanometre-scale are increasingly vulnerable to
manufacturing defects, process variability and lifetime degradation. The reduction of lithographic
pattern width and separation has led to more defects. Gradual degradation and failure of intercon-
nects in future technologies obligates pessimistic over-design. Interconnect reliability is further
reduced by the nanoscale’s inherent random parametric behaviour. Reliable on-chip communica-
tion is crucial. In the context of communication-centric system design, Network-on-Chip (NoC)
provides an effective solution with high level routing and reliable communication protocols for
complex computer architectures. Faulty regions or paths can be avoided by using this on-chip
dynamic routing network.
Three key directions have been identified that provide the fundamental basis for future research.
(i) Adaptation with “on-the-fly” optimization is a promising technique to implement reliable ser-
vices within an unreliable environment, potentially avoiding the high cost of classical redundancy.
As demonstrated in Chapter 5, a DP-network with real-time resolution of the optimization problem
can substantially enhance communication performance. Such an approach can be further extended
to provide real-time control of power and reliability, and potentially enabling new functionalities.
Current methods tend to focus on design-time optimization that would limit the capability to tackle
the sophisticated dynamic on-chip environment.
(ii) Asynchrony and clock domain decoupling have been proposed as a potential solution to the
problems caused by variation, and continued scaling has forced asynchrony onto long links. As
VLSI systems continue to scale, judicious system partitioning and integration with asynchrony
can potentially enhance the reliability of communication systems.
(iii) Virtualization is an important technique in system design to reduce designer effort and im-
proving designer efficiency. Currently this is not being explored at a level of abstraction (across
the physical and data link layers) suitable for on-chip communication links. Virtualization can
provide an alternative solution for reliable communication in on-chip links. This can be done by
6.2 Future Work 164
isolating the system designer from the imperfections of underlying technology and encapsulating
the complexity of the underlying communication architectures.
In conclusion, while this thesis demonstrated significant progress in the investigation of bandwidth
utilization and degradations (Chapter 3), intra-chip bandwidth optimization using analogue wave
on-FPGA signalling (Chapter 4) and the novel DP-network for on-chip dynamic routing (Chapter
5), a wide variety of objectives, architectures and new technology adaptations will benefit from
further study.
165
Appendix A
A Simple Approximation of
Interconnections Length
A simple methodology to approximate interconnection length in communication link [MSCL07]
is presented. This approach provides a fast approximation with simple algebraic modelling for
interconnection length and channel utilization.
A.1 Average Interconnections Length
Consider a typical communication link with (S) bits, the two placement constraint areas are (L)
tiles apart. The placement area is (A), width (u) and length (v), thus A = uv, as shown in
Fig. 3.3(a). Let the number of long wires at each channel be (W ), and the number of channels
that are required to implement S bits communication link will be S/W . It is also assumed that
a random placement of pins at the constrained region. This is a reasonable assumption, as the
hierarchical placement assumption for random logics is not applicable for communication links.
Therefore, the average interconnections length R is then given by
R =
S/W∑
k=1
W ·Rk
 /S (A.1)
A.2 Mathematical Derivation 166
where Rk is the average length of interconnections going through channel k. Since there are two
possible cases, u < S/W and u ≥ S/W for with fringed interconnections and direct interconnec-
tions, two sets of expressions are derived separately.
A.2 Mathematical Derivation
 
Figure A.1: The realization of a communication link to a typical island-based FPGA.
The average interconnection length for the case u ≥ S/W is denoted by Ru≥S/W . Also, let index
i be the channel index in the bus cross section counted from center of the bus and the index j be
the channel index from the boundary between direct and fringed interconnections regions of the
bus. Therefore, 2R(i) + L becomes the average interconnections length for an interconnection
at channel i, where R(i) is the average distance from the end point of the long wire channel
(for example M1 and M2 in Fig. A.2) to any of the tile in the placement area. Then the average
interconnections length can be computed as follows,
Ru≥S/W =
W
S
4 S/2W∑
i=1
R(i) +
SL
W
 (A.2)
A.2 Mathematical Derivation 167
The term R(i) is the expected length of an interconnections within the placement area. By assum-
ing that the placement is uniformly distributed, the expected value can be computed as the sum
of the total number of steps required to visit every tile from i divided by uv. By considering the
placement area as two separate regions, thus R(i) is given as follows,
R(i) =
[
(u2 − i)
∑v
k=1 k + v
∑u/2−i
k=1 k
]
+
[
(u2 + i)
∑v
k=1 k + v
∑u/2+i−1
k=1 k
]
uv
=
v + 1
2
+
u
4
+
i2
u
− i
u
(A.3)
By substituting Eq. A.3 into Eq. A.2, the average interconnections length for u ≥ S/W can be
obtained as
Ru≥S/W =
u
2
+
1
u
[(
S
2W
+ 1
)(
S
3W
− 2
3
)
+A
]
+ L+ 1 (A.4)
For the case u < S/W , the average interconnections length Ru<S/W comprises both fringed and
direct interconnections. Therefore, it can be considered as a function of the two different cases as
follows,
Ru<S/W =
W
S
4 S/2W−u/2∑
j=1
Q(j) + 4
u/2∑
i=1
R(i) +
SL
W
 (A.5)
where Q(j) is the average distance from the end point of a fringed long wire channel (for example
M3 and M4 in Fig. A.2) to any of the tile in the placement area and R(i) is given by Eq. A.3.
Q(j) =
v
∑
k=1 uk + u
∑v
k=1 k + (j − 1)uv
uv
(A.6)
=
u+ v
2
+ j (A.7)
By substituting Eq. A.7 into Eq. A.5, the following expression for average interconnection lengths
can be obtained:
A.3 Results for the Simple Approximation 168
Ru<S/W =
2W
3S
(
u2
4
− 1) + A
u
+
S
2W
+ L+ 1 (A.8)
A.2.1 Summary
The average interconnection lengths are expressed in Eqs. A.4 and A.8 for the cases of u ≥ S/W
and u < S/W , respectively. It is interesting to notice that aspect ratio of the placement area is
crucial to the interconnect lengths. If u ≥ S/W , the lengths grow quadratically with u whereas
if u ≥ S/W , the length grow linearly. In other words, the communication link with fringed
interconnections would be affected by the aspect ratio of placement area quadratically.
A.3 Results for the Simple Approximation
By relating the average interconnections length to the actual circuits delay, consider that the delay
is in a linear relationship with the interconnects length and therefore delay is αR + β, where α
and β can easily be measured from any FPGAs. Fig. 3.15 shows that the delay of interconnections
and average interconnections length in tiles has a linear relationship.
Fig. A.2 shows the theoretical average interconnections delay versus the experimental delay in a
Virtex-4 (XC4VLX200) FPGA. It shows that the experimental results match well with the theoret-
ical prediction. Especially, the theoretical model can predict the general trend that delay decreases
quadratically with the parameter u, the width of the placement area. We also present in Table I
results for some experimental and theoretical interconnections delays comparison. It can be seen
that the relative error |D−D′|/D is reasonably small and the overall average error is 4.6%, lending
credence to the theory.
A.3 Results for the Simple Approximation 169
Table A.1: Comparison of experimental and theoretical interconnections delay
Bit (S) Length (L) Width (u) D (ns)(exp) D′(ns)(Theor) |D′ −D|/D
64 100 8 1.9 1.84 0.032
128 100 5 2.07 2.11 0.019
128 100 20 1.93 1.90 0.016
256 100 5 2.10 2.61 0.24
256 100 20 2.03 2.02 0.0049
512 50 5 2.44 2.63 0.078
512 50 10 1.93 1.87 0.031
512 50 20 1.81 1.57 0.16
512 100 5 2.87 3.61 0.26
512 100 20 2.18 2.26 0.037
A.3 Results for the Simple Approximation 170
5 10 15 20 25 30 35 40
2
3
4
Width (tile)
De
la
y 
(ns
)
(a)  Theoretical and experimental results for a 512-bit link
5 10 15 20 25 30 35 40
1.8
2
2.2
2.4
2.6
2.8
Width (tile)
De
la
y 
(ns
)
(b)  Theoretical and experimental results for a 256-bit link
5 10 15 20 25 30 35 40
1.8
2
2.2
Width (tile)
De
la
y 
(ns
)
(c) Theoretical and experimental results for a 128-bit link
Experimental
Theoretical
Experimental
Theoretical
Experimental
Theoretical
 
Figure A.2: Theoretical and the experimental interconnections delay.
171
Appendix B
Power Consumptions for Long
Interconnections
Table B.1 shows the results of power dissipations of long interconnections in different FPGA
circuits. These results were obtained using VPR 1 with 20 MCNC benchmark circuits, which
includes FPGA circuits of different sizes. The power estimation package [PYW02] is also used to
obtain the power dissipations for the circuits and interconnections. From this table, there are only
4.5% and 8.6% interconnects having interconnection lengths longer than 50 and 30 tiles, which
are considered long interconnections. However, these long interconnections consume significant
proportion of power in a circuit design. Long interconnections dominates the power consumption.
Specifically, the 50 and 30 tiles long interconnections consume 38.6% and 43.8% overall power
in the circuits.
Fig. B shows the power dissipation of 5 selected benchmark circuits for different interconnection
lengths. The average curve shows the overall average power for the 20 MCNC circuits. The
overall simulation results for all circuits are shown in Table B.2. These 5 circuits are large circuit
with LUTs consumptions between 1800 and 3600. The results were obtained using the VPR tool
together with the power estimation package. Each data point corresponds to the average power
dissipation from interconnection of a particular length. As can be seen from the figure, power
1VPR (Virtual Place and Route) is a software to model different architectures and is able to evaluate the architecture
and compute-aided design flow using benchmarking circuits.
B. Power Consumptions for Long Interconnections 172
grows exponentially with the length. Especially, for interconnect with length longer than 50 tiles,
the power dissipation increases drastically.
0 50 100 150 200
0
1
2
3
4
5
6
7
8 x 10
−4
Interconnection Length (tile)
Po
w
er
 (W
)
elliptic
frisc
s38417
ex1010
seq
average
Figure B.1: Power dissipation of 5 MCNC benchmark circuits for different interconnection
lengths.
B. Power Consumptions for Long Interconnections 173
Number % net, % net, % Power % Power
of LUTs L≥ 50 L≥ 30 (L≥ 50) (L≥ 30)
alu4 1522 2.3 6.9 7.7 15.9
apex2 1878 3.6 7.68 17 23.7
apex4 1262 3.13 9.39 46.5 50
bigkey 1707 0.67 1.15 27.4 28.2
clma 8381 6.3 13.8 46.6 56.8
des 1591 0.56 1.46 8.78 11.08
diffeq 1494 5.1 8.9 29.6 39.1
dsip 1370 0.44 0.87 19.6 20.4
elliptic 3602 5.7 10.2 54 59.6
ex5p 1064 8 10.6 38.78 42
ex1010 4598 1.62 7.98 71.4 75.5
frisc 3539 6.53 13.37 51.2 58.75
misex3 1397 5 10.11 21 33.4
Ppdc 4575 9.65 15.4 63.65 66.78
s298 1930 3.08 3.96 65.46 65.83
s38417 6096 9.65 15.4 63.65 66.78
seq 1750 4.88 8.2 22.9 29.2
spla 3690 7.6 12.97 59.4 63.4
tseng 1046 2.04 5.04 18.07 25
Average 4.52 8.60 38.56 43.76
Table B.1: Distribution and power consumption percentages for interconnections with lengths 30
and 50 tiles in 20 MCNC benchmark circuits.
B. Power Consumptions for Long Interconnections 174
L
en
gt
h
(T
ile
)
1-
20
21
-4
0
41
-6
0
61
-8
0
81
-1
00
10
1-
12
0
12
1-
14
0
14
1-
16
0
16
1-
18
0
18
1-
20
0
20
1-
22
0
Po
w
er
(W
)
al
u4
4.
07
e-
6
2.
73
e-
5
5.
80
e-
5
8.
59
e-
5
1.
01
e-
4
1.
27
e-
4
-
-
-
-
-
ap
ex
2
2.
79
e-
6
2.
58
e-
5
7.
81
e-
5
9.
43
e-
5
1.
61
e-
4
2.
03
e-
4
-
-
-
-
-
ap
ex
4
2.
32
e-
06
6.
85
e-
06
7.
72
E
-0
6
6.
24
E
-0
6
1.
65
e-
4
2.
08
e-
4
2.
51
e-
4
2.
37
e-
4
3.
08
e-
4
5.
58
e-
4
5.
93
e-
4
bi
gk
ey
2.
92
e-
5
1.
16
e-
4
1.
16
e-
4
1.
16
e-
4
4.
44
e-
4
-
-
-
-
-
cl
m
a
1.
55
e-
6
8.
32
e-
6
1.
25
e-
5
2.
34
e-
5
3.
79
e-
5
5.
46
e-
5
7.
22
e-
5
4.
04
e-
5
6.
32
e-
5
6.
33
e-
5
1.
55
e-
4
de
s
1.
26
e-
5
1.
00
e-
4
1.
50
e-
4
1.
50
e-
4
1.
50
e-
4
3.
50
e-
4
4.
50
e-
4
4.
50
e-
4
8.
50
e-
4
1.
02
e-
3
2.
10
e-
3
di
ff
eq
4.
35
e-
6
3.
44
e-
5
4.
56
e-
5
1.
93
e-
5
3.
48
e-
5
3.
33
e-
4
1.
40
e-
4
3.
61
e-
4
-
-
-
ds
ip
1.
95
e-
5
1.
05
e-
4
2.
01
e-
4
2.
47
e-
4
-
-
-
-
-
-
-
el
lip
tic
2.
27
e-
6
1.
32
e-
5
2.
17
e-
5
7.
63
e-
5
1.
04
e-
4
9.
94
e-
5
8.
35
e-
5
1.
37
e-
4
1.
95
e-
4
1.
78
e-
4
2.
79
e-
4
ex
5p
7.
82
e-
6
2.
45
e-
5
3.
57
e-
5
9.
53
e-
5
1.
92
e-
4
2.
85
e-
4
3.
44
e-
4
4.
12
e-
4
-
-
-
ex
10
10
4.
15
e-
7
2.
98
e-
6
3.
28
e-
6
1.
04
e-
6
1.
02
e-
6
1.
04
e-
6
1.
02
e-
6
1.
08
e-
4
1.
28
e-
4
1.
62
e-
4
3.
62
e-
4
fr
is
c
5.
98
e-
7
4.
09
e-
6
1.
68
e-
5
2.
57
e-
5
4.
49
e-
5
6.
03
e-
5
6.
33
e-
5
6.
33
e-
5
6.
26
e-
5
1.
76
e-
4
3.
18
e-
4
m
is
ex
3
4.
68
e-
6
5.
63
e-
5
1.
23
e-
4
1.
49
e-
4
2.
07
e-
4
1.
49
e-
4
4.
59
e-
4
-
-
-
-
pd
c
4.
53
e-
7
1.
69
e-
6
4.
00
e-
6
8.
63
e-
6
1.
77
e-
5
1.
30
e-
5
2.
23
e-
5
1.
82
e-
5
3.
69
e-
5
6.
66
e-
5
5.
35
e-
5
s2
98
1.
90
e-
6
1.
39
e-
5
6.
39
e-
5
6.
40
e-
5
6.
39
e-
5
1.
93
e-
4
2.
30
e-
4
5.
51
e-
4
3.
31
e-
4
7.
63
e-
5
9.
63
e-
5
s3
84
17
9.
52
e-
6
4.
93
e-
5
9.
63
e-
5
1.
18
e-
4
1.
68
e-
4
2.
10
e-
4
3.
84
e-
4
3.
95
e-
4
4.
91
e-
4
5.
81
e-
4
6.
10
e-
4
se
q
3.
59
e-
6
2.
80
e-
5
7.
41
e-
5
1.
22
e-
4
1.
57
e-
4
1.
55
e-
4
1.
65
e-
4
2.
83
e-
4
2.
93
e-
4
3.
20
e-
4
4.
83
e-
4
sp
la
7.
21
e-
7
3.
10
e-
6
9.
14
e-
6
1.
72
e-
5
2.
29
e-
5
2.
88
e-
5
3.
10
e-
5
6.
24
e-
5
7.
98
e-
5
1.
43
e-
4
2.
08
e-
4
ts
en
g
8.
52
e-
6
4.
85
e-
5
9.
48
e-
5
6.
03
e-
6
5.
94
e-
4
-
-
-
-
-
-
A
ve
ra
ge
6.
15
e-
6
3.
52
e-
5
6.
38
e-
5
7.
50
e-
5
1.
48
e-
4
1.
54
e-
4
1.
93
e-
4
2.
40
e-
4
2.
58
e-
4
3.
04
e-
4
4.
78
e-
4
Ta
bl
e
B
.2
:P
ow
er
co
ns
um
pt
io
n
(W
)f
or
di
ff
er
en
ti
nt
er
co
nn
ec
tio
ns
of
M
C
N
C
be
nc
hm
ar
k
ci
rc
ui
ts
.N
ot
e
th
at
so
m
e
ci
rc
ui
ts
do
no
th
av
e
in
te
rc
on
ne
ct
io
ns
of
a
sp
ec
ifi
c
le
ng
th
.
175
Appendix C
Wave-Pipelined Interconnect Testing
Figure C.1: Failure rate detection circuit [WSC07] for global interconnection.
An example of interconnect wave-pipelining has been realized in a Xilinx Virtex-4 XC4VLX200
FPGA device (with speed grade -1). Following the testing and measurement circuit in [WSC07], a
testing circuit interconnect wave-pipelining was implemented. The circuit was originally designed
for measuring the on-chip data path delay. By replacing the path-under-test combinational circuit
with a global interconnection, the interconnection delay and wave-pipelining throughput can then
be examined subsequently. The circuit is shown in Fig. C.1. A test stimuli generator generates
C. Wave-Pipelined Interconnect Testing 176
Frequency (Hz)
Ph
a
se
 
sh
ift
e
d 
(D
e
gr
e
e
)
0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6
x 108
-50
0
50
100
150
200
250
(C) Wave-pipelining
region
(B) Failure region
Data period equals 
to interconnection delay
Zero phase shifting
Maximum frequency for
zero phase shifting
(A) Single bit
transmission region
Figure C.2: Measurement results corresponding to the testing circuit in Fig. C.1.
test patterns and the patterns are captured by the sink register. The test pattern data rate equals
to the clock frequency. The test pattern will traverse through the long interconnection, which is
built on top of a combination of interconnect segments. Moreover, each testing date is generated
at the positive edge from the source register with clkA and registered at the negative edge at the
sink with clkB . clkB is phase shifted clock of clkA. Further detail about the measurement circuit
can be found in [WSC07].
For synchronous communication, the phase shifting of clkB is zero. The maximum clock fre-
quency equals to half of the reciprocal of the interconnection delay, as only half clock cycle is
allowed for the data to traverse. For wave-pipelining communication, the interconnection can sup-
port higher throughput. Thus, data with a higher data rate can be injected into the interconnection.
By increasing the phase of clkB , data with higher data rate can be captured at the sink.
The results of successful and failed communications for different clock frequencies and phase
shifting is shown in Fig. C.2. The area in grey shaded is the failure region (B). In this region, data
is failed to be captured at the sink register after being transmitted along the long interconnection.
C. Wave-Pipelined Interconnect Testing 177
In contrast, communications are successful in regions (A) and (C). Typically, communication is
carried by wave-pipelining in region (C), as the data bit period is shorter than the interconnection
delay. Multiple bits are traversing simultaneously along the line. For the circuit without phase
shifting, the maximum frequency is 83 MHz, as being indicated in the figure. On the other hand,
for circuit with clock shifting, the clock frequency can be greatly enhanced to 248 MHz, which
is three times faster than the conventional synchronous approach. Higher clock frequency can be
obtained with judicious design of the shifted clock.
178
Appendix D
Interconnect Design for Maximizing
Throughput
Interconnect design has great impact on throughput and delay performances. Traditional inter-
connect design methodologies, such as buffer insertion and transistor sizing, aim to minimize
signalling delay [Bak90]. Similar techniques are also adapted to design FPGA interconnects, in
order to achieve minimal latency for reconfigurable circuits. It has been reported that these tech-
niques have been also employed for wave-pipelined interconnects to achieve a higher throughput
[DD05]. However, interconnects that designed for minimal delay may not imply a maximum
wave-pipelining throughput performance. Potentially, the throughput performance of intercon-
nects are not fully exploited based conventional interconnects. The impacts of interconnect design
on throughput performance is investigated and new methodologies that aims to maximize the
throughput performance are presented. This section is organized as follow: In Section D.1, a the-
oretical analysis shows that the throughput limit of a wave-pipelined interconnect is constrained
by the discrepancy of the charging and discharging time at each stage. In Section D.2, two new
design methods, cascaded repeaters utilization and transistor sizing, are proposed to maximize
the throughput performance in a wave-pipelined interconnect. Experimental results are also pre-
sented to show the significant improvement in the throughput and power consumption with the
new design approaches.
D.1 Throughput Analysis 179
D.1 Throughput Analysis
Supposing an interconnection has n identical segments, which comprises of a switch, a buffer and
a wire segment. The inverter is assumed switching at the 50% of the supply voltage and the wave-
front observed at the far end of each segment is shown in Fig. D.1. Also assumes that the buffer is
minimum sized and comprises of two inverters, as in [RCN03]. The time for the segment to charge
and discharge to the switching voltage of the inverter are denoted as t0.5r and t
0.5
f respectively. The
incoming pulse width to a segment is tv, which is possible to let the segment to charge up to the
fractional voltage of v. After traveling over this segment, the time that charges up the subsequent
stage is t1 and time that discharges the subsequent stage is t2. Thus,
t1 = tv − t0.5r + t0.5f
It is assumed that the rising time is the same as falling time, so t2 can be written as
t2 = tv − t0.5f + t0.5r (D.1)
Substituting Eq. 4.9 into Eqs. D.1-D.1, the charging, t1 and discharging time t2 for the i+1 stage
can be expressed as
t1 = σi ln
ki
1− vi − σi ln
ki
1− 0.5 + σi ln
viki
0.5
= σi ln
ki
1− vi + σi ln vi (D.2)
t2 = σi ln
ki
1− vi − σi ln vi (D.3)
where 0 ≤ vi ≤ 1, ki, σi ≥ 1, for i = 1, 2, ..., n.
As observed from Eqs. D.2-D.3, the charging time t1 is always smaller than the discharging time
t2, which implies that the rising edge traverse faster than the falling edge along the interconnects.
This coincides with the results in [DD05] that vi for i = 1, 2, ..., n is monotonically decreasing.
Consequently, after the signal travels over each segment, the time t1 gets smaller by the amount
of σi ln vi, and this amount will be added to t2 in the same time. Moreover, the time discrepancy
between t1 and t2 increases with the number of buffered stages (See the using of conventional
repeaters curve in Fig. D.3) and, subsequently, the falling edge will interfere the rising edge in the
line. This becomes a hindrance to improve the interconnect throughput.
D.2 Throughput Maximizing Buffer Design 180
Figure D.1: Waveform at the far end of each segment
D.2 Throughput Maximizing Buffer Design
D.2.1 Cascaded Buffers
The idea to overcome this bottleneck is to remove the discrepancy between the charging and
discharging time of the buffers, such that the rising and falling edges traverse along the intercon-
nection at the same speed. The cascaded buffers illustrated in Fig. D.2 are designed. It comprises
three inverters connected in series, the first two of which are of the minimum size to rectify in-
coming waveform, and the last one is larger sized for better driving ability. Note that minimum
sized buffers were used in our analysis to minimize the hardware area and a similar trend has been
found with more aggressive buffer sizing.
With such buffers in use, the pulse at each stage is in a sequence of its original style and inverted
one along the length of a line. Then, the pulse variation will no longer be in the monotonic trend
of either increasing or decreasing as conventional design. The plots in Fig. D.3 show variation of
the charge and discharge time in the cases of using different buffers, where t1 and t2 are respected
D.2 Throughput Maximizing Buffer Design 181
Figure D.2: Diagram of the cascaded buffers
to the same variables in Fig. D.1, representing the charge and discharge time accordingly.
It can be seen that using cascaded buffers reduces the discrepancy of the charge and discharge time
significantly. Thus, the throughput is expected to be improved. Furthermore, the reason for not
using a single inverter as repeaters is that with existing of first two inverters, the voltage waveform
can be converted to square wave alike at the output of the first inverter, which leads the second
driving inverter to maintain the minimum output resistance for smaller time constant σ.
D.2.2 Transistor Sizing
Repeaters shown in Fig. D.4 is of the same structure with the conventional ones, two inverters
in series. Conventionally, the size of the PMOS is designed two times large as NMOS in digital
applications. However, in order to cope with the difference of raising edge and falling edge, the
width of the NMOS of repeater driver is proposed to be further tuned by an index m.
Utilizing this design, the time for each segment to charge and discharge to the switch voltage can
be equalized once this index is properly selected. In this case, the pulse width variation will be
canceled out, and the throughput will be improved.
D.3 Results and Discussions 182
2 4 6 8 10 12 14
0.7
0.8
0.9
1
1.1
1.2
1.3
1.4
Ch
ar
ge
/D
isc
ha
rg
e 
tim
e 
(ns
)
Number of Buffers
use conventional repeaters
use conventional repeaters
use
cascaded
buffers
t2
t1
Figure D.3: Charge/discharge time variation in the cases of using different buffers
Figure D.4: Diagram of the transistor sized repeaters
D.3 Results and Discussions
To verify the efficiency of the throughput centric interconnects design using the proposed buffers,
simulations were carried out in the 90nm technology using Cadence Virtuoso. The transistor
models used are from the PTM (Predictive Technology Model). The interconnect is fixed at 1cm
long, and the physical parameters of the interconnect are estimated on the basis of the aggressive
and conservative predictions in [HMH01].
D.3 Results and Discussions 183
0 10 20 30 40 50
0
0.5
1
1.5
2
2.5
1.2V VDD
Number of Buffers
Th
ro
ug
hp
ut
 (G
bs
)
20% improve
40% improve
0 10 20 30 40 50
0
0.5
1
1.5
2
2.5
1V VDD
Number of Buffers
Th
ro
ug
hp
ut
 (G
bs
)
68% improve
87% improve
0 10 20 30 40 50
0
0.5
1
1.5
2
0.9V VDD
Number of Buffers
Th
ro
ug
hp
ut
 (G
bs
)
140% improve94%
improve
0 10 20 30 40 50
0
0.5
1
1.5
2
0.8V VDD
Number of Buffers
Th
ro
ug
hp
ut
 (G
bs
)
140%
improve
185% improve
Figure D.5: Throughput comparison among the three approaches
D.3.1 Throughput Comparison
The relationship between the number of buffers and the maximum throughput of a fixed length
interconnect with different supply voltage is shown in Fig. D.5.
It is no doubt that using either of the proposed buffers always provides higher throughput than
using conventional repeaters with supply voltage varying from 1.2V to 0.8V as expected, and the
improvement becomes more and more significant. The utilization of cascaded buffers is possible
to produce from 40% to almost 190% higher throughput than using conventional repeaters when
the supply voltage scales from 1.2V to 0.8V. Meanwhile, the improvement made by the transistor
sized repeaters is from 20% to 140% with supply voltage from 1.2V to 0.8V compared to the
conventional repeaters.
However, the throughput performance of the interconnect is very sensitive to the width of the
D.3 Results and Discussions 184
NMOS of the buffer driver with transistor sized repeaters in use, shown in Fig. D.6. The numbers
on the right side of the curves are representing the number of buffers in the interconnect.
450 500 550
0.8
1
1.2
1.4
1.6
1.8
2
2.2
Th
ro
ug
hp
ut
 (G
bs
)
Width of NMOS (nm)
(a)
50
35 
25 
450 500 550
1.4
1.6
1.8
2
2.2
2.4
2.6
Th
ro
ug
hp
ut
 (G
bs
)
Width of NMOS (nm)
(b)
35
50
25
Figure D.6: The throughput of the interconnect with transistor-sizing (a) is more sensitive to the
NMOS width than using cascaded buffers (b).
Firstly, it can be seen that the optimal size of the NMOS differs from the number of buffers, which
indicates this sort of repeaters is only applicable in the ASIC design. It is also clear that a little
drift in the size of the NMOS of the repeaters makes an apparent impact on the throughput of the
interconnect when transistor sized repeaters are in use. In contrast, the throughput is almost not
affected by the same range size variation in the case of using cascaded buffers. Consequently,
processing transistor sized repeaters requires more careful consideration in the random issues,
such as process variations.
D.3.2 Power Dissipation Comparison
The dynamic power dissipation of an interconnect is highly dependent on the bit transitions. Hence
the maximum dynamic power dissipation is considered for the sensible comparison. For a given
throughput, the maximum dynamic power dissipation occurs when an interconnect operates with
a clock signal of the frequency which corresponds to that throughput injected at the source.
D.4 Summary 185
It is seen from Fig. D.5 that to achieve the same throughput performance on an interconnect, the
number of cascaded buffers needed is always less than that of the other two methods. Therefore,
although the cascaded buffer has two more transistors per buffer, the maximum dynamic power is
still expected to be closed.
0 0.5 1 1.5 2 2.5
0
1
2
3
4
5
M
ax
im
um
 D
yn
am
ic 
Po
we
r (
mW
)
Throughput (Gbs)
Maximum Throughput of
Convetional Approach
Maximum Throughput of 
Transistor Sizing
 
 with conventional repeaters
with transistor sizing
with cascaded buffers
Figure D.7: Maximum dynamic power of an interconnect against throughput at 1V VDD
It is seen from the Fig. D.7 that the maximum dynamic power of the interconnects designed using
conventional repeaters, transistor sized repeaters and cascaded buffers differs extremely slightly
as expected. This figure is for 1V supply voltage only, but the identical trends can be found on the
other supply voltage cases.
D.4 Summary
The theoretical analysis of an multistage interconnect has been demonstrated to show that the
discrepancy of the charging and discharging time at each stage plays a major role in limiting
the throughput of wave pipelined interconnects. In order to maximizing the performance, two
new design methods, cascaded buffers and transistor sizing, have been proposed and evaluated
with SPICE modeling. Experimental result shows that the throughput can be increased by up
to 185% in the 90nm technology generation. Moreover, the technique of buffer insertion and
supply voltage scaling is introduced, which is possible to give up to 60% dynamic power reduction
without performance loss when the new approaches are applied. These results provide useful
D.4 Summary 186
references for FPGA architects when wave-pipelined link is considered in an FPGA architecture.
187
Bibliography
[ABD+05] A. Ahmadinia, C. Bobda, J. Ding, M. Majer, and J. Teich. A practical approach for
circuit routing on dynamic reconfigurable devices. In Proc. of IEEE International
Workshop on Rapid System Prototyping, pages 84 – 90, 2005.
[ACPP08] G. Ascia, V. Catania, M. Palesi, and D. Patti. Implementation and analysis of a new
selection strategy for adaptive routing in Networks-on-Chip. IEEE Transactions on
Computers, 57(6):809–820, 2008.
[Act02] Actel. ProASIC 500K Family. 2002.
[Act09] Actel. Antifuse FPGA family. 2009.
[AFE+05] M. Amda, T. Felicijan, A. Efthymiou, D. Edwards, and L. Lavagno. Asynchronous
on-chip networks. IEE Proc.-Comput. Digit. Tech., 152(2):273–283, 2005.
[Alt03] Altera. Avalon Bus Specification Reference Manual. 2003.
[Alt05] Altera. Stratix II Device Family Data Sheet. 2005.
[AR95] M.J. Alexander and G. Robins. New performance-driven FPGA routing algorithms.
In Proc. of Design Automation Conference (DAC), pages 562–567, 1995.
[ARM99] ARM. AMBA Specification v2.0. Technical report, 1999.
[ARM01] ARM. AMBA Multi-layer AHB overview. Technical report, 2001.
[Bak90] H.B. Bakoglou. Circuits, Interconnections, and Packaging for VLSI. MA: Addison-
Wesley, 1990.
BIBLIOGRAPHY 188
[BB03] S. Balachandran and D. Bhatia. A-priori wirelength and interconnection estima-
tion based on circuit characteristics. In Proc. of ACM Interconnection Workshop on
System Level Interconnect Prediction (SLIP), pages 77 – 84, 2003.
[BB04] D. Bertozzi and L. Benini. Xpipes: A Network-on-Chip architecture for gigascale
Systems-on-Chip. IEEE Circuits and Systems Magazine, 4(2):18–31, 2004.
[BB05] L. Benini and D. Bertozzi. Network-on-chip architectures and design methods. IEE
Proc.-Comput. Digit. Tech., 152(2):261–272, 2005.
[BC97] D. Bormann and P. Cheung. Asynchronous wrapper for heterogenerous systems.
In Proc. of IEEE International Conference on Computer Design, pages 307 – 314,
1997.
[BCK+04] I. Blunno, J. Cortadella, A. Kondratyev, L. Lavagno, K. Lwin, and C. Sotirious.
Handshake protocol for de-synchronization. In Proc. of IEEE International Sympo-
sium on Asynchronous Circuits and Systems (ASYNC), 2004.
[BCKL98] W. Burleson, M. Ciesielski, F. Klass, and W. Liu. Wave-pipelining: A tutorial and
research survey. IEEE Trans. on VLSI Systems, 6(3):464–474, 1998.
[BDM07] P. Bogdan, T. Dumitras, and R. Marculescu. Stochastic communication: A new
paradigm for fault-tolerant Networks-on-Chip. VLSI Design, 2007:1–17, 2007.
[Bel58] R. Bellman. On a routing problem. Quarterly of Applied Mathematics, 16:87–90,
1958.
[BF02] J. Bainbridge and S. Furber. Chain: A delay-insensitive chip area interconnect. IEEE
Micro, 22(5):16 – 23, 2002.
[BHUH06] M. Beauchamp, K. S. Hemmert, K. Underwood, and S. Hauck. Architectural modi-
fications to improve floating-point unit efficiency in FPGAs. In Proc. of IEEE Sym-
posium on Field-Programmable Custom Computing Machines (FCCM), pages 177–
187, 2006.
[bib08] Noxim, Network-on-Chip simulator. In http://sourceforge.net/projects/noxim, 2008.
[bib09] Predictive Technology Model (PTM), http://ptm.asu.edu/. 2009.
BIBLIOGRAPHY 189
[BJM+05] D. Bertozzi, A. Jalabert, S. Murali, R. Tamhankar, S. Stergiou, L. Benini, and
G. D. Micheli. NoC synthesis flow for customized domain specific multiprocessor
Systems-on-Chip. IEEE Trans. on Parallel and Distributed Systems, 16(2):113–129,
2005.
[BMN+05] T.A. Bartic, J.-Y. Mignolet, V. Nollet, T. Marescaux, D. Verkest, S. Vernalde, and
R. Lauwereins. Topology adaptive Network-on-Chip design and implementation.
IEE Proc.-Comput. Digit. Tech, 152(4):467 – 472, 2005.
[BRM99] V. Betz, J. Rose, and A. Marquardt. Architecture and CAD for Deep-Submicron
FPGAs. Kluwer Academic Publishers, Norwell, MA, USA, 1999.
[BRV93] S. Brown, J. Rose, and Z.G. Vranesic. A stochastic model to predict the routability
of Field-Programmable Gate Arrays. IEEE Trans. on Computer-Aided Design of
Integrated Circuits and Systems, 12(12):1827–1838, 1993.
[BS79] P. O. Borjesson and C. Sundberg. Simple approximations of the error function Q(x)
for communications applications. IEEE Trans. on Communications, 27(3):639–643,
1979.
[BSJ96] E. I. Boemo, L. Sergio, and M. Juan. The wave pipeline effect on LUT-based
FPGA architectures. In Proc. of the Fourth International ACM Symposium on Field-
Programmable Gate Arrays (FPGA), pages 45 – 50, 1996.
[BT89] D. Bertsekas and J. Tsitsiklis. Parallel and Distributed Computation: Numerical
Methods. Prentice-Hall, Inc., Princeton, NJ, 1989.
[CCP06] D. Chen, J. Cong, and P. Pan. FPGA design automation: A survey. Foundations and
Trends in Electronic Design Automation, 1:139–169, 2006.
[CFH+05] J. Cong, Y. Fan, G. Han, X. Yang, and Z. Zhang. Architecture and synthesis for
on-chip multicycle communication. IEEE Transactions on Computer-Aided Design
of Integrated Circuits and Systems, 23(4):550–564, 2005.
[Cha84] D.M. Chapiro. Globally-asynchronous locally-synchronous systems. PhD thesis,
Stanford, 1984.
[Chi92] G.M. Chiu. The odd-even turn model for adaptive routing. IEEE Transactions on
Parallel and Distributed Systems, 11(7):729 – 738, 1992.
BIBLIOGRAPHY 190
[CLC05] R.C.C. Cheung, W. Luk, and P.Y.K. Cheung. Reconfigurable elliptic curve crytosys-
tems on a chip. In Proc. of Design, Automation and Test in Europe Conference and
Exhibition (DATE), pages 24 – 29, 2005.
[CLR01] T.H. Cormen, C.E. Leiserson, and R.L. Rivest. Introduction to Algorithms. MIT
Press, Massachusetts Institute of Technology, MA, 2001.
[CMS03] S. Chakraborty, J. Mekie, and D. K. Sharma. Reasoning about synchronization
issues in GALS systems: A unified approach. In Proc. of Workshop on Formal
Methods in GALS Architectures (FMGALS), 2003.
[Con01] J. Cong. An interconnect-centric design flow for nanometer technologies. Proceed-
ings of the IEEE, 89(4):505–528, 2001.
[Cot69] L. Cotton. Maximum rate pipelined systems. In Proc. of AFIPS Spring Joint Com-
puter Conference, 1969.
[CW97] J. Cong and C. Wu. FPGA synthesis with retiming and pipelining for clock pe-
riod minimization of sequential circuits. In Proc. of the 34th annual conference on
Design Automation (DAC), pages 644 – 649, 1997.
[CZ05] A. Chattopadhyay and Z. Zilic. GALDS: a complete framework for designing mul-
ticlock ASICs and SoCs. IEEE Transactions on Very Large Scale Integration (VLSI)
Systems, 13(6):641–654, 2005.
[Dal92] W.J. Dally. Virtual channel flow control. IEEE Transactions on Parallel and Dis-
tributed Systems, 3(2):194–205, 1992.
[DD05] V. V. Deodhar and J. A. Davis. Voltage scaling, wire sizing and repeater insertion
design rules for wave-pipelined VLSI global interconnect circuits. In Proc. of Sym-
posium on Quality of Electronic Design, pages 592 – 597, 2005.
[DeH04] A. DeHon. Unifying mesh- and tree-based programmable interconnect. IEEE Trans.
on VLSI Systems, 12(10):1051–1065, 2004.
[DGS04] R. Dobkin, R. Ginosar, and C. Sotiriou. Data synchronization issues in GALS SoCs.
In Proc. of the 16th IEEE International Symposium on Asynchronous Circuits and
Systems (ASYNC), pages 170–180, 2004.
BIBLIOGRAPHY 191
[Dic96] C. Dick. Computing the discrete Fourier Transform on FPGA based systolic arrays.
In Proc. of ACM/SIGDA International Symposium on Field-Programmable Gate Ar-
rays (FPGA), pages 129 – 135, 1996.
[DMBY07] C. D’Alessandro, A. Mokhov, A. Bystrov, and A. Yakovlev. Delay/phase regenera-
tion circuits. In Proc. of IEEE International Symposium on Asynchronous Circuits
and Systems (ASYNC), pages 105–116, 2007.
[DPL+07] R. Dobkin, Y. Perelman, T. Liran, R. Ginosar, and A. Kolodny. High rate wave-
pipelined asynchronous on-chip bit-serial data link. In Proc. of the IEEE Interna-
tional Symposium on Asynchronous Circuits and Systems, pages 3–14, 2007.
[DSBY05] C. D’Alessandro, D. Shang, A. Bystrov, and A. Yakovlev. PSK signalling on SoC
buses. In Proc. of International Conference on Power and Timing Modeling, Opti-
mization and Simulation (PATMOS), pages 286–296, 2005.
[DT01] W.J. Dally and B. Towles. Route packets, not wires: On-chip interconnection net-
works. In Proc. of Design Automation Conference (DAC), pages 684 – 689, 2001.
[DT04] W.J. Dally and B. Towles. Principles and Practices of Interconnection Networks.
Morgan Kaufmann, 2004.
[EBE04] A. Efthymiou, J. Bainbridge, and D.A. Edwards. Adding testability to an asyn-
chronous interconnect for GALS SoC. In Proc. IEEE VLSI Test Symposium, pages
20 – 23, 2004.
[EKRZ02] D. Edenfeld, Andrew B. Kahng, Mike Rodgers, and Yervant Zorian. Technology
roadmap for semiconductors. Computer, 37(1):42–53, 2002.
[Elm48] W.C. Elmore. The transient analysis of damped linear networks with particular re-
gard to wideband amplifiers. J. Applied Physics, 19(1):55–63, 1948.
[ES81] A.S. El Gamal and Z.A. Syed. A stochastic model for interconnections in custom
integrated circuits. IEEE Trans. on Circuits and Systems, 28(9):888–894, 1981.
[FMM08] R. Francis, S. Moore, and R. Mullins. A network of time-division multiplexed wiring
for FPGAs. In Proc. of the International Symposium on Networks-on-Chips (NOCS),
pages 35 – 44, 2008.
BIBLIOGRAPHY 192
[GBHW08] K. Goossens, M. Bennebroek, J. Y. Hur, and M.-A. Wahlah. Hardwired networks
on chip in FPGAs to unify functional and configuration interconnects. In IEEE
International Symposium on Networks-on-Chip, pages 45 – 54, 2008.
[GBLM07] D. Greenfield, A. Banerjee, J. Lee, and S. Moore. Implications of rent’s rule for
NoC design and its fault-tolerance. In Proc. of the International Symposium on
Networks-on-Chips (NOCS), pages 283–294, 2007.
[GN92] C. Glass and L. Ni. The turn model for adaptive routing. ACM SIGARCH Computer
Architecture News, 20(2):278 – 287, 1992.
[GOV+03] F.K. Gurkaynak, S. Oetiker, T. Villiger, N. Felber, H. Kaeslin, and W. Fichtner.
On the GALS design methodology of ETH Zurich. Proc. of the Formal Methods
For Globally Asynchronous Locally Synchronous (GALS) Architecture, pages 32–
41, 2003.
[HBE99] S. Hauck, G. Borriello, and C. Ebeling. Mesh routing topologies for multi-FPGA
systems. IEEE Trans. on VLSI Systems, 6(3):400–408, 1999.
[HBH05] M.W. Heath, W.P. Burleson, and I.G. Harris. Synchro-tokens: A deterministic GALS
methodology for chip-level debug and test. IEEE Trans. on Computers, 54(12):1532
– 1546, 2005.
[HCK+02] M. Hutton, V. Chan, P. Kazarian, V. Maruri, T. Ngai, J. Park, R. Patel, B. Pedersen,
J. Schleicher, and S. Shumarayev. Interconnect enhancements for a high-speed PLD
architecture. In Proc. of the International Symposium on Field Programmable Gate
Array (FPGA), pages 3 – 10, 2002.
[HM04] J. Hu and R. Marculescu. DyAD smart routing for Networks-on-Chip. In Proc. of
the Design Automation Conference (DAC), pages 260 – 263, 2004.
[HM06] S. Hollis and S. W. Moore. RasP: An area-efficient, on-chip network. In Proc. of
International Conference on Computer Design (ICCD), pages 63–69, 2006.
[HMH01] R. Ho, K.W. Mai, and M.A. Horowitz. The future of wires. Proceedings of the
IEEE, 89(4):490–504, 2001.
BIBLIOGRAPHY 193
[Hop84] J. J. Hopfield. Neurons with graded response have collective computational prop-
erties like those of two state neurons. Proceedings of National Academic Science,
81:3088–3092, 1984.
[Hu05] J. Hu. Design Methodologies For Application Specific Networks-on-Chip. PhD
thesis, Carnegie Mellon University, 2005.
[IBM99] IBM. CoreConnect(TM) Bus Architectre. White paper, 1999.
[JDD06] A. Joshi, V. Deodhar, and J. Davis. Low power multilevel interconnect networks
using wave-pipelined multiplexed (wpm) routing. In Proc. of Symposium on VLSI
Design, pages 773 – 776, 2006.
[JLD07] A.J. Joshi, G.G. Lopez, and J.A. Davis. Design and optimization of on-chip inter-
connects using wave-pipelined multiplexed routing. IEEE Trans. on VLSI Systems,
15(9):990–1002, 2007.
[KBV93] M. Khellah, S. Brown, and Z. Vranesic. Modelling routing delays in SRAM-based
FPGAs. In Proc. of Canadian Conference on VLSI, 1993.
[Kin07] D.J. Kinniment. Synchronization and Arbitration in Digital Systems. John Wiley &
Sons Ltd, Chichester, UK, 2007.
[KJS+02] S. Kumar, A. Jantsch, J.-P. Soininen, M. Forsell, M. Millberg, J. berg, K. Tiensyrj,
and A. Hemani. A Network-on-Chip architecture and design methodology. In Proc.
of International Symposium in VLSI, pages 105 – 112, 2002.
[KLLY05] D. Kim, K. Lee, S.-J. Lee, and H.-J. Yoo. A reconfigurable crossbar switch with
adaptive bandwidth control for networks-on-chip. In Proc. of IEEE International
Symposium on Circuits and Systems (ISCAS), pages 2369 – 2372, 2005.
[KMdL+06] N. Kapre, N. Mehta, M. de Lorimier, R. Rubin, H. Barnor, M. Wilson, M. Wrighton,
and A. DeHon. Packet-switched vs. time-multiplexed FPGA overlay networks.
In Proc. IEEE Symposium on Field-Programmable Custom Computing Machines
(FCCM), pages 205 – 216, 2006.
[KPR04] H. Kalte, M. Porrmann, and U. Ruchert. System-on-Programmable-Chip approach
enabling online fine-grained 1D-placement. In Proc. of International Parallel and
Distributed Processing Symposium, page 141, 2004.
BIBLIOGRAPHY 194
[LAB+05] D. Lewis, E. Ahmed, G. Baeckler, V. Betz, M. Bourgeault, D. Cashman, D. Gal-
loway, M. Hutton, C. Lane, A. Lee, P. Leventis, S. Marquardt, C. McClintock,
K. Padalia, B. Pedersen, G. Powell, B. Ratchev, S. Reddy, J. Schleicher, K. Stevens,
R. Yuan, R. Cliff, and J. Rose. The Stratix II logic and routing architecture. In Proc.
International Symposium on Field-programmable Gate Arrays (FPGA), pages 14 –
20, 2005.
[Lat09] Lattice. LatticeXP Non-Volatile FPGA. 2009.
[LH96] D.H. Linder and J.C. Harden. Phased logic: Supporting the synchronous de-
sign paradigm with dalay-insensitive circuitry. IEEE Transaction on Computers,
45(9):1031–1044, 1996.
[LKK+05] S.J. Lee, K.K., H.J. Kim, N.J. Cho, and H.J. Yoo. Adaptive Network-on-Chip with
wave-front train serialization scheme. In Proc. Symposium on VLSI Circuits, pages
104 – 107, 2005.
[LLM06] E. Lee, G. Lemieux, and S. Mirabbasi. Interconnect driver design for long wires
in Field-Programmable Gate Arrays. In Proc. of the International Conference on
Field-Programmable Technology (FPT), pages 57 – 76, 2006.
[LLST04] J. Liang, A. Laffely, S. Srinivasan, and R. Tessier. An architecture and compiler for
scalable on-chip communication. IEEE Transactions on VLSI Systems, 12(7):711–
726, 2004.
[LLTY04] G. Lemieux, E. Lee, M. Tom, and A. Yu. Directional and single-driver wires
in FPGA interconnect. In Proc. of the International Conference on Field-
Programmable Technology (FPT), pages 41–48, 2004.
[LT96] K.P. Lam and C.W. Tong. Closed semiring connectionist network for the Bellman-
Ford computation. IEE Proc. Comput. Digit. Tech., 143:189–195, 1996.
[LV05] G. Lakshminarayanan and B. Venkataramani. Optimization techniques for FPGA-
based wave-pipelined DSP blocks. IEEE Trans. on VLSI Systems, 13(7):783–793,
2005.
[LWH+05] L.-Y Lin, C.-Y Wang, P.-J Huang, C.-C Chou, and J.-Y Jou. Communication-driven
task binding for multiprocessor with latency insensitive Network-on-Chip. In Proc.
BIBLIOGRAPHY 195
Asia and South Pacific Design Automation Conference (ASP-DAC), pages 39 – 44,
2005.
[MB06] G. Micheli and L. Benini. Networks on Chips: Technology and Tools. Morgan
Kaufmann, 2006.
[MBV+02] T. Marescaux, A. Bartic, D. Verkest, S. Vernalde, and R. Lauwereins. Interconnec-
tion networks enable fine-grain dynamic multi-tasking on FPGA. In Proc. IEEE
International Conference on Field Programmable Logic and Applications (FPL),
pages 795–805, 2002.
[MCLL09] T. Mak, P. Cheung, W. Luk, and K.P. Lam. A DP-Network for optimal dynamic
routing in Network-on-Chip. In Proc. of ACM International Conference on Hard-
ware/Software Codesign and System Synthesis (CODES), pages 119–128, 2009.
[MCS04] J. Mekie, S. Chakraborty, and D.K. Sharma. Evaluation of pausible clocking for
interfacing high speed ip cores in gals framework. in Proc. of VLSID, page 559,
2004.
[MCSB06] V. Manohararajah, Fordon R. Chiu, D. P. Singh, and S. D. Brown. Difficulty of
predicting interconnect delay in a timing driven FPGA CAD flow. In Proc. ACM
International Workshop on System Level Interconnect Prediction (SLIP), pages 3 –
8, 2006.
[MDS+08a] T. Mak, C. D’Alessandro, P. Sedcole, P. Cheung, A. Yakovlev, and W. Luk. Global
interconnections in FPGAs: Modeling and performance analysis. In Proc. of Inter-
national Workshop on System Level Interconnect Prediction (SLIP), pages 51–58,
2008.
[MDS+08b] T. Mak, C. D’Alessandro, P. Sedcole, P.Y.K. Cheung, A. Yakovlev, and W. Luk.
Implementation of wave-pipelined interconnects in FPGAs. In in Proc. of Proc. of
the Second IEEE International Symposium on Networks-on-Chips, 2008.
[Mol97] P.A. Molina. The Design of a Delay-Insensitive Bus Architecture Using Handshake
Circuits. PhD thesis, Imperial College of London, 1997.
BIBLIOGRAPHY 196
[MSC+07] T. Mak, P. Sedcole, P.Y.K. Cheung, W. Luk, and K.P. Lam. A hybrid analog-digital
routing network for NoC dynamic routing. In Proc. IEEE International Symposium
on Networks-on-Chip (NoC), pages 173 – 182, 2007.
[MSCL06a] T. Mak, P. Sedcole, P. Cheung, and W. Luk. On-FPGA communication architectures
and design factors. In Proc. IEEE International Conference on Field Programmable
Logic and Applications (FPL), 2006.
[MSCL06b] T. Mak, P. Sedcole, P.Y.K. Cheung, and W. Luk. On-FPGA communication archi-
tectures and design factors. In in Proc. IEEE International Conference on Field-
Programmable Logic and Applications (FPL), pages 1–8, 2006.
[MSCL07] T. Mak, P. Sedcole, P.Y.K. Cheung, and W. Luk. Average interconnection delay
estimation for on-FPGA communication links. Electronic Letters, 43(17):918–920,
2007.
[MSCL08a] T. Mak, P. Sedcole, P. Cheung, and W. Luk. Interconnection lengths and delays
estimation for communication links in FPGAs. In ACM International Workshop on
System Level Interconnect Prediction (SLIP), pages 1–10, 2008.
[MSCL08b] T. Mak, P. Sedcole, P.Y.K. Cheung, and W. Luk. Wave-pipelined signalling for
On-FPGA communication. In Proc. IEEE International Conference on Field Pro-
grammable Technology (FPT), pages 9–16, 2008.
[MTC+00] S.W. Moore, G.S. Taylor, P.A. Cunningham, R.D. Mullins, and P. Robinson. Self-
calibrating clocks for globally asynchronous locally synchronous systems. In Proc.
International Conference on Computer Design, page 73, 2000.
[MWM04] R. Mullins, A. West, and S. Moore. Low-latency virtual-channel routers for on-
chip networks. In Proc. of the 31st Annual International Symposium on Computer
Architecture, page 188, 2004.
[NSN+06] M. Najibi, K. Saleh, M. Naderi, H. Pedram, and M. Sedighi. Prototyping globally
asynchronous locally synchronous circuits on commercial synchronous FPGAs. In
Proc. Field Programmable Logic and Applications (FPL), pages 1 – 6, 2006.
BIBLIOGRAPHY 197
[OHM05] U. Y. Ogras, J.-C. Hu, and R. Marculescu. Key research problems in NoC design:
A holistic perspective. In Proc. of International Conference on Hardware/Software
Codesign (CODES), pages 69 – 74, 2005.
[OM05] U.Y. Ogras and R. Marculescu. Application-specific Network-on-Chip architecture
customization via long-range link insertion. In Proc. International Conference on
Computed Aided Design (ICCAD), pages 246 – 253, 2005.
[OM06] U.Y. Ogras and R. Marculescu. “It’s a small world after all”: NoC performance
optimization via long-range link insertion. IEEE Trans. on VLSI Systems, 14(7):693–
706, 2006.
[PHK06] M. Palesi, R. Holsmark, and S. Kumar. A methodology for design of application
specific deadlock-free routing algorithms for NoC systems. In Proc. of International
Conference on Hardware/software codesign (CODES), pages 142 – 147, 2006.
[PYW02] K. Poon, A. Yan, and S. Wilton. A flexible power model for FPGAs. In Proc.
of International Symposium on Field-Programmable Logic and Applications (FPL),
pages 312 – 321, 2002.
[RC03] A. Royal and P. Cheung. Globally asynchronous locally synchronous FPGA archi-
tectures. In Proc. of International Symposium on Field Programmable Logic and
Applications (FPL), pages 355–364, 2003.
[RCN03] J. Rabaey, A. Chandrakasan, and B. Nikolic. Digital Integrated Circuits: A Design
Perspective. Prentice-Hall, 2003.
[RM04] K. Ryu and V. Mooney. Automatic bus generation for multiprocessor SoC design.
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems,
23(11):1531–1549, 2004.
[RMCF88] F.U. Rosenberger, C.E. Molnar, T.J. Chaney, and T.P. Fang. Q-modules: Internally
clocked delay-insensitive modules. IEEE Trans. On Computers, 37(9):1005–1018,
1988.
[Ros02] S.M. Ross. Introduction to Probability Models. Academic Press, 2002.
BIBLIOGRAPHY 198
[RSM01] K. Ryu, E. Shin, and V. Mooney. A comparison of five different multiprocessor SoC
bus architectures. In Proc. of Euromicro Symposium on Digital Systems Design,
page 202, 2001.
[Sak93] T. Sakurai. Closed-form expressions for interconnection delay, coupling, and
crosstalk in VLSI’s. IEEE Trans. on Electron Devices, 40(1):118–124, 1993.
[SBKV05] B. Sethuraman, P. Bhattacharya, K. Khan, and R. Vemuri. LiPaR: A light-weight
parallel router for FPGA-based Network-on-Chip. In Proc. Great Lake Symposium
on VLSI, pages 452 – 457, 2005.
[SC04] L. Shannon and P. Chow. Maximizing system performance: Using reconfigurability
to monitor system communications. In Proc. of International Conference on Field
Programmable Technology (ICFPT), pages 231 – 238, 2004.
[SC05] L. Shannon and P. Chow. Simplifying the integration of processing elements in
computing systems using a programmable controller. In Proc. of International IEEE
Symposium on Field-Programmable Custom Computing Machines (FCCM), pages
63 – 72, 2005.
[SCCL04] P. Sedcole, Peter Y. K. Cheung, G. Constantinides, and W. Luk. A structured
methodology for System-on-an-FPGA design. In Proc. of International Symposium
in Field Programmable Logic and Applications (FPL), pages 1047–1051, 2004.
[SCEH04] A. Sharma, K. Compton, C. Ebeling, and S. Hauck. Exploration of pipelined FPGA
interconnect structures. In Proc. of the International Symposium on Field Pro-
grammable Gate Array (FPGA), pages 13 – 22, 2004.
[Sec04] T. Seceleanu. Communication on a segmented bus platform. In Proc. of IEEE
International SOC Conference (SOCC), pages 205–208, 2004.
[SF01] J. Sparso and S. Furber. Principles of Asynchronous Circuit Design: A System Per-
spective. Kluwer Academic Publishers, 2001.
[SL03] C. Sotiriou and L. Lavagno. De-synchronization: Asynchronous circuits from syn-
chronous specification. In Proc. of the IEEE System-on-Chip Conference, 2003.
BIBLIOGRAPHY 199
[SsA04] R. Soares, I. S. silva, and A. Azevedo. When reconfigurable architecture meets
Network-on-Chip. In Proc. of International Symposium on Integrated Circuits and
Systems Design, pages 216 – 221, 2004.
[SSC06] M. Saldana, L. Shannon, and P. Chow. The routability of multiprocessor network
topologies in FPGAs. In Proc. of the International Symposium on System Level
Interconnect Predictions (SLIP), pages 232 – 232, 2006.
[Ste02] N.J. Steiner. A Standalone Wire Database for Routing and Tracing in Xilinx Virtex,
Virtex-E, and Virtex-II FPGAs. Master Thesis in Electrical Engineering, Virginia
Polytechnic Institute and State University, 2002.
[Str01] D. Stroobandt. A Priori Wire Length Estimates for Digital Design. Springer, Dor-
drecht, The Netherlands, 2001.
[STR+02] M. Singh, J.A. Tierno, A. Rylyakov, S. Rylov, and S.M. Nowick. An adaptively-
pipelined mixed synchronous-asynchronous digital FIR filter chip operating at 1.3
GigaHertz. In Proc. of the International Symposium on Asynchronous Circuits and
Systems (ASYNC), pages 84 – 95, 2002.
[Sut89] I. Sutherland. Micropipelines. Communications of the ACM, 32(6):720 – 738, 1989.
[SWA+05] S. Sivaswamy, G. Wang, C. Ababei, K. Bazargan, R. Kastner, and E. Bozorgzadeh.
HARP: Hard-wired routing pattern FPGAs. In Proc. of the International Symposium
on Field Programmable Gate Array (FPGA), pages 21 – 29, 2005.
[SZBR07] T. Schonwald, J. Zimmermann, O. Bringmann, and W. Rosenstiel. Fully adaptive
fault-tolerant routing algorithm for Network-on-Chip architectures. In Proc. of the
Euromicro Conference on Digital System Design Architectures, Methods and Tools,
pages 527 – 534, 2007.
[Tay02] M. Taylor. The RAW microprocessor: A computational fabric for software circuits
and general-purpose programs. IEEE Micro, 22(2):25 – 35, 2002.
[TKDD05] N. Thepayasuwan, S. Kallakuri, A. Doboli, and S. Doboli. Communication sub-
system synthesis and analysis tool using bus architecture generation and stochastic
arbitration policies. In Proc. of the International Symposium on Circuits and Systems
(ISCAS), pages 1044 – 1047, 2005.
BIBLIOGRAPHY 200
[TLG09] P. Teehan, G. Lemieux, and M. Greenstreet. Towards reliable 5Gbps wave-pipelined
and 3Gbps surfing interconnect in 65nm FPGAs. In Proc. of ACM/SIGDA Inter-
national Symposium on Field-Programmable Gate Arrays (FPGA), pages 43–52,
2009.
[VD05] V. Vinita and J. Davis. Optimization of throughput performance for low-power VLSI
interconnects. IEEE Trans. on VLSI System, 13(3):308–318, 2005.
[WAHS06] D. Wu, B. Al-Hashimi, and M. Schmitz. Improving routing efficiency for Network-
on-Chip through contention-aware input selection. In Proc. of Asia and South Pacific
Design Automation Conference (ASP-DAC), pages 36 – 41, 2006.
[Wal00] M.G. Walker. Modeling the wiring of deep submicron ICs. IEEE Spectrum,
37(3):65–71, 2000.
[WMSC09] L. Wang, T. Mak, P. Sedcole, and P.Y.K. Cheung. Throughput maximization for
wave-pipelined interconnects using cascaded buffers and transistor sizing. In Proc.
of IEEE International Symposium on Circuits and Systems, pages 1293 – 1296,
2009.
[WSA+06] G. Wang, S. Sivaswamy, C. Ababei, K. Bazargan, R. Kastner, and E. Bozorgzadeh.
Statistical analysis and design of HARP FPGAs. IEEE Trans on VLSI Systems,
14(7):2088–2102, 2006.
[WSC07] J.S.J. Wong, P. Sedcole, and P.Y.K. Cheung. Self-characterization of Combinatirial
Circuit Delays in FPGAs. In Proceedings of Field-Programmable Technology, pages
17–23, 2007.
[Xil05a] Xilinx. Virtex-4 Data Sheets. 2005.
[Xil05b] Xilinx. Virtex-II Pro complete data sheet. 2005.
[Xil06a] Xilinx. Xilinx ISE 8.2i Software Manuals. 2006.
[Xil06b] Xilinx. Xilinx System Generator for DSP version 8.2.02: User Guide. 2006.
[Xil06c] Xilinx. Achieving Higher System Performance with the Virtex-5 Family of FPGAs.
White Paper, WP245(v1.1.1), July 7, 2006.
BIBLIOGRAPHY 201
[XW02] J. Xu and W. Wolf. Wave pipelining for application-specific Networks-on-Chips. In
Proc. of CASES, pages 198–201, 2002.
[XW03] J. Xu and W. Wolf. A wave-pipelined on-chip interconnect structure for Networks-
on-Chips. In Proc. of International Symposium on High-Performance Interconnects,
2003.
[YD99] K.Y. Yun and A.E. Dooply. Pausible clocking-based heterogeneous systems. IEEE
Trans. on VLSI Systems, 7(4):482–488, 1999.
[YKLL03] C.W. Yu, K.H. Kwong, K.H. Lee, and P.H.W. Leong. A Smith-Waterman systolic
cell. In Proc. of International symposium on Field Programmable Logic and Appli-
cations (FPL), pages 375–384, 2003.
[YR05] A. Ye and J. Rose. Using bus-based connections to improve Field-Programmable
Gate Array density for implementing datapath circuits. In Proc. ACM International
Symposium on Field Programmable Gate Array (FPGA), pages 3 – 13, 2005.
[ZKS04] C.A. Zeferino, M. E. Kreutz, and A. A. Susin. RASoC: A router soft-core for
Networks-on-Chip. In Proc. of Design, Automation and Test in Europe (DATE),
page 30198, 2004.
