High-Performance and Wavelength-Reused Optical Network on Chip (ONoC) Architectures and Communication Schemes for Manycore Processor by Liu, Feiyang
High-Performance and
Wavelength-Reused Optical
Network on Chip (ONoC)
Architectures and Communication
Schemes for Manycore Processor
Feiyang Liu
a thesis submitted for the degree of
Doctor of Philosophy
at the University of Otago, Dunedin,
New Zealand.
29 March 2017
Abstract
Optical Network on Chip (ONoC) is an emerging chip-scale optical inter-
connection technology to realize the high-performance and power-efficient
inter-core communication for many-core processors. By utilizing the silicon
photonic interconnects to transmit data packets with optical signals, it can
achieve ultra low communication delay, high bandwidth capacity, and low
power dissipation. With the benefits of Wavelength Division Multiplex-
ing (WDM), multiple optical signals can simultaneously be transmitted in
the same optical interconnect through different wavelengths. Thus, the
WDM-based ONoC is becoming a hot research topic recently. However,
the maximal number of available wavelengths is restricted for the reliable
and power-efficient optical communication in ONoC. Hence, with a limited
number of wavelengths, the design of high-performance and power-efficient
ONoC architecture is an important and challenging problem.
In this thesis, the design methodology of wavelength-reused ONoC archi-
tecture is explored. With the wavelength reuse scheme in optical routing
paths, high-performance and power-efficient communication is realized for
many-core processors only using a small number of available wavelengths.
Three wavelength-reused ONoC architectures and communication schemes
are proposed to fulfil different communication requirements, i.e., network
scalability, multicast communication, and dark silicon.
Firstly, WRH-ONoC, a wavelength-reused hierarchical Optical Network on
Chip architecture, is proposed to achieve high network scalability, namely
obtaining low communication delay and high throughput capacity for hun-
dreds of thousands of cores by reusing the limited number of available wave-
lengths with the modest hardware cost and energy overhead. WRH-ONoC
ii
combines the advantages of non-blocking communication in each λ-router
and wavelength reuse in all λ-routers through the hierarchical networking.
Both theoretical analysis and simulation results indicate that WRH-ONoC
can achieve prominent improvement on the communication performance
and scalability (e.g., 46.0% of reduction on the zero-load packet delay and
72.7% of improvement on the network throughput for 400 cores with small
hardware cost and energy overhead) in comparison with existing schemes.
Secondly, DWRMR, a dynamical wavelength-reused multicast scheme based
on the optical multicast ring, is proposed for widely existing multicast com-
munications in many-core processors. In DWRMR, an optical multicast
ring is dynamically constructed for each multicast group and the multicast
packets are transmitted in a single-send-multi-receive manner requiring only
one wavelength. All the cores in the same multicast group can reuse the
established multicast ring through an optical token arbitration scheme for
the interactive multicast communications, thereby avoiding the frequent
construction of multicast routing paths dedicatedly for each core. Simula-
tion results indicate that DWRMR can reduce more than 50% of end-to-end
packet delay with slight hardware cost, or require only half number of wave-
lengths to achieve the same performance compared with existing schemes.
Thirdly, Dark-ONoC, a dynamically configurable ONoC architecture, is
proposed for the many-core processor with dark silicon. Dark silicon is an
inevitable phenomenon that only a small number of cores can be activated
simultaneously while the other cores must stay in dark state (power-gated)
due to the restricted power budget. Dark-ONoC periodically allocates non-
blocking optical routing paths only between the active cores with as less
wavelengths as possible. Thus, it can obtain high-performance communica-
tion and low power consumption at the same time. Extensive simulations
are conducted with the dark silicon patterns from both synthetic distribu-
tion and real data traces. The simulation results indicate that the number
of wavelengths is reduced by around 15% and the overall power consump-
tion is reduced by 23.4% compared to existing schemes.
Finally, this thesis concludes several important principles on the design of
wavelength-reused ONoC architecture, and summarizes some perspective
issues for the future research.
iii
Acknowledgements
First and foremost, I would like to express my sincere gratitude and appre-
ciation to my supervisors, Dr. Haibo Zhang, A/Prof. Zhiyi Huang, and Dr.
Yawen Chen, for their professional and invaluable guidance and support to
my research. They gave me the opportunity and encouraged me to explore
some novel and interesting research problems. I must thank Prof. Huaxi
Gu from Xidian University for his expert technological advices.
Meanwhile, I would like to express my sincere gratitude to the thesis com-
mittee for their insightful and precious comments. They helped to improve
the quality of my PhD thesis and encouraged me to broaden my research
from various perspectives.
I must thank all my colleagues in Computer Science Department for their
precious advices and suggestions to my ’boring’ presentations. From the
discussions in a wonderful atmosphere, I learned a lot of useful skills and
opened my vision in my own work.
I have to thank all my friends here in Dunedin who made my life colourful
and pleasant. I must thank Zhenglong Cao, Weiwei Zhang, and Huijuan
Hua for their companionship in every Chinese New Year. I must thank
Aleksei Fedorov and his lovely family for their kindness.
Finally, I want to express my love and gratitude to all my family members
for their constant support and encouragement. I must express my special
love to my girlfriend Hui Li for her understanding and support when I was
not around her.
iv
Publications
[1] Feiyang Liu, Haibo Zhang, Yawen Chen, Zhiyi Huang, Huaxi Gu.
Wavelength-Reused Hierarchical Optical Network on Chip Architecture for
Manycore Processors (2017). In IEEE Transactions on Sustainable Com-
puting (in publication), doi:10.1109/TSUSC.2017.2733551.
[2] Feiyang Liu, Haibo Zhang, Yawen Chen, Zhiyi Huang, Huaxi Gu
(2015). WRH-ONoC: A Wavelength-Reused Hierarchical Architecture for
Optical Network on Chips. In IEEE Conference on Computer Communi-
cations (INFOCOM), Hong Kong, pp. 1912-1920. doi:10.1109/INFOCOM.
2015.7218574.
[3] Feiyang Liu, Haibo Zhang, Yawen Chen, Zhiyi Huang, Huaxi Gu
(2016). Dynamic Ring-Based Multicast with Wavelength Reuse for Op-
tical Network on Chips. In IEEE 10th International Symposium on Embed-
ded Multicore/Many-Core Systems-on-Chip (MCSoC), Lyon, pp. 153-160.
doi:10.1109/MCSoC.2016.9.
[4] Feiyang Liu, Zhiyi Huang, Haibo Zhang, Yawen Chen, Hui Li, Huaxi
Gu (2017). Dark-ONoC: A Power-Efficient and Wavelength-Reused Optical
Network on Chip Architecture for Many-Core Processor with Dark Silicon
(under review).
[5] Feiyang Liu, Haibo Zhang, Yawen Chen, Zhiyi Huang, Huaxi Gu
(2017). A Ring-based Multicast Routing and Wavelength Reuse Scheme
for Dynamical Configured Optical Network on Chip (ONoC) (prepare to
submission).
As co-author:
[6] Xuanzhang Liu, Huaxi Gu, Haibo Zhang, Feiyang Liu, Yawen Chen,
Xiaoshan Yu (2017). Energy-Aware On-chip Virtual Machine Placement
v
for Cloud-Supported Cyber-Physical Systems. In Microprocessors and Mi-
crosystems, vol. 52, pp. 427-437. doi:10.1016/j.micpro.2016.07.013.
[7] Luming Wan, Haibo Zhang, Feiyang Liu, Yawen Chen (2016). Routing
in Delay Tolerant Networks with Fine-Grained Contact Characterisation
and Dynamic Message Replication. In IEEE 17th International Symposium
on A World of Wireless, Mobile and Multimedia Networks (WoWMoM),
Coimbra, pp. 1-6. doi:10.1109/WoWMoM.2016.7523551.
[8] Luming Wan, Feiyang Liu, Yawen Chen, Haibo Zhang (2015). Routing
Protocols for Delay Tolerant Networks: Survey and Performance Evalua-
tion. In International Journal of Wireless & Mobile Networks, vol. 7, no.
3, pp. 55-69. doi:10.5121/ijwmn.2015.7305.
[9] Yawen Chen, Haibo Zhang, Feiyang Liu, Huaxi Gu (2015). An Op-
timization Framework for Routing on Optical Network-on-Chips (ONoCs)
from a Networking Perspective. In IEEE International Conference on Sig-
nal Processing, Communications and Computing (ICSPCC), Ningbo, pp.
1-5. doi:10.1109/ICSPCC.2015.7338820.
vi
Contents
1 Introduction 1
1.1 Communication of Many-Core Processor . . . . . . . . . . . . . . . . . 2
1.2 Inter-Core Communication Architecture . . . . . . . . . . . . . . . . . 3
1.2.1 Electronic Network on Chip . . . . . . . . . . . . . . . . . . . . 4
1.2.2 Optical Network on Chip . . . . . . . . . . . . . . . . . . . . . . 7
1.2.3 Advantages and Limitations of ONoC . . . . . . . . . . . . . . . 11
1.3 Challenging Issues in ONoC Design . . . . . . . . . . . . . . . . . . . . 13
1.3.1 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . 13
1.3.2 Routing and Wavelength Allocation . . . . . . . . . . . . . . . . 15
1.4 Motivations and Contributions . . . . . . . . . . . . . . . . . . . . . . . 16
1.4.1 Network Scalability . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.4.2 Multicast Communication . . . . . . . . . . . . . . . . . . . . . 18
1.4.3 Dark Silicon . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
1.5 Thesis Structure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2 Optical Network on Chip Design 21
2.1 Overview of ONoC Architecture . . . . . . . . . . . . . . . . . . . . . . 21
2.1.1 Basic Optical Components . . . . . . . . . . . . . . . . . . . . . 22
2.1.2 Typical Network Architecture . . . . . . . . . . . . . . . . . . . 27
2.1.3 Wavelength-Based Routing Scheme . . . . . . . . . . . . . . . . 29
2.2 All-Optical ONoC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
2.2.1 Ring-Based ONoC . . . . . . . . . . . . . . . . . . . . . . . . . 31
2.2.2 Crossbar-Based ONoC . . . . . . . . . . . . . . . . . . . . . . . 33
2.2.3 Advantages and Disadvantages . . . . . . . . . . . . . . . . . . 36
2.3 Electronic-Optical Hybrid ONoC . . . . . . . . . . . . . . . . . . . . . 37
2.3.1 Path-Reserved ONoC . . . . . . . . . . . . . . . . . . . . . . . . 38
2.3.2 Hierarchical ONoC . . . . . . . . . . . . . . . . . . . . . . . . . 42
2.3.3 Advantages and Disadvantages . . . . . . . . . . . . . . . . . . 45
2.4 Routing and Wavelength Allocation . . . . . . . . . . . . . . . . . . . . 46
2.4.1 Fixed Wavelength Routing . . . . . . . . . . . . . . . . . . . . . 46
2.4.2 Dynamical Routing and Wavelength Allocation . . . . . . . . . 47
2.4.3 Advantages and Disadvantages . . . . . . . . . . . . . . . . . . 48
2.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
vii
3 WRH-ONoC: Wavelength-Reused Hierarchical ONoC Architecture 51
3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
3.1.1 Scalability of Many-Core Processor . . . . . . . . . . . . . . . . 52
3.1.2 Optical Network on Chip . . . . . . . . . . . . . . . . . . . . . . 53
3.1.3 Non-Blocking λ-Router . . . . . . . . . . . . . . . . . . . . . . . 55
3.1.4 Hierarchical Networking . . . . . . . . . . . . . . . . . . . . . . 57
3.1.5 Main Contributions of WRH-ONoC . . . . . . . . . . . . . . . . 59
3.2 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.2.1 Hierarchical Interconnection . . . . . . . . . . . . . . . . . . . . 60
3.2.2 Gateway Structure . . . . . . . . . . . . . . . . . . . . . . . . . 62
3.3 Communication Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . 64
3.3.1 Positional Prefix Address . . . . . . . . . . . . . . . . . . . . . . 64
3.3.2 Unicast Communication . . . . . . . . . . . . . . . . . . . . . . 65
3.3.3 Multicast Communication . . . . . . . . . . . . . . . . . . . . . 67
3.3.4 Wavelength-Level Flow Control . . . . . . . . . . . . . . . . . . 68
3.4 Theoretical Modelling and Analysis . . . . . . . . . . . . . . . . . . . . 68
3.4.1 Hardware Requirements . . . . . . . . . . . . . . . . . . . . . . 69
3.4.2 Communication Delay . . . . . . . . . . . . . . . . . . . . . . . 71
3.4.3 Energy Consumption . . . . . . . . . . . . . . . . . . . . . . . . 76
3.5 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 77
3.5.1 Simulation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 78
3.5.2 Comparison with Theoretical Results . . . . . . . . . . . . . . . 79
3.5.3 Simulation with Data Traces . . . . . . . . . . . . . . . . . . . . 80
3.5.4 Simulation with Synthetic Traffic Patterns . . . . . . . . . . . . 82
3.5.5 Hardware Cost Analysis . . . . . . . . . . . . . . . . . . . . . . 90
3.5.6 Energy Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . 92
3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95
4 DWRMR: A Dynamically-configured and Wavelength-Reused ONoC
with Multicast Ring based Routing 97
4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
4.1.1 Multicast Communication . . . . . . . . . . . . . . . . . . . . . 98
4.1.2 Existing Multicast Routing Schemes . . . . . . . . . . . . . . . 100
4.1.3 Existing Multicast-Enabled Architectures . . . . . . . . . . . . . 102
4.1.4 Main Contributions of DWRMR . . . . . . . . . . . . . . . . . . 103
4.2 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
4.2.1 Core Plane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.2.2 Control Plane . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107
4.2.3 Forwarding Plane . . . . . . . . . . . . . . . . . . . . . . . . . . 108
4.3 Communication Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . 110
4.3.1 Ring-Based Multicast Routing . . . . . . . . . . . . . . . . . . . 110
4.3.2 Multicast Ring Reuse . . . . . . . . . . . . . . . . . . . . . . . . 112
4.4 Routing and Wavelength Allocation Algorithm . . . . . . . . . . . . . . 113
4.4.1 Preliminary Definition . . . . . . . . . . . . . . . . . . . . . . . 113
4.4.2 Multicast Ring Model . . . . . . . . . . . . . . . . . . . . . . . 114
4.4.3 Heuristic Solution . . . . . . . . . . . . . . . . . . . . . . . . . . 116
viii
4.5 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 120
4.5.1 Simulation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 120
4.5.2 Synthetic-Based Simulations . . . . . . . . . . . . . . . . . . . . 121
4.5.3 Simulation with Data Traces . . . . . . . . . . . . . . . . . . . . 126
4.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5 Dark-ONoC: A Dark Silicon Aware ONoC Architecture 129
5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
5.1.1 Many-Core Processor in Dark Silicon . . . . . . . . . . . . . . . 130
5.1.2 Properties of Dark Silicon . . . . . . . . . . . . . . . . . . . . . 131
5.1.3 Existing Research on Dark Silicon . . . . . . . . . . . . . . . . . 133
5.1.4 Significance of Dark Silicon Aware ONoC . . . . . . . . . . . . . 134
5.1.5 Main Contributions of Dark-ONoC . . . . . . . . . . . . . . . . 135
5.2 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
5.2.1 Hierarchical Network . . . . . . . . . . . . . . . . . . . . . . . . 136
5.2.2 Electronic Core Plane . . . . . . . . . . . . . . . . . . . . . . . 137
5.2.3 Optical Control Plane . . . . . . . . . . . . . . . . . . . . . . . 139
5.2.4 Optical Data Plane . . . . . . . . . . . . . . . . . . . . . . . . . 140
5.3 Communication Process . . . . . . . . . . . . . . . . . . . . . . . . . . 141
5.3.1 Dark Silicon Pattern Transition . . . . . . . . . . . . . . . . . . 142
5.3.2 Optical Routing Configuration . . . . . . . . . . . . . . . . . . . 142
5.3.3 Optical Data Transmission . . . . . . . . . . . . . . . . . . . . . 143
5.4 Routing and Wavelength Allocation Scheme . . . . . . . . . . . . . . . 144
5.4.1 Motivating Example . . . . . . . . . . . . . . . . . . . . . . . . 144
5.4.2 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . 145
5.4.3 Heuristic Routing and Wavelength Allocation Scheme . . . . . . 148
5.5 Performance Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . 153
5.5.1 Performance Model . . . . . . . . . . . . . . . . . . . . . . . . . 154
5.5.2 Simulation Setup . . . . . . . . . . . . . . . . . . . . . . . . . . 156
5.5.3 Simulation with Fixed Dark Silicon Patterns . . . . . . . . . . . 158
5.5.4 Simulation with Random Dark Silicon Patterns . . . . . . . . . 161
5.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
6 Conclusion and Future Work 165
6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
6.1.1 Main Contributions . . . . . . . . . . . . . . . . . . . . . . . . . 166
6.1.2 Limitation Issues . . . . . . . . . . . . . . . . . . . . . . . . . . 168
6.1.3 Potential Improving Solutions . . . . . . . . . . . . . . . . . . . 169
6.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
6.2.1 Reliable ONoC Architecture . . . . . . . . . . . . . . . . . . . . 170
6.2.2 3D ONoC Architecture . . . . . . . . . . . . . . . . . . . . . . . 171
6.2.3 Intra/Inter-chip Hybrid ONoC Architecture . . . . . . . . . . . 172
References 173
ix
List of Tables
2.1 Comparisons of Two Typical ONoC Architectures . . . . . . . . . . . . 30
2.2 Comparisons of Different Optical Crossbars . . . . . . . . . . . . . . . . 35
2.3 Comparisons of Different All-Optical ONoC Schemes . . . . . . . . . . 37
2.4 Comparisons of Different Hierarchical ONoC Schemes . . . . . . . . . . 45
3.1 Simulation Settings for WRH-ONoC . . . . . . . . . . . . . . . . . . . 79
3.2 Hardware Requirement Comparison between WRH-ONoC and λ-router 91
3.3 Hardware Costs Comparison (area in mm2) . . . . . . . . . . . . . . . 92
3.4 Optical Energy Parameters . . . . . . . . . . . . . . . . . . . . . . . . . 93
3.5 Electronic Energy Parameters (45nm Process) . . . . . . . . . . . . . . 93
4.1 Simulation Settings for DWRMR . . . . . . . . . . . . . . . . . . . . . 121
5.1 Power Consumption Parameters of Optical Devices . . . . . . . . . . . 157
x
List of Figures
1.1 An example of ENoC. (a) 36 cores are connected in a 6 × 6 mesh net-
work. Each core Ci connects to an electronic router Ri through network
interface (NI). (b) A typical virtual-channel router. . . . . . . . . . . . 5
1.2 An example of optical communication process in a typical ONoC archi-
tecture using wavelength-based routing, with one source core and two
destination cores by using two different wavelengths. . . . . . . . . . . 9
1.3 A typical electronic-optical hybrid ONoC architecture in mesh topology,
which consists of an electronic control network for path reservation and
an optical data network for optical transmission. . . . . . . . . . . . . . 11
2.1 Optical switching elements with passive and active MRs. (a)-(b) passive
MRs have different wavelengths with different diameters; (c)-(d) active
MRs are tuned by heating or applying a voltage. . . . . . . . . . . . . . 24
2.2 Two typical optical routers for ONoC. (a) Wavelength-based optical
router, GWOR, and its wavelength routing matrix. (b) Configurable
optical router, Cygnus, and its MR configuring matrix. . . . . . . . . . 25
2.3 The reasons of wavelength limitation in ONoC, by considering (a) the
crosstalk noise of different optical signals and (b) the maximal acceptable
optical input power for each waveguide. . . . . . . . . . . . . . . . . . . 27
2.4 Example of two different kinds of ONoC architectures. (a) λ-router is
an all-optical wavelength-routed ONoC with fixed wavelength routing.
(b) Optical data network in an electronic-optical hybrid ONoC with
dynamical routing and wavelength allocation. . . . . . . . . . . . . . . 28
2.5 ORNoC is a typical ring-based ONoC. (a) Logical interconnection of
an ORNoC with 6 cores; (b) wavelength allocation for clockwise opti-
cal interconnect; (c) wavelength allocation for counter-clockwise optical
interconnect; (d) wavelength routing matrix. . . . . . . . . . . . . . . . 32
2.6 Three typical optical crossbar architectures. (a) Multi-Write-Single-
Read (MWSR), (b) Single-Write-Multi-Read (SWMR), (c) Multi-Write-
Multi-Read (MWMR). . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.7 Two kinds of hybrid ONoC architectures, (a) mesh-based optical global
network with path reservation, and (b) crossbar-based optical global
network with wavelength routing. . . . . . . . . . . . . . . . . . . . . . 42
3.1 The operation principle of (a) microring resonator as a wavelength-
selective filter, and (b) the optical switch by filtering optical signals with
different wavelengths. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
xi
3.2 The principle of λ-router: (a) the connection architecture; (b) optical
switching element (OSE); (c) wavelength routing matrix. . . . . . . . . 56
3.3 An example of WRH-ONoC for connecting 160 cores using 25 wave-
lengths with 3 levels of λ-routers and 5 sibling gateways. . . . . . . . . 61
3.4 The structure of gateway: (a) two internal data paths for wavelength
assignment of upward and downward traffics; (b) E-O converter with
MR-based modulators; (c) O-E converter with MR-based photodetectors. 63
3.5 The positional prefix address for the cores, λ-routers, and gateways ac-
cording to their positions in the network hierarchy. . . . . . . . . . . . 64
3.6 The routing process of an inter-subsystem packet: upward transmission,
turnover, and downward transmission. . . . . . . . . . . . . . . . . . . 66
3.7 The comparison of average communication delay from simulation results
and modelling, {N,Wmax, g} to be (a) {320,20,4} and {480,30,6}, and
(b) {400,25,5} and {640,40,8}. . . . . . . . . . . . . . . . . . . . . . . . 79
3.8 The trace-based simulations for different ONoC schemes with 64 cores:
(a) the average end-to-end packet delay; (b) the packet delay variations
over time with the blackscholes trace. . . . . . . . . . . . . . . . . . . . 81
3.9 Performance analysis for different ONoC schemes with 400 cores us-
ing synthetic unicast traffic: (a) average end-to-end delay; (b) average
throughput per core. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
3.10 Performance analysis with different network sizes: (a) average end-to-
end packet delay; (b) throughput per core. . . . . . . . . . . . . . . . . 85
3.11 Performance analysis with different number of sibling gateways: (a) av-
erage end-to-end delay; (b) throughput per core. . . . . . . . . . . . . . 86
3.12 Performance analysis with different buffer sizes: (a) average end-to-end
packet delay; (b) throughput per core. . . . . . . . . . . . . . . . . . . . 87
3.13 Performance analysis with the locality traffic patterns: (a) end-to-end
packet delay; (b) throughput per core. . . . . . . . . . . . . . . . . . . . 88
3.14 Performance analysis with different multicast distributions: (a) average
end-to-end delay; (b) average throughput. . . . . . . . . . . . . . . . . . 89
3.15 Performance analysis with different multicast ratios ω: (a) average end-
to-end delay; (b) average throughput. . . . . . . . . . . . . . . . . . . . 90
3.16 Average energy efficiency with (a) 64 cores and data traces; (b) 400 cores
and synthetic traffics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
4.1 Analysis of the multicast traffic in a 64-core system running PARSEC
benchmarks, (a) the ratio of multicast packets for each core; (b) the
average ratio of interactive multicast packets within the same multicast
group. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
4.2 Replication-based multicast routing schemes for a 4 × 4 ONoC, (a)
unicast-based scheme with exclusive wavelengths assignment for each
routing path; (b) tree-based and (c) path-based schemes with optical
splitters in the intermediate routers. . . . . . . . . . . . . . . . . . . . . 101
4.3 Overview of the proposed multicast scheme, (a) the dynamically estab-
lished multicast ring for routing; (b) the principle of single-send-multi-
receive transmission. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
xii
4.4 Network architecture with three logical planes: core plane (microcores),
control plane (centralized multicast ring/wavelength allocation), and for-
warding plane (multicast packet delivery). . . . . . . . . . . . . . . . . 106
4.5 Main components and communication process in (a) the core plane (for
the local configuration), and (b) the control plane (for the centralized
routing and wavelength allocation). . . . . . . . . . . . . . . . . . . . . 107
4.6 The principle of multicast-enabled optical router, (a) the router archi-
tecture for a single wavelength; (b) different switch status by tuning the
resonant wavelength of MR. . . . . . . . . . . . . . . . . . . . . . . . . 109
4.7 Multicast ring reuse scheme within the multicast group, by interchang-
ing between two states: (a) multicast routing, and (b) optical token
arbitration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
4.8 Hamiltonian cycles and the labelling scheme for a 4 × 4 ONoC, start-
ing from the source core of (a) (xs, ys) = (0, 0) in the counter-clockwise
Hamiltonian cycle H−, and (b) (xs, ys) = (1, 2) in the clockwise Hamil-
tonian cycle H+, respectively. . . . . . . . . . . . . . . . . . . . . . . . 119
4.9 Comparison with different multicast routing schemes, unicast-based (UM),
tree-based (TM), and path-based (PM), in the average (a) end-to-end
packet delay and (b) network throughput. . . . . . . . . . . . . . . . . . 122
4.10 Performance evaluation with different number of available wavelength
channels for DWRMR, in the average (a) end-to-end packet delay and
(b) network throughput. . . . . . . . . . . . . . . . . . . . . . . . . . . 124
4.11 Performance evaluation of multicast ring reuse for the interactive mul-
ticast traffic, τ = 0.1, 0.3, 0.5, 0.7, 0.9 in the average (a) end-to-end
packet delay and (b) network throughput. . . . . . . . . . . . . . . . . . 125
4.12 Average multicast packet delay in different application of trace-based
simulations. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.1 Two typical kinds of dark silicon patterns in an 8×8 mesh based ONoC
with 64 cores, (a) fixed pattern, with four equivalent groups which are
active in turn; (b) random pattern, with variable number of active cores
and spatial distributions. . . . . . . . . . . . . . . . . . . . . . . . . . . 132
5.2 Dark-ONoC architecture: electronic core plane with a manager core for
the dark silicon pattern transition; optical control plane for the central-
ized routing and wavelength allocation, and fast configuration of optical
routing paths; optical data plane for the non-blocking transmission of
massive optical packets. . . . . . . . . . . . . . . . . . . . . . . . . . . 138
5.3 (a) Each core connects with an optical router through the network in-
terface (NI). (b) Local routing table for each active core in the network
interface, in which each routing path is related to a wavelength and an
output port. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
5.4 The connection of optical interfaces in the optical control channel from
the RWA to cores and routers, with one manager core and n cores using
n wavelengths (λ1,...,λn). . . . . . . . . . . . . . . . . . . . . . . . . . . 141
xiii
5.5 The configurable optical router for a specific wavelength, and the MR
configuring matrix. Mi: MR for modulation; Di: MR for photodetec-
tion; Ri: MR for routing. . . . . . . . . . . . . . . . . . . . . . . . . . . 142
5.6 Two different routing and wavelength allocation methodologies: (a) min-
imizing the number of intermediate routers (9 intermediate routers and
5 wavelengths) and reducing the length of routing paths; (b) minimiz-
ing the number of required wavelengths (29 intermediate routers and 1
wavelength). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
5.7 Average number of wavelengths for different schemes with the fixed dark
silicon patterns in different network sizes, (a) with 25% active cores (16,
36, and 64 for 8×8, 12×12, 16×16); and (b) with 16 active cores. . . . 159
5.8 Performance comparison for fixed dark silicon pattern, (a) average packet
delay; (b) maximal network throughput; (c) average power consumption. 160
5.9 Average number of required wavelengths for different schemes with syn-
thetic random dark silicon patterns, (a) in 8×8 ONoC with 8, 16, 24,
and 32 active cores; and (b) in 16×16 ONoC with 16, 32, 48, and 64
active cores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
5.10 Average number of required wavelengths for different dark silicon pat-
terns from data traces in 8×8 ONoC, in (a) light case with 8-16 active
cores, and (b) heavy case with 16-32 active cores. . . . . . . . . . . . . 164
xiv
Chapter 1
Introduction
Inter-core communication network is an essential component for many-core processors.
Recent advances in silicon photonic devices and three-dimensional integration tech-
nologies make the chip-scale optical communication network into a reality. Optical
Network on Chip (ONoC) is a emerging optical interconnection technology for many-
core processors. It integrates the silicon compatible optical devices with the traditional
electronic devices in the same chip to construct an optical inter-core network. Exist-
ing electronic network architectures and communication schemes cannot be employed
in ONoC directly, since the optical interconnects have highly different physical prop-
erties from the electronic counterparts. Wavelength division multiplexing (WDM) is
widely used in ONoC to increase the bandwidth capacity. Wavelength-based routing
can be implemented by transmitting optical signals simultaneously in the same optical
interconnect with different wavelengths. However, the maximal number of available
wavelengths is limited for the reliable communication and high power efficiency.
In this thesis, the design methodology of wavelength-reused Optical Network on
Chip architecture and communication scheme is explored to achieve high-performance
and energy-efficient inter-core communication for many-core processors. This chapter
introduces the motivations of wavelength-reused ONoC architecture at first, includ-
ing the communication requirements of many-core processors, the comparison between
electronic Network on Chip (ENoC) and ONoC, the main advantages and challenging
problems in ONoC design. Then, three important research issues from different com-
munication requirements of many-core processors are presented, i.e., network scalabil-
ity, multicast communication, and dark silicon. Accordingly, three wavelength-reused
ONoC architectures and communication schemes are proposed in this thesis to solve
these problems. Finally, the structure of this thesis is given.
1
1.1 Communication of Many-Core Processor
Processor, namely the Central Processing Unit (CPU), is one of the most important
components in current existing computing systems. It can be considered as the heart
of a computing system which needs to fulfil almost all the data processing tasks. With
the continuous development of manufacturing technologies in integrated circuits, more
and more transistors can be integrated in the same chip area of a processor. To further
boost the computing capability with high energy efficiency, at present the processors
evolve from single-core to many-core, instead of increasing the clock frequency which
leads to high power consumption and serious thermal problems (Blake, Dreslinski, and
Mudge, 2009). Nowadays, many-core processors are widely used in personal computers,
mobile phones, and tablets. For some high-performance computing applications, such
as cloud computing, data center, and supercomputing systems, the number of cores
integrated in the same processor can be more than hundreds of cores (Borkar, 2007;
Nychis, Fallin, Moscibroda, Mutlu, and Seshan, 2012).
In industry, several commercial many-core processors have already been released
at present. TILE-Gx72 with 72 cores in 40 nm process was released in 2013 and it
can be used for networking and data center applications (Mellanox, 2013). Oracle
announced the 32-core 256-thread processor, SPARC M7, in 2015 and it can even
scale from 32 cores to 512 cores (Oracle, 2015). An 80-core research chip, Teraflops,
was developed by the Tera-Scale Computing Research Program in Intel Corporation
using 64 nm CMOS process in 2007 (Intel, 2007), and the current commercial available
Intel Xeon Phi Processor 7290 can integrate up to 72 cores running 288 threads with
14 nm process (Intel, 2016). SW26010 is a 260-core 64-bit RISC processor designed
for the Sunway TaihuLight supercomputer, which is the fastest supercomputer in the
world in TOP500 list at present (Fu, Liao, Yang, Wang, Song, Huang, Yang, Xue, Liu,
Qiao, Zhao, Yin, Hou, Ge, Zhang, Wang, Zhou, and Yang, 2016). In 2016, Adapteva
reported a 1024-core 64-bit RISC processor, Epiphany-V, using 16 nm process for deep-
learning and self-driving applications (Olofsson, 2016). With the powerful computation
capability, many-core processors are becoming the mainstream computational platform
for could computing, data center, and supercomputing applications, by enabling high
computational parallelism with the lager amounts of micro processing cores.
The enormous communication requirements among these high-speed cores also lead
to challenging research problems, especially when the data is distributively stored in
the cache hierarchy of many-core processors (Dally and Towles, 2001; Li, Nicopoulos,
2
Richardson, Xie, Narayanan, and Kandemir, 2006). For instance, to utilize the cache
space more efficiently, each core has some dedicated data and instruction cache spaces,
while the last level cache is uniformly distributed to all the cores and can be shared
through message passing. Thus, the inter-core communication network requires to have
very low end-to-end transmission delay and high throughput capacity, especially for
some distributed computing and multimedia applications. According to the technical
prediction in (Vantrease, Schreiber, Monchiero, McLaren, Jouppi, Fiorentino, Davis,
Binkert, Beausoleil, and Ahn, 2008), the inter-core communication network needs to
achieve a bandwidth capacity of more than 10 terabyte per second. Moreover, the enor-
mous inter-core communication traffic also leads to a high energy consumption. For
instance, up to 36% of in-chip power consumption is contributed by the data commu-
nication in the 16-core MIT RAW processor, and the power consumption of inter-core
communication in the 80-core Intel TeraFLOPS processor can reach around 100 Watts
(Vangal, Howard, Ruhl, Dighe, Wilson, Tschanz, Finan, Singh, Jacob, Jain, Erra-
guntla, Roberts, Hoskote, Borkar, and Borkar, 2008). Furthermore, for some critical
messages in different application scenarios, such as protocol-specific control messages
and cache coherence messages, it requires to provide the quality-of-service (QoS) guar-
anteed communication (Winter and Fettweis, 2011; Munk, Freier, Richling, and Chen,
2015), i.e., reliable communication with definite delay, and multicast communication
(Rodrigo, Flich, Duato, and Hummel, 2008), i.e., one-to-multiple communication.
Therefore, the design of high-performance inter-core communication network is be-
coming a significant and challenging problem in the development of many-core pro-
cessors (Nychis, Fallin, Moscibroda, Mutlu, and Seshan, 2012). Generally, the funda-
mental communication requirements of the many-core processor include low end-to-
end packet delay, high network throughput, low hardware cost, and low energy con-
sumption. For different application scenarios, there are also some application specific
communication requirements, such as quality-of-service guarantee, multicast commu-
nication, and network scalability, etc.
1.2 Inter-Core Communication Architecture
Most of existing many-core processors utilize the electronic Network on Chip (ENoC)
architecture to implement the inter-core communication (Daya, Chen, Subramanian,
Kwon, Park, Krishna, Holt, Chandrakasan, and Peh, 2014; Lu, Fu, Wang, Han, Yan,
and Li, 2015). ENoC introduces the design principles of networking and store-and-
3
forward routing into the chip-scale communication (Marculescu, Hu, and Ogras, 2005;
Hoskote, Vangal, Singh, Borkar, and Borkar, 2007). A typical ENoC architecture in-
terconnects all the cores with electronic routers in a fixed topology and routes the data
packets distributively from the source core to the destination core in the hop-by-hop
manner. Nevertheless, for a many-core processor with a large number of cores, the
traditional electronic interconnects cannot appropriately satisfy the communication
requirements with low packet delay, high network throughput, and low energy con-
sumption (Owens, Dally, Ho, Jayasimha, Keckler, and Peh, 2007). Optical Network on
Chip (ONoC) is proposed as a promising alternative by using silicon photonic devices
to transmit the data packets through the modulated optical signals (Kirman, Kir-
man, Dokania, Martinez, Apsel, Watkins, and Albonesi, 2006; Kurian, Miller, Psota,
Eastep, Liu, Michel, Kimerling, and Agarwal, 2010). Moreover, wavelength division
multiplexing (WDM) can be employed to transmit multiple optical signals with dif-
ferent wavelengths at the same time to improve the bandwidth capacity and routing
flexibility (Batten, Joshi, Stojanovic, and Asanovic, 2012). In this section, the typical
network architectures of ENoC and ONoC are introduced.
1.2.1 Electronic Network on Chip
Electronic Network on Chip was firstly proposed in around 2000 (Guerrier and Greiner,
2000). It is a communication technology for many-core processors to replace the tra-
ditional bus-based architecture. The main design principle of ENoC is to construct
a distributed inter-core network and to realize the store-and-forward routing between
the cores (Dally and Towles, 2001; Kumar, Jantsch, Soininen, Forsell, Millberg, Oberg,
Tiensyrja, and Hemani, 2002). In a typical ENoC architecture, each core is connected
to an electronic router through a network interface. The electronic router is a compact
and energy-efficient routing/switching device customized for the chip-scale communi-
cation. All the routers are interconnected by electronic links into different network
topologies and are designed to implement different routing algorithms to reduce the
communication delay and improve the network throughput, such as in mesh-based
ENoC schemes (Liu, Gu, and Yang, 2012). It can be seen from the example in Figure
1.1(a), 36 cores are interconnected by using an ENoC in 6 × 6 mesh topology. Each
core Ci corresponds to an electronic router Ri with the same address in the network.
For the inter-core communication in an ENoC architecture, the source core sends the
data packets through the network interface to the connected electronic router. Then,
the electronic router calculates the routing path hop-by-hop according to the address
4
Routing 
Compuation
West
North
East
South
Local Core
Virtual Channels
Crossbar 
Switch
Virtual 
Channel 
Allocator
Switch 
Allocator
West
North
East
South
Local Core
Flow 
Control
VC_ID
Dst_ID
R0 R1 R3R2
R6 R7
R4 R5
R8 R9 R10 R11
R12 R13 R14 R15 R16 R17
R18 R19 R20 R22R21 R23
R24 R25 R26 R28R27 R29
R31R30 R32 R34R33 R35
NI NI
NI NI
NI NI
NI NI
NI NI
NI NI
NI NI
NI NI
NI NI
NI NI
NI NI
NI NI
NI NI
NI NI
NI NI
NI NI
NI NI
NI NI
C2C0 C5C1 C3 C4
C7C6 C8 C9 C10 C11
C12 C13 C14 C15 C16 C17
C18 C19 C20 C21 C22 C23
C24 C25 C26 C27 C28 C29
C30 C31 C32 C33 C34 C35
(a) (b)
Figure 1.1: An example of ENoC. (a) 36 cores are connected in a
6× 6 mesh network. Each core Ci connects to an electronic router Ri
through network interface (NI). (b) A typical virtual-channel router.
of destination core. At the destination side, the data packets are received through the
connected network interface. The network interface also conducts the end-to-end flow
control (Concer, Bononi, Soulie, Locatelli, and Carloni, 2009).
Electronic router is the key component in an ENoC architecture (Kim, Nicopoulos,
Park, Narayanan, Yousif, and Das, 2006). It usually conducts the store-and-forward
routing for data packets. In Figure1.1(b), a typical electronic router with multiple
virtual channels is illustrated for the mesh-based ENoC, where the virtual-channel is
a technology to improve the efficiency of electronic links between neighbouring routers
(Nicopoulos, Park, Kim, Vijaykrishnan, Yousif, and Das, 2006; Liu, Gu, and Yang,
2010). It can be seen that each electronic router has five pairs of input and output
ports to connect with the local core and four neighbouring routers in different direc-
tions, denoted by West, North, East, and South. Inside of the electronic router, it
consists of several input buffers for each virtual channel and each input, a routing com-
putation unit, a virtual-channel allocator, a crossbar switch, and a switch allocator.
The electronic router works in a pipeline manner to complete the routing computation,
virtual channel allocation, switch allocation, and switch transfer for each packet (Kim,
Nicopoulos, Park, Narayanan, Yousif, and Das, 2006). It is worth noting that different
adaptive routing algorithms can be implemented in the electronic router considering
both the address of destination core and the load balance in the network to improve
the communication performance (Liu, Gu, and Yang, 2012).
5
At present, the design methodology of ENoC is widely studied (Marculescu, Hu, and
Ogras, 2005; Marculescu, Ogras, Peh, Jerger, and Hoskote, 2009; Ma, Jerger, Wang,
Lai, and Huang, 2014), including the network topology, routing algorithm, router archi-
tecture, flow control scheme, task mapping algorithm, fault tolerance scheme, etc. For
instance, the design of network topology focuses on reducing the average communica-
tion distance and increasing the network connectivity (Agarwal, Iskander, and Shankar,
2009), the design of routing algorithm focuses on reducing the network congestion
through load balance (Gratz, Grot, and Keckler, 2008), and the design of router archi-
tecture focuses on decreasing the processing delay and hardware cost (Kim, Nicopoulos,
Park, Narayanan, Yousif, and Das, 2006). According to the performance models and
simulation results, ENoC can obtain good communication performance for a processor
with tens of cores (Pande, Grecu, Jones, Ivanov, and Saleh, 2005). However, due to the
physical constraints of electronic interconnects in wire delay, bandwidth, power dissipa-
tion, and signal interference, ENoC architecture is hard to satisfy the communication
requirements of many-core processors with more than hundreds of cores (Shacham,
Bergman, and Carloni, 2008). For instance, the transmission delay, power consump-
tion, and signal interference of the traditional electronic interconnects will increase
significantly, due to the reduced width of metallic wires, the increased average number
of hops in the routing paths, and the lengthened global communication paths (Bjer-
regaard and Mahadevan, 2006). Moreover, the input buffer and crossbar switch in an
electronic router dominate the overall hardware cost and power consumption, and their
requirements are related to the number of cores connected in an ENoC (Kahng, Li,
Peh, and Samadi, 2012). For instance, in a mesh-based ENoC, the energy consumption
for transmitting a 168-bit packet through one hop of router and a 1 mm link is up to
197 pJ, with 4-packet input buffer (Hamedani, Jerger, and Hessabi, 2014). When the
network size increases, the power consumption of ENoC can take more than 30% of
overall power consumption in a many-core processor (Owens, Dally, Ho, Jayasimha,
Keckler, and Peh, 2007; Parikh, Das, and Bertacco, 2014).
Even though some efforts were made to reduce the communication delay and power
consumption in the ENoC design, they cannot efficiently solve the physical limita-
tions of metallic-based electronic interconnects. To reduce the packet delay for global
communication, some application-specific express links are increased to the traditional
mesh based ENoC (Ogras and Marculescu, 2006; Jiao and Fu, 2011). These physical
express links provide direct interconnection between two cores, but they are statically
configured for a specific application in advance without scalability. The express vir-
6
tual channels are proposed to dynamically establish some express virtual channels for
the communication between specific cores (Kumar, Peh, Kundu, and Jha, 2007; Chen,
Agarwal, Krishna, Koo, Peh, and Saraswat, 2010). However, since this scheme requires
to use virtual channels in the electronic routers, the increased buffer space can lead to
high hardware cost and power consumption. Another scheme to reduce the commu-
nication distance is to employ three-dimensional (3D) network architecture to replace
the planar network (Rahmani, Latif, Liljeberg, Plosila, and Tenhunen, 2010; Take,
Matsutani, Sasaki, Koibuchi, Kuroda, and Amano, 2014; Chen, Chao, and Wu, 2015).
The main advantages of 3D ENoC include the reduced average communication distance
and the increased routing diversity. However, it can lead to heavy thermal issues with
multiple layers, large buffer space requirements with more input/output ports in the
electronic routers, and serious traffic congestions in the center of 3D network.
Therefore, some emerging interconnection technologies should be employed in many-
core processors to satisfy the communication requirements and overcome the limitations
of electronic interconnects (Karkar, Mak, Tong, and Yakovlev, 2016). This thesis fo-
cuses on Optical Network on Chip which makes use of the recent advances in silicon
photonic interconnects (Fusella and Cilardo, 2016).
1.2.2 Optical Network on Chip
The idea of exploiting optical interconnects in the chip-scale communication was firstly
proposed in 2000 (Collet, Litaize, Campenhout, Jesshope, Desmulliez, Thienpont,
Goodman, and Louri, 2000; Miller, 2000). All the required optical devices, such as laser
sources, modulators, optical routers/switches, wavelength multiplexer, and photodete-
tors have been demonstrated in nanophotonic technologies, and they can be integrated
with the electronic devices in the same chip, namely silicon compatible (Chen, Chen,
Haurylau, Nelson, Fauchet, Friedman, and Albonesi, 2005; Kirman, Kirman, Dokania,
Martinez, Apsel, Watkins, and Albonesi, 2006). Optical Network on Chip is an optical
inter-core communication technology by utilizing the emerging silicon-compatible pho-
tonic devices to achieve high-performance and energy-efficient communication for the
many-core processor, especially when it integrates a large number of cores (Shacham,
Bergman, and Carloni, 2008; Li, Browning, Gratz, and Palermo, 2014). In an ONoC
architecture, the data packets are transmitted from the source core to the destination
core through the silicon waveguides (optical medium) using modulated optical signals
with ultra low end-to-end delay (with light speed) and low power dissipation, and
multiple optical signals can be transmitted simultaneously with different wavelengths
7
to achieve extremely high bandwidth and routing flexibility. The wavelength multi-
plexed optical signals can be transmitted simultaneously with neglectable interference
and be demultiplexed at the receivers by using wavelength-specific filters. The overall
bandwidth of multiple terabits-per-second can be achieved with a limited power con-
sumption by using optical interconnects (Petracca, Lee, Bergman, and Carloni, 2008).
Therefore, it is a promising alternative to address the communication problems of tra-
ditional ENoCs. According to some physical experiments, using optical interconnects,
the average end-to-end communication delay can be reduced by 70% compared to the
optimized electronic interconnects (Zhang and Louri, 2010), and ideally the power
efficiency can be improved by four times (Dokania and Apsel, 2009).
In industry, Luxtera demonstrated the first commercial CMOS (complementary
metal-oxide-semiconductor) compatible photonic interconnects to provide high-speed
optical communications in a single chip with 10 Gbps channel in 2006 (Gunn, 2006).
In 2012, STMicroelectronics envisioned a roadmap for the development of silicon pho-
tonics, in which four key applications are the main driving forces: intensive computing,
broadband communication, mass storage, and consumer multimedia (Zuffada, 2012).
Intel announced the use of silicon photonic architecture to define the next-generation
many-core processors and servers, and revealed its first inexpensive 100 Gbps optical
chip in 2013 (Intel, 2013). IBM advanced a significant step by integrating the photonic
chip on the same package as CPUs in 2015 (Hruska, 2015). Although the develop-
ment of silicon photonic interconnects in industry is still in the beginning period, the
research on the design methodology of ONoC architecture with the silicon photonic
interconnects is important for future many-core processors.
Figure 1.2 illustrates the optical inter-core communication process in a typical
ONoC architecture with the wavelength-based routing. In this example, the source
core sends optical packets to two destination cores by using two different wavelengths,
λ1 and λ2. It can be seen that the fundamental optical devices in a typical ONoC
architecture generally include laser sources, silicon waveguides, microring resonators,
optical routers, modulators, and photodetectors (Fusella and Cilardo, 2016; Gu, Mo,
Xu, and Zhang, 2009; Chan, Hendry, Biberman, Bergman, and Carloni, 2010). Laser
source provides the optical signals on which the data packets are modulated and car-
ried, and it can be implemented by using an off-chip laser to save the chip area and
energy consumption and distributing optical signals to all the cores with separate power
waveguides (Morris, Jolley, and Kodi, 2014; Koohi and Hessabi, 2014), or using sepa-
rate on-chip lasers (e.g., Vertical Cavity Surface Emitting Lasers, VCSELs) directly for
8
Src 
Core
Dst
Core 2
Dst 
Core 1
Optical 
Router
WaveguideNetwork Interface at 
the source side
Network Interface at 
the destination side
Mircoring 
Resonator
(MR)
Optical signals from 
Laser Source
1
2
1 1
1
1
12 2
2 MR-based 
Photodetector
MR-based 
Modulator
Mircoring Resonator
in off state
1 2
Mircoring Resonator
in on state
Figure 1.2: An example of optical communication process in a typical
ONoC architecture using wavelength-based routing, with one source
core and two destination cores by using two different wavelengths.
each core to achieve flexible power intensity adjusting (Chen, Zhang, Contu, Klamkin,
Coskun, and Joshi, 2014). Silicon waveguide is the optical transmission medium with
low power attenuation and high confinement to the optical signals. The propagation
loss of waveguide can be as less as 1.5 dB/cm, and 90o waveguide bending can be made
with a signal loss of 0.05 dB (Grani, Bartolini, Furdiani, Ramini, and Bertozzi, 2014).
According to the demonstration in (Lee, Chen, Biberman, Liu, Hsieh, Chou, Dadap,
Xia, Green, Sekaric, Vlasov, Osgood, and Bergman, 2008), a 5cm long waveguide can
achieve an aggregate bandwidth of 1.28 Tbps by multiplexing 32 wavelength channels
each having a bandwidth of 40 Gbps. While in an ENoC, it requires to use 256-bit
electronic wires working at 5 GHz to achieve this bandwidth, which leads to huge
hardware costs and power consumption.
Microring resonator (MR) is a compact (with a radius of 3-5 µm) wavelength-
selective optical device with a specific resonant wavelength. As shown in Figure 1.2,
when an MR is placed close to two waveguides, if the wavelength of input optical signal
matches with the resonant wavelength of MR, the optical signal will be coupled from the
original waveguide into the MR and output to the other waveguide. It can be considered
that the MR can absorb/inject the optical signal with resonant wavelength from/to the
adjacent waveguide. Generally, the resonant wavelength of each MR is determined by
the materials and geometric diameter, and it can be tuned by the electronic-optical
or thermal-optical effects, i.e., adding a bias voltage or heating (Dong, Shafiiha, Liao,
9
Liang, Feng, Feng, Li, Zheng, Krishnamoorthy, and Asghari, 2010). Thus, there are
two kinds of MRs: passive MR with diameter-determined resonant wavelength, and
active MR with tunable resonant wavelength. According to this property, optical
routers, modulators, and photodetectors can be implemented based on MRs, as shown
in Figure 1.2. In this figure, the active MRs with two different resonant wavelengths
can be tuned on or off separately by heating.
Optical router is constructed by waveguides and MRs to realize high-speed switching
for optical signals in each optical routing path based on their wavelengths. To leverage
the high-speed optical transmission from the source core to the destination core, all the
optical routers in an optical routing path need to be configured statically or dynamically
in advance. As shown in Figure 1.2, the optical routing path from the source core to
the destination core 1 needs to configure two MRs with wavelength λ1, while the
optical routing path to destination core 2 does not require to configure any MR for
transmitting optical signal with wavelength λ2. If Nλ wavelengths are used in ONoC,
each optical router should integrate Nλ sets of MRs for routing optical signals with
different wavelengths. As shown in Figure 1.2, the MR-based modulator in the network
interface of the source core converts the electronic signal into the optical signal directly
with the on-off keying modulation scheme (i.e., the existence and absence of light for
bit ’1’ and bit ’0’, respectively), by using the electronic signal to tune an active MR
directly; in the network interface of the destination core, the MR-based photodetector
filters the optical signal with the matched wavelength, and the optical signal is absorbed
by germanium or SiGe material and is converted back to electronic current.
Recent advances in silicon photonics and 3D integrated circuits allow an ONoC
architecture integrating a large number of these area-compact optical devices. The
optical devices can be deployed in different layers of the same chip with the electronic
devices to achieve flexible configuration. As shown in Figure 1.3, a typical electronic-
optical hybrid ONoC architecture in 4 × 4 mesh topology consists of an electronic
control network and an optical data network interconnected in identical topology in
different layers (Shacham, Bergman, and Carloni, 2008; Gu, Mo, Xu, and Zhang, 2009).
Compared with the traditional ENoC, it lacks optical buffering and processing devices
in ONoC. Thus, the electronic control network is used to conduct the routing and
wavelength allocation for the optical data network. The optical communication process
in Figure 1.2 only illustrates a part of the optical data network.
It can be seen from Figure 1.3 that each core is connected to an electronic router and
an optical router, which are interconnected through inter-layer links. The electronic
10
Figure 1.3: A typical electronic-optical hybrid ONoC architecture
in mesh topology, which consists of an electronic control network for
path reservation and an optical data network for optical transmission.
control network employs the packet routing to establish the optical routing path from
the source core to the destination core in the hop-by-hop manner, namely the control
packet reserves the corresponding optical routing path in the optical data network when
it passes an electronic router. The structure of electronic router is similar to the typical
router in ENoC. The optical data network utilizes the configurable optical routers with
active MRs. Multiple wavelengths can be used in the optical interconnects and optical
routers in the optical data network. Since in the ONoC architecture given in Figure 1.3,
the optical routing paths are reserved for different pairs of source and destination cores
dynamically, it may encounter some congestions in the electronic control network for
optical routing path reservation, if there is no available wavelength in some intermediate
optical interconnects. Hence, a key research problem is the routing and wavelength
allocation scheme, where the wavelength utilization in optical interconnects should be
balanced to deploy more optical routing paths with a limited number of wavelengths
in each optical interconnect.
1.2.3 Advantages and Limitations of ONoC
Compared with ENoC, ONoC has the following advantages. (i) Very low end-to-end
communication delay, which is almost independent of the length of optical routing
path due to the high transmission speed of optical signals. Thus, ONoC is especially
11
preferable for the long distance and global communication. (ii) High bandwidth ca-
pacity through the wavelength multiplexing ability. Each optical wavelength channel,
which is similar to a separate physical channel in ENoC, can achieve a bandwidth of
up to 40 Gbps (Vlasov, Green, and Xia, 2008) and multiple wavelength channels can
be multiplexed in the same optical interconnect to improve the bandwidth. (iii) Low
power consumption for inter-core communication. Once the optical routing path is
established between the source core and the destination core, the power consumption
of inter-core communication in ONoC is very low and almost independent of the length
of optical routing path. Even though the electronic-optical hybrid ONoC needs to con-
sume some power for routing path establishment, the overall power consumption is still
much lower compared with the ENoC. That is because the control packets are much
smaller than the data packets. (iv) Low signal interference in the optical routing paths.
In the ONoC architecture, there is only the insertion loss for every optical device in the
optical routing path. However, compared with the electromagnetic effects in electronic
interconnects with the high clock frequency, the insertion loss of optical interconnect
is much smaller (Chan, Hendry, Biberman, Bergman, and Carloni, 2010).
Due to the physical constraints of optical devices, there are some limitations on
the design of ONoC. (i) It lacks optical buffer and optical processing logics compared
with the traditional electronic interconnects. Thus, the store-and-forward transmis-
sion and distributed routing cannot be applied directly; otherwise it requires to expe-
rience electronic-to-optical (E-O) and optical-to-electronic (O-E) conversions in each
hop, which leads to significant hardware cost and high power consumption, and the
low transmission delay cannot be achieved with frequent signal conversions. (ii) The
maximal number of available wavelengths in each optical interconnect is limited for
reliable communication. That is because the maximal optical power can be injected
to the optical interconnect without non-linear effects is limited. For instance, only
62 wavelengths can be used in maximum with 19 Gbps bandwidth and -20 dB noise
tolerance (Preston, Droz, Levy, and Lipson, 2011). Therefore, the limited number of
wavelengths should be used in an efficient way for the optical inter-core communication.
(iii) The hardware cost and power consumption of ONoC are related to the number of
used wavelengths. For the electronic-optical hybrid ONoC, the optical routing paths
are dynamically established by configuring the optical routers with wavelength-specific
MRs. Thus, the more wavelengths are used, the more wavelength-specific MRs are
required, and also the higher power is consumed for tuning these MRs. (iv) It lacks
on-chip all-optical wavelength conversion device. The optical signals cannot change
12
their wavelengths within the limited number of available wavelengths and thus cannot
dynamically change their optical routing paths in wavelength-routed ONoCs.
To realize the optical communication for many-core processors with a limited num-
ber of wavelengths, and to decrease the number of required MRs and the power con-
sumption for tuning these MRs, wavelength reuse is an efficient design methodology for
ONoC architecture. Therefore, this thesis focuses on the design of wavelength-reused
Optical Network on Chip for many-core processors.
1.3 Challenging Issues in ONoC Design
Since the optical interconnects in an ONoC architecture have different physical proper-
ties and constraints compared with the electronic interconnects, such as low transmis-
sion delay, high bandwidth capacity, and wavelength multiplexing, the design method-
ology of network architecture and communication scheme for ONoC is different from
the traditional electronic-based schemes. Generally, network architecture and routing
and wavelength allocation scheme are two main research problems in the ONoC de-
sign. This section gives a brief introduction and the more detailed analysis on different
network architectures and communication schemes will be given in Chapter 2.
1.3.1 Network Architecture
Existing ONoC architectures generally can be divided into two categories according to
different types of optical components: all-optical ONoC and electronic-optical hybrid
ONoC. Their challenging design issues are discussed separately.
All-Optical ONoC only exploits optical devices to establish a non-blocking com-
munication network, e.g., using optical ring (Le Beux, Trajkovic, O’Connor, Nicolescu,
Bois, and Paulin, 2011) or optical crossbar (O’Connor, 2004; Koohi and Hessabi, 2014)
architectures. In the all-optical ONoC architecture, optical routing paths and wave-
lengths are constantly allocated between any two cores. Taking λ-router for exam-
ple (O’Connor, 2004), each optical routing path from/to a specific source/destination
core uniquely corresponds to a wavelength, and it requires to use at least N different
wavelengths to interconnect N cores. Therefore, the most important benefit of all-
optical ONoC is the non-blocking optical communication between all the cores through
wavelength-based routing. However, the main limitations include the following. (i) The
number of available wavelengths is not sufficient for a many-core processor with a large
number of cores. (ii) The hardware cost and power consumption can be very high
13
when the network size is large. Since there are N(N − 1) optical routing paths for
the non-blocking interconnection of N cores by using N wavelengths, the number of
required MRs increases quadratically with the increase of the number of cores, and it
leads to high power consumption to tune the MRs. (iii) In an all-optical ONoC archi-
tecture, the specific wavelength allocated to an optical routing path cannot be reused
dynamically by the other optical routing paths even if there is no communication in
the optical routing path, since the optical routing paths and transmission wavelengths
are assigned fixedly. Thus, the main challenging research problem in the design of all-
optical ONoC architecture is how to improve the scalability, i.e., to reduce the number
of required wavelengths and to decrease the hardware cost and power consumption.
Electronic-Optical Hybrid ONoC consists of an optical data network for the
data communication by using the circuit-switching scheme and an extra electronic con-
trol network for the configuration of optical routing paths by using the packet-switching
scheme (Shacham, Bergman, and Carloni, 2008; Gu, Mo, Xu, and Zhang, 2009). As
shown in Figure 1.3, the electronic control network needs to reserve the optical rout-
ing path in a hop-by-hop manner, namely a control packet is routed in the electronic
control network in multiple hops and it reserves each hop of optical interconnect along
the optical routing path, i.e., tuning corresponding active MRs in the optical router
when the control packet arrives at a new electronic router. Thus, it may lead to a high
preparation delay for inter-core communication, especially when the distance between
the source core and the destination core is large. After the configuration of optical
routing path with a specific wavelength, optical packets can be transmitted from the
source core directly to the destination core without any intermediate buffering and
processing. Thus, the transmission delay is very low, and the dedicated optical routing
path can provide the quality-of-service guaranteed communication.
Compared with the all-optical ONoC architecture, the optical routing path and
transmission wavelength are dynamically allocated for each inter-core communication
in the electronic-optical hybrid ONoC, according to the locations of the source and
destination cores. With the variable wavelength usage in the optical interconnects (i.e.,
the free wavelengths are different from time to time), the optical routing path and the
allocated wavelength for a specific pair of source and destination cores can also be
varied if the wavelength adaptive routing is used (Fusella, Flich, Cilardo, and Mazzeo,
2015). Therefore, the most important advantage of the electronic-optical hybrid ONoC
is that it can achieve high wavelength utilization in the optical interconnects and can
be used with any number of wavelengths. However, the drawbacks of this kind of
14
ONoC architecture include: (i) the hop-by-hop optical routing path reservation may
lead to high inter-core communication delay for the many-core processor with a large
network size; (ii) the dedicated optical routing path and wavelength reservation can
result in serious congestions to use some optical interconnects when the data rate of
inter-core communication is high. Hence, the main challenging research problems in
the electronic-optical hybrid ONoC include: (i) to design an efficient optical routing
path establishment scheme, which can configure the optical routers along the optical
routing path more faster, instead of transmitting a control packet hop-by-hop in the
electronic control network; (ii) to distribute the optical routing path and wavelength
utilization in an uniform way in the routing and wavelength allocation scheme so as to
prevent the congestions of optical interconnects.
1.3.2 Routing and Wavelength Allocation
Routing and wavelength allocation (RWA) scheme is a key research problem in ONoC
when the network architecture is determined. It calculates the optical routing path from
the source core to the destination core, and allocates a free wavelength for the optical
routing path. In an ONoC architecture with a limited number of available wavelengths,
the RWA scheme has an important impact on the maximal communication capacity.
According to different ONoC architectures, there are two kinds of RWA schemes, i.e.,
fixed and dynamical RWA schemes.
Fixed Routing and Wavelength Allocation scheme is usually employed in the
all-optical ONoC architectures to establish non-blocking optical routing paths between
the cores. In the fixed RWA scheme, the optical routing path and wavelength are
statically allocated for each pair of source and destination cores. For instance, in
the λ-router architecture (O’Connor, 2004), the optical routing path between any two
cores is determined by using a specific wavelength along a fixed route. For the non-
blocking optical communication between N cores, the fixed RWA scheme needs to
allocate N(N−1)
2
bidirectional optical routing paths without wavelength conflict. Thus,
it generally requires a large number of wavelengths. For instance, in the λ-router
architecture where all the wavelengths are utilized uniformly in each optical waveguide,
it still needs to use N different wavelengths to interconnect N cores. In the ORNoC
architecture (Le Beux, Trajkovic, O’Connor, Nicolescu, Bois, and Paulin, 2011), each
optical routing path is determined by using a different wavelength in the clockwise or
counter-clockwise optical interconnects, while the number of required wavelengths is
reduced by allocating the same wavelength for as many link-disjoint optical routing
15
paths as possible in the cyclic optical interconnects. However, when the number of
cores increases, the fixed RWA scheme in ORNoC needs to increase the number of
required wavelengths quadratically. Hence, the most important research problem for
the fixed RWA scheme is to balance the wavelength usage in the optical links and reuse
the same wavelength in the link-disjoint optical routing paths.
Dynamical Routing and Wavelength Allocation scheme is usually employed
in the electronic-optical hybrid ONoC architectures by using the electronic control net-
work to conduct the routing computation and wavelength allocation. In the dynamical
RWA scheme, the optical routing path and wavelength are dynamically allocated for
different pairs of source cores and destination cores. Since the wavelength usage of in-
termediate optical interconnects in an optical routing path is variable in this case, the
calculated optical routing path may not have a free wavelength, if the computation of
optical routing path and the allocation of wavelength are conducted separately. For in-
stance, in a mesh-based ONoC, if the dimensional routing algorithm (i.e., XY routing)
is used without consideration of the wavelength utilization in the optical interconnects
(Gu, Mo, Xu, and Zhang, 2009), the available wavelengths in the optical interconnects
near the center of the network will be exhausted very soon. However, according to
the analysis in (Yoo, Ahn, and Kim, 2003), the joint routing and wavelength alloca-
tion problem is NP-complete in general, and some heuristic schemes are often used to
solve this problem. Therefore, the most important research problem for the design of
dynamical RWA scheme is to consider the wavelength usage of optical interconnects
in the routing computation and to design an optimal heuristic scheme to reduce the
number of required wavelengths in the wavelength allocation.
1.4 Motivations and Contributions
To realize high-performance and energy-efficient inter-core communication for many-
core processors, this thesis focuses on three important communication problems for
many-core processors: network scalability, multicast communication, and dark sili-
con. Three wavelength-reused ONoC architectures and corresponding communication
schemes are proposed to solve these communication requirements accordingly. Their
motivations and main contributions are presented in the following.
16
1.4.1 Network Scalability
The fast development of high-performance computing systems requires more and more
cores to be integrated in a many-core processor. It was predicted that thousands of or
even more cores would be integrated in a many-core processor in the near future (Kelm,
Johnson, Lumetta, and Patel, 2010; Nychis, Fallin, Moscibroda, Mutlu, and Seshan,
2012). With the large requirement of data communication between the cores, an inter-
core communication network should achieve high-performance and high scalability,
i.e., obtaining low end-to-end communication delay and high bandwidth capacity with
low hardware cost and energy overhead, especially for the cloud computing and data
center applications (Lu, Fu, Wang, Han, Yan, and Li, 2015). At present, although the
conventional electronic interconnect is sufficient for small-scale many-core processors,
e.g., with less than tens of cores. In the future, it is hard to meet the communication
requirement and energy efficiency of large-scale many-core processors, e.g., with more
than hundreds of cores, due to the deep submicron effects of metallic interconnects,
e.g., increased wire delay and leakage power (Morris, Kodi, Louri, and Whaley, 2014).
Therefore, WRH-ONoC, a Wavelength-Reused Hierarchical Optical Network on
Chip architecture, is proposed to solve the network scalability problem by exploit-
ing the advantages of optical λ-router in non-blocking wavelength-based routing and
hierarchical networking in wavelength reuse among λ-routers. Using the WRH-ONoC
architecture, a many-core processor with a large number of cores is divided into mul-
tiple subsystems according to the number of available wavelengths. Then, λ-routers
are used for the intra-subsystem communication between the cores in the same subsys-
tem. The inter-subsystem communication is conducted through a hierarchical network,
which is constructed by multiple λ-routers and gateways. The gateway is utilized to
bridge two λ-routers in the network hierarchy for wavelength reassignment. Thus, the
available wavelengths can be reused in all the λ-routers in WRH-ONoC. To improve the
network throughput between the λ-routers in different levels, multiple sibling gateways
are used with the load balance ability. According to the theoretical analysis and sim-
ulation results in Chapter 3, WRH-ONoC can achieve prominent improvement on the
communication performance and network scalability compared with the existing ONoC
schemes, e.g., 46.0% of reduction on the zero-load delay and 72.7% of improvement on
the maximal throughput for interconnecting 400 cores.
17
1.4.2 Multicast Communication
Multicast communication intensively exists in many-core processors for some cooper-
ative computing applications and cache coherence protocols (Eisley, Peh, and Shang,
2008; Rodrigo, Flich, Duato, and Hummel, 2008). In each multicast communication,
the same data packet needs to be transmitted from the source core to multiple des-
tination cores. An important property of multicast communication is the interactive
multicast, namely the cores in a multicast group can frequently transmit multicast
packets within the same multicast group. For instance, several cores share the same
cache line, while any core can change it and invoke the multicast communication for
data update. If without an efficient multicast scheme, even with only a small ratio
of multicast communication (1%), it can have a significant influence on the overall
communication performance (Ma, Jerger, and Wang, 2012).
To improve the performance of multicast communication for ONoC, DWRMR, a
Dynamical-configured Wavelength-Reused and Multicast Ring based routing architec-
ture, is proposed in this thesis. In the DWRMR architecture, different from the tradi-
tional replication-based multicast routing schemes, the optical multicast ring is dynam-
ically established for each multicast group to reduce the number of multicast packet
copies. The source core can transmit its multicast packets in the established multicast
ring in a manner of single-send-multi-receive using only a single wavelength. The es-
tablished multicast ring can also be reused among the cores within the same multicast
group for the interactive multicast communication by using an optical-token arbitra-
tion scheme. Thus, it can avoid setting up exclusive multicast routing paths for each
core in the same multicast group. In the routing and wavelength allocation scheme,
the same wavelength is reused in as many link-disjoint multicast rings as possible. The
wavelength utilization in the optical interconnects is balanced to accommodate more
multicast rings with less number of wavelengths. According to the simulation results
in Chapter 4, DWRMR is able to reduce more than 50% of end-to-end packet delay
with slight hardware cost, or require only half number of wavelengths to achieve the
same performance compared with existing multicast routing schemes.
1.4.3 Dark Silicon
Dark silicon is a phenomenon that only a small ratio of cores can operate simultaneously
in the many-core processor because of the tight power budget. As the number of
cores integrated in a many-core processor keeps increasing, the proportion of dark
18
silicon will become larger and larger in the whole chip in the future. It was predicted
that at least 21% of the chip area will be dark in 22 nm process and this proportion
will increase to more than 50% in the following 8 nm process (Esmaeilzadeh, Blem,
Amant, Sankaralingam, and Burger, 2011). In general, dark silicon can introduce the
following three important influences on the communication of many-core processors. (i)
The communication pattern is highly variable since each core can dynamically change
between active and dark. (ii) The average communication distance is increased because
the distribution of active cores should be uniform roughly. (iii) It is not necessary to
reserve the communication resources for all the cores at all the time, since the maximal
number of active cores is limited.
To improve the power efficiency and to alleviate the dark silicon problem, Dark-
ONoC, a dynamically configurable dark silicon aware ONoC architecture, is proposed.
In the Dark-ONoC architecture, non-blocking optical routing paths are only estab-
lished between the active cores to improve the bandwidth capacity, and the number of
required wavelengths is decreased through wavelength reuse to save power. To reduce
the number of required wavelengths in Dark-ONoC for different dark silicon patterns,
a hierarchical network architecture is designed, which consists of a centralized routing
and wavelength allocation plane and a configurable optical data plane. The optical
routing and wavelength allocation scheme is formulated as a mapping problem, from
the logical interconnections of active cores to the optical links and wavelengths. A
heuristic solution is designed by combining the wavelength utilization aware routing
and reusable wavelength allocation scheme. It is able to reduce the number of re-
quired wavelengths in the optical routing paths between the active cores in different
dark silicon patterns. The simulation results in Chapter 5 indicate that the number of
wavelengths is reduced by around 15% and the overall power consumption is reduced
by 23.4% compared to existing routing schemes in ONoC.
1.5 Thesis Structure
This thesis focuses on the design of wavelength-reused Optical Network on Chip archi-
tecture and communication scheme for many-core processors. Three wavelength-reused
ONoC architectures are proposed for three important communication problems: WRH-
ONoC for network scalability, DWRMR for multicast communication, and Dark-ONoC
for dark silicon. Chapter 1 summarizes the motivations and main challenging problems.
The structure of the rest of this thesis is organized as follows.
19
Chapter 2 introduces the background of ONoC design methodology, including the
main components in an ONoC architecture, and the typical ONoC architectures and
communication schemes. The advantages and limitation issues of existing ONoC ar-
chitectures are analysed in more details. Moreover, the design principle of routing and
wavelength allocation scheme is investigated.
In Chapter 3, a Wavelength-Reused Hierarchical Optical Network on Chip archi-
tecture, WRH-ONoC, is proposed to solve the network scalability problem. In this
chapter, the network architecture and communication scheme of WRH-ONoC are de-
scribed in details, including the routing schemes for both unicast and multicast com-
munications, the gateway structure and wavelength reassignment scheme. The com-
munication performance, hardware cost, and energy efficiency are analysed through
theoretical modelling and extensive simulations.
In Chapter 4, a Dynamical-configured Wavelength-Reused and Multicast Ring
based routing architecture, named as DWRMR, is proposed to solve the multicast
communication problem. In this chapter, an efficient network architecture is proposed
for DWRMR, which consists of an optical control plane and a configurable optical data
plane. The heuristic routing and wavelength allocation scheme is designed, in which
the number of required wavelengths is reduced through wavelength reuse. Extensive
simulations are carried out to evaluate the communication performance of DWRMR
with different multicast traffic patterns.
In Chapter 5, a dynamically configurable ONoC architecture, called Dark-ONoC,
is proposed to solve the dark silicon problem. In this chapter, a hierarchical network
architecture is proposed for Dark-ONoC, which also divides the routing and wavelength
allocation, and the optical data transmission into different planes. Then, the routing
and wavelength allocation scheme is formulated as a mapping problem, and a heuris-
tic solution is designed which combines the wavelength utilization aware routing and
reusable wavelength allocation schemes. Extensive simulations are conducted to eval-
uate the communication performance and energy efficiency of Dark-ONoC with dark
silicon patterns both from synthetic distribution and real data traces.
Finally, in Chapter 6, the whole thesis is concluded, including the main contri-
butions of proposed ONoC architectures, their limitations and drawbacks, and some
potential improving solutions. Moreover, some prospective researching problems are
suggested for the future work, including the reliability problem, the three-dimensional
ONoC design, and the intra-/inter-chip hybrid optical communication.
20
Chapter 2
Optical Network on Chip Design
Optical Network on Chip (ONoC) is an optical inter-core communication technology
for many-core processors. It exploits the emerging nanophotonic devices which can
be integrated with traditional electronic circuits in the same chip. Data packets can
be transmitted in an ONoC from the source core to the destination core by using the
modulated optical signals. Several ONoC architectures and communication schemes
have been proposed for many-core processors at present, and wavelength division mul-
tiplexing (WDM) is widely used to realize the wavelength-based routing.
In this chapter, the design methodology of Optical Network on Chip is explored.
First, the fundamental design issues of ONoC are studied, including the basic opti-
cal components, the typical network architectures, and the wavelength-based routing
scheme. From the perspective of network architecture, most of existing ONoCs can be
classified into all-optical ONoCs and electronic-optical hybrid ONoCs. Their design
methodologies are explored separately, including the principles of some representa-
tive ONoC architectures, the main advantages, and the limitation issues. Routing
and wavelength allocation (RWA) is the key problem in the communication scheme of
ONoC, and it determines the optical routing paths between cores and the wavelength
utilization in the optical links. In this chapter, both the fixed wavelength routing and
the dynamical routing and wavelength allocation schemes are studied. Finally, the
main challenges in the design of wavelength-reused ONoC are summarized.
2.1 Overview of ONoC Architecture
ONoC is a promising inter-core communication technology for many-core processors
(Fusella and Cilardo, 2016). At present, the development of nanophotonic technologies
21
provides all the required silicon-compatible optical devices, such as laser source, silicon
waveguide, optical router, modulator, and photodetector. In an ONoC architecture, the
wavelength-based routing is preferable and data packets are transmitted with multiple
optical wavelengths in parallel, thereby achieving very low communication delay, high
bandwidth capacity, and low power consumption. This section gives an overview on
the design methodology of ONoC from three aspects: basic optical components, typical
network architecture, and wavelength-based routing scheme.
2.1.1 Basic Optical Components
As the communication process of a typical ONoC in Figure 1.2 of Chapter 1 demon-
strates, from the view of network level, the basic optical components in an ONoC
architecture include laser source, silicon waveguide, microring resonator (MR), optical
router, modulator, and photodetector. These optical devices can be integrated with
the electronic devices in different layers and interconnected through the vertical links,
e.g., Through Silicon Via (TSV) (Morris, Kodi, Louri, and Whaley, 2014). In this way,
they are able to be configured and optimized flexibly.
Laser source provides the optical signals with multiple wavelengths on which the
data packets are modulated and carried. It can be implemented by using an off-chip
laser to save the chip area and energy consumption and coupling the optical signals to
all the cores with separate power waveguides (Morris, Jolley, and Kodi, 2014; Koohi and
Hessabi, 2014), or using separate on-chip lasers (e.g., Vertical Cavity Surface Emitting
Lasers, VCSELs) directly for each core to achieve flexible power intensity adjusting
(Chen, Zhang, Contu, Klamkin, Coskun, and Joshi, 2014). Since the off-chip laser
needs to use an optical coupler between the laser source and many-core processor chip,
the output power intensity cannot be adjusted flexibly. Thus, the worst-case optical
power should always be provided. Moreover, the optical coupler can lead to extra
optical power loss, e.g., 3 dB (Morris, Kodi, Louri, and Whaley, 2014). On the other
hand, the main benefits of on-chip laser include the elimination of coupling power loss,
flexible run-time laser power management (Chen and Joshi, 2013), and dynamical laser
source power-gating (Demir and Hardavellas, 2014). Moreover, a laser source sharing
scheme was proposed in (Chen, Zhang, Contu, Klamkin, Coskun, and Joshi, 2014). All
the laser sources with different wavelengths are integrated in a separate layer, and all
the cores can share the lasers to reduce the overall power consumption. The drawback
of on-chip laser is the low power efficiency and serious thermal effects (Kurian, Miller,
Psota, Eastep, Liu, Michel, Kimerling, and Agarwal, 2010).
22
Due to the low efficiency of existing laser sources (only 10%-30%), the power con-
sumption of laser source can take more than 36 percent of the overall power consump-
tion in an ONoC based on single-write-multi-read crossbar (Pan, Kim, and Memik,
2010). Thus, for the reliable communication in an ONoC architecture, the minimum
laser power for each source core, Plaser (in dBm), should satisfy the following constraint:
Plaser ≥ Ploss + 10log10(Nλ) + Prs, (2.1)
where Ploss is the worst-case optical signal loss in all the optical devices along the
optical routing path between any source and destination cores, Nλ is the number of
wavelengths used for parallel transmission, and Prs is the optical sensitivity of receiver.
Thus, to reduce the power consumption of laser source, it must consider two important
issues in the design of ONoC, i.e., reducing the optical signal loss along the optical
routing path, and limiting the maximal number of wavelengths in each optical link.
Silicon waveguide is the optical transmission medium in the ONoC architecture.
It is implemented by using two different materials (i.e., Si as core and SiO2 as cladding)
with a high refractive index difference to guide and confine the transmission of optical
signals. Extremely high bandwidth capacity and very low signal attenuation can be
achieved by transmitting optical signals in waveguide with multiple wavelengths. For
instance, a 5cm long waveguide is demonstrated in (Lee, Chen, Biberman, Liu, Hsieh,
Chou, Dadap, Xia, Green, Sekaric, Vlasov, Osgood, and Bergman, 2008), and it can
achieve an aggregate bandwidth of 1.28 Tbps with 32 wavelength channels each with
a bandwidth of 40 Gbps and 0.6 dB/cm propagation loss. However, the waveguide
bending and waveguide crossing can introduce extra power losses, e.g., 0.005 dB for
90o waveguide bending and 0.18 dB for waveguide crossing (Grani, Bartolini, Furdiani,
Ramini, and Bertozzi, 2014). Thus, in the design of ONoC architecture, except for
reducing the length of each optical routing path, the number of waveguide bendings
and crossings should also be decreased.
Microring resonator (MR) is a wavelength-selective optical device in ONoC. Each
MR is a small cyclic waveguide (e.g., 5µm radius) with specific resonant wavelengths
in a fixed wavelength interval (i.e., free spectral range), while only one resonant wave-
length is used in each optical interconnect. The resonant wavelength of each MR is
determined by the geometric diameter and the property of materials, and it can be
tuned through the thermal and electronic effects (Dong, Shafiiha, Liao, Liang, Feng,
Feng, Li, Zheng, Krishnamoorthy, and Asghari, 2010). One important property of MR
is that it can extract/inject the optical signal from/to the waveguide that is closes to
23
MR, when the wavelength of optical signal matches to its resonant wavelength. The
wavelength-selective optical switch can be realized by using MRs and two waveguides.
As shown in Figure 2.1, two basic kinds of optical switching elements (OSE) are im-
plemented according to different configuration schemes of MRs. In Figure 2.1 (a) and
(b), two waveguides are placed in the orthogonal and parallel positions, and the optical
switching elements use the passive MRs, namely the resonant wavelength of each MR
is merely determined by its geometric diameter. Thus, only the optical input signal
with the matched wavelength can be coupled into a MR and conduct optical switching,
namely the optical signal is transmitted out using the different waveguide. In Figure
2.1 (c)-(d), the optical switching elements use active MRs, namely the resonant wave-
length of each MR can be tuned by heating or applying a bias voltage. Thus, the
optical input signal with a specific wavelength can be coupled into the MR and realize
optical switching when the MR is tuned on.
1 2
1
1
2
2
on
on off
off
(a)
(b)
(c)
(d)
Figure 2.1: Optical switching elements with passive and active MRs.
(a)-(b) passive MRs have different wavelengths with different diame-
ters; (c)-(d) active MRs are tuned by heating or applying a voltage.
Since the resonant wavelength of each MR is highly sensitive to the geometric dimen-
sion and thermal variations, it requires to use the electronic-optical of thermal-optical
effects to tune each MR, such as applying a bias voltage or heating. For instance,
in a fully-connected optical crossbar with 64 wavelengths, the power consumption for
thermal tuning of MRs can reach up to 38% (Pan, Kim, and Memik, 2010). Therefore,
the number of MRs used in the ONoC architecture should be optimized by consid-
ering both the communication performance and the power consumption. Based on
the wavelength-selective property of MR, the optical router/switch, modulator, and
photodetector can be designed for the optical signals with different wavelengths.
Optical router/switch is an optical component which can change the direction of
optical signal transmission by using MRs. Since there is no optical buffering and pro-
24
cessing devices in ONoC, the optical routing path should be configured in the optical
routers in advance. According to different configuration schemes, the optical routers
can be divided into two types: wavelength-based router (O’Connor, 2004; Tan, Yang,
Zhang, Jiang, and Yang, 2012) and configurable router (Shacham, Bergman, and Car-
loni, 2008; Gu, Mo, Xu, and Zhang, 2009). The design principles of two typical optical
routers are illustrated in Figure 2.2. Figure 2.2(a) gives the GWOR router architecture
based on the passive MRs with different wavelengths (Tan, Yang, Zhang, Jiang, and
Yang, 2012). The optical routing path between any pair of input and output ports is
determined by a specific wavelength according to the wavelength routing table. For
instance, the optical signal from West input can transmit to South output and North
output by using wavelength λ2 and wavelength λ1, respectively. Figure 2.2(b) gives the
Cygnus router architecture based on the active MRs which enables configuring different
MRs (Gu, Mo, Xu, and Zhang, 2009). The optical routing path between any pair of in-
put and output ports is determined by the active MR to be configured according to the
MR configuring matrix. For instance, the optical signal from West input can transmit
to South output and North output by configuring MR {3} and MR {7}, respectively.
1
2
3
4
5
6
7
8
9
10 11
12
13
14
15
16
West East
North
South
Inject
Eject
21
1
1
1
2
2
2
West East
North
South
(a) (b)
W N E S
W - 1 3 2
N 1 - 2 3
E 3 2 - 1
S 2 3 1 -
W N E S
W - 7 x 3
N 9 - 12 x
E x 16 - 10
S 6 x 8 -
C
4
13
11
5
C 1 15 14 2 -
Figure 2.2: Two typical optical routers for ONoC. (a) Wavelength-
based optical router, GWOR, and its wavelength routing matrix. (b)
Configurable optical router, Cygnus, and its MR configuring matrix.
Using the wavelength-based optical router, an ONoC architecture can achieve non-
blocking optical routing with a fixed wavelength allocation for all the optical routing
paths. However, to interconnect multiple cores, the wavelength-based optical router
needs to increase the input and output ports by increasing the number of waveguides
and MRs and using more wavelengths (Tan, Yang, Zhang, Jiang, and Yang, 2012;
O’Connor, 2004). In general, the number of required wavelengths and waveguides
increases linearly with the number of cores, while the number of required MRs increases
25
quadratically with the number of cores. On the other hand, using the configurable
router, an ONoC architecture can dynamically establish an optical routing path from a
source core to any destination core with any number of wavelengths. However, multiple
optical routing paths may compete for the same wavelength in the intermediate optical
links. Moreover, it can be seen that 16 MRs are used in each optical router in Figure
2.2(b). For an ONoC with N cores using Nλ wavelengths, 16 × N × Nλ MRs are
required in total. Therefore, in the design of ONoC based on the configurable optical
router, it should use as less number of wavelengths as possible.
Modulator converts the electronic signals from the source core into optical signals
with different wavelengths. It can be implemented using the property of active MR.
A MR-based modulator can realize the On-Off Key (OOK) modulation with multiple
wavelength-specific MRs and a waveguide. For example, when the MR is tuned on,
the optical signal with the matched wavelength is absorbed and a bit of ’0’ is output;
otherwise a bit of ’1’ is output when the MR is tuned off. Since the MR can be tuned
with a very small delay and low power cost (e.g., 0.1 mW ), the MR-based modulator
is able to achieve very high bandwidth (e.g., > 10 Gbps).
Photodetector converts the optical signal with a specific wavelength received by
the destination core back into the electronic signal. It can be implemented using
germanium or SiGe material which can absorb the optical signal and then generate
electronic current (Koester, Dehlinger, Schaub, Chu, Ouyang, and Grill, 2005). In
addition, multiple wavelength-specific MRs are used before photodetectors to filter
the optical signals with the correct wavelengths. Each photodetector has a receiving
sensitivity, such as 26 dBm in (Morris, Kodi, Louri, and Whaley, 2014), which indicates
the minimum optical power required for the reliable photodetection.
It is worth noting that even though multiple wavelengths can be used in ONoC, the
maximal number of available wavelengths is limited. Figure 2.3 illustrates the reason of
wavelength limitation by considering the crosstalk noise of different optical signals in an
optical interconnect and the maximal acceptable optical input power for waveguide. In
Figure 2.3(a), it can be seen that for the optical interconnect with a fixed free spectral
range, since each MR is not an ideal wavelength filter, the crosstalk noise among optical
signals with different wavelengths is small when there are only 6 wavelengths; while the
crosstalk noise becomes significant with 13 different wavelengths. Thus, to guarantee an
acceptable signal-to-noise ratio (SNR), the number of wavelengths being multiplexed in
an optical interconnect is limited. Figure 2.3(b) illustrates the optical power intensity
that should be injected to the waveguide with different number of wavelengths, where
26
Wavelength spectrum
Wavelength spectrum Number of wavelength
S
tr
en
gt
h
S
tr
en
g
th
P
ow
er
 i
nt
en
si
ty
 (
dB
)
loss rsP P
maximal input power
Maximal number 
of wavelengths
Crosstalk 
noise
Free spectral range
Free spectral range
(a) (b)
Figure 2.3: The reasons of wavelength limitation in ONoC, by con-
sidering (a) the crosstalk noise of different optical signals and (b) the
maximal acceptable optical input power for each waveguide.
Ploss is the worst-case signal loss accumulated in the optical devices along the optical
routing path, Prs is the receiving sensitivity of photodetector. Since the maximal
acceptable power intensity for the waveguide without non-linear effects is restricted,
according to Eq. 2.1, the maximal number of wavelengths is also limited. According to
the experiment result in (Preston, Droz, Levy, and Lipson, 2011), 62 wavelengths can
be used in an optical interconnect in maximum with 19 Gbps bandwidth and -20 dB
noise tolerance. Thus, how to reuse the limited number of wavelengths in the design
of ONoC is a challenging and important research problem.
2.1.2 Typical Network Architecture
For different optical routers and routing schemes, there are two typical kinds of ONoC
architectures: all-optical ONoC with the wavelength-based optical router (Tan, Yang,
Zhang, Jiang, and Yang, 2012), optical crossbar (O’Connor, 2004; Tan, Yang, Zhang,
Jiang, and Yang, 2012), and optical ring (Vantrease, Schreiber, Monchiero, McLaren,
Jouppi, Fiorentino, Davis, Binkert, Beausoleil, and Ahn, 2008; Le Beux, Trajkovic,
O’Connor, Nicolescu, Bois, and Paulin, 2011), and electronic-optical hybrid ONoC with
an optical data network using configurable optical routers and an extra electronic
control network (Shacham, Bergman, and Carloni, 2008; Gu, Mo, Xu, and Zhang, 2009;
Chan and Bergman, 2012). The design principles of these two ONoC architectures are
shown in Figure 2.4, taking the λ-router architecture (O’Connor, 2004) and the PNoC
architecture (Gu, Mo, Xu, and Zhang, 2009) for examples.
It can be seen from Figure 2.4(a) that a λ-router only utilizes the passive opti-
27
CPU 3 L2
L1 NI
1 2 3
(b)
CPU 4 L2
L1 NI
CPU 5 L2
L1 NI
CPU 0 L2
L1 NI
CPU 1 L2
L1 NI
CPU 2 L2
L1 NI
Optical Router
(a)
8
8
8
8
7
7
7
7
7
7
7
7
6
6
6
6
5
5
5
5
5
5
5
5
4
4
4
4
3
3
3
3
3
3
1
1
2
2
1
1
1
1
3
3
2
2
1
1
1O
2O
3O
4O
5O
6O
7O
8O
1I
2I
3I
4I
5I
6I
7I
8I
O
1
O
2
O
3
O
4
O
5
O
6
O
7
O
8
I1 - 5 3 6 2 7 1 8
I2 5 - 4 7 3 8 2 1
I3 3 4 - 5 1 6 8 7
I4 6 7 5 - 4 1 3 2
I5 2 3 1 4 - 5 7 6
I6 7 8 6 1 5 - 4 3
I7 1 2 8 3 7 4 - 5
I8 8 1 7 2 6 3 5 -
S D Path λ
0 3 0-3 1
1 2 1-2 3
2 4 2-1-4 2
3 5 3-4-5 2
5 1 5-2-1 1
Wavelength Routing Matrix
Routing and Wavelength Table
Figure 2.4: Example of two different kinds of ONoC architectures. (a)
λ-router is an all-optical wavelength-routed ONoC with fixed wave-
length routing. (b) Optical data network in an electronic-optical hy-
brid ONoC with dynamical routing and wavelength allocation.
cal switching elements and waveguides to construct an all-optical wavelength-routed
ONoC. The optical switching elements with two passive MRs are deployed in N stages
for the fully connection of N cores, and N different wavelengths are required to provide
non-blocking communication between the connected cores, as labelled by λ1 to λ8 using
different colors in Figure 2.4(a). In the λ-router architecture, each core is connected to
an input port Ii and an output port Oi, and the optical routing path from any input
port to an output port is determined by the allocated wavelength, as shown in the wave-
length routing matrix in Figure 2.4(a). For example, when the source core connected to
I2 needs to send data packets to different destination cores, such as the cores connected
to O3 and O8, it selects the corresponding wavelengths for different destination cores in
the wavelength matrix, i.e., wavelength λ4 and wavelength λ1, respectively. The main
advantages of wavelength-routed architecture include that (i) non-blocking inter-core
communication is achieved by allocating different wavelengths; (ii) the network layout
is simple and regular. However, it can be seen that the number of required optical
switching elements in λ-router will increase quadratically to interconnect more cores,
and N wavelengths are statically allocated to N cores. Thus, the main drawback of
all-optical ONoC is the poor scalability and low wavelength utilization.
Figure 2.4(b) shows the optical data network in a typical electronic-optical hybrid
ONoC with the configurable optical routers in mesh topology (Gu, Mo, Xu, and Zhang,
2009). The interconnection of electronic control network and optical data network is
given in Figure 1.3 in Chapter 1. It can be seen that in this electronic-optical hybrid
28
ONoC, the optical routing path is dynamically established according to the distribution
of source and destination cores. Figure 2.4(b) only illustrates the optical routing paths
and wavelength allocation in the optical data network. Three wavelengths are used
for five different optical routing paths, and the same wavelength can be allocated to
two link-disjoint optical routing paths. For instance, the optical routing path from
core {2} to core {4} and the optical routing path from core {3} to core {5} can use
the same wavelength λ2, even they pass the same intermediate optical router by using
different optical interconnects. Thus, the main advantages of this ONoC architecture
include that (i) the number of required wavelengths is independent to the number of
cores thus with higher scalability; (ii) higher wavelength utilization can be achieved in
the routing and wavelength allocation by balancing the wavelength usage in the optical
interconnects, thus with the same number of wavelengths it can provide optical routing
paths for more cores. However, the routing reservation process in the electronic control
network is time-consuming and leads to extra hardware overheads.
2.1.3 Wavelength-Based Routing Scheme
Wavelength-based optical routing is a favourable communication scheme for ONoC
(Tala, Castellari, Balboni, and Bertozzi, 2016). To make full use of the high-speed
optical transmission, there should be no buffering and intermediate processing in the
optical routing path. Thus, the store-and-forward routing scheme, which is based on the
distributed buffering and routing computation, is not preferable in ONoC. Moreover,
due to the technological limitations of optical devices, ONoC lacks optical buffering
device and optical processing logics, and the electronic-to-optical (E-O) and optical-to-
electronic (O-E) conversions can lead to high delay and power consumption. Therefore,
the optical routing paths and wavelengths should be configured between the source core
and the destination core in advance, either by fixed assignment in all-optical ONoC or
by dynamical routing and wavelength allocation in electronic-optical hybrid ONoC.
In an all-optical ONoC based on the fixed wavelength routing, it can be seen from
Figure 2.2(a) and Figure 2.4(a) that non-blocking optical routing paths are constantly
allocated between the source cores and destination cores. Optical switching only hap-
pens when the input optical signal couples to the passive MR with the matched wave-
length. Thus, each optical routing path is exclusively determined by a wavelength, and
the optical signal with a specific wavelength can be directly routed in the corresponding
passive MR of the fixedly allocated optical routing path.
The wavelength-based routing in electronic-optical hybrid ONoC is implemented by
29
Table 2.1: Comparisons of Two Typical ONoC Architectures
ONoC Design Principle Advantages Limitations
All-Optical
ONoC
Optical routing path and wave-
length allocation are fixed be-
tween any two cores in ad-
vance.
Non-blocking wavelength-
based routing, multicast-
enabled, low communi-
cation delay and high
throughput.
High demand on the number of
wavelengths and the optical de-
vices, low wavelength efficiency,
low scalability, only preferable for
small-scale ONoC.
Electronic-
Optical
Hybrid
ONoC
Optical routing path and wave-
length are dynamically as-
signed and configured for each
communication hop-by-hop.
Without limitation on the
number of wavelength,
high wavelength efficiency
through wavelength reuse.
Extra preparation delay to estab-
lish the optical routing paths and
extra hardware costs in the elec-
tronic control network.
dynamically allocating an optical routing path and a carrier wavelength for a specific
inter-core communication, according to the distribution of source core and destination
core and the available wavelengths in optical interconnects. The wavelength utilization
in optical interconnects is different in different time period, thus the routing and wave-
length allocation scheme can get different optical routing paths, i.e., passing different
intermediate optical routers and allocating different wavelengths in the passed optical
interconnects, even for the same source core and destination core. Once the optical
routing path and wavelength are allocated, the active MRs with allocated wavelength
in the intermediate optical routers are configured using the matrix similar to Figure
2.2(b). For any source core, the optical routing path to a different destination core is de-
termined by the allocated wavelength and an output port. Note that if the wavelength
usage in optical interconnects can be balanced through the routing and wavelength
allocation scheme, it can achieve higher wavelength utilization with a limited number
of wavelengths than the fixed wavelength routing scheme in all-optical ONoC.
The comparison of two typical ONoCs is summarized in Table 2.1. It can be seen
the wavelength reuse ability is very important. All-optical ONoC cannot efficiently
reuse the wavelengths, thus it can only connect a small number of cores. electronic-
optical hybrid ONoC dynamically allocates optical routing paths and wavelengths. It
is possible to reuse the wavelength in different optical routing paths and achieve high-
performance and scalable communication with a limited number of wavelengths.
2.2 All-Optical ONoC
As introduced in Section 2.1.2, all-optical ONoC utilizes only optical devices to con-
struct an optical network architecture and achieves non-blocking through dedicated
wavelength allocation. Two typical optical routers for the all-optical ONoC architec-
30
ture, GWOR and λ-router, are illustrated in Figure 2.2(a) and Figure 2.4(a), and the
advantages and limitations of all-optical ONoC are analysed in Section 2.1.2. Even
though the all-optical ONoC architecture has the drawback of poor scalability, due to
the wavelength limitation and quadric increase of MRs, it is widely used especially as
the global optical network in some hierarchical ONoC architectures. In this section,
the design methodologies of two important types of all-optical ONoC architectures are
introduced, namely ring-based ONoC and crossbar-based ONoC.
2.2.1 Ring-Based ONoC
Ring-based ONoC exploits the cyclic waveguides to interconnect all the cores in a
many-core processor, and different wavelengths are allocated for the inter-core com-
munication (Le Beux, Trajkovic, O’Connor, Nicolescu, Bois, and Paulin, 2011; Luo,
Killian, Le Beux, Chillet, Li, O’Connor, and Sentieys, 2015; Wang, Gu, Yang, Wang,
and Hao, 2016). A typical ring-based ONoC is the ORNoC architecture (Le Beux,
Trajkovic, O’Connor, Nicolescu, Bois, and Paulin, 2011), and its design principle is
illustrated in Figure 2.5 for interconnecting six cores, i.e., core {A} to core {F}. It
can be seen from Figure 2.5(a) that the physical layout of ORNoC is simple with two
cyclic optical interconnects in different directions to interconnect all the cores. To
achieve non-blocking communication between each pair of cores in ORNoC, it requires
to use at least five wavelengths, and the wavelength allocation in the clockwise optical
interconnect and counter-clockwise optical interconnect are illustrated in Figure 2.5(b)
and Figure 2.5(c), respectively. Note that each cycle in Figure 2.5(b)-(c) stands for a
different wavelength, and each directional link stands for an optical routing path from
a source core to a destination core. According to this wavelength allocation, the wave-
length routing matrix can be achieved for each core as shown in Figure 2.5(d), where
the number i represents the wavelength λi, and ’+’ and ’−’ indicate the clockwise
and counter-clockwise optical interconnects, respectively. To transmit/receive optical
signals to/from different cores with different wavelengths at the same time, the optical
network interface (ONI) for each core needs to have multiple MR-based modulators
and photodetectors, which correspond to the wavelength routing matrix.
The ring-based ONoC has the following advantages. (i) No optical router is required
in the optical routing path. From a specific source core to any destination core, the
optical routing is statically determined with no need for switching, and the wavelength
is exclusively allocated in the optical interconnect. For example, the optical routing
path from core {A} to core {D} is allocated in the clockwise optical interconnect by
31
A
D
B
C
F
E
A
D
B
C
F
E
A B C D E F
A - +1 +2 +4 -2 -1
B -1 - +1 +3 -4 -3
C -2 -1 - +1 +2 +5
D +4 -3 -1 - +1 +3
E +2 -4 -2 -1 - +1
F +1 +3 +5 -3 -1 -
(a) (b) (c) (d)
A
D
B
C
F
E
Core
Waveguide
Figure 2.5: ORNoC is a typical ring-based ONoC. (a) Logical inter-
connection of an ORNoC with 6 cores; (b) wavelength allocation for
clockwise optical interconnect; (c) wavelength allocation for counter-
clockwise optical interconnect; (d) wavelength routing matrix.
using wavelength λ4. Even though it passes core {B} and core {C}, the optical signal
cannot be received since no MR-based photodetector with wavelength λ4 in their optical
network interfaces. (ii) According to the design in (Le Beux, Trajkovic, O’Connor,
Nicolescu, Bois, and Paulin, 2011; Luo, Killian, Le Beux, Chillet, Li, O’Connor, and
Sentieys, 2015), the optical routing path and wavelength allocation can be customized
considering the real communication requirement. For example, if core {B} and core
{D} have no communication, the allocated wavelength can be released and used in
other optical routing paths to reduce the number of wavelengths. (iii) Since the ring-
based ONoC can eliminate the waveguide crossing and reduce the waveguide bending
in optical routing paths compared with the multi-stage ONoC, such as λ-router, it is
able to reduce the optical signal loss and power consumption of laser source. This has
been quantitatively verified in (Ramini, Grani, Bartolini, and Bertozzi, 2013). (iv)
To interconnect more cores with a limited number of wavelengths, it can increase the
number of cyclic optical interconnects. For example, to connect 64 cores with 64
wavelengths, it requires to use 28 waveguides in an ORNoC.
The limitation of ring-based ONoC is that the hardware cost increases significantly
with the increase of cores, when the number of available wavelengths is limited. The
number of required MR-based modulators and photodetectors increases linearly with
the number of cores, while the number of required waveguides increases quadratically.
The worst-case optical signal loss in the ring-based ONoC will be very large when the
cyclic optical interconnect connects too many cores. In ORNoC (Le Beux, Trajkovic,
O’Connor, Nicolescu, Bois, and Paulin, 2011), the solution to this limitation is to de-
32
sign a hierarchical network using ring-based ONoC for global communication among
multiple clusters of cores, instead of the core-to-core communication. QuT architec-
ture is an extension for ring-based ONoC with some extra links between non-adjacent
cores to reduce the network diameter and reduce the number of required wavelengths
(Hamedani, Jerger, and Hessabi, 2014). However, it requires multiple optical switches
to bypass the optical signals into the increased optical links, and needs an optical con-
trol network to prevent optical signals from different source cores being transmitted to
the same destination core with the same wavelength at the same time.
2.2.2 Crossbar-Based ONoC
Optical crossbar can provide passive optical communication in the shared optical in-
terconnects between the source and destination cores by using different wavelengths
(Batten, Joshi, Stojanovic, and Asanovic, 2012). There are three typical kinds of
optical crossbars in ONoC: Multi-Write-Single-Read (MWSR) crossbar, Single-Write-
Multi-Read (SWMR) crossbar, and Multi-Write-Multi-Read (MWMR) crossbar (Xu,
Yang, and Melhem, 2012a). As the example shown in Figure 2.6, there are four cores
each connecting with an input port Ii and an ouput port Oi, and four different wave-
lengths are required for the optical communication.
In a Multi-Write-Single-Read crossbar, each core can receive optical signals from
all the other cores with a dedicated wavelength, and all the cores compete for one
wavelength to send optical packets to the same core. For instance, in Figure 2.6(a),
if core {1} needs to communicate with core {3}, it has to compete with the other
cores to use wavelength λ3, which is determined by the destination core {3}. Thus, an
optical arbitration scheme is required to allocate the wavelength for transmitting, and
the active MRs should be used in MWSR crossbar. Note that if multiple wavelengths
are free in the MWSR crossbar, a source core can multicast optical packets to multi-
ple destination cores by applying for multiple wavelengths at the same time, namely
MWSR crossbar has the multicast capability (i.e., one-to-many communication).
In a Single-Write-Multi-Read crossbar, each core can send optical packets to all
the other cores with a dedicated wavelength, and each core can receive multiple op-
tical packets with different wavelengths at the same time. For instance, in Figure
2.6(b), if core {1} needs to communicate with core {3}, it can directly send the optical
packets using wavelength λ1, which is determined by the source core {1}; meanwhile,
core {3} can receive multiple optical packets from different source cores with different
wavelengths at the same time. In a SWMR crossbar, optical packets from the same
33
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
1
2
3
4
I1
I2
I3
I4
I1
I2
I3
I4
I1
I2
I3
I4
O1
O2
O3
O4
O1
O2
O3
O4
O1
O2
O3
O4
(a) (b) (c)
Figure 2.6: Three typical optical crossbar architectures. (a) Multi-
Write-Single-Read (MWSR), (b) Single-Write-Multi-Read (SWMR),
(c) Multi-Write-Multi-Read (MWMR).
source core to different destination cores need to be queued in the network interface to
use the same wavelength, while optical packets with different wavelengths to the same
destination core can be received at the same time (i.e., many-to-one communication).
Multi-Write-Multi-Read crossbar is a combination of MWSR and SWMR. Each core
can send optical packets to different destination cores and receive optical packets from
different source cores by using different wavelengths. In the source side, all the cores
need to compete for one wavelength through an arbitration scheme. In the destination
side, each core can receive multiple optical packets at the same time through different
wavelengths. For instance, in Figure 2.6(c), core {1} can use two different wavelengths
to send optical packets to core {2} and core {3} after the arbitration, and core {3}
can receive optical packets from core {1} and core {2} with two different wavelengths
at the same time. Thus, MWMR has the advantages of both MWSR and SWMR,
namely the multicast capability and simultaneously receiving multiple optical packets
(i.e., many-to-many communication). Moreover, the source core can apply for multiple
wavelengths to communication with the same destination core with lower possibility
of wavelength congestion. However, since MWMR needs to arbitrate for multiple
wavelengths, the complexity of arbitration is much higher than MWSR crossbar.
The comparison of different optical crossbars are given in Table 2.2. It can be seen
that MWSR crossbar has the multicast ability by transmitting multicast packets to
all the destination cores by applying for multiple wavelengths. However, it needs to
use an extra arbitration scheme to allocate each wavelength. For example, using the
34
Table 2.2: Comparisons of Different Optical Crossbars
Crossbar Properties Advantages Limitations
MWSR A specific wave-
length is fixedly
assigned for re-
ceiving.
The source core can send optical packets to
multiple destination cores using different
wavelengths, i.e., one-to-many communi-
cation.
The destination core can only receive
optical packets from one core, and the
arbitration scheme is required to solve
the conflict.
SWMR A specific wave-
length is fixedly
assigned for
sending.
The destination core can receive optical
packets from multiple source cores using
different wavelengths, i.e., many-to-one
communication.
The source core can only send optical
packets to one core, and the packets
need to be queued in the source side.
MWMR Multiple wave-
lengths can be
used for sending
and receiving.
The source core can apply for multiple
wavelengths to send optical packets to mul-
tiple cores, and the destination core can re-
ceive optical packets from multiple source
cores using different wavelengths.
The arbitration scheme is more com-
plicated and more optical devices are
required for sending and receiving with
multiple wavelengths.
optical token arbitration scheme in which an optical token represents the ownership of a
wavelength and is passed in the optical interconnect (Vantrease, Schreiber, Monchiero,
McLaren, Jouppi, Fiorentino, Davis, Binkert, Beausoleil, and Ahn, 2008). SWMR
crossbar can receive multiple optical packets with different wavelengths at the same
time. However, in the source core, the optical packets to different destination cores
share the same wavelength, thus they need to be queued up and to be transmitted one
by one. MWMR crossbar has the advantages of multicast ability in the source core and
the parallel receiving ability in the destination core. In terms of the hardware cost, all
three kinds of crossbars need to use N wavelengths to interconnect N cores. In total,
MWSR requires N2 MR-based modulators and N MR-based photodetectors, SWMR
requires N MR-based modulators and N2 MR-based photodetectors, and MWMR
requires N2 MR-based modulators and N2 MR-based photodetectors.
Several ONoCs are designed on the basis of the crossbar-based optical architec-
tures. For instance, Corona architecture utilizes a MWSR crossbar interconnecting
64 clusters of cores, which has the multicast ability for cache coherence (Vantrease,
Schreiber, Monchiero, McLaren, Jouppi, Fiorentino, Davis, Binkert, Beausoleil, and
Ahn, 2008). OCMP architecture divides the whole optical network into several small
MWSR crossbars deployed in different layers to reduce the number of required MRs
and wavelengths (Morris, Kodi, Louri, and Whaley, 2014). ATAC architecture uses a
SWMR crossbar for 64 clusters of cores, and broadcast communication is achieved for
each cluster only filtering 1/64 of optical signals from a specific source cluster (Kurian,
Miller, Psota, Eastep, Liu, Michel, Kimerling, and Agarwal, 2010), while the Firefly
architecture reserves all of the optical signal to a specific destination core (Pan, Kumar,
35
Kim, Memik, Zhang, and Choudhary, 2009). An ONoC architecture based on MWMR
crossbar is proposed in (Xu, Yang, and Melhem, 2012a). Each core can send/receive
optical signals by using all the wavelengths, and the channel borrowing can be realized
by dynamically allocating the same wavelength to multiple communications if they
share no optical link in the optical crossbar. LumiNOC architecture utilizes MWMR
crossbars to interconnect the cores in the same row/column, where the MWMR cross-
bar use multiple waveguides and it can achieve non-blocking communication between
the connected cores by using different wavelengths (Li, Browning, Gratz, and Palermo,
2014). Generally, these crossbar-based architectures cannot interconnect all the cores
in a large many-core processor, due to the wavelength limitation and the large number
of required MRs. Thus, they are usually used for a small group of cores or the global
communication between multiple clusters of cores with sufficient wavelengths.
2.2.3 Advantages and Disadvantages
All-optical ONoC architectures are constructed by waveguides and MRs with multiple
wavelengths. The multi-stage based ONoC and the ring-based ONoC are designed on
the basis of fixed optical routing and wavelength allocation. They can be considered as
a Single-Write-Single-Read crossbar, because each pair of source and destination cores
fixedly uses a specific optical routing path and wavelength. The crossbar-based ONoC
allows sharing the optical interconnects and wavelengths among the cores through
optical arbitration (Ramini, Tala, and Bertozzi, 2014). In the ring-based ONoC and
crossbar-based ONoC schemes, waveguides are deployed in parallel to each other which
can eliminate the waveguide crossing to reduce the optical signal loss, compared with
the multi-stage ONoC scheme. Generally, these all-optical ONoCs have their own
advantages and disadvantages as listed in Table 2.3, and they are only preferable for
connecting a small number of cores or clusters with sufficient available wavelengths.
Ring-based ONoC has the following advantages: (i) non-blocking optical commu-
nication by using different wavelengths; (ii) with a limited number of wavelengths,
it can be extended by increasing the number of waveguides; (iii) only passive MRs
to reduce the MR tuning power; (iv) the same wavelength can be reused in multiple
link-disjoint optical routing paths. The main drawbacks of ring-based ONoC include:
(i) the wavelength allocation scheme is very complicate; (ii) the number of required
waveguides increases very fast as the number of cores increases, especially with a small
number of available wavelengths in optical interconnects.
Crossbar-based ONoC has the following advantages: (i) it can realize multicast
36
Table 2.3: Comparisons of Different All-Optical ONoC Schemes
Multi-Stage Based Ring Based Crossbar Based
Interconnection Multi-stage network con-
structed by optical switch-
ing units, e.g., N stages for
N cores in λ-router, a large
number of waveguide cross-
ings for optical switching.
Two optical cyclic interconnects
in different directions, with-
out the requirement of opti-
cal router/switch, no waveguide
crossing.
One or multiple optical inter-
connects, without the require-
ment of optical router/switch,
no waveguide crossing.
Routing and
Wavelength
Allocation
Fixed optical routing and
wavelength allocation,
non-blocking, require N
wavelengths for N cores.
Fixed optical routing and wave-
length allocation, non-blocking,
require less number of wave-
lengths, able to reducing the re-
quired number of wavelengths by
increasing waveguides.
Fixed optical routing and shared
wavelength, require arbitration
in MWSR and MWMR and
buffering in SWMR, require N
wavelengths for N cores.
Hardware
Cost
For N cores, each core re-
quires N−1 E-O and N−1
O-E converters with differ-
ent wavelengths, N(N − 2)
MRs are required for opti-
cal switching.
For N cores, each core requires
N − 1 E-O and N − 1 O-E
converters with different wave-
lengths.
For N cores, each core requires
N − 1 E-O and 1 O-E convert-
ers for MWSR, 1 E-O and N −1
O-E converters for SWMR, and
N − 1 E-O and N − 1 O-E con-
verters for MWMR with differ-
ent wavelengths.
Advantages Non-blocking wavelength
routing, simple wavelength
allocation and regular
layouts.
Non-blocking, no waveguide
crossing, wavelength reuse
ability, the trade-off between
the number of wavelengths and
waveguides.
No waveguide crossing, wave-
length reuse ability, low hard-
ware cost.
Limitations A large number of waveg-
uide crossings, high hard-
ware cost, limited by the
number of wavelengths.
Complicate wavelength alloca-
tion algorithm.
Require arbitration scheme for
MWSR and MWMR and queue-
ing for SWMR, limited by the
number of wavelengths.
communication, such as by transmitting multicast packets with different wavelengths
in MWSR crossbar, and by filtering a part of optical signals in SWMR crossbar; (ii)
the optical interconnect is shared by all the cores through optical arbitration instead
of being exclusively allocated. However, the main limitation is that an extra optical
arbitration scheme is required for MWSR and MWMR to solve the competition to send
in the same wavelength by multiple cores. In addition, the number of cores that can be
interconnected is limited due to the constraint on the number of available wavelengths
and the high tuning power consumption for the large number of active MRs.
2.3 Electronic-Optical Hybrid ONoC
To scale up the network size of ONoC, it needs to combine the advantages of electronic
interconnects (such as buffering and pipelined routing computation) and optical inter-
connects (such as low communication delay and high bandwidth capacity). Thus, the
37
electronic-optical hybrid ONoC consists of both electronic devices and optical devices.
According to different network architectures, existing electronic-optical hybrid ONoC
schemes can be divided into two categories: Path-Reserved ONoC and Hierarchical
ONoC. Generally, the path-reserved ONoC employs an electronic control network to
assist establishing the optical routing paths dynamically; while the hierarchical ONoC
employs some electronic local networks which are efficient for short-distance communi-
cation within a cluster of cores, and provides optical global communication for a limited
number of clusters with sufficient wavelengths. In this section, the design principles of
two kinds of ONoC schemes are discussed.
2.3.1 Path-Reserved ONoC
The path-reserved ONoC architecture generally consists of an electronic control net-
work and an optical data network, which have the identical network topology, such
as mesh and torus. It exploits a dynamically configurable optical data network with
the configurable optical routers as shown in Figure 2.2(b), and utilizes an electronic
control network to establish optical routing paths and allocate wavelengths for different
source and destination cores at runtime. Even though the path-reserved ONoC can
be implemented in different network topologies, such as torus (Shacham, Bergman,
and Carloni, 2008; Chan and Bergman, 2012), mesh (Gu, Mo, Xu, and Zhang, 2009;
Xie, Nikdast, Xu, Wu, Zhang, Ye, Wang, Wang, and Liu, 2013), fat-tree (Gu, Xu,
and Zhang, 2009; Wang, Xu, Wu, Ye, Zhang, Nikdast, Wang, and Wang, 2014), and
3D-mesh (Ye, Xu, Huang, Wu, Zhang, Wang, Nikdast, Wang, Liu, and Wang, 2013;
Zhao, Gong, Tan, and Gu, 2016), without loss the generality, only the widely used
mesh-based ONoC architecture is discussed in this thesis. Generally, mesh topology is
a direct and efficient choice for many-core processors with a large number of standard
cores, just like in TILE-Gx72 (Mellanox, 2013), Intel Teraflops (Intel, 2007), Adapteva
Epiphany-V (Olofsson, 2016), both considering the layout of cores and the complexity
of routing algorithm. The example of path-reserved ONoC architecture in the mesh
topology has been given in Figure 1.3 in Chapter 1.
In the path-reserved ONoC architecture, each core is connected with an electronic
router in the electronic control network and an optical router in the optical data net-
work. The electronic control network employs the packet-switching for each control
packet and uses the electronic routers dynamically establishing the optical routing
paths from the source cores to the destination cores. Since the optical routing path
is reserved by the control packet in every electronic router in a hop-by-hop manner,
38
it should experience buffering, routing computation, and next-hop link arbitration in
every electronic router along the routing path. The optical data network employs the
circuit-switching, since the optical routing path is exclusively reserved from the source
core to the destination core. Each optical router in the optical data network is con-
figurable by using active MRs, as shown in Figure 2.2(b), and it is configured by the
electronic router when the corresponding optical interconnect is reserved. The main
advantages of path-reserved ONoC include the following. (i) It has no wavelength
limitation in the optical data network, and it can also be designed for the wideband
optical signal without using wavelength multiplexing, namely only one wavelength
channel. (ii) It can provide quality-of-service guaranteed optical communication after
the optical routing path is established, which is preferable for the transmission of criti-
cal messages and massive data blocks (Shacham, Bergman, and Carloni, 2008). (iii) It
can employ the adaptive routing schemes in the electronic control network to balance
the link/wavelength utilization in the optical data network by detouring the overused
optical links (Fusella, Flich, Cilardo, and Mazzeo, 2015).
However, the hop-by-hop optical routing path reservation in the electronic control
network can lead to high preparation delay and power consumption. The exclusive
reservation of optical routing paths can also cause severe congestions in the electronic
routers, especially when the network size is large. When most of the data packets
are small, e.g., a single cache line with 64-128 bytes, the utilization of each optical
routing path is low, while the optical routing paths need to be frequently established
between different cores for different inter-core communications. Thus, the time delay
and power consumption to establish optical routing paths in the electronic control
network might be much higher than the transmission delay and power consumption in
the optical data network. Moreover, the extra electronic control network can lead to
great hardware cost and power consumption, especially with a large buffer space in the
electronic routers when the path reservation is blocked.
Some researches have been conducted to alleviate the limitations of path-reserved
ONoC scheme. First, wavelength multiplexing can be used in the established optical
routing paths to improve the bandwidth capacity, just similar to multiple bits of data
being transmitted in parallel electronic wires, or being used for the wavelength routing
to improve the routing diversity, similar to multiple virtual channels in ENoC (Chan
and Bergman, 2012). If the wavelength routing is used in the path-reserved ONoC, non-
blocking communication can be realized by using different wavelength. However, each
optical router needs to use multiple sets of active MRs with different resonant wave-
39
lengths. Second, two path-reserved ONoC architectures are proposed to improve the
efficiency of optical interconnects by solving the competition on the shared optical inter-
connects among different communications through time division multiplexing (TDM)
(Hendry, Chan, Kamil, Oliker, Shalf, Carloni, and Bergman, 2010; Zhang, Gu, Yang,
Chen, and Hao, 2014). In PhotonicTDM architecture (Hendry, Chan, Kamil, Oliker,
Shalf, Carloni, and Bergman, 2010), all the possible communications which share the
optical interconnects are scheduled by using round-robin arbitration. In each time slot,
a specific optical routing path is configured to utilize the shared optical interconnect,
thus the congestion in the hop-by-hop optical routing path reservation is solved; while
in the Flyover architecture (Zhang, Gu, Yang, Chen, and Hao, 2014), it further utilizes
WDM to improve the bandwidth capacity by using multiple wavelengths in the optical
routing path in each time slot. However, the drawback of TDM-based arbitration is
that it can only be used with a small number of cores; otherwise it needs to wait several
time slots to share the optical interconnects and it can disturb the continuous optical
transmission of massive data packets in an established optical routing paths. Frequent
changes of optical routing paths can also lead to high power consumption.
One path-reserved ONoC with an optical control network is proposed in (Grani and
Bartolini, 2014). It employs a wavelength-routed non-blocking optical ring to trans-
mit control packets and configure optical routing paths in a torus-based optical data
network. Instead of using the distributed routing computation and path reservation in
the electronic control network, it utilizes a central arbitrator to compute the optical
routing path for each communication. Thus, the source core can directly send a rout-
ing request to the central arbitrator using the optical ring, and the central arbitrator
can configure all the optical routers after the routing computation in the allocated
optical routing path through the optical ring. CSPIN architecture employs the same
idea for mesh-based ONoC, but the detailed design of optical control network is not
given (Zhang, Ma, Yu, Yang, Liu, Yang, and Jiang, 2015). Compared with the tradi-
tional path-reserved ONoC with an electronic control network, it can greatly reduce
the time delay and power consumption for optical routing path establishment, and the
possibility of congestion is decreased since the centralized routing computation is used.
Except for the mesh topology, some other topologies can also be used in the path-
reserved ONoCs, such as torus topology (Shacham, Bergman, and Carloni, 2008; Chan
and Bergman, 2012). Torus-based ONoC architecture is an extension for the mesh-
based ONoC, with extra optical interconnects between two nodes in the opposite sides
of the same row/column. Thus, it can reduce the average communication distance and
40
increase the routing diversity than the mesh-based ONoC. Similar to the path-reserved
ONoC in mesh topology, torus-based ONoC also requires to use an extra electronic
control network to establish the optical routing paths (Chan and Bergman, 2012).
Thus, torus-based ONoC can lead to significant hardware costs (4N extra optical links
and electronic links for an N×N network). Moreover, with the different lengths of
optical/electronic links in torus-based ONoC, the routing computation must consider
the difference of transmission delays by using different links. 3D X-Torus architecture
combines the ideas of 3D ONoC and torus (Hou, Guo, Cai, and Zhu, 2014) by di-
viding a large network into several identical small torus ONoCs and integrating them
vertically. This design can reduce the average communication distance and increase
the routing diversity by using the vertical optical links. However, the torus topology
and 3D integration can lead to significant complexity on the floorplan of optical links.
According to the analysis of (Feng, Ye, and Xu, 2013; Xie, Xu, Zhao, Huang, Song,
and Guo, 2015), the waveguide crossings can result in serious crosstalk noise in ONoC
if the floorplan of optical links (i.e., waveguides) are not optimized. The optical links
between nodes in the opposite sides introduces a large number of waveguide crossings in
the torus-based ONoC. This is a significant drawback for torus-based ONoC compared
with the mesh-based ONoC. To reduce the number of waveguide crossings, the STorus
architecture is proposed by dividing the whole network into two subnetworks (Li, Gu,
Chen, Song, and Hao, 2016). Each subnetwork uses a modified torus topology, with
optical links in the same diagonal direction without waveguide crossing in the center
of the network, and two subnetworks are deployed in different layers. This design can
greatly reduce the waveguide crossings in the center of the network, while the routing
algorithm is quite complex due to the diagonal interconnection and there are still a lot
of waveguide crossings in the border of the network. Moreover, according to the mathe-
matical analysis model of (Xie, Xu, Zhao, Huang, Song, and Guo, 2015), the waveguide
crossing of 60o or 120o is a good choice to reduce the crosstalk noise of each crossing
point. In summary, the torus-based ONoC has the advantages of short communication
distance and high routing diversity, while it also has significant limitations, including:
(i) high hardware costs, with a large number of extra optical/electronic links; (ii) the
increased number of waveguide crossings especially at the border of the network, which
can lead to high crosstalk noise and increase the complex of physical implementation;
(iii) the different lengths of optical/electronic links, due to the long distance between
the nodes in the opposite side of the same row/column; (iv) complicate routing algo-
rithm, especially for different communication distances in the electronic network.
41
2.3.2 Hierarchical ONoC
In the hierarchical ONoC architecture, all the cores are divided into several clusters.
Optical interconnects are used for the global communication between different clus-
ters, while electronic interconnects are used for the local communication between the
cores in the same cluster. Thus, the hierarchical ONoC can address the wavelength
limitations of the all-optical NoC and reduce the time delay for optical routing path
establishment of the path-reserved ONoC by only interconnecting a small number of
clusters. Both the optical global network and the electronic local network can utilize
different network architectures, thus there are several different kinds of hierarchical
ONoC designs. Two typical kinds of hierarchical ONoC schemes are illustrated in Fig-
ure 2.7. They use a mesh-based optical global network with routing path reservation
and an optical crossbar-based global network with wavelength routing, respectively.
(a) (b)
Optical 
Router Waveguide
Optical 
Interface
Core
Figure 2.7: Two kinds of hybrid ONoC architectures, (a) mesh-based
optical global network with path reservation, and (b) crossbar-based
optical global network with wavelength routing.
In Figure 2.7(a), the first kind of hierarchical ONoC scheme employs a mesh-based
optical global network and four cores are connected to the same optical router for
global communication. The local communication between four cores in the same cluster
can use an electronic router/crossbar directly, while the optical global communication
between different clusters is conducted using optical circuit-switching similar to the
traditional path-reserved ONoC with an electronic control network and an optical data
network (Shacham, Bergman, and Carloni, 2008). The hierarchical ONoC schemes
based on this design principle include HOME (Mo, Ye, Wu, Zhang, Liu, and Xu, 2010),
3CEO (Kim, Seo, and Han, 2011), and THOE (Ye, Xu, Wu, Zhang, Liu, and Nikdast,
42
2012), and they use a path-reserved optical global network with mesh, 3D-mesh, and
torus topologies, respectively. The main advantages of this kind of hierarchical ONoC
schemes can be summarized as: (i) it can reduce the time delay for hop-by-hop routing
path reservation in the electronic control network, and increase the efficiency of the
optical data network by allowing more cores to share the optical interconnects; (ii) it
can achieve low communication delay and power consumption for local communication
by using one hop of electronic transmission; (iii) this hierarchical network architecture
has high scalability, and the electronic local network can be extended to an electronic
mesh network when interconnecting more cores. However, the path-reserved optical
global network can become its bottleneck, especially when a large number of cores
connecting with the same optical router to share the optical global network.
In Figure 2.7(b), the hierarchical ONoC scheme utilizes an optical crossbar as the
global communication network, and multiple cores in the same cluster connect to an op-
tical network interface. The optical global communication can use different kinds of op-
tical crossbars, such as MWSR in Corona (Vantrease, Schreiber, Monchiero, McLaren,
Jouppi, Fiorentino, Davis, Binkert, Beausoleil, and Ahn, 2008) and OCMP (Morris,
Kodi, Louri, and Whaley, 2014), SWMR in Firefly (Pan, Kumar, Kim, Memik, Zhang,
and Choudhary, 2009) and ATAC (Kurian, Miller, Psota, Eastep, Liu, Michel, Kimer-
ling, and Agarwal, 2010), and MWMR in Channel Borrowing (Xu, Yang, and Melhem,
2012a) and METEOR (Bahirat and Pasricha, 2014). The optical crossbar can be de-
ployed in a cyclic manner as shown in Figure 2.7(b), and an off-chip laser can be
used as the light source to reduce the hardware cost and power consumption in chip.
The communication properties of different optical crossbars are introduced in Sec-
tion 2.2.2. For example, it supports one-to-many communication in MWSR crossbar,
many-to-one communication in SWMR crossbar, and many-to-many communication
in MWMR crossbar. Moreover, since it requires M wavelengths to interconnect M
optical network interfaces in the global network, each cluster should contain N
M
cores
when there are N cores in total in the many-core processor. Within a cluster, the
cores can be interconnected through an electronic crossbar or a small-scale ENoC. For
instance, a mesh-based ENoC is used in Firefly (Pan, Kumar, Kim, Memik, Zhang,
and Choudhary, 2009) and ATAC (Kurian, Miller, Psota, Eastep, Liu, Michel, Kimer-
ling, and Agarwal, 2010) schemes. The main advantage of this kind of hierarchical
ONoC is that the limited number of wavelengths can be used in a many-core processor
for high-bandwidth global communication. However, since multiple cores in the same
cluster need to share one optical network interface for global communication, it would
43
become the bottleneck of the hierarchical network.
Except for above two typical schemes, there are several other different hierarchi-
cal ONoC schemes. For example, in the E-PROPEL architecture interconnecting 256
cores (Morris and Kodi, 2010), it uses a multi-hierarchy network architecture with
four 64-core clusters and sixteen 4-core blocks in each cluster. In the multi-hierarchy
network, four cores are connected electrically in a block, non-blocking optical cross-
bar is used for the communication between different blocks in the same cluster, and
four clusters are further connected using multiple non-blocking optical crossbars in a
fat-tree style. A butterfly fat-tree based hierarchical ONoC architecture, HONoC, is
constructed by using electronic interconnects in lower levels and non-blocking wave-
length routed 4× 4 optical router, GWOR, in the top level (Tan, Yang, Zhang, Wang,
and Jiang, 2014). ORNoC architecture utilizes the optical ring for global communi-
cation between optical interfaces and the electronic mesh-based NoC for local com-
munication between cores. Multiple wavelengths and multiple waveguides are used
in the optical ring to achieve non-blocking wavelength-based routing (Le Beux, Tra-
jkovic, O’Connor, Nicolescu, Bois, and Paulin, 2011). While Chameleon architecture
is an extension to ORNoC by increasing an electronic control network to configure
the optical ring according to the communication requirement to reduce the number
of required wavelengths and waveguides (Le Beux, Li, O’Connor, Cheshmi, Liu, Tra-
jkovic, and Nicolescu, 2014). The H2ONoC (Fusella and Cilardo, 2017) and Amon
(Werner, Navaridas, and Lujn, 2015) architectures are designed to solve the problem
of long optical routing paths in large-scale mesh-based ONoC. They divide the whole
network into four small mesh-based ONoCs and interconnect them using non-blocking
optical interconnects. Thus, the time delay for optical routing path reservation in the
mesh-based ONoC is significantly reduced and the network throughput is increased by
using multiple optical interconnects, especially in H2ONoC the cores in the same row
and column are connected using an separate non-blocking wavelength-routed crossbar.
Olympic is an ONoC architecture with a hierarchical ring topology, and it uses opti-
cal interconnects for both global communication and local communication (Bartolini,
Lusnig, and Martinelli, 2013). Non-blocking wavelength-routed optical rings are used
both for inter-core communication in the same cluster and the communication between
clusters. In summary, the design principle of hierarchical ONoC schemes is to improve
the bandwidth of optical global network by sacrificing some hardware costs, e.g., using
non-blocking wavelength-routed optical interconnects or connecting in a fat-tree style.
44
Table 2.4: Comparisons of Different Hierarchical ONoC Schemes
Scheme Design Principle Advantages Limitations
Path-
Reserved
ONoC
It consists of an elec-
tronic control network
and an optical data net-
work. Optical routing
paths and wavelengths
are established and allo-
cated dynamically.
No limitation on the number of
wavelengths and high wavelength
utilization. Wavelength reuse can
be achieved in the routing and
wavelength allocation.
Hop-by-hop optical routing reser-
vation is time-consuming when in-
terconnecting a large number of
cores. Routing and wavelength
allocation has significant influ-
ence on the communication per-
formance.
Hierarchical
ONoC
It consists of optical
global networks and elec-
tronic local networks,
and different kinds of op-
tical networks and elec-
tronic networks can be
combined.
Optical global network can use the
all-optical architecture with suf-
ficient wavelengths or the path-
reserved architecture with less
hops. Electronic local network
is efficient for short-distance com-
munication.
Multiple cores share one optical
network interface in the optical
global network, and it can become
the bottleneck when the data rate
is high.
2.3.3 Advantages and Disadvantages
The main purpose of electronic-optical hybrid ONoC schemes is to utilize the lim-
ited number of wavelengths to provide high-performance optical communication for
the many-core processor with a large number of cores. Their design principles, main
advantages and limitations are summarized in Table 2.4.
The path-reserved ONoC scheme dynamically establishes an optical routing path
from the source core and destination core. Thus, it can eliminate the wavelength limi-
tation in the optical data network compared with the all-optical ONoC architectures,
and improve the wavelength utilization in the optical interconnects compared with the
fixed wavelength-routed architectures. However, the optical routing path establish-
ment can introduce extra processing delay, hardware cost, and power consumption,
especially by using an electronic control network to establish the optical routing path
in a hop-by-hop manner. Since the optical interconnects and wavelengths can be used
in different optical routing paths, if the usage of optical interconnects and wavelengths
cannot be balanced in the optical data network and the same wavelength cannot be
used in multiple routing paths, heavy congestions can be encountered in the path
reservation process. Therefore, the routing and wavelength allocation is an important
research problem in the path-reserved ONoC scheme.
The hierarchical ONoC architecture is designed by using the optical interconnects
for global communication and the electronic interconnects for local communication.
Optical interconnects need to experience an electronic-to-optical conversion in the
source side and an optical-to-electronic conversion in the destination side, while the
45
electronic interconnects still have low transmission delay and power consumption for
short-distance communication. Thus, the hierarchical ONoC is an efficient combination
of optical interconnects and electronic interconnects for many-core processors. Differ-
ent design schemes can be used to construct a hierarchical ONoC architecture, such as
different network topologies in the optical global network and electronic local network,
and they have different advantages and limitations as introduced above. In general,
the design principle of hierarchical ONoC architecture is to achieve high-bandwidth
optical global communication by increasing some hardware costs and employ efficient
electronic local communication within each cluster of cores.
2.4 Routing and Wavelength Allocation
Routing and wavelength allocation (RWA) scheme calculates the optical routing paths
and allocates the wavelengths, when the network architecture of ONoC is determined.
The calculation of optical routing paths and the allocation of wavelengths are interde-
pendent problems, since the allocated wavelength must be available in all the hops of
optical interconnects along the optical routing path. The RWA scheme can determine
the maximal number of inter-core communications that can be supported in an ONoC
architecture with a limited number of available wavelengths. To accommodate more
optical communications, wavelength reuse should be considered in the design of RWA
scheme. According to different ONoC architectures, existing RWA schemes can be
divided into fixed wavelength routing for all-optical ONoC and dynamical routing and
wavelength allocation for electronic-optical hybrid ONoC. Their design methodologies
are presented separately in this section.
2.4.1 Fixed Wavelength Routing
In the all-optical ONoC architecture, taking λ-router and ORNoC as examples, the
optical routing path and the wavelength fixedly correspond to each other for the com-
munication between any two cores. It can achieve non-blocking wavelength-based op-
tical communication between all the connected cores by allocating the optical routing
paths and wavelengths in advance. From the wavelength routing matrix of λ-router
and ORNoC in Figure 2.4(a) and Figure 2.5, it can be seen that the main design
principles of fixed wavelength routing scheme include that (i) for any source core, the
optical signals transmitted to different destination cores must use different wavelengths
or waveguides; (ii) for any destination core, the optical signals received from different
46
source cores must also use different wavelengths or waveguides; and (iii) there is no
wavelength conflict (i.e., two optical routing paths using the same wavelength pass
the same optical interconnect) in all the optical interconnects. However, the number
of required MRs in λ-router increases very fast when the network size scales up, and
it can only be used when there are sufficient wavelengths available; while the number
of required wavelengths increases fast in ORNoC, and it can be used with a limited
number of wavelengths by increasing the number of waveguides.
According to the analysis of all-optical ONoCs, it can find that the main research
problem in the fixed wavelength routing scheme, with non-blocking optical communi-
cation using a limited number of wavelengths, can be summarized to deploy all the
wavelengths in every optical interconnect uniformly, thereby improving the utilization
of optical interconnects and reducing the maximal required number of wavelengths.
2.4.2 Dynamical Routing and Wavelength Allocation
The dynamical routing and wavelength allocation scheme aims to utilize the limited
number of wavelengths and optical interconnects more efficiently. The optical routing
path and wavelength are dynamically allocated for each inter-core communication from
a specific source core to its destination core. The dynamical RWA scheme can be im-
plemented in the routing path reservation process of path-reserved ONoC architecture
(Chan and Bergman, 2012). Since the same wavelength should be used in all the optical
interconnects of an optical routing path, the wavelength utilization in the intermediate
optical interconnects should also be considered in the routing computation, instead of
only considering the address of destination core. Otherwise, the path-setup packet may
be congested in the electronic control network if there is no available wavelength in
the next hop optical interconnect in the optical routing path. Thus, the calculation of
optical routing paths and the allocation of wavelengths are interdependent problems
in the RWA scheme (Yoo, Ahn, and Kim, 2003).
However, current researches do not pay much attention on this problem. Some path-
reserved ONoC architectures directly use the broadband optical signals in the optical
circuit-switching or only employs wavelength multiplexing to improve the bandwidth
capacity of each optical interconnect, such as in (Shacham, Bergman, and Carloni,
2008; Gu, Mo, Xu, and Zhang, 2009; Ye, Xu, Huang, Wu, Zhang, Wang, Nikdast,
Wang, Liu, and Wang, 2013; Zhao, Gong, Tan, and Gu, 2016). Some other wavelength-
multiplexed path-reserved ONoC architectures divide the routing computation and
wavelength allocation into two separate problems, such as in (Chan and Bergman,
47
2012; Chen, Gu, Yang, and Chen, 2012; Chen, Gu, Chen, and Zhang, 2013). These
schemes employ the dimensional routing (i.e., XY routing) in a 2D-mesh based ONoC
or torus based ONoC. The optical routing path is uniquely determined for each pair of
source and destination cores according to their addresses. Then, different wavelengths
are allocated to multiple optical routing paths if they share any optical interconnect.
However, wavelength conflicts may appear in some intermediate optical interconnects,
since the optical routing paths are established distributively by the electronic router in
the electronic control network without any wavelength adaptiveness. WANoC scheme
is designed to reduce the possibility of wavelength conflicts in the mesh-based ONoC
by allocating a distinct wavelength to every router in each row and column. According
to the dimensional routing, if an optical routing path needs to make a turn in a specific
router, it must apply for the corresponding wavelength (Chen, Gu, Yang, and Chen,
2012). In this scheme, the wavelength conflicts only occur in the optical interconnects
between the cores in the same row, and there is no wavelength conflict in the vertical
optical interconnects if the wavelength of turning router can be granted. However, since
the dimensional routing in 2D mesh based ONoC is a deterministic routing algorithm
and more traffic can be routed to the center of network along the shortest paths, the
wavelength utilization in the optical interconnects is not uniformly distributed in the
network and heavy wavelength conflicts would occur in the network center.
Generally, the dynamical RWA scheme in the path-reserved ONoC can determine
the maximal number of optical communications that can be supported at the same
time with a limited number of available wavelengths in the optical interconnects. To
improve the overall wavelength utilization of optical interconnects and accommodate
more optical inter-core communications, the dynamical RWA should have the ability to
balance the wavelength utilization in the optical network and reuse the same wavelength
in as many optical routing paths as possible. In this thesis, an important research
problem is to explore the wavelength reuse in the design of dynamical routing and
wavelength allocation scheme for the path-reserved ONoC architecture.
2.4.3 Advantages and Disadvantages
In summary, the fixed wavelength routing scheme and the dynamical routing and wave-
length allocation scheme are designed for different ONoC architectures. They have their
own advantages and disadvantages.
The fixed wavelength routing scheme is used in the wavelength-routed all-optical
ONoC architectures. The main purpose of fixed wavelength routing is to achieve non-
48
blocking optical communication between all the cores with a limited number of wave-
lengths, by balancing the wavelength utilization in all the optical interconnects and
reusing the same wavelength in as many optical routing paths as possible, just like in
the ORNoC architecture. However, it can only be used for interconnecting a small num-
ber of cores with sufficient wavelengths, otherwise it requires numerous wavelengths
for non-blocking optical communication which greatly exceeds the number of avail-
able wavelengths. Moreover, the hardware costs for fixed wavelength routing increase
significantly when the number of cores interconnected in ONoC increases.
The dynamical routing and wavelength allocation scheme is designed for the path-
reserved ONoC architectures. Since the number of available wavelengths is not enough
to provide non-blocking optical communication between all the cores in an ONoC, the
dynamical routing and wavelength allocation scheme only allocates an optical routing
path and wavelength for the inter-core communication at runtime. To accommodate
more optical routing paths in the ONoC architecture with a limited number of wave-
lengths, it should also consider the current wavelength utilization in the intermediate
optical links, namely to achieve wavelength adaptive routing, instead of only consid-
ering the address of destination core. The key issues in the design of dynamical RWA
scheme include balancing the wavelength utilization in all the optical interconnects and
reusing the same wavelength in different optical routing paths. The main advantage
of dynamical RWA scheme is that it can approximately achieve non-blocking optical
communication between all the source and destination cores when the data rate is
low, and it can accommodate more optical routing paths without wavelength conflicts
within the wavelength limitation when the data rate is high. However, the computation
complexity of dynamical RWA scheme is high by considering the routing computation
and wavelength allocation at the same time, and the optical routing path configuration
will lead to some extra time delay.
2.5 Summary
This chapter explores the design methodology of ONoC. Wavelength-based routing is
preferable in ONoC architecture, since it can increase the bandwidth capacity of opti-
cal interconnects and improve the routing flexibility. However, the number of available
wavelengths is limited by the laser power intensity and crosstalk noise. Several different
ONoC architectures and communication schemes are analysed and compared, includ-
ing the all-optical ONoC and electronic-optical hybrid ONoC, and the fixed wavelength
49
routing and dynamical routing and wavelength allocation. All-optical ONoC generally
obtains the non-blocking wavelength-routed communication through fixed wavelength
allocation. However, the wavelength limitation and hardware costs restrict its scal-
ability. Electronic-optical hybrid ONoC combines the benefits of electronic intercon-
nects and optical interconnects. The hop-by-hop optical routing reservation in the
path-reserved ONoC and the competition to access the optical global network in the
hierarchical ONoC are the main concerns. Routing and wavelength allocation scheme
determines the maximal communication capacity of an ONoC architecture. To utilize
the optical interconnects and wavelengths more efficiently with a limited number of
wavelengths, wavelength reuse should be considered in the RWA scheme.
50
Chapter 3
WRH-ONoC: Wavelength-Reused
Hierarchical ONoC Architecture
The number of processing cores integrated in a many-core processor chip keeps increas-
ing to improve the computational capability. However, the design of high-bandwidth
and scalable inter-core network architecture becomes a challenging problem. Generally,
it requires to obtain low end-to-end communication delay and high network throughput
with the modest hardware cost and energy overhead. In this chapter, a Wavelength-
Reused Hierarchical Optical Network on Chip architecture, WRH-ONoC, is proposed
by exploiting the advantages of λ-router for the non-blocking wavelength-based routing
and hierarchical networking for the wavelength reuse in all the λ-routers.
In WRH-ONoC architecture, all the cores are grouped into multiple subsystems,
where the size of each subsystem can be configured according to the number of available
wavelengths. Then, the cores in the same subsystem are directly interconnected using
a single λ-router to realize non-blocking intra-subsystem communication. For the inter-
subsystem communication, all the subsystems are further connected through multiple
λ-routers and gateways in a hierarchical manner, by which optical signals can change
their wavelengths in the gateways. WRH-ONoC is capable of interconnecting hundreds
of thousands of cores by using only a limited number of wavelengths in each optical
interconnect, through the reuse of available wavelengths in all the λ-routers. Efficient
routing schemes are designed for both the unicast and multicast traffics in WRH-ONoC.
Furthermore, the minimum hardware requirement of WRH-ONoC is analysed, and
the expected end-to-end communication delay and the maximum data rate are derived,
given the total number of cores and the maximal number of available wavelengths in
optical links. Both theoretical analysis and simulation results indicate that WRH-
51
ONoC can achieve prominent improvement on the communication performance and
network scalability (e.g., 46.0% of reduction on the zero-load packet delay and 72.7%
of improvement on the maximal throughput for 400 cores with the modest hardware
cost and energy overhead) in comparison with existing schemes.
3.1 Motivation
3.1.1 Scalability of Many-Core Processor
As predicted in Moore’s Law, the continuous development of semiconductor manu-
facturing technology enables more and more transistors being integrated in a single
processor chip (Moore, 2006). An efficient design methodology to utilize the abundant
transistors with high performance is to increase the number of processing cores, instead
of increasing the complexity of each core (Geer, 2005; Borkar, 2007). Currently, the
processor with multiple cores is widely utilized in personal digital devices, such as 2 to
8 cores in a desktop computer, 4-16 cores in a smart phone or tablet. In industries, the
high-performance processor which integrates tens of hundreds of cores is becoming the
mainstream platform for cloud computing, data center, and supercomputing systems
(Gries, Hoffmann, Konow, and Riepen, 2011), such as 80 cores in Intel Teraflops (Intel,
2007), 256 cores in Kalray MPPA (Kalray, 2012), and 260 cores in Sunway 26010 (Fu,
Liao, Yang, Wang, Song, Huang, Yang, Xue, Liu, Qiao, Zhao, Yin, Hou, Ge, Zhang,
Wang, Zhou, and Yang, 2016). These many-core processors can provide extremely
high computation capability for distributed and parallel computing applications. Ac-
cording to the technological predictions in (Borkar, 2007; Nychis, Fallin, Moscibroda,
Mutlu, and Seshan, 2012), up to trillions of transistors with thousands of cores can
be integrated in the many-core processor in the future. However, the scalability of
many-core processors is still constrained by the performance of inter-core communica-
tion architecture (Meindl, 2003). In general, the main challenging issues on the design
of inter-core network include: (i) the communication delay between cores increases
significantly compared with the computation delay, because the delay of electronic link
does not shrink as fast as the increase of transistor speed. Thus, the communication
between two cores with a long distance or the global communication can experience a
high packet delay; (ii) the interconnection architecture should provide extremely high
bandwidth capacity for the inter-core communication among a large number of cores,
especially for some data-intensive applications and cache coherence messages.
52
Network on Chip (NoC) is an inter-core communication technology which intro-
duces the principles of networking and packet routing into the communication of many-
core processors (Dally and Towles, 2001; Kumar, Jantsch, Soininen, Forsell, Millberg,
Oberg, Tiensyrja, and Hemani, 2002; Henkel, Wolf, and Chakradhar, 2004; Nychis,
Fallin, Moscibroda, Mutlu, and Seshan, 2012). Generally, in a NoC architecture, each
core is connected to an electronic router through the network interface. The electronic
routers are interconnected in different network topologies to construct a communication
network, which implements the store-and-forward switching scheme for the communi-
cation between any two cores. The design issues of NoC have been widely studied
(Marculescu, Hu, and Ogras, 2005), including network topology (Pande, Grecu, Jones,
Ivanov, and Saleh, 2005), routing algorithm (Ma, Jerger, and Wang, 2011), router
architecture (Kim, Nicopoulos, Park, Narayanan, Yousif, and Das, 2006), mapping
scheme (Hu and Marculescu, 2005), and virtual-channel arbitration (Nicopoulos, Park,
Kim, Vijaykrishnan, Yousif, and Das, 2006), etc. For a processor with tens of cores,
NoC can obtain good communication performance, e.g., with low end-to-end packet
delay and high network throughput. However, when it extends to hundreds of or even
thousands of cores, the traditional NoC architecture with electronic interconnects can-
not satisfy the communication requirement. For example, in a many-core processor
with 400 cores in a 20×20 mesh-based NoC, the hop-by-hop buffering and routing can
lead to high communication delay with a longer average distance between cores, high
hardware cost and power consumption with larger buffer space, low throughput with
heavier traffic congestions especially in the center of network.
Therefore, the design of high performance and scalable interconnection architec-
ture is an important and challenging research problem. It requires to realize efficient
communication for more than hundreds of cores with low end-to-end delay and high
bandwidth capacity, as well as low hardware cost and energy overhead.
3.1.2 Optical Network on Chip
Optical Network on Chip (ONoC) is a silicon-compatible optical interconnection net-
work among the cores at the chip level (Kirman, Kirman, Dokania, Martinez, Apsel,
Watkins, and Albonesi, 2006; Petracca, Lee, Bergman, and Carloni, 2009; Li, Brown-
ing, Gratz, and Palermo, 2014; Morris, Jolley, and Kodi, 2014). It can overcome the
limitations of electronic interconnects by transmitting the data packets with modulated
optical signals with extremely low transmission delay, high bandwidth capacity, and
low power consumption. Moreover, with the benefits of Wavelength Division Multiplex-
53
ing (WDM), multiple optical signals can be transmitted simultaneously through the
same optical link using different wavelengths. Thus, some wavelength-routed ONoC
architectures are proposed to eliminate the hop-by-hop routing and buffering in the
traditional electronic NoCs. However, at present the optical interconnects also suffer
from some limitations. For example, the number of available wavelength channels is
limited (e.g., 62 wavelengths in maximum with 19 Gbps bandwidth and -20 dB noise
tolerance (Preston, Droz, Levy, and Lipson, 2011)), and there is no optical device for
direct wavelength conversion. These limitations pose great challenges to the design of
high-performance and scalable ONoC for many-core processors.
The fundamental optical components in an ONoC include light sources, waveguides,
optical routers, modulators, and photodetectors. Light source provides optical signals
on which the data packets are modulated, and it can be implemented by using an off-
chip laser coupled with power waveguides (Morris, Jolley, and Kodi, 2014) or on-chip
lasers directly for each core (Chen, Zhang, Contu, Klamkin, Coskun, and Joshi, 2014).
Waveguide is the optical transmission medium whose propagation loss can be less than
1 dB/cm. Optical router conducts high-speed switching for optical signals when its
connection is configured in advance. Modulator converts the electronic signals into
the optical signals (with on-off keying), and photodetector receives the optical signals
and converts them back to the electronic signals. Microring resonator (MR) is the
basic element of optical routers, modulators, and photodetectors. It is a compact and
energy-efficient optical filter which is designed to pass the optical signals with a specific
wavelength (Liu, Liao, Chetrit, Basak, Nguyen, Rubin, and Paniccia, 2010). As shown
in Figure 3.1(a), when the wavelength of an input signal λi equals to the resonant
wavelength λr of MR, the optical signal couples into the MR and changes its direction.
Thus, in an optical router, the optical signals with different wavelengths can be filtered
and routed to different outputs in parallel based on the resonant wavelength of MRs,
as shown in Figure 3.1(b). Nλ sets of MRs with different resonant wavelengths are
required in each optical router to transmit optical signals with Nλ wavelengths.
As described in Chapter 2, most of existing ONoC architectures can be classi-
fied into two categories: all-optical (O’Connor, 2004; Koohi and Hessabi, 2014) and
electronic-optical hybrid (Shacham, Bergman, and Carloni, 2008; Vantrease, Schreiber,
Monchiero, McLaren, Jouppi, Fiorentino, Davis, Binkert, Beausoleil, and Ahn, 2008;
Kurian, Miller, Psota, Eastep, Liu, Michel, Kimerling, and Agarwal, 2010; Pan, Ku-
mar, Kim, Memik, Zhang, and Choudhary, 2009; Liu, Zhang, Chen, Huang, and Gu,
2016). All-optical ONoC architectures connect all the cores with solely optical com-
54
(a) (b)
1
I
2 n1n 
1O 2O 1nO  nOTurn
ThroughInput
r
i
i r 
i r= 
Figure 3.1: The operation principle of (a) microring resonator as
a wavelength-selective filter, and (b) the optical switch by filtering
optical signals with different wavelengths.
ponents, e.g., λ-router is a fully-connected architecture constructed by the cascaded
optical switching units (O’Connor, 2004). Generally, most of the all-optical ONoC
architectures can provide non-blocking communication through the fixed wavelength
allocation for the optical routing paths. However, their main drawback is the poor
scalability, due to the wavelength constraint and quadratic increase of required optical
devices. Electronic-optical hybrid ONoC architectures were proposed to achieve better
scalability by combining the properties of electronic and optical interconnects. One
type of hybrid ONoC is to use an extra electronic control network to conduct buffer-
ing and processing for the optical network, and the end-to-end optical routing paths
are established in the electronic control network before the communication (Shacham,
Bergman, and Carloni, 2008). However, since each optical communication needs to ex-
clusively reserve the optical routing path in a hop-by-hop way, which is time-consuming
and leads to severe traffic contentions and hardware overheads, especially for connecting
up to hundreds of cores. Another type is to construct a hierarchical ONoC network with
multiple electronic local interconnects and one optical global interconnect (Vantrease,
Schreiber, Monchiero, McLaren, Jouppi, Fiorentino, Davis, Binkert, Beausoleil, and
Ahn, 2008; Kurian, Miller, Psota, Eastep, Liu, Michel, Kimerling, and Agarwal, 2010;
Pan, Kumar, Kim, Memik, Zhang, and Choudhary, 2009). Even though this approach
needs no prior path reservation, it still cannot make full use of the advantages of opti-
cal interconnects, due to the inefficient local communication and severe contentions to
access the global optical network with high data rate.
3.1.3 Non-Blocking λ-Router
λ-router is a wavelength-routed optical architecture which can provide non-blocking
communication among all the connected cores (O’Connor, 2004). Since λ-router is one
of the main components used in the proposed ONoC architecture, it is introduced in
55
details. Figure 3.2(a) illustrates the architecture of an 8-inputs×8-outputs λ-router.
Each core {i} connects to an input Ii and an output Oi to transmit and receive optical
signals. The key component in a λ-router is the 2-inputs×2-outputs optical switching
element (OSE) using two MRs with the same resonant wavelength λr, as shown in
Figure 3.2(b). Eight stages of MRs with eight different wavelengths, λ1 to λ8, are used
for the optical routing of eight cores. When the wavelength of input signal λi equals to
λr of the OSE, the optical signal will be coupled into an MR and output from one port;
otherwise the optical signal will pass the OSE and output from the other port. The
footprint of an OSE is very small, less than 10×10µm2 when the radius of each MR is
smaller than 3µm (Kazmierczak, Briere, Drouard, Bontoux, Rojo-Romeo, O’Connor,
Letartre, Gaffiot, Orobtchouk, and Benyattou, 2005).
(c)
5 3 6 2 7 1 8
5 4 7 3 8 2 1
3 4 5 1 6 8 7
6 7 5 4 1 3 2
2 3 1 4 5 7 6
7 8 6 1 5 4 3
1 2 8 3 7 4 5
8 1 7 2 6 3 5
      
      
      
      
      
      
      
      
 
 
  
 
 
 
 
 
  
1
2
3
4
5
6
7
8
I
I
I
I
I
I
I
I
 
 
 
 
 
 
 
 
 
  
1 2 3 4 5 6 7 8[ ]O O O O O O O O
(a)
(b)
r
i r 
j r 
r
Waveguide
Micro-
Resonator
8
8
8
8
7
7
7
7
7
7
7
7
6
6
6
6
5
5
5
5
5
5
5
5
4
4
4
4
3
3
3
3
3
3
1
1
2
2
1
1
1
1
3
3
2
2
1
1
1O
2O
3O
4O
5O
6O
7O
8O
1I
2I
3I
4I
5I
6I
7I
8I
Stage 1
Stage 2
Stage 3
Stage 4
Stage 5
Stage 6
Stage 7
Stage 8
4
1
Figure 3.2: The principle of λ-router: (a) the connection architecture;
(b) optical switching element (OSE); (c) wavelength routing matrix.
For non-blocking optical communication among all of the connected cores, an N×N
λ-router needs to employ WDM by using N waveguides and N wavelengths. As illus-
trated in Figure 3.2(a), the OSE layout in a λ-router is to place OSEs in N stages where
the number of OSEs in stage i is bN
2
c if i is odd and bN−3
2
c if i is even (Kazmierczak,
Briere, Drouard, Bontoux, Rojo-Romeo, O’Connor, Letartre, Gaffiot, Orobtchouk, and
Benyattou, 2005). Since there is no communication between input Ii and output Oi as
they connect to the same core {i}, the minimum number of MRs and OSEs required
are N(N−2) and dN(N−2)
2
e, respectively (Liu, Zhang, Chen, Huang, and Gu, 2015). In
each stage, all the OSEs share the same resonant wavelength λr. Different resonant
wavelengths should be used in different stages. The wavelength used for the optical
56
communication between Ii and Oj is determined by Mi,j ∈M , where M is the wave-
length routing matrix. For example, in Figure 3.2(c), M2,3=λ4 and M2,8=λ1. Thus,
the optical communication paths from I2 to O3 and I2 to O8 are determined only by
the wavelengths of λ4 and λ1, respectively, as the dashed lines in Figure 3.2(a).
λ-router has several distinctive advantages that make it promising for small-scale
ONoCs: (i) fully non-blocking communication between all the connected cores through
the wavelength routing; (ii) multicast capability through transmitting the multicast
packets to different destination cores using different wavelengths concurrently; (iii) low
transmission delay and high bandwidth capacity with the fixed routing and wavelength
allocation. If a waveguide contains N−1 channels each using a specific wavelength, then
the aggregate bandwidth per waveguide can reach 10×(N−1) Gbps, when the bandwidth
is 10 Gbps/wavelength (Liu, Liao, Chetrit, Basak, Nguyen, Rubin, and Paniccia, 2010).
However, the only concerned issue of λ-router is its poor scalability. The number of
waveguides and wavelengths are linearly proportional to the scale of λ-router (the
number of ports), but the number of OSEs and MRs increase quadratically. That
makes it not suitable for large scale ONoCs due to the limited available wavelengths
and chip area, e.g., to interconnect 100 cores, a λ-router requires as many as 100
wavelengths, 100 waveguides, and 4900 OSEs (9800 MRs) cascaded in 100 stages.
3.1.4 Hierarchical Networking
To improve the scalability of ONoC for many-core processors with more than hundreds
of cores, several representative electronic-optical hybrid ONoC architectures have been
proposed. PNoC (Shacham and Bergman, 2007) and HOME (Mo, Ye, Wu, Zhang,
Liu, and Xu, 2010) employ the optical circuit-switching which uses an extra electronic
control network, for buffering and routing, to configure the optical routing paths be-
tween source and destination cores in the optical data network. In PNoC, each core
needs to connect with an electronic router and an optical router, and the corresponding
electronic control network and optical data network should use the identical topology.
While HOME utilizes a condensed optical data network, namely four cores share the
same electronic router and optical router. In these ONoC architectures, the end-to-end
optical routing path is dedicatedly reserved before a specific communication in a hop-
by-hop way by the electronic control network. Thus, they can provide quality of service
(QoS) guaranteed optical communication for some data-intensive applications after the
optical routing path reservation. However, the time-consuming electronic path-setup
process leads to a long preparation delay and low link utilization for the short packets
57
due to severe contentions, especially in the many-core processor with a large number
of short control packets for cache coherence. Different from these ONoC schemes, the
proposed scheme in this chapter utilizes non-blocking λ-routers with the wavelength-
based routing, thus it needs no dynamical optical routing path reservation for each
inter-core communication in advance. Moreover, to interconnect N cores, PNoC needs
N electronic and optical routers in a mesh/torus topology, while HOME shares one
electronic/optical router among four cores so that it is a mesh with N
4
nodes. Thus,
these ONoC schemes can lead to longer communication paths in average for large-scale
many-core processors. That also leads to higher transmission delay and much more po-
tential traffic congestions during the optical routing path reservation. In the proposed
scheme, the optical routing path is significantly shortened because of the hierarchical
interconnection of λ-routers, and multiple redundant routing paths are provided in the
gateways between two λ-routers to achieve load balance.
Corona (Vantrease, Schreiber, Monchiero, McLaren, Jouppi, Fiorentino, Davis,
Binkert, Beausoleil, and Ahn, 2008), ATAC (Kurian, Miller, Psota, Eastep, Liu, Michel,
Kimerling, and Agarwal, 2010), and Firefly (Pan, Kumar, Kim, Memik, Zhang, and
Choudhary, 2009) schemes employ the idea of hierarchical networking. They divide
all the cores in a large many-core processor into several small clusters, then use an
electronic local network in each cluster and connect all the clusters through a global
cyclic optical crossbar. For the intra-cluster traffic, they use different electronic inter-
connects, e.g., crossbar for Corona, mesh for ATAC and Firefly. For the inter-cluster
traffic, the global optical crossbar utilizes WDM to provide separate wavelength chan-
nels among different clusters, e.g., MWSR (Multi-Write-Single-Read) for Corona, and
SWMR (Single-Write-Multi-Read) for ATAC and Firefly, and an arbitration scheme
is required to solve the access contentions to the global optical network. In terms
of the scalability, the electronic intra-cluster interconnects are not efficient, especially
for data-intensive applications, and the accessing contention in the global optical net-
work can also become their bottleneck when multiple cores in the same cluster need to
communicate with cores in the other clusters. In the proposed scheme, the optical in-
terconnects are used for both intra- and inter-subsystem traffics with the non-blocking
λ-routers, in which every input uses different wavelength channels to different outputs,
thus it needs no extra arbitration for optical interconnects. Since the wavelength can
be changed in the gateways, all the available wavelengths are able to be reused in all
the λ-routers. Redundant optical paths with the load-balance ability are offered for
the inter-subsystem traffic by using multiple gateways between the same two λ-routers.
58
3.1.5 Main Contributions of WRH-ONoC
To realize high-performance and scalable communication for many-core processors,
a Wavelength-Reused Hierarchical ONoC architecture, WRH-ONoC, is proposed by
exploring the benefits of wavelength reuse and hierarchical networking. In the WRH-
ONoC architecture, all of the cores are grouped into subsystems. Then, λ-router
is utilized to implement non-blocking optical communication within each subsystem,
while multiple λ-routers and gateways are further connected in a hierarchical manner to
provide high-bandwidth inter-subsystem communication. All the available wavelengths
are reused in λ-routers through wavelength reassignment in gateways. Hence, WRH-
ONoC is capable of connecting hundreds of thousands of cores by using a limited
number of wavelengths. The key contributions of WRH-ONoC are listed as follows:
• A wavelength-reused hierarchical architecture that sustains the strength of λ-
router in high-throughput communication but offsets its weakness in poor scal-
ability is proposed. By dividing the cores into subsystems and connecting them
using λ-routers in a hierarchical network, the available wavelengths can be reused
in all of the λ-routers through the wavelength reassignment in gateways.
• Both efficient unicast and multicast routing schemes are designed for WRH-
ONoC. Intra-subsystem communication is fully non-blocking via one hop of λ-
router, while the low-latency and high-throughput inter-subsystem communica-
tion is achieved through the dynamic load balance among sibling gateways in
each hop and paralleled wavelength assignment in each gateway.
• The expected end-to-end communication delay, the maximum allowed data rate,
and the average energy consumption for the unicast communication are derived,
assuming the traffic follows the Uniform distribution in space and the Poisson
distribution in time.
• The hardware requirements of WRH-ONoC with given number of cores are anal-
ysed, and the results indicate that optical devices can be reduced by about 90%
in comparison with the λ-router scheme. Meanwhile, compared to existing ONoC
schemes, the overall hardware cost measured in the chip area can also be reduced.
• Extensive simulations are carried out to evaluate the performance of WRH-
ONoC, using both realistic data traces and synthetic traffic patterns. Simulation
results indicate that it is efficient for both unicast and multicast communications.
59
Compared with existing ONoC schemes, it can achieve significant improvement
on the communication performance and scalability, e.g., with a decrease of 46.0%
on the zero-load packet delay and an increase of 72.7% on the maximal through-
put for 400 cores when using the same number of wavelengths. For the multicast
communication, the zero-load delay can also be reduced by 63.9% and the max-
imal throughput can be increased by 8.4 times compared to the PNoC scheme
with tree-based multicast routing, even with only 5% of multicast traffic.
3.2 Network Architecture
The key idea of the proposed WRH-ONoC architecture is to sustain the strength of
λ-router in non-blocking routing but offset its weakness in scalability by combining the
hierarchical network with wavelength reuse. Moreover, the optical communication is
realized for both the local and global traffics in WRH-ONoC. The network architecture
of WRH-ONoC is presented in the following section.
3.2.1 Hierarchical Interconnection
As shown in Figure 3.3, WRH-ONoC employs a hierarchical network architecture con-
structed by multiple λ-routers and gateways. The λ-routers are used to provide non-
blocking optical communication among the connected cores or gateways, while the
gateways are used to conduct wavelength re-assignment for optical signals, and thus to
achieve wavelength reuse in λ-routers. Assuming to design an ONoC to interconnect
N cores by using only Wmax wavelengths, where N>>Wmax. In the WRH-ONoC archi-
tecture, as illustrated in Figure 3.3, all the cores are grouped into multiple subsystems,
where each subsystem has n cores according to the number of available wavelengths
and the requirement of applications. For generality, each subsystem is supposed to
connect with the same number of cores in a level 1 λ-router in the following analysis.
Within each subsystem, the cores should be connected directly using a small λ-
router with the sufficient wavelengths, thereby providing non-blocking optical commu-
nication in each subsystem. Data packets can be transmitted in pre-determined optical
routing paths with the allocated wavelengths according to the wavelength routing ma-
trix as shown in Figure 3.2(c). For the communication between cores in different
subsystems, data packets need to be transmitted through the network hierarchy by
passing multiple λ-routers and gateways. The gateways serve as the bridges between
two λ-routers, and multiple sibling gateways, which connect with the same two λ-
60
Subsystem 0
Subsystem 1
Subsystem 2
Subsystem 3
Subsystem 4
Subsystem 5
Subsystem 6
Subsystem 7
0
1L
1
1L
2
1L
3
1L
4
1L
5
1L
6
1L
7
1L
AC
BC
CC
DC
Gateway for 
wavelength conversion
(25 x 25)
(25 x 25)
(25 x 25)
(25 x 25)
(25 x 25)
(25 x 25)
(25 x 25)
(25 x 25)
(10 x 10)
0
2L(25 x 25) (25 x 25)
0
3L
1 Wavelength
2 Wavelength
Intra-Subsystem 
Communication
Inter-Subsystem 
Communication
EC
Level 3
 -Router
1
2L
Level 2
 -Router
Level 1
 -Router
1G
0G
2G
3G
4G
1G
0G
2G
3G
4G
1G
0G
2G
3G
4G
1G
0G
2G
3G
4G
1G
0G
2G
3G
4G
1G
0G
2G
3G
4G
1G
0G
2G
3G
4G
1G
0G
2G
3G
4G
1G
0G
2G
3G
4G
1G
0G
2G
3G
4G
Level 2
 -Router
Level 1
 -Router
Multiple sibling gateways 
for load-balance
Nonblocking intra-
subsystem communication
Figure 3.3: An example of WRH-ONoC for connecting 160 cores using
25 wavelengths with 3 levels of λ-routers and 5 sibling gateways.
routers, are used to provide high throughput between λ-routers in adjacent levels in
the hierarchical network.
Figure 3.3 gives an example for interconnecting 160 cores using only 25 wavelengths
via a three-level hierarchical λ-router network. All 160 cores are divided into 8 subsys-
tems with each subsystem having n=20 cores, and g=5 sibling gateways are used to
connect one λ-router to the next-level λ-router. Each λ-router in level 1 is a 25× 25 λ-
router that interconnects all 20 cores in one subsystem and 5 gateways. Each λ-router
in level 2 is also a 25×25 λ-router but interconnects 20 gateways from 4 subsystems and
5 gateways for higher level interconnection. The level-3 λ-router is a 10× 10 λ-router
that interconnects with two level-2 λ-routers using 10 gateways. Since all the λ-routers
reuse the same set of available wavelengths, the inter-subsystem communications that
between cores in different subsystems may use different wavelengths to pass through
different λ-routers, e.g., the source core CD to the destination core CE in Figure 3.3.
In the proposed WRH-ONoC architecture, an off-chip laser with multiple wave-
lengths is coupled to several power waveguides which provide light source for each core
and gateway separately (Vantrease, Schreiber, Monchiero, McLaren, Jouppi, Fiorentino,
Davis, Binkert, Beausoleil, and Ahn, 2008). Moreover, due to the technology limita-
tion of chip-scale optical wavelength converter at present, the optical signal needs to be
converted to electronic signal, temporarily buffered, and retransmitted using another
wavelength in the gateways. If the all-optical wavelength converter is mature in the
61
future, the gateway can be updated accordingly. Therefore, through the wavelength
reassignment in gateways, all the Wmax available wavelengths can be reused by each λ-
router, which only connects with a group of cores instead of the whole system, thereby
overcoming the drawback of scalability for using a large-scale λ-router directly. Note
that to ensure the non-blocking intra-subsystem communication, the number of cores
and sibling gateways connected to the same λ-router should be no larger than the
number of wavelengths Wmax, namely it always has g + n ≤ Wmax.
From the perspective of networking, WRH-ONoC is similar to a tree network, i.e.,
the top-level λ-router is the root and all the cores are the leaves. However, the most
important difference is that WRH-ONoC has multiple redundant channels between
each parent and child λ-routers by using multiple sibling gateways, thus it can achieve
higher bandwidth capacity between two λ-routers in different levels. Load balance can
be realized by uniformly choosing a sibling gateway between λ-routers in a dynamic
way. As shown in Figure 3.3, there are 625 possible routing paths between the source
core CD and the destination core CE, depending on which sibling gateways are used in
each hop in the routing path.
3.2.2 Gateway Structure
From Figure 3.3, it can be seen that gateway is the key component in WRH-ONoC.
The detailed structure of gateway is shown in Figure 3.4. To conduct the wavelength
conversion between two λ-routers, each gateway consists of input/output ports, optical-
to-electronic (O-E) and electronic-to-optical (E-O) converters, buffer queues, packet
dispatchers, and wavelength routing matrices. Note that each gateway has two separate
pairs of input/output ports: one pair for the upward traffic from a lower-level λ-router
to a higher-level λ-router (upward direction), and the other for the downward traffic
from a higher-level λ-router to a lower-level λ-router (downward direction). Each
pair of input/output ports has independent buffer queues. Let wl and wh be the
number of wavelengths used in the lower-level and higher-level λ-routers which are
interconnected by the gateway, respectively. As shown in Figure 3.4, there are wl MR-
based photodetectors in the O-E converter for the Input1 each receiving the optical
signal with a specific wavelength. The optical signal filtered by the MR with a resonant
wavelength λi is converted to the electronic signal by the associated photodetector and
buffered in the input queue for λi. Similarly, there are wh MR-based modulators in
the E-O converter for the Output1 each operating for a specific wavelength. The data
packet which is allocated for λj in the output buffer is modulated by the MR with
62
(a)
(b) (c)
O/E E/OO/E E/O
lw hw lw hw
Input1: from
Lower Level 
 -Router
Input2: from 
Higher Level 
 -Router
Output1: to
Higher Level
 -Router
Output2: to
Lower Level 
 -Router
Lower Level
Wavelength Matrix
 -Router
Higher Level
Wavelength Matrix
 -Router
hw

Xbar
lw

Xbar
Independent Packet
Dispatchers
lw

hw

hw
lw
lw
hw
1 1
1 1
lw hw lw hw
1
2
3
4
5
w
EO
Optical 
Outputs
Electrical
Inputs
Power
Waveguide Off-chip 
Laser
Electrical 
Outputs
Optical 
Inputs
1
2
3
4
5
w
OE
MR-based
PhotodetectorMR-based
Modulator
Figure 3.4: The structure of gateway: (a) two internal data paths
for wavelength assignment of upward and downward traffics; (b) E-O
converter with MR-based modulators; (c) O-E converter with MR-
based photodetectors.
wavelength λj to the optical signal and then transmitted through the λ-router. The
input queues and output buffers are fully connected using an internal crossbar for
wavelength switching. For paralleled wavelength reassignment in gateways, each input
queue is associated with a packet dispatcher which is responsible for dispatching data
packets from the input queues to the corresponding outputs based on the wavelength to
be used in the next hop, according to the wavelength matrices. The design for another
pair of input/output (the downward direction) is similar.
The gateway operates as follows: when a set of WDM-based optical signals enter
an input port, each single-wavelength signal is filtered by a specific MR, converted to
electronic signal and written to the corresponding input queue based on the receiving
wavelength, as shown in Figure 3.4(c). Each packet dispatcher works for one wave-
length and it continually dispatches packets from the input queue to the output port.
According to the destination address, the dispatcher determines the next-hop λ-router
for each packet (details on choosing the next-hop will be given in Section 3.3), and looks
up either the lower-level or higher-level wavelength routing matrix to determine the
wavelength to be used. As shown in Figure 3.4(b), the gateway continuously forwards
data packets that stored in the output buffers to the MR-based modulators in parallel,
while optical signals are multiplexed to the waveguide with different wavelengths.
63
SS0 SS1 SS2 SS3 SS4 SS5 SS6 SS7

 
       
0
0 1
00 10 11 00 01 10 11
Source: core 9
{0,0,10;01001}
Destination 2: core 6
{0,1,01;00110}
Level 1
Level 2
Level 3
01
1G 2G
3G 4G
5G
Destination 1: core 7
{0,0,00;00111}
Figure 3.5: The positional prefix address for the cores, λ-routers, and
gateways according to their positions in the network hierarchy.
The main advantage of the proposed gateway design is that multiple optical signals
with different wavelengths can be processed concurrently, and there is no blocking in
each data channel, thereby maximizing the wavelength utilization. Although the signal
conversion and packet buffering will introduce some delay, they have negligible impact
on the communication performance of many-core processors, because: (i) E-O and O-E
conversions can be done at very high speed (less than 100 ps) and with low energy cost
(100 fJ/bit) (Morris, Kodi, Louri, and Whaley, 2014); (ii) a large portion of data traffic
in an ONoC occurs locally due to the application properties or task mapping algorithm
(Guerre, Ventroux, David, and Merigot, 2010). Thus, if the tasks of an application
are assigned to the cores only in one subsystem or in the neighbouring subsystems,
inter-core communication, such as cache coherence messages, will not experience very
frequent E-O and O-E conversions and buffering in the gateways.
3.3 Communication Scheme
In WRH-ONoC architecture, the inter-core communication within a subsystem directly
employs wavelength routing in the level 1 λ-router, while the communication between
cores in different subsystems should experience multiple λ-routers and gateways. The
routing and wavelength allocation for the inter-subsystem communication is conducted
in the gateways. In the following, the communication schemes for both unicast and
multicast traffics are presented, respectively.
3.3.1 Positional Prefix Address
The foundation of packet routing in the WRH-ONoC architecture is the positional
prefix address, which is related to the position in the network hierarchy. Each core
64
has an unique address in the form of {networkID; coreID}. The coreID represents the
unique identification of each core in the subsystem it belongs to, and it has dlog2 ne bits
where n is the maximum number of cores in a subsystem. The networkID is composed
of several fields for subnetworks {sl, ..., s2, s1} where l is the number of levels of λ-
routers in the network hierarchy. Figure 3.5 illustrates the address assignment for the
network given in Figure 3.3. Therefore, the number of bits for coreID field is 5 as there
are 20 cores in each subsystem. In the WRH-ONoC architecture, the routing structure
is a tree which rooted at the top-level λ-router, and all the cores and gateways in a
subtree that rooted at one λ-router forms a subnetwork. Let |si| be the number of bits
in the si field where 0<i≤ l. |sl| is 1 bit as there must be one λ-router in the top level.
|si| is dlog2 rie for i < l where ri is the maximum number of level i λ-routers which
connect to a λ-router at level i+1. Thus, it ensures that different level i λ-routers
which connect to the same level i+1 λ-router are denoted by different addresses.
3.3.2 Unicast Communication
The unicast communication in WRH-ONoC can be classified into intra-subsystem uni-
cast and inter-subsystem unicast. If the networkID in the addresses of the source and
destination cores are equal, it is an intra-subsystem unicast packet; otherwise it is an
inter-subsystem unicast packet. For an intra-subsystem unicast packet, the source core
will look up its local wavelength routing matrix using the CoreID of destination ad-
dress to determine the right wavelength to be used for the optical communication, and
send the data packet directly through this wavelength in the level 1 λ-router.
For an inter-subsystem unicast packet, the source core uniformly chooses a gateway
from the set of sibling gateways that connect to the same subsystem for load-balance,
and sends the packet to the gateway through the connected level-1 λ-router by using the
corresponding wavelength. Then the packet will be routed in gateways and delivered
to the destination core in a multi-hop manner. The key task of gateway is to determine
the next-hop λ-router and the carrier wavelength used in the next-hop transmission.
Figure 3.6 shows the general process of routing an inter-subsystem packet, and it is
explained in details in the following.
(I) Upward Transmission: When an inter-subsystem unicast packet is generated
in the source core, it needs to be transmitted upward in the network hierarchy according
to the location of its destination core. Suppose a gateway receives the packet from a
level i λ-router. If the {sl, ..., si+1} fields of the source and destination addresses are
not equal, the destination core must be located outside of the subnetwork which is
65
i = 1?1{ }l iS S  equal?
To a downward 
gateway via level i-1
λ-router by , i--1{ }iS 
To an upward 
gateway via level 
i+1 λ-router, i++ 
To a gateway in  
the same level via
{ }iS
To destination 
core according to 
{coreID}
Source core generates a 
new inter packet, i = 1
yes
no
yesno
turnover
Input
Outputupward downward
Figure 3.6: The routing process of an inter-subsystem packet: upward
transmission, turnover, and downward transmission.
rooted at the level i+1 λ-router. Then, the gateway needs to transmit the packet
further upward. For example, in Figure 3.5, a packet generated by the source core 9 in
the subsystem SS2 with networkID {0, 0, 10} is to be routed in the gateway G2. If the
destination core is core 6 in the subsystem SS5 with networkID {0, 1, 01}, the packet
is routed upward to the gateway G3 since the {s3, s2} fields of source and destination
addresses are unequal.
(II) Turnover: If the {sl, ..., si+1} fields of the source and destination addresses
are equal, the destination core must be located in the same subnetwork rooted at the
level i+1 λ-router. The packet stops going upward, and is routed to a sibling gateway
that connects to another level i λ-router which is encoded with {si}. For example in
Figure 3.5, if the destination core of the packet generated by the source core 9 is core 7
in SS0 with networkID {0, 0, 00}. Since the {s3, s2} fields of the source and destination
addresses are the same, G2 routes this packet to a sibling gateway G1 that connects to
another λ-router encoded as {s1}={00} at the level 1.
(III) Downward Transmission: After the turnover λ-router, the packet is then
routed in the downward direction. The rules for transmitting a downward packet
are: (i) if the gateway connects to a level i+1 λ-router and a level i λ-router where
i>1, the packet is routed to a gateway that connects the level i λ-router with the λ-
router encoded with {si−1} at level i−1; (ii) if the gateway connects to the destination
subsystem, namely at level 1, it transmits the packet directly to the destination core via
the level 1 λ-router according to the coreID of the destination address. For example,
in Figure 3.5, if G4 needs to route the packet to the destination core 6 in SS5, it sends
the packet to gateway G5 which connects to the λ-router at level 1 with {s1}={01}.
66
Then gateway G5 will route it to the destination core 6 by coreID={00110}.
It is worth noting that in the WRH-ONoC architecture, there are multiple sibling
gateways between two λ-routers at adjacent levels. For example, five sibling gateways
are used in Figure 3.3. These sibling gateways are designed to improve the throughput
between the λ-routers in different levels to prevent the bottleneck effects. Thus, the
source core can send inter-subsystem packets to any sibling gateway through a different
wavelength with non-blocking. To balance the traffic load, once the next-hop λ-router is
determined according to the address of the destination core, the intermediate gateway
used for the next-hop routing is uniformly chosen from the sibling gateways in the
routing path. Benefiting from the wavelength routing in λ-router, it needs no extra
arbitration in the gateways. When the next hop gateway is determined, the current
gateway can check its wavelength matrix to get the right wavelength and forward the
packets through the connected λ-router.
3.3.3 Multicast Communication
Multicast communication is quite common in many-core processors due to the coopera-
tive computing and cache coherence protocols (Morris, Jolley, and Kodi, 2014; Karkar,
Mak, Tong, and Yakovlev, 2016). In the WRH-ONoC architecture, a subsystem-level
multicast scheme is designed by making use of the multicast capability of λ-router. For
the intra-subsystem multicast communication, the multicast packet is modulated by
using multiple wavelengths according to the destinations’ coreIDs, and separate copies
are sent to each destination core simultaneously through the non-blocking λ-router.
For example, in Figure 3.3, the source core CA can multicast packets to core CB and
core CC at the same time via two different wavelengths in the λ-router.
Inter-subsystem multicast communication is implemented through the combination
of inter-subsystem unicasting and intra-subsystem multicasting. Specifically, for each
inter-subsystem multicast packet, the source core unicasts a separate packet copy to
every subsystem which contains at least one destination core in the same way as shown
in Figure 3.6 by using the NetworkID of the destination subsystem. Once the packet
copy arrives at the gateway which connects to the destination subsystem, the gateway
uses intra-subsystem multicasting to send the packet copy to all the destination cores
in that subsystem through different wavelengths.
To distinguish the multicast and unicast packets, the packet header contains a
multi flag field (1 bit), i.e., ′1′ for a multicast packet and ′0′ for an unicast packet.
The multicast address is expressed in the form of {networkID; bit-string}, where
67
{networkID} is used for routing the packet copy to the destination subsystem, and
{bit-string} labels all the destination cores within the subsystem. {bit-string} contains
n bits where n is the number of cores in each subsystem, and the ith bit is marked as
′1′ only if the ith core is a destination.
Compared with the traditional tree-based multicast routing schemes (Samman,
Hollstein, and Glesner, 2010), the proposed multicast scheme is simple but very ef-
ficient. In the tree-based multicast schemes, multiple packet copies are dynamically
generated in the routing nodes based on the distribution of the destination cores.
However, this requires each multicast packet to carry the address information of all the
destination cores, which increases not only the complexity of routing algorithm but also
the communication overhead. In the proposed multicast scheme for WRH-ONoC, each
packet copy only needs to carry the addresses of destination cores in one subsystem
where it is destined to when it is generated in the source core, and no special routing
policy is required for the multicast packets in the intermediate gateways. Moreover, in
the worst case (e.g., broadcast communication), the number of copies need to be sent
via the inter-subsystem unicasting for each multicast packet is no larger than the total
number of subsystems.
3.3.4 Wavelength-Level Flow Control
Since there is no cyclic dependent routing path in WRH-ONoC and every packet can
be delivered to its destination core in a limited number of hops, the inter-core commu-
nication in WRH-ONoC is deadlock and livelock free. Moreover, since the buffer size
of each input is strictly limited (Ruadulescu, Goossens, Micheli, Murali, and Coenen,
2006), a wavelength-level credit-based flow control scheme is used in the gateways to
prevent the buffer overflow. When the buffer queue for a specific wavelength is full,
the gateway will postpone any request for using this wavelength until there is a new
vacancy for the incoming packet.
3.4 Theoretical Modelling and Analysis
To evaluate the communication performance and hardware cost of WRH-ONoC, the
minimal hardware requirement is analysed for the WRH-ONoC architecture intercon-
necting N cores with Wmax wavelengths and g sibling gateways, and the expected
end-to-end communication delay and the energy consumption are derived. Consider-
ing that many-core processor is a general-purpose platform for different applications,
68
the communication pattern follows the widely used Uniform-Random distribution in
the theoretical analysis, namely Uniform distribution in space and Poisson distribu-
tion in time. With this assumption, some important communication performance is
feasible in the theoretical model (Pande, Grecu, Jones, Ivanov, and Saleh, 2005). In
the following analysis, all the subsystems in WRH-ONoC have the same number of
cores. Moreover, due to the inter-subsystem level unicasting property of the multicast
routing, it only analyses the performance of unicast communication.
3.4.1 Hardware Requirements
The main components in the proposed WRH-ONoC architecture include the network
interfaces for cores, λ-routers, and gateways. The hardware requirements of these
components are analysed separately.
Network Interface
In the WRH-ONoC architecture, each core connects with the network hierarchy through
a network interface (NI) which conducts E-O and O-E conversions. The network inter-
face consists of a transmitter and a receiver. Since each core can directly communicate
with up to Wmax−1 other cores or gateways in the same subsystem, the transmitter
must have Wmax−1 MR-based E-O converters with each having a narrow-band MR to
modulate the optical signal with a specific wavelength. Similarly, the receiver needs
Wmax−1 MR-based O-E converters with each having an MR filter and a photodetec-
tor to receive the optical signal with a specific wavelength. Therefore, in minimal it
requires 2N(Wmax−1) MRs in the network interfaces for a WRH-ONoC with N cores
and Wmax wavelengths, when all the λ-routers in level 1 are fully connected.
λ-Router
To interconnect N cores by using Wmax wavelengths in WRH-ONoC, each λ-router at
level i is connected to a next-level λ-router through g sibling gateways. Assuming that
a network hierarchy with L levels of λ-routers is required to interconnect all the cores
and gateways. The following theorem gives the minimum number of λ-routers required
at each level i, 1 ≤ i ≤ L.
Theorem 1. The minimum number of λ-routers required at level i, denoted by Ri, can
69
be calculated as
Ri =

d N
Wmax−ge, i = 1;
d gRi−1
Wmax−ge, i ∈ (1, L);
1, i = L.
(3.1)
Proof: Since each λ-router can directly connect to Wmax−g cores and g gateways using
Wmax wavelengths, the minimum number of λ-routers required at level 1 is d NWmax−ge.
For all the λ-routers at level i, they need to connect to Ri−1 λ-routers at level i−1 via
gRi−1 gateways. Then if gRi−1≤Wmax, level i is the top level L and only one λ-router
is required, RL=1; otherwise more than one λ-routers are required at level i, and each
λ-router at level i also needs to connect to one λ-router at the higher level via g sibling
gateways. Thus, gRi−1+gRi≤RiWmax, and it has Ri≥ gRi−1Wmax−g . Hence, the minimum
number of λ-routers required at level i where 1<i<L, is Ri = d gRi−1Wmax−ge.
Let Rsum and Gsum be the minimum number of λ-routers and gateways required to
connect N cores using Wmax wavelengths in WRH-ONoC, respectively. Then L, Rsum
and Gsum can be computed using the following algorithm.
Algorithm 1: COMPUTE(L,Rsum, Gsum)
Input: N , Wmax, g;
Output: L, Rsum, Gsum;
1 R1 ← d NWmax−ge; i← 2;
2 Rsum ← R1;
3 while gRi−1>Wmax do
4 Ri ← d gRi−1Wmax−ge
5 Rsum ← Rsum +Ri;
6 i← i+1;
7 L← i; RL ← 1;
8 Rsum ← Rsum +RL;
9 Gsum ← g(Rsum − 1);
10 return L, Rsum, Gsum;
Since each λ-router, except for the one at the top level, is connected to a higher level
λ-router via g sibling gateways, thus Gsum = g(Rsum−1) (line 9 in Algorithm 1). Since
each m×m λ-router needs at least dm(m−2)
2
e OSEs and each OSE has two MRs, thus
each m×m λ-router needs m(m−2) MRs. Moreover, in the network architecture there
is no communication between the set of g sibling gateways that connect to the same
two λ-routers. Thus, g(g − 1) MRs can be removed from each of the two connected
70
λ-routers. Let N irm represents the number of MRs used by a level i λ-router. If i = 1,
each level 1 λ-router is connected to a level 2 λ-router via g sibling gateways. Thus
N1rm ≤ Wmax(Wmax−2)−g(g−1). If i > 1, each level i λ-router is connected to at most
bWmax
g
c λ-routers in the other levels. Hence, N irm ≤ Wmax(Wmax−2)− bWmaxg cg(g−1).
It can be seen that the minimal number of required MRs depends on the detailed
interconnections of λ-routers, and can be computed in the same way as in Algorithm 1.
Gateway
According to the gateway structure shown in Figure 3.4, each gateway has two sep-
arate data paths for routing upward and downward traffics. In each direction, the
gateway can receive optical signals from one previous-hop λ-router with up to Wmax−g
wavelengths, and send optical signals to one next-hop λ-router with up to Wmax−g
wavelengths. Thus, each data path should have up to Wmax−g pairs of MR-based E-O
and O-E converters, Wmax−g input queues, Wmax−g output buffers. Each path should
also have up to Wmax−g parallel wavelength dispatchers, and a Wmax×Wmax crossbar
that fully connecting the input queues and the output buffers.
3.4.2 Communication Delay
To evaluate the performance of WRH-ONoC, the communication delay is defined as the
time interval for end-to-end packet transmission, namely a packet traversing through
the hierarchical network from the source core to its destination core. In the following,
the communication delay is modelled for the unicast traffic. Depending on the location
of destination cores, different packets may go through different number of hops, thus
with different communication delays. Let D denote the average communication delay,
it can be calculated as
D = αDintra + (1− α)Dinter, (3.2)
where Dintra and Dinter represent the average delivery delays for the intra-subsystem
and inter-subsystem traffics, respectively. α is the proportion of intra-subsystem traffic.
For the intra-subsystem communication, each packet only traverses one specific
λ-router in level 1. Due to the high-speed optical transmission and non-blocking
wavelength-based routing, any packet which from different inputs to different outputs
has the same delay to pass through the λ-router. Thus, the delay for an intra-subsystem
packet is constant and can be computed as:
Dintra = DEO +DλR +DOE, (3.3)
71
where DEO and DOE are the signal conversion delays, and DλR is the packet propaga-
tion delay over one hop of λ-router.
For the inter-subsystem communication, each packet needs to pass multiple λ-
routers and gateways, the average communication delay Dinter can be computed as:
Dinter=DEO+Nhop ×DλR︸ ︷︷ ︸
I
+(Nhop−1)×DGW︸ ︷︷ ︸
II
+DQ+DOE, (3.4)
where Nhop is the expected number of hops that an inter-subsystem packet needs to
traverse. Part I of Eq. (3.4) is the expected accumulated delay for the packet to
traverse all the λ-routers in the routing path. Part II of Eq. (3.4) is the accumulated
delay incurred by packet processing in gateways excluding the queueing delay, where
DGW is the packet transmitting delay in a gateway and it can be modelled as:
DGW =DEO+2Dbuf+Dxbar+Dwl+DOE, (3.5)
where Dbuf is the delay for reading and writing the sliced buffer, Dxbar is the delay
for copying a packet from the input to the output through the internal crossbar, and
Dwl is the delay incurred by looking up the wavelength routing matrix to get the right
wavelength for the next hop λ-router. DQ in Eq. (3.4) is the expected accumulated
queueing delay in all the gateways which are in the routing path. Actually, DQ is
the only uncertain part of packet delay incurred by the gateways, and it is modelled
separately so that Part II only depends on the average number of hops Nhop.
It can be seen from Eq. (3.4) that Dinter is a function of the average number of hops
Nhop and the queueing delay DQ, which depend on the network structure and traffic
patterns. In the following, the queueing delay Dinter is modelled for WRH-ONoC with
L levels of λ-routers and g sibling gateways under Uniform-Poisson traffic pattern.
Assume each core generates unicast packets following the Poisson distribution with
the same traffic rate θ packet/cycle, and each core sends packets to other cores with
the same probability. Hence, the traffic rate from any core i to anther core j is θ
N−1
packet/cycle, and the proportion of intra-subsystem traffic α is Wmax−g−1
N−1 .
Lemma 1. The probability that an inter-subsystem packet traverses 2i− 1 hops of λ-
routers is P (2i− 1) =( 1
Ri
− 1
Ri−1
)× N
N−1 , where i=2, ..., L. The expected number of hops
that an inter-subsystem packet traverses is Nhop=
∑L
i=2(2i− 1)P (2i− 1).
Proof: As described in Section 3.3.2, every inter-subsystem packet needs to experi-
ence two transmission periods: upward and downward, and there is always a turnover
72
λ-router in which the transmission changes from the upward direction to downward
direction. For example, in Figure 3.3, L03 is the turnover λ-router for the communica-
tion from the source core CD to the destination core CE. Suppose that the turnover
λ-router for an inter-subsystem communication is located at level i. Then the number
of hops that this communication traverses must be 2i − 1. Obviously, the probability
that an inter-subsystem packet traverses 2i − 1 hops equals to the probability that
the turnover λ-router for this packet is located at level i. If a level i λ-router is the
turnover router for a packet, the last λ-router from which the packet was received and
the next λ-router to which the packet will be forwarded must be two different routers
at level i−1. Let Ph(i) represent the probability that an inter-subsystem packet passes
a λ-router at level i, and Ph(∩|i) denote the probability that the packet changes its
direction from upward to downward in a level i λ-router given that the packet passes
through a level i λ-router. Then, P (2i− 1) = Ph(∩|i)× Ph(i).
In the WRH-ONoC interconnection, each level 1 λ-router aggregates the upward
traffic from n=Wmax−g cores, and segregates the downward traffic to n cores. Each
level 2 λ-router aggregates/segregates traffic from/to R1
R2
×n= N
R2
cores. Each level 3
λ-router aggregates/segregates traffic from/to R2
R3
× R1
R2
×n= N
R3
cores. Similarly, each
level i λ-router aggregates/segregates traffic from/to N
Ri
cores. If a packet passes a level
i λ-router, the destination core must not be any core from which the level i λ-router
can aggregate its upward traffic. Since the packet generated by any source core has the
same probability to be sent to all other cores, Ph(i)=(N−NRi )×
1
N−1 . For any packet that
passes through a level i λ-router, it can either keep going upward, or change its direction
from the upward to the downward, depending on the destination address of the packet.
Hence, Ph(∩|i) = (NRi−
N
Ri−1
)× 1
N− N
Ri
. Thus, P (2i−1) =Ph(∩|i)×Ph(i) = ( 1Ri−
1
Ri−1
)× N
N−1 ,
and the expected number of hops is Nhop=
∑L
i=2(2i−1)P (2i−1).
In the gateway design, each input buffer slice is a separate FIFO queue. To derive
the maximum traffic injection rate and the relationship between the gateway buffer
size and traffic rate, firstly it assumes the infinite input buffer in the following analysis.
The queueing delay with a limited buffer size (with back pressure) is a complicated
queueing network problem, which is not possible to achieve an explicit result. Thus, the
communication delay with limited buffer is directly analysed in the simulations. Hence,
each data path in the gateway is an independent M/M/1 queueing system, which can
be solved with existing queueing model. It also assumes that the routing structure as
illustrated in Figure 3.5 is a balanced and complete tree, so that in each level all the
λ-routers connect to the same number of cores or gateways and every input/output
73
of λ-routers are used. The following gives the expected end-to-end queueing delay for
an inter-subsystem packet. If the routing tree is imbalanced or incomplete, i.e., the
network is not fully utilized, Theorem 2 gives an upper bound of DQ.
Theorem 2. The average accumulated packet queueing delay can be calculated as:
DQ=
L∑
i=2
(
P (2i− 1)×
i−1∑
j=1
2θj
µj(µj − θj)
)
, (3.6)
where the average data rate θj =
N2
N−1×
Rj−1−1
R2j−1
× θ
g(Wmax−g) packet/cycle and the packet service
rate µj =
sp
td
packet/cycle. sp is the average packet size and td =2Dbuf+Dxbar+Dwl+DEO
is the packet dispatch delay.
Proof: For each inter-subsystem packet that traverses 2i−1 hops of λ-routers, the
packet needs to be routed by i−1 gateways in both the upward and downward directions.
Let Ti be the average queueing delay in the i
th hop gateway. According to Lemma 1,
the probability that a packet passes 2i−1 hops of λ-routers is P (2i−1). Then, the
expected accumulated queueing delay DQ can be computed by
DQ=
L∑
i=2
(
P (2i− 1)×
2i−2∑
j=1
Tj
)
. (3.7)
Since the upward and downward traffics are processed separately by using two inde-
pendent data paths in gateways, the average queueing delay Ti in i
th hop gateway of
the upward and downward directions can be analysed respectively.
(I) Ti in the Upward Direction: Since each core generates unicast traffic following
the Poisson distribution in time and Uniform distribution in space with the same traf-
fic rate θ packet/cycle, the traffic injection rate in a gateway that bridges a level j−1
λ-router and a level j λ-router can be computed as follows: the traffic rate from a level
j−1 λ-router to a level j λ-router is θ× N
Rj−1
×
N− N
Rj−1
N−1 = θ×
N2
N−1×
Rj−1−1
R2j−1
packet/cycle.
For example, consider the λ-router at level 3 and the left level 2 λ-router in Figure
3.5. N
Rj−1
= N
R2
is the number of cores whose upward traffic can pass the left level 2 λ-
router. Since only the packets with destination cores outside of the subnetwork rooted
at the left level 2 λ-router can be routed upward to the level 3 λ-router,
N− N
Rj−1
N−1 =
N−N
R2
N−1
is the proportion of the packets that are routed from the left level 2 λ-router to the
level 3 λ-router. Since the packets from a level j−1 λ-router to a level j λ-router are
evenly routed via g sibling gateways that connect to these two λ-routers, the traffic
injection rate in each gateway that connects a level j−1 λ-router to a level j λ-router
74
is N
2
N−1×
Rj−1−1
R2j−1
× θ
g
packet/cycle. According to the gateway design, each gateway needs
to have Wmax−g input queues to buffer the packets received from the level j−1 λ-
router using Wmax−g different wavelengths, since there is no communication among g
sibling gateways. Thus, the upward traffic injected to each gateway can be dispatched
to Wmax−g input queues following the Uniform distribution. Let θj denote the traffic
injection rate at one input queue in a gateway that connects a level j−1 λ-router to a
level j λ-router. It has θj =
N2
N−1×
Rj−1−1
R2j−1
× θ
g(Wmax−g) packet/cycle.
In each gateway, the input queues are fully connected with the output buffers, and
each input queue has an independent packet dispatcher. Let td represent the packet
dispatch delay which is defined as the average time interval between two adjacent
packets that are sent out from the same output buffer. Then td=2Dbuf+Dxbar+Dwl+
DEO. Each input queue can be modelled as a FIFO queueing system with the packet
injection rate of θj packet/cycle and packet service rate of µj = sp/td packet/cycle
where sp is the average packet size. Each input queue is subjected to the Birth-Death
process (Bhat, 2008) according to the queueing theory. Assume the probability that
q packets stay in the queueing system is Pj(q) with an initial state of Pj(0), and it
has Pj(q) = Pj(0)× (θjµj)
q for q ≥ 0. To achieve a stable queueing system, it should
guarantee
θj
µj
<1, and thus the average queue length in stable state, denoted by Qj, is
Qj =
∑∞
2 [(q−1)Pj(q)]=
∑∞
2 [(q−1)(1−
θj
µj
)(
θj
µj
)q]=
θ2j
µj(µj−θj) . According to Little’s Law,
the average queue delay of each input queue in stable state is Tj =
Qj
θj
=
θj
µj(µj−θj) .
(II) Ti in the Downward Direction: Similar to the upward direction, the traffic
rate from a level j λ-router to a level j−1 λ-router is θ×(N− N
Rj−1
)×
N
Rj−1
N−1 =θ×
N2
N−1×
Rj−1−1
R2j−1
packet/cycle, which is equal to the traffic rate from a level j−1 λ-router to a level j
λ-router because the downward process is symmetrical to the upward process. Since
the routing of downward packets is similar to the upward packets through another
independent data path, it has Tk = T2i−k−1, k ∈ [1, i−1]. Hence, DQ =
∑L
i=2
(
P (2i −
1)×
∑i−1
j=1 2Tj
)
=
∑L
i=2
(
P (2i− 1)×
∑i−1
j=1
2θj
µj(µj−θj)
)
, where θj =
N2
N−1×
Rj−1−1
R2j−1
× θ
g(Wmax−g) ,
and µj =sp/td.
Corollary 1. If WRH-ONoC uses the minimal number of λ-routers and in each λ-
router all the available wavelengths are fully utilized, the maximal of the average data
rate that can be achieved before network saturation is θmax=
sp(N−1)Wmax2
tdN2
packet/cycle,
which only relates to the number of cores N , the number of available wavelengths Wmax,
and the packet service rate in each gateway sp
td
.
Proof: To guarantee the network stability, each input queue in a gateway between
75
level j−1 and level j λ-routers should satisfy θj
µj
< 1, i.e., N
2
N−1×
Rj−1−1
R2j−1
× θ
g(Wmax−g) <
sp
td
.
Then, it can have θ <minj∈[2,L]
spg(Wmax−g)
td
× (N−1)R
2
j−1
N2(Rj−1−1) . When 1≤Rj <Rj−1, it always
has
R2j
Rj−1 <
R2j−1
Rj−1−1 for ∀j ∈ [1, L]. Thus, θmax =
spg(Wmax−g)
td
× (N−1)R
2
L−1
N2(RL−1−1)
is the maximal of
average data rate before network saturation, when j = L. If θ > θmax, the queueing
delay will keep increasing. If the WRH-ONoC architecture is constructed by using
the minimal number of λ-routers and in each λ-router all the available wavelengths
are fully utilized, the level L λ-router interconnects with RL−1 =
Wmax
g
level (L−1)
λ-routers. Hence, θmax =
spg(Wmax−g)
td
× (N−1)
N2
× (Wmax
g
)2× g
Wmax−g =
sp(N−1)Wmax2
tdN2
in this
network configuration, so it relates to the number of cores N , the number of available
wavelengths Wmax, and the packet service rate
sp
td
.
From Corollary 1, it can be seen that θmax depends on the maximum allowable
traffic rate at the topmost level, since the traffic is aggregated there but the number
of alternative paths remains to be g sibling gateways. One possible solution to further
increase θmax is to use a fat-tree topology with more λ-routers/gateways at higher
levels. However, this will not only increase the hardware cost and complexity, but
also increase the number of levels required in the network hierarchy, as each λ-router
can only connect a limited number of λ-routers/gateways, thereby leading to higher
inter-subsystem packet delay.
3.4.3 Energy Consumption
Similar to the analysis of inter-core communication delay, the energy consumption is
also analysed separately for the intra-subsystem and inter-subsystem traffics. Let E
represent the average energy consumption for transmitting a packet in WRH-ONoC,
it can be computed from two aspects as the following:
E = αEintra + (1− α)Einter, (3.8)
where Eintra and Einter stand for the average energy consumption for intra-subsystem
and inter-subsystem traffics, and α is the proportion of intra-subsystem traffic. Accord-
ing to the communication process in WRH-ONoC, the average energy consumption for
intra-subsystem traffic Eintra is composed of the energy consumption for electronic-to-
optical conversion (modulation) EEO, optical-to-electronic conversion (photodetection)
EOE, and the light source Els, as shown in Eq. (3.9).
Eintra = EEO + EOE + Els. (3.9)
76
With a given average data rate, EEO and EOE are constant for specific O-E and E-
O converters. Meanwhile, the light source should provide sufficient optical power to
ensure the photodetector can correctly decode the optical signals. The optical power
of light source Els is determined by the light source efficiency η, the insertion loss
of waveguides and λ-routers IL, and the photodetector sensitivity Pd (Morris, Kodi,
Louri, and Whaley, 2014). Since a separate light source is provided for each core and
gateway with the given clock frequency of fclk, Els can be calculated as:
Els =
1
ηfclk
× Pd × 10
IL
10 , (3.10)
where the insertion loss IL (in dB) is the power attenuation of optical signal when it
propagates through all the optical devices in the routing path between the source core
and the destination core. In the WRH-ONoC architecture, the insertion loss occurs in
waveguides (transmit through and cross) and λ-routers (drop in or pass an MR).
For the inter-subsystem traffic, the average energy consumption for inter-subsystem
communication includes an electronic part (i.e., in the gateway) and an optical part
(i.e., in light source, O-E and E-O converters) as shown in Eq.(3.11). For the electronic
part, each packet should transmit through Nhop−1 hops of gateways. For the optical
part, each packet should pass through Nhop λ-routers, and experience Nhop times of
O-E and E-O conversions in the network interface of source and destination core, and
the intermediate gateways.
Einter=EGW (Nhop−1) +(EEO+EOE+Els)Nhop. (3.11)
The energy consumption of gateway EGW is caused by the sliced input/output buffers,
wavelength dispatchers, crossbars, and the static energy (e.g., leakage).
3.5 Performance Evaluation
The communication performance and scalability of WRH-ONoC are evaluated through
extensive simulations with both realistic data traces and synthetic traffics in differ-
ent network configurations. Moreover, it is also compared with some existing ONoC
schemes, including PNoC (Shacham and Bergman, 2007), HOME (Mo, Ye, Wu, Zhang,
Liu, and Xu, 2010), ATAC (Kurian, Miller, Psota, Eastep, Liu, Michel, Kimerling, and
Agarwal, 2010), and Firefly (Pan, Kumar, Kim, Memik, Zhang, and Choudhary, 2009).
77
3.5.1 Simulation Setup
To evaluate the communication performance and scalability of WRH-ONoC, it is im-
plemented in a network-level simulator based on Noxim (Catania, Mineo, Monteleone,
Palesi, and Patti, 2016), which can achieve the confidence interval of 95% according to
(Catania, Holsmark, Kumar, and Palesi, 2006; Khalili and Zarandi, 2012). Moreover,
other ONoC schemes are also implemented in the simulator with the same parame-
ter settings to make fair comparison. Table 3.1 summarizes the parameter settings of
different components being used in the simulations. All the optical devices, including
E-O&O-E converters and routers, work with a bandwidth of 10 Gbps for each wave-
length channel (Liu, Liao, Chetrit, Basak, Nguyen, Rubin, and Paniccia, 2010). All the
electronic devices use a system clock of 1 GHz, thus each clock cycle is 1 ns. Consider-
ing the physical implementations in many-core processor, each packet has a constant
size of 64 bits (Kao and Chao, 2014), and multiple successive packets can be transmit-
ted for a large data block. For instance, since each cache coherence message has eight
bytes and each cache line has seventy-two bytes in the data traces (Hestness, Grot, and
Keckler, 2010), it needs to transmit one and nine packets for them each time, respec-
tively. According to the analysis of data traces, in the simulations each core in each
clock cycle can generate a single short packet with 80 percent possibility and generate a
data block with nine packets with 20 percent possibility. According to the simulations
in (Liu, Yang, and Melhem, 2015), the optical signals can go through eight routers
in an optical mesh, thus in the simulations the delay for optical signals is also config-
ured as passing through a λ-router to eight stages/cycle and eight hops/cycle in the
ring-like crossbars for fair comparison. The transmitting delay of gateways incurred
by buffering, wavelength look-up, and packet dispatching takes five cycles for every
packet. While in the other ONoC schemes, the transmitting delay of the electronic
router is set to their minimum, 2 cycles. Since Noxim is a simulator for mesh-based
electronic NoCs, it is used to simulate the path setup process for PNoC and HOME,
and the intra-cluster packet delivery for ATAC and Firefly. Thus, the end-to-end packet
delay is the electronic network delay plus the optical delivery delay in these schemes.
Moreover, since the buffer size in the gateways and the electronic routers have signifi-
cant influence on the energy consumption and area cost, the overall buffer sizes in each
gateway and electronic router are configured to be exactly equal. For instance, each
sliced buffer in the gateways is fixed to be 2 packets/wavelength, while the buffer in an
electronic router of other schemes is set to 2×ng×Wmax
nr
packets/port, where ng and nr are
the numbers of ports in the gateway and the electronic router. To ensure the accuracy
78
Table 3.1: Simulation Settings for WRH-ONoC
Clock frequency 1 GHz Packet size 64 bits
Optical channel bandwidth 10 Gbps/wl Gateway delay 5 cycles
λ-router delay 8 stages/cycle Electronic router delay 2 cycles
Optical router/crossbar delay 8 hops/cycle Overall buffer size Equal
4 6 8 10 12 14 16 18 20 22 24
0
40
80
120
160
200
240
280
320
23.17
A
v
e
ra
g
e
 E
n
d
-t
o
-E
n
d
 D
e
la
y
 (
n
s
)
Average Data Rate (Gbps/core)
  {320,20,4}, modelling 
  {320,20,4}, simulation
  {480,30,6}, modelling
  {480,30,6}, simulation
15.17
10 12 14 16 18 20 22 24 26 28 30 32
0
40
80
120
160
200
240
280
320
360
31.17
19.18
 {400,25,5}, model
 {400,25,5}, simulate
 {640,40,8}, model
 {640,40,8}, simulate
A
v
e
ra
g
e
 E
n
d
-t
o
-E
n
d
 D
e
la
y
 (
n
s
)
Average Data Rate (Gbps/core)
(a) (b)
Figure 3.7: The comparison of average communication delay from
simulation results and modelling, {N,Wmax, g} to be (a) {320,20,4}
and {480,30,6}, and (b) {400,25,5} and {640,40,8}.
of simulation results, each simulation run lasts for 500,000 cycles with 10,000 cycles
for warmup, and the simulation results are achieved by running the same simulation
setting in three times. According to the analysis, the coefficient of variation (i.e., the
ratio of the standard deviation to the mean value) of the simulation results is always
less than 4.78%. The λ-router scheme was not simulated since it is non-blocking with
a constant transmission delay.
3.5.2 Comparison with Theoretical Results
To verify the average communication delay model of WRH-ONoC given in Section 3.4,
the average end-to-end packet delay obtained from the theoretical analysis and the
simulation results are compared with the variation of average data rate, as shown in
Figure 3.7. In the simulations, the unicast packets are generated following the widely
used Possion-Uniform distribution (Hamedani, Jerger, and Hessabi, 2014; Zhang, Gu,
Wang, Yang, and Tan, 2017; Gu, Chen, Yang, Chen, and Zhang, 2017), and the buffer
size of gateway is set to 100 packets/wavelength to approximate the infinite buffer. Since
the buffer space is not limited in each gateway on the routing path, this simulation
can also indicate the upper-bound performance for a certain network configuration of
WRH-ONoC from the point of view of queueing delay.
79
It can be seen from Figure 3.7 that four sets of simulations are conducted, in
which the network configuration of WRH-ONoC {N,Wmax, g}are set to be{320, 20, 4}
and {480, 30, 6} in Figure 3.7(a), and {400, 25, 5} and {640, 40, 8} in Figure 3.7(b).
According to the parameter settings in Table 3.1, the transmission delays of λ-routers,
DλR, are 3, 4, 4, and 5 cycles in these cases, respectively. The delays of DEO, DOE,
DGW , td are set to 1 cycle, 1 cycle, 6 cycles, and 5 cycles in both the theoretical analysis
and simulations. When the average data rate θ is small (e.g., θ≤ 10 Gbps/core), the
average end-to-end delay is also small (around 25 ns) and remains stable because most
of the data packets do not experience much queueing delay in the gateways. When
the average data rate θ approaches the maximum data rate θmax, the average end-
to-end delay increases dramatically because the network becomes saturated and most
packets need to be queued in the gateways between the source core and the destination
core. It is worth noting that the average delay measured in the simulations keeps
close to the theoretical results, with the average different of 22.9%, 20.0%, 17.3%, and
13.7%, respectively. The maximum data rates can be achieved are 15.17, 23.17, 19.18,
and 31.17 Gbps/core, respectively, which also matches with the analysis in Corollary
1. However, when the average data rate increases, the gap between the theoretical
analysis and the simulation results becomes bigger. The main reason for the gap
between the theoretical analysis and simulation results is that infinite buffer size is
assumed in each gateway in the theoretical analysis, while the gateway buffer size
is set to 100 packets/wavelength only to approximate the infinite buffer space in the
simulations. With this configuration in the simulations, when the data packets are
congested in the gateways, especially the gateways in the upper levels, waiting for
the routing computation and wavelength reallocation, the network congestion can be
spread back to the source core with the credit-based flow control. Thus, the actual
data rate is smaller in the simulations than in the theoretical modelling, and in turn
the average end-to-end delay achieved in the simulations is a little lower than the
theoretical analysis. Moreover, this phenomenon can become more obvious when the
average data rate is higher due to more severe network congestion. Generally, this
comparison can validate the correctness of previous theoretical analysis in Section 3.4.
3.5.3 Simulation with Data Traces
In the first set of simulations, the traffic pattern of inter-core communication is achieved
from some realistic data traces and used to evaluate the performance of WRH-ONoC.
The communication data traces are obtained from Netrace (Hestness, Grot, and Keck-
80
0 5 10 15 20 25 30 35 40 45 50
0
10
20
30
40
50
60
70
80
90
100
A
v
e
ra
g
e
 E
n
d
-t
o
-E
n
d
 D
e
la
y 
(n
s)
Time Interval (million cycles)
 PNoC
 HOME
 ATAC
 Firefly
 WRH-ONoC
340
(a) (b)
bla
cks
cho
les
bod
ytra
ck
can
nea
l
ded
up
fer
ret
flui
dan
ima
te
sw
apt
ion
s
vip
s
x26
4
0
10
20
30
40
50
60
70
80
90
100
A
v
e
ra
g
e
 E
n
d
-t
o
-E
n
d
 D
e
la
y
 (
n
s
)
  PNoC
  HOME
  ATAC
  Firefly
  WRH-ONoC
193118 215
Figure 3.8: The trace-based simulations for different ONoC schemes
with 64 cores: (a) the average end-to-end packet delay; (b) the packet
delay variations over time with the blackscholes trace.
ler, 2010) which constructs a 64-core system running the PARSEC benchmarks (Bienia,
Kumar, Singh, and Li, 2008) with 2-level cache and MESI coherence protocol. Accord-
ing to the analysis to the data traces, even though the average data rates in these
traces are low (less than 0.24 packets/cycle), they contain bursty (massive data pack-
ets in a short time) and multicast traffics. Thus, they are good tools to evaluate the
communication performance of WRH-ONoC for small-scale many-core processors.
In the simulations, WRH-ONoC is configured to a 2-level hierarchical network with
64 cores, namely {N,Wmax, g}={64, 20, 4}. Thus, there are 16 cores connected in each
subsystem, 20 different wavelengths used in each λ-router, and 4 sibling gateways used
between two λ-routers. The configuration of other ONoC schemes to compare with
WRH-ONoC are set as the following. PNoC is configured to an 8×8 mesh network
using the traditional dimensional routing algorithm, and HOME uses a 4×4 mesh with
four cores share one optical router. These two ONoC schemes do not employ WDM
and need to reserve the optical routing path by the electronic network in advance.
ATAC and Firefly schemes are configured with 16 clusters, 4 cores for each cluster in a
2×2 electronic mesh. In ATAC scheme, each cluster has a fixed optical point to access
the global optical ring, while in Firefly scheme the cores at the same relative position
of different clusters are connected by an optical ring thus with higher path diversity
than ATAC scheme. The simulation results are shown in Figure 3.8.
Figure 3.8(a) shows the average end-to-end packet delay of different ONoC schemes
by running different data traces. It can be seen that WRH-ONoC achieves the lowest
packet delay (around 15 ns) compared to all the other ONoC schemes. It is worth
noting that the average end-to-end delay achieved by WRH-ONoC does not vary much
81
for different data traces. It indicates that the communication capacity of WRH-ONoC
can accommodate the traffic variations in all these traces, since it can process multiple
communications in parallel through different channels via wavelength multiplexing and
has better load-balance capability to deal with the bursty traffic and multicast traffic.
On the other hand, since PNoC requires the end-to-end optical path reservation and
has longer communication distance, its packet delay is much higher and varies a lot
even with a low data rate. Since HOME scheme has lower communication distance
within a 4×4 mesh, the path reservation delay is not significant compared with PNoC.
In ATAC, the packets need to be transmitted in an electronic network at both the
source and destination sides, and all of the cores need to compete for the global optical
ring, hence its average packet delay is a little higher. Firefly can also achieve better
performance compared with the other schemes by using more optical rings, in which
the cores at the same position of each cluster are connected by a separate optical ring.
To further demonstrate the detailed delay variations in different time periods, Fig-
ure 3.8(b) is used to show the short-term average end-to-end delay that during a fixed
time interval, namely 1 million cycles, for these ONoC schemes by using the blacksc-
holes data trace. It can be seen that the average end-to-end delay for WRH-ONoC
remains very low and stable, while PNoC has lots of fluctuation in average delay over
time. In the trace based simulations, since there are only 64 cores and the average data
rate is low, the performance improvement of WRH-ONoC over other ONoC schemes
is not significant. Thus, larger scale many-core processors will be evaluated with the
synthetic traffic pattern. Even though ATAC and Firefly schemes also have a low
fluctuation, their average packet delays are also a little higher than WRH-ONoC.
Note that in the trace-based simulations, since there are only 64 cores and the
average data rate is low, the communication performance improvement of WRH-ONoC
over other ONoC schemes is not very significant. Thus, the WRH-ONoC architecture
with larger network sizes and variable data rates are further evaluated and compared
with the other scheme in the synthetic traffic patterns.
3.5.4 Simulation with Synthetic Traffic Patterns
In the following, the synthetic traffic patterns are used to further evaluate the communi-
cation performance and scalability of WRH-ONoC for large-scale many-core processors,
which interconnect more than hundreds of cores. With the synthetic traffic patterns,
it can evaluate the saturation data rate and the maximum throughput by gradually
increasing the average data rate of each core. Also the communication performance of
82
(a) (b)
 
0 2 4 6 8 10 12 14 16 18 20
0
20
40
60
80
100
120
140
160
180
200
A
v
e
ra
g
e
 E
n
d
-t
o
-E
n
d
 D
e
la
y
 (
n
s
)
Average Data Rate     (Gbps/core)
 PNoC
 HOME
 ATAC
 Firefly
 WRH-ONoC
0 5 10 15 20 25 30 35 40 45 50 55 60 65
0
5
10
15
20
25
30
A
ve
ra
g
e
 T
h
ro
u
g
h
p
u
t 
(G
b
p
s/
co
re
)
Average Data Rate     (Gbps/core)
 PNoC
 HOME
 ATAC
 Firefly
 WRH-ONoC
Figure 3.9: Performance analysis for different ONoC schemes with 400
cores using synthetic unicast traffic: (a) average end-to-end delay; (b)
average throughput per core.
WRH-ONoC for the unicast and multicast traffics can be simulated separately. In order
to obtain the average-case communication performance of WRH-ONoC with different
network parameters, the synthetic traffic follows the widely used Uniform-Possion dis-
tribution in the simulations (Hamedani, Jerger, and Hessabi, 2014; Zhang, Gu, Wang,
Yang, and Tan, 2017; Gu, Chen, Yang, Chen, and Zhang, 2017).
Comparison with Existing Schemes
In this set of simulations, the unicast traffic which follows the Uniform-Poisson distri-
bution is used to evaluate the communication performance of different ONoC schemes.
The total number of cores in the many-core processor N is set to 400. In the WRH-
ONoC architecture, all the cores are divided to 20 subsystems and organized as {N,Wmax, g}
={400, 21, 1}. With this configuration, WRH-ONoC has 2 levels of λ-routers with g=1
sibling gateway between two λ-routers. Meanwhile, in PNoC scheme 400 cores are con-
nected by a 20×20 electronic-optical mesh network using dimensional routing algorithm,
while HOME interconnects all the four-core clusters using a 10×10 electronic-optical
mesh. PNoC and HOME utilize only one wavelength without WDM. In the ATAC and
Firefly schemes, 400 cores are grouped into 25 clusters connected by WDM-based op-
tical crossbar with a little higher maximum number of wavelengths than WRH-ONoC
(25 to 21), and a 4×4 electronic mesh for the intra-cluster routing. The simulation
results are shown in Figure 3.9.
Figure 3.9(a) shows the average end-to-end packet delay with the variation of aver-
age data rate. It can be seen that WRH-ONoC scheme can achieve the lowest end-to-
end delay and highest saturation data rate compared with all the other ONoC schemes.
83
When the average data rate θ is small, the packet delay in every ONoC scheme is very
low, since most of the packets do not suffer from the queueing delay in the gateways
of WRH-ONoC and in the electronic routers of other ONoC schemes. The zero-load
delay is used to evaluate the average packet delay when the average data rate is low,
and it is defined as the delay without contention, thus it is also the lower bound of the
average delay. Since WRH-ONoC scheme has only 2 levels of λ-routers when g=1 and
most of the packets do not experience many wavelength reassignments in the gateways,
it can obtain the lowest zero-load delay, around 12.6 ns. The zero-load delay of Firefly
is also low, about 18.4 ns, owing to its low average distance and high path diversity.
But when compared with the other ONoC schemes, WRH-ONoC still has significant
advantages in zero-load delay. PNoC has the highest zero-load delay, about 60.8 ns,
due to the large average length of routing paths and end-to-end path reservation delay.
The zero-load delays of HOME and ATAC are 37.4 ns and 47.5 ns. Therefore, WRH-
ONoC has at least 46.0% of reduction on the zero-load delay even in comparison with
Firefly, the best of all the other ONoC schemes.
As the average data rate increases, the average end-to-end packet delay increases
as well, because of the increasing contentions occurred in the electronic routers or
gateways. Whereas, the WRH-ONoC scheme can achieve very low end-to-end delay
(<20 ns) even with the average data rate of 16 Gbps/core. On the other hand, HOME
gets saturated quickly at 2 Gbps/core due to the heavy traffic loads in the global
mesh network with 4 cores in a cluster sharing one optical router. PNoC saturates at
around 4 Gbps/core due to the long end-to-end path reservation process and the poor
scalability of mesh topology. ATAC and Firefly achieve higher saturation data rate
of around 11 Gbps/core since they both utilize the optical ring for all the 25 clusters.
The saturation data rate of WRH-ONoC (around 17 Gbps/core) is about 8.5, 4.3, 1.55
and 1.55 times of that for HOME, PNoC, ATAC and Firefly owing to the non-blocking
property of λ-routers and fast wavelength assignment in the gateways.
Figure 3.9(b) shows the average throughput for different ONoC schemes when in-
creasing the average data rate. It can be seen that when the average data rate is
low, the average throughput of all these schemes increase linearly with the average
data rate, since the throughput capacity is above that data rate. However, when the
average data rate exceeds the throughput capacity, the achieved average throughput
cannot increase any more. Different ONoC schemes have different maximal through-
puts, namely the saturated throughput. It can be seen that the proposed WRH-ONoC
scheme achieves much higher maximal throughput than the other ONoC schemes,
84
(a) (b)
 
0 5 10 15 20 25 30 35 40
0
20
40
60
80
100
120
140
160
180
200
A
ve
ra
g
e
 E
n
d
-t
o
-E
n
d
 D
e
la
y 
(n
s)
Average Data Rate    (Gbps/core)
 PNoC, 400
 PNoC, 640
 Firefly, 400
 Firefly, 640
 WRH, 400
 WRH, 480
 WRH, 640
 WRH, 800
0 5 10 15 20 25 30 35 40 45 50 55 60 65
0
5
10
15
20
25
30
35
40
45
50
55
60
A
ve
ra
g
e
 T
h
ro
u
g
h
p
u
t 
(G
b
p
s/
co
re
)
Average Data Rate    (Gbps/core)
 PNoC, 400
 PNoC, 640
 Firefly, 400
 Firefly, 640
 WRH, 400
 WRH, 480
 WRH, 640
 WRH, 800
Figure 3.10: Performance analysis with different network sizes: (a)
average end-to-end packet delay; (b) throughput per core.
around 22.1 Gbps/core. Generally, WRH-ONoC has at least 72.7% of improvement on
maximal throughput, compared to the best of other schemes (12.8 Gbps/core in Firefly).
Impact of Network Size
The scalability of WRH-ONoC is evaluated in the simulations by using different net-
work sizes. Four groups of simulations are carried out with 400, 480, 640, and 800
cores, and the network architectures are organized as {N,Wmax, g} to be {400,25,5},
{480,30,6}, {640,40, 8}, and {800,50,10}, respectively. PNoC and Firefly are also com-
pared with 400 and 640 cores in the simulations. PNoC is configured as the 20×20 and
20×32 mesh networks. Firefly uses the same number of wavelengths as WRH-ONoC
scheme in the global optical ring, and 16 cores in each cluster are connected by a 4×4
mesh local network. The simulation results are given in Figure 3.10.
It can be seen from the simulation results that WRH-ONoC can achieve much
better performance, i.e., lower end-to-end delay and higher throughput, than PNoC
and Firefly schemes when connecting the same number of cores. It is worth noting
that: (i) the performance of PNoC and Firefly schemes deteriorate as the network
size expands. That is because the linearly increased average number of hops and
severe contentions in the path reservation for PNoC, and the inefficient electronic local
network and contentions in the optical global network for Firefly. Similar results can be
achieved in HOME and ATAC schemes; (ii) benefiting from non-blocking λ-routers and
the load balance among sibling gateways, WRH-ONoC can achieve higher saturation
data rate and average throughput for larger scale systems by increasing the number
of wavelengths Wmax and the number of sibling gateways g. For instance, when Wmax
is increased from 25 to 50 and g is increased from 5 to 10, the saturation data rate
85
(a) (b)
 
0 2 4 6 8 10 12 14 16 18 20
0
20
40
60
80
100
120
140
160
180
200
A
ve
ra
g
e
 E
n
d
-t
o
-E
n
d
 D
e
la
y
 (
n
s)
Average Data Rate    (Gbps/core)
 L=2 levels, g=1 gateway
 L=3 levels, g=4 gateways
 L=3 levels, g=5 gateways
0 5 10 15 20 25 30 35 40 45 50 55 60 65
0
5
10
15
20
25
30
 L=2 levels, g=1 gateway
 L=3 levels, g=4 gateways
 L=3 levels, g=5 gateways
A
ve
ra
g
e
 T
h
ro
u
g
h
p
u
t 
(G
b
p
s/
co
re
)
Average Data Rate     (Gbps/core)
Figure 3.11: Performance analysis with different number of sibling
gateways: (a) average end-to-end delay; (b) throughput per core.
and average throughput are more than doubled, even though the number of cores is
also doubled (from 400 to 800). This is because the proportion of intra-subsystem
traffic increases with the increase of Wmax, and the paralleled data paths between λ-
routers increases with the increase of g. For Firefly scheme, even the optical network
can provide higher bandwidth with more wavelengths, the electronic local network will
become its bottleneck; (iii) for each configuration, since WRH-ONoC is connected in
3 levels of λ-routers, the average end-to-end delay achieved before network saturation
roughly increases with the increase of λ-router size (i.e., the number of input/output
ports or the number of stages), because of the increased transmission delay.
Therefore, from this set of simulations, it can be seen that WRH-ONoC achieves
better scalability than the other ONoC schemes, namely with lower packet delay and
higher throughput when interconnecting the same number of cores.
Impact of Gateway and Buffer Size
The communication performance of WRH-ONoC with different network configuration
is also evaluated in the simulations. The influence of different number of sibling gate-
ways between λ-routers is simulated in Figure 3.11, and the influence of different buffer
size in gateways is simulated in Figure 3.12.
Figure 3.11 shows the performance analysis of WRH-ONoC with different number
of sibling gateways. The simulations are carried out for a many-core processor with 400
cores. To interconnect the same number of cores, WRH-ONoC has 2 levels of λ-routers
when g=1, and 3 levels when g=4 and 5. It can be seen that when g=1, WRH-ONoC
has less levels of λ-routers and gateways, thus it can achieve lower end-to-end packet
delay due to less number of times of wavelength reassignments in gateways, as shown in
86

(b)

(a)
10 11 12 13 14 15 16 17
0
20
40
60
80
100
120
140
160
180
200
2
A
v
e
ra
g
e
 E
n
d
-t
o
-E
n
d
 D
e
la
y 
(n
s
)
Average Data Rate    (Gbps/core)
  Buffer = 1 packet
  Buffer = 2 packets
  Buffer = 4 packets
  Buffer = 8 packets
  Buffer = 16 packets
1
4/8/16
0 5 10 15 20 25 30 35 40 45 50 55 60 65
0
5
10
15
20
25
30
4/8/16
A
ve
ra
g
e
 T
h
ro
u
g
h
p
u
t 
(G
b
p
s/
co
re
)
Average Data Rate    (Gbps/core)
  Buffer = 1 packet
  Buffer = 2 packets
  Buffer = 4 packets
  Buffer = 8 packets
  Buffer = 16 packets
1
2
Figure 3.12: Performance analysis with different buffer sizes: (a) av-
erage end-to-end packet delay; (b) throughput per core.
Figure 3.11(a); when g=4 and 5, WRH-ONoC has more alternative channels between
two different levels of λ-routers and better load balance ability by randomly choosing
in more sibling gateways, thus it can achieve higher saturation throughput, as shown
in Figure 3.11(b). Hence, WRH-ONoC can be configured according to the requirement
of specific applications, on the number of sibling gateways, to achieve a better tradeoff
between the average end-to-end packet delay and network throughput.
Figure 3.12 shows the influence of gateway’s buffer size on the performance of WRH-
ONoC. The simulations are carried out for a many-core processor with {N,Wmax, g}=
{400, 25, 5}. The buffer size in each gateway increases from 1 to 16 packets/wavelength.
Since the wavelength-level credit-based flow control is used, the transmission of data
packet is stopped when the input buffer of the next-hop gateway is full. With this
design, the end-to-end packet delay can increase significant when the network is con-
gested, but there is always no packet loss with different buffer sizes. It can be seen that
the performance is improved by increasing the buffer size from 1 to 2, because larger
packet buffer in the gateways can better deal with the bursty traffic. However, further
increasing the buffer size to 4, 8, and 16 does not greatly enhance the performance.
This is because (i) the average queue delay in the gateways is very small when the
network is not saturated due to high-speed transmission in λ-routers and paralleled
packet dispatching in the gateways. Even with a large buffer, only a small portion is
used for temporarily storing the packet to wait for the wavelength reassignment; (ii)
when the average data rate further increases, since the gateways in the upper levels
can become the bottleneck due to the limited buffer size and the converged data rate,
the increase of buffer size cannot bring an obvious improvement on the communication
performance. The optimization of buffer size configuration in the gateway, such as with
87
0 5 10 15 20 25 30 35 40 45 50 55 60 65
0
20
40
60
80
100
120
140
160
180
200
A
v
e
ra
g
e
 E
n
d
-t
o
-E
n
d
 D
e
la
y
 (
n
s
)
Average Data Rate     (Gbps/core)
 Uniform
     = 0.1
     = 0.3
     = 0.5
     = 0.7
0 5 10 15 20 25 30 35 40 45 50 55 60 65
0
5
10
15
20
25
30
35
40
45
50
55
60
65
A
ve
ra
g
e
 T
h
ro
u
g
h
p
u
t 
(G
b
p
s/
co
re
)
Average Data Rate     (Gbps/core)
 Uniform
     = 0.1
     = 0.3
     = 0.5
     = 0.7
(a) (b)
 








Figure 3.13: Performance analysis with the locality traffic patterns:
(a) end-to-end packet delay; (b) throughput per core.
different buffer sizes in different levels, will be research in details in future work.
Impact of Locality Traffic Pattern
In most of many-core processors, the communication tends to occur locally as the tasks
are commonly executed only in a subset of cores. Since the ratio of intra-subsystem
traffic has significant influence on the communication performance of WRH-ONoC,
in this set of simulations the locality traffic pattern is used. In the locality traffic,
the proportion of intra-subsystem traffic α can be tuned manually. For each type of
traffics (intra-subsystem or inter-subsystem traffic), it is still subjected to the Uniform
distribution in space and Poisson distribution in time. In the simulations, the WRH-
ONoC architecture is configured as {N,Wmax, g}={400, 25, 5}. The simulations results
are given in Figure 3.13.
It can be seen from Figure 3.13 that the traffic distributions in WRH-ONoC has
significant impact on the communication performance. Note that the performance gap
between Uniform traffic and locality traffic is more apparent with larger ratio of intra-
subsystem α and average data rate θ. This is because that, for Uniform traffic, intra-
subsystem communication only takes n−1
N−1 =4.76%. The intra-subsystem packets only
traverse one level 1 λ-router with very low latency, while inter-subsystem packets need
to traverse several hops of λ-routers, and need to compete for the gateway resources
along the routing path. Hence, the average end-to-end delay increases with the increase
of θ, but decreases with the increase of α. Likewise, the average throughput increases
with both the increase of θ and α.
88
0 1 2 3 4 5 6 7 8 9
0
20
40
60
80
100
120
140
160
180
200
A
v
e
ra
g
e
 E
n
d
-t
o
-E
n
d
 D
e
la
y
 (
n
s
)
Average Data Rate     (Gbps/core)
 {s,n}={1,20}
 {s,n}={2,10}
 {s,n}={4,5}
 {s,n}={5,4}
 {s,n}={10,2}
 {s,n}={20,1}
0 5 10 15 20 25 30 35 40 45 50 55 60 65
0
10
20
30
40
50
60
70
A
ve
ra
g
e
 T
h
ro
u
g
h
p
u
t 
(G
b
p
s/
co
re
)
Average Data Rate     (Gbps/core)
 {s,n}={1,20}
 {s,n}={2,10}
 {s,n}={4,5}
 {s,n}={5,4}
 {s,n}={10,2}
 {s,n}={20,1}
(a) (b)
 
Figure 3.14: Performance analysis with different multicast distribu-
tions: (a) average end-to-end delay; (b) average throughput.
Impact of Multicast Traffic
The multicast communication widely exists in many-core processors and has significant
on the communication performance. In the simulations, the multicast communication
performance of WRH-ONoC is evaluated with the parametrized destination core’s dis-
tribution and multicast traffic ratio. The destination cores of each multicast packet can
be distributed in different subsystems, and the simulation results are given in Figure
3.14. Moreover, the multicast traffic ratio, denoted by ω, is the proportion of multicast
packets over the total number of packets, and it can be configured to different values,
as the simulation results shown in Figure 3.15.
The simulation results in Figure 3.14 reveal the influence of different distributions
of destination cores for a 400-core WRH-ONoC with {N,Wmax, g}={400, 25, 5}. The
traffic pattern follows the Uniform-Poisson distribution in which ω = 5% of traffic is
multicast. The parameter settings {s, n} indicate that the destination cores of each
multicast packet are randomly distributed in s different subsystems with n randomly
chosen cores per subsystem. In the simulations, the total number of destination cores
for a multicast packet is set to 20, which is the number of cores in a subsystem, so that 6
combinations of s and n can be evaluated. It can be seen that WRH-ONoC can achieve
much lower end-to-end delay and much higher throughput by placing the destination
cores in as less subsystems as possible, e.g., the same subsystem or neighbouring ones,
because less packet copies are required to transmit in the hierarchical network. Very low
packet delay (about 20 ns) is achieved when all the 20 destination cores are in the same
subsystem with up to 8 Gbps/core data rate. Moreover, since for a multicast packet,
as many as 20 packet copies are generated at the last-hop gateways, the saturated
data rate is relatively lower compared to the previous unicast simulations, while the
89
0 1 2 3 4 5 6 7 8 9 10
0
20
40
60
80
100
120
140
160
180
200
A
ve
ra
g
e
 E
n
d
-t
o
-E
n
d
 D
e
la
y
 (
n
s
)
Average Data Rate     (Gbps/core)
 PNoC,    =5%
 WRH,     =5%
 PNoC,    =10%
 WRH,     =10%
 PNoC,    =15%
 WRH,     =15%
0 5 10 15 20 25 30 35 40 45 50 55 60 65
0
10
20
30
40
50
60
70
 PNoC,     = 5%
 WRH,      = 5%
 PNoC,     = 10%
 WRH,      = 10%
 PNoC,     = 15%
 WRH,      = 15%
A
ve
ra
g
e
 T
h
ro
u
g
h
p
u
t 
(G
b
p
s
/c
o
re
)
Average Data Rate     (Gbps/core)
        PNoC
(5%, 10%, 15%)
(a) (b)
 












Figure 3.15: Performance analysis with different multicast ratios ω:
(a) average end-to-end delay; (b) average throughput.
saturated throughput is higher due to more copies generated for each multicast packet.
Figure 3.15 shows the communication performance of WRH-ONoC with different
multicast traffic ratios. The proposed WRH-ONoC scheme is also compared with PNoC
which uses a tree-based multicast routing algorithm (Samman, Hollstein, and Glesner,
2010). In the simulations, WRH-ONoC is configured with {N,Wmax, g}={400, 25, 5},
while PNoC is in a 20×20 mesh. For each multicast packet, there are 20 randomly chosen
destinations. The traffic pattern follows Uniform-Poisson distribution with different
proportions of multicast traffic. It can be seen that, as ω increases, the maximum data
rate and throughput decrease significantly owing to much heavier bursting multicast
traffic especially with as many as 20 destinations. However, WRH-ONoC can always
achieve better performance than PNoC, i.e., the zero-load delay can reduce by 63.9%
and the saturation throughput can increase by 8.4 times even with only 5% of multicast
traffic. These results can also be similarly achieved from HOME, ATAC, and Firefly,
especially since the multicast traffic can lead to much heavier congestions on their
global optical interconnects.
3.5.5 Hardware Cost Analysis
Table 3.2 gives the hardware requirement of optical devices (in MRs) of WRH-ONoC
and the comparison with the λ-router scheme, since WRH-ONoC is proposed on the
basis of λ-router and to solve its scalability problem for large-scale many-core proces-
sors. From this table, it can be seen that the number of wavelengths, MRs required
in the signal converters (O-E&E-O) and λ-routers in WRH-ONoC are much smaller
than those in a global λ-router, more than 90% of reduction. That is because each
core in the λ-router scheme needs to directly interconnect with all the other cores,
90
Table 3.2: Hardware Requirement Comparison between WRH-ONoC
and λ-router
Architecture
Configuration Requirements of MRs
N Wmax g in E-O&O-E Reduction in Routers Reduction
λ-router 400 400 - 159600 - 159200 -
WRH-ONoC 400 25 5 14600 90.85% 13950 91.23%
λ-router 480 480 - 229920 - 229440 -
WRH-ONoC 480 30 6 21120 90.81% 20340 91.13%
λ-router 640 640 - 408960 - 408320 -
WRH-ONoC 640 40 8 37760 90.77% 36720 91.01%
while in WRH-ONoC each core only directly connects with the cores or gateways in
the same subsystem. Even though the use of gateways will increase hardware costs on
the electronic devices, the number of gateways is much fewer than the connected cores,
e.g., 20 gateways for a 400-core system when Wmax=25, g=1, and the buffer space is
much smaller than in the traditional electronic routers due to high-speed optical trans-
mission. Without mature on-chip wavelength converter, the electronic hardware cost
is inevitable, and all the electronic gateways can be integrated in a separate layer with
3D integration technology (Morris, Kodi, Louri, and Whaley, 2014). In a many-core
processor with a size of 20×20 mm2, the overall footprint of each gateway can be less
than 0.16 mm2 with 45 nm process according to the Orion 3.0 simulator (Kahng, Lin,
and Nath, 2015).
Table 3.3 shows the comparison on the hardware cost between WRH-ONoC and
the other ONoC schemes by using Orion 3.0 simulator (Kahng, Lin, and Nath, 2015),
which is often used to estimate the chip area of electronic routers for Network on Chips
with the commercial-popular 45 nm process. According to the table, WRH-ONoC
consumes only the second smallest area of chip among existing schemes even when
g=5. For the electronic part, PNoC, ATAC, and Firefly need as many as 400 routers,
while WRH-ONoC needs only 125 gateways when g=5 with larger crossbars (25×25).
Due to the fast wavelength reassignment and abundant optical channels, WRH-ONoC
has significantly reduced the size of buffers which could potentially consume a lot of
chip area. However, in the hardware cost analysis, to make fair comparison, the overall
buffer sizes for the gateways and electronic routers are set to be exactly the same
among all the compared ONoC schemes, a decision that has somewhat disadvantaged
WRH-ONoC, because WRH-ONoC can reduce the requirement of buffer size. It can be
seen that even though WRH-ONoC uses larger crossbar, the total area cost is smaller
than most of the other schemes due to fewer gateways used. The overall chip area
91
Table 3.3: Hardware Costs Comparison (area in mm2)
Scheme Wmax
Electronic router/gateway Optical router/crossbar
count size
per sum
count size
per sum
area area area area
PNoC 1 400 5×5 0.0606 24.23 400 5×5 0.0015 0.6
HOME 1 100 8×8 0.0782 7.82 100 5×5 0.0015 0.15
ATAC 25 400 5×5 0.0606 24.23 1 25×25 0.0325 0.0325
Firefly 25 400 6×6 0.0665 26.59 8 25×25 0.0325 0.26
WRH (g=5) 25 125 25×25 0.1614 20.17 26 25×25 0.0325 0.845
WRH (g=1) 21 20 21×21 0.1204 2.41 21 21×21 0.0221 0.4641
of WRH-ONoC scheme using 5 sibling gateways (the second last row in the table) is
only higher than HOME which has the lowest throughput capacity. Moreover, if it
decreases the number of sibling gateways from 5 to 1, the number of gateways will be
significantly reduced and the overall area cost is only about 2.41 mm2, which is the
smallest among all the ONoC schemes (see the last row in the table). For the optical
part, since without available synthesis tool at present, it estimates the chip area by
assuming each MR takes 3×3 µm2 and each OSE takes 10×10 µm2 according to
(Kazmierczak, Briere, Drouard, Bontoux, Rojo-Romeo, O’Connor, Letartre, Gaffiot,
Orobtchouk, and Benyattou, 2005). PNoC and HOME schemes use a large number
of small optical routers in the mesh topology, while ATAC and Firefly use a large
optical ring for the global interconnection. WRH-ONoC uses multiple λ-routers and
thus requiring more optical devices. However, since the overall chip area of optical
devices only occupies a small proportion in an ONoC architecture, WRH-ONoC still
consumes much less chip area than other schemes.
3.5.6 Energy Efficiency
The energy efficiency of WRH-ONoC is evaluated by using both data traces and syn-
thetic traffic patterns, and is also compared with the other ONoC schemes. The energy
efficiency is defined as the amount of energy consumed for transmitting a packet in av-
erage, in pJ/packet. In the simulations, both the dynamic energy consumption (e.g.,
routing, forwarding) and the static energy consumption (e.g., leakage) are considered.
Table 3.4 lists the optical energy parameters used in the simulations, which are cited
from existing works (Morris, Jolley, and Kodi, 2014; Morris, Kodi, Louri, and Whaley,
2014; Kao and Chao, 2014). The electronic energy cost of gateways in WRH-ONoC and
electronic routers in the other ONoC schemes are obtained from Orion 3.0 simulator
(Kahng, Lin, and Nath, 2015) and given in Table 3.5. Since the buffer size in the elec-
92
Table 3.4: Optical Energy Parameters
Parameter Value Unit
Laser efficiency 30 (Morris, Jolley, and Kodi, 2014) %
Receiver 100 (Morris, Kodi, Louri, and Whaley, 2014) fJ/bit
Modulator 100 (Morris, Kodi, Louri, and Whaley, 2014) fJ/bit
Photodetector sensitivity -26 (Morris, Kodi, Louri, and Whaley, 2014) dBm
MR drop 0.5 (Kao and Chao, 2014) dB
Waveguide transmitting 1 (Morris, Kodi, Louri, and Whaley, 2014) dB/cm
MR pass 0.05 (Kao and Chao, 2014) dB
Waveguide crossing 0.5 (Morris, Kodi, Louri, and Whaley, 2014) dB
Table 3.5: Electronic Energy Parameters (45nm Process)
Wavelengths (max) 20 (for 64 cores) 25 (for 400 cores)
Router/gateway size 5 6 8 20 5 6 8 21 25
Routing (pJ/packet) 8.69 9.83 11.99 20.86 12.12 13.51 16.40 33.51 45.74
Leakage (pJ/packet) 0.36 0.39 0.45 0.62 0.51 0.56 0.65 0.96 1.28
tronic routers and gateways has huge influence on the energy cost, the overall buffer
size is set to be equal in each electronic router and gateway. With different overall
buffer sizes in the trace based and synthetic traffic based simulations, 20 wavelengths
are used for connecting 64 cores in the trace based simulations while 25 wavelengths
are used for 400 cores in synthetic based simulations, thus their energy costs are listed
separately in the table.
Figure 3.16(a) compares the average energy efficiency of different ONoC schemes
with different data traces. The network configurations of these ONoC schemes are
the same as in Section 3.5.3. It can be seen that WRH-ONoC has the lowest energy
cost than the other ONoC schemes, due to the optical transmission for both intra-
and inter-subsystem traffics, and no end-to-end path reservation in advance. PNoC
and HOME schemes consume much higher energy, because they need the path reserva-
tion and release processes in the electronic network. ATAC and Firefly schemes have
relatively lower energy cost because the electronic routing is only used for the local
communication. It is worth noting that WRH-ONoC only consumes about 44 pJ for
transmitting a packet, the overall power consumption is only 0.676 W for 64 cores with
the peak data rate of 0.24 packets/cycle in these data traces.
Figure 3.16(b) evaluates the energy efficiency of WRH-ONoC for a 400-core system
with the synthetic traffic patterns. It can be seen that WRH-ONoC consumes the
second least energy when the average data rate is low in comparison with the other
ONoC schemes. Since WRH-ONoC requires to go through about 2.8 hops of λ-routers
and larger crossbar in the gateways, the energy cost is a little bit higher than Firefly
93
(a) (b)
1 G
bps
/co
re
2 G
bps
/co
re
5 G
bps
/co
re
10 
Gbp
s/co
re
12 
Gbp
s/co
re
15 
Gbp
s/co
re
0
100
200
300
400
500
600
700
800
900
1000
A
ve
ra
g
e
 E
n
e
rg
y 
E
ff
ic
ie
n
cy
 (
p
J/
p
a
ck
e
t)  PNoC
 HOME
 ATAC
 Firefly
 WRH-ONoC
Saturated
blac
ksc
hol
es
bod
ytra
ck
can
nea
l
ded
up ferr
et
fluid
ani
ma
te
swa
ptio
ns vips x26
4
0
20
40
60
80
100
120
140
160
180
200
220
240
260
A
ve
ra
g
e
 E
n
e
rg
y 
E
ff
ic
ie
n
cy
 (
p
J/
p
a
ck
e
t)
 PNoC
 HOME
 ATAC
 Firefly
 WRH-ONoC
Figure 3.16: Average energy efficiency with (a) 64 cores and data
traces; (b) 400 cores and synthetic traffics.
scheme which uses only one hop of optical network and small electronic routers in the
local network. ATAC scheme needs to transmit packets in an electronic local network
at both the source and destination side, so the energy cost is a little higher. PNoC and
HOME schemes consume much higher energy due to the electronic path reservation
process in the large-scale mesh network. However, with the increase of average data
rate, the energy cost for transmitting a packet also increases. That is because much
more static energy is consumed when the end-to-end packet delay is increased and
the packets keep being buffered due to contentions, even if the static energy cost is
low in device-level. Since WRH-ONoC can achieve much lower communication delay
even at the average data rate of 15 Gbps/core as shown in Figure 3.9, it consumes less
static energy which only has slight increase when the average data rate increases from
1 to 15 Gbps/core. This also indicates that WRH-ONoC is much more energy efficient,
when it is used in the many-core processors that have high bandwidth capacity and low
communication delay requirements, such as on-chip cloud computing (Lu, Fu, Wang,
Han, Yan, and Li, 2015) and on-chip storage systems (Morris, Jolley, and Kodi, 2014).
It is worth noting that due to the limitations on the wavelength conversion, it
is difficult to further reduce the energy cost of WRH-ONoC at present. However,
there are some perspective methods which can make a better tradeoff between the
communication performance and energy efficiency: (i) achieving an optimal subsystem
division, which can enable higher proportion of intra-subsystem traffic; (ii) configuring
an optimal number of sibling gateways g, which can provide sufficient load balance and
decrease the network hierarchy; (iii) designing an optimal gateway structure, which
can reduce the energy cost but ensure fast wavelength allocation.
94
3.6 Summary
In this chapter, an optical inter-core communication network, WRH-ONoC, is pre-
sented for many-core processors to further improve the network scalability. It combines
the hierarchical networking with wavelength reuse to achieve both high-performance
and scalable inter-core communication. WRH-ONoC has several remarkable advan-
tages: (i) reusing a limited number of optical wavelengths for large-scale many-core
processors; (ii) optical non-blocking local communication and high-throughput hier-
archical global communication; (iii) efficient unicast and multicast routing schemes.
Moreover, WRH-ONoC is a highly configurable architecture, e.g., the number of sib-
ling gateways and the scale of λ-routers (i.e., the number of input/output ports),
through which it can achieve better tradeoff between the communication performance
and the hardware cost according to the specific requirements of different applications.
Both the realistic data traces and synthetic traffic patterns are used in the evaluation of
WRH-ONoC, and fair comparisons are also made with existing ONoC schemes. Simu-
lation results demonstrate that WRH-ONoC can achieve much lower end-to-end packet
delay and higher network throughput for both unicast and multicast communications.
For the unicast communication, 46.0% of reduction on the zero-load packet delay and
72.7% of improvement on the maximal throughput can be achieved compared with
other ONoC schemes when interconnecting 400 cores. For the multicast communica-
tion, the zero-load delay can also be reduced by 63.9% and the maximal throughput
can be increased by 8.4 times compared to the PNoC scheme with tree-based multi-
cast routing, even with only 5% of multicast traffic. Compared with existing ONoC
schemes, the overall hardware cost of WRH-ONoC is also reduced due to the decreased
requirement on buffer size, while the energy efficiency is much higher especially with
data-intensive applications.
95
96
Chapter 4
DWRMR: A
Dynamically-configured and
Wavelength-Reused ONoC with
Multicast Ring based Routing
Multicast communication, in which the data packets are transmitted from one source
to multiple destinations at the same time, widely exists in many-core processors. In
a short time period, the multicast communication can occupy a large quantity of net-
work resources and lead to severe traffic congestions. Thus, even with only a low ratio
of multicast traffic, it can generate significant influence on the overall communica-
tion performance of many-core processors. Generally speaking, the main challenges of
multicast communication in a manycore processor should include to satisfy the high
bandwidth requirement, to reduce the delay of each packet and the difference of packet
delays, and to reuse the multicast routing paths, etc.
In this chapter, Optical Network on Chip is employed to solve the problems of mul-
ticast communication, and a Dynamical-configured Wavelength-Reused and Multicast
Ring based routing architecture, named as DWRMR, is proposed for the typical ONoC
architecture with mesh topology. First, the high bandwidth capacity of optical chan-
nels and the high transmission speed of optical signals can improve the communication
performance for multicast traffic in DWRMR. Second, different with the traditional
multicast routing schemes, the ring-based scheme is proposed to reduce the number of
multicast packet copies, in which the source core can transmit the multicast packets
in the multicast ring, i.e., a dynamically established cyclic optical routing path, in the
97
single-send-multi-receive manner using a single wavelength. The same wavelength can
also be reused in the link-disjoint multicast rings. Third, for the interactive multicast
communication within the same multicast group, the established multicast ring can
be reused among the cores via the optical-token arbitration, which avoids setting up
exclusive multicast routing paths for each core.
The most important research problem in DWRMR is to reduce the number of re-
quired wavelengths for all the multicast rings in the routing and wavelength allocation.
In this chapter, the multicast ring routing and wavelength allocation problem is for-
mulated as an integer linear programming problem, and a heuristic algorithm that is
able to accommodate more multicast rings under the wavelength limitation is proposed.
Simulation results indicate that DWRMR can reduce more than 50% of average end-to-
end packet delay with slight hardware cost, or require only half number of wavelengths
to achieve the same performance compared with existing schemes.
4.1 Motivation
4.1.1 Multicast Communication
Multicast communication, i.e., one-to-many communication, intensively exists in many-
cores due to the demand on cooperative computing systems and cache coherence proto-
cols (Eisley, Peh, and Shang, 2008; Rodrigo, Flich, Duato, and Hummel, 2008; Krishna,
Peh, Beckmann, and Reinhardt, 2011; Karkar, Mak, Tong, and Yakovlev, 2016). The
same data packets from one source core need to be addressed simultaneously to multi-
ple destination cores in the multicast communication. For example, in a 64-core system
running PARSEC benchmarks (Bienia, Kumar, Singh, and Li, 2008) with MESI coher-
ence protocol (Hestness, Grot, and Keckler, 2010), the multicast traffic takes a large
ratio in each application as shown in Figure 4.1(a), e.g., the multicast packets hold
about 39% in average and 88% at maximum for each core in the blackscholes applica-
tion. Moreover, in average there are more than 16 destination cores in each multicast
communication in these applications. Even with only a small ratio of multicast commu-
nication (1%) (Ma, Jerger, and Wang, 2012), if without an efficient multicast scheme,
the source core needs to transmit a large number of packets to each destination in a
short time period, which can suddenly occupy a large number of network resources and
lead to severe traffic congestions. According to the analysis in (Abadal, Martinez, Alar-
con, and Cabellos-Aparicio, 2014), the intensity of multicast traffic and the number of
98
(a) (b)
bla
cks
cho
les
bod
ytra
ck
can
nea
l
ded
up
ferr
et
flui
dan
ima
te
sw
apt
ion
s
vip
s
x26
4
0
10
20
30
40
50
60
70
80
90
P
e
rc
e
n
ta
g
e
 o
f 
M
u
lt
ic
a
st
 T
ra
ff
ic
 (
%
)
 Total Average
 Maximum for Specific Cores
bla
cks
cho
les
bod
ytra
ck
can
nea
l
ded
up
ferr
et
flui
dan
ima
te
sw
apt
ion
s
vip
s
x26
4
0
10
20
30
40
50
60
70
80
90
In
te
ra
c
ti
v
e
 T
ra
ff
ic
 i
n
 M
u
lt
ic
a
s
t 
G
ro
u
p
s
 (
%
)
Average
Figure 4.1: Analysis of the multicast traffic in a 64-core system run-
ning PARSEC benchmarks, (a) the ratio of multicast packets for each
core; (b) the average ratio of interactive multicast packets within the
same multicast group.
destinations in each multicast both increase with the increasing number of cores in the
many-core processors. Therefore, the design of high performance and energy efficient
interconnection architecture and routing protocol to support multicast communication
plays an important role in many-core processors.
Generally, the multicast communication in many-core processors can introduce the
following challenges. (i) Low packet delay and high bandwidth capacity. Since multiple
data packets need to be transmitted from the source core to a group of destination
cores in each multicast communication, a large number of network resource, such as
routers and links, need to be occupied at thle same time. Thus, the multicast commu-
nication requires to achieve low packet delay and high bandwidth capacity, otherwise
it will lead to heavy traffic congestions when the routing paths are reserved for a long
time. (ii) Small difference on the packet delays to different destinations. In each mul-
ticast communication, the distribution of the source core and the destination cores
has significant influence on the packet delay, and the packet delays for different desti-
nation cores can have a large difference. This difference on the packet delays should
be reduced, especially for the multicast communication caused by cache coherence.
(iii) Interactive multicast. It is worth noting that an important property of multicast
traffic is the interactive multicast within a certain group of cores, namely the cores
in a multicast group (including one source and multiple destination cores which can
exchange roles) have frequent interactive multicast communications among themselves.
For instance, several cores share the same cache line while any core can change it and
invoke the multicast communication for data update. As shown in Figure 4.1(b), more
99
than 60% of multicast packets are transmitted within the multicast group in average
for different applications. Thus, it needs to improve the efficiency of multicast routing
paths, especially for the interactive multicast communication. In this chapter, ONoC is
employed to solve the challenging problems of multicast communication in many-core
processors. Existing research is divided into two aspects: multicast routing schemes
and multicast-enabled architectures.
4.1.2 Existing Multicast Routing Schemes
Multicast routing schemes are widely used in the communication network of many-core
processors (Samman, Hollstein, and Glesner, 2010; Ebrahimi, Daneshtalab, Liljeberg,
Plosila, Flich, and Tenhunen, 2014). Unfortunately, there is no related multicast rout-
ing scheme being specifically proposed for ONoC. For the traditional mesh/torus-based
ONoC, only the XY routing is widely used (Shacham, Bergman, and Carloni, 2008;
Mo, Ye, Wu, Zhang, Liu, and Xu, 2010; Feng, Ye, and Xu, 2013; Chen, Gu, Chen, and
Zhang, 2013; Xie, Nikdast, Xu, Wu, Zhang, Ye, Wang, Wang, and Liu, 2013; Gu, Chen,
Yang, Chen, and Zhang, 2017; Fusella and Cilardo, 2017), and most exiting research
focuses on the design of network architecture, unicast communication, floorplan, the
analysis of insertion loss/crosstalk noise, etc. In this thesis, the design principles of the
traditional multicast routing schemes used in the electronic-based NoC are analysed
and their implementations in ONoC are also discussed in the following. Even though
they are not originally proposed for ONoC, in order to make a fair comparison and make
the results more useful, their implementations in ONoC also employ the wavelength
division multiplexing, which is a big improvement to their original designs. In the
literature, most of the current existing multicast routing schemes are replication-based.
They can be generally classified into three categories: (i) Unicast-Based Multicast,
the source core generates multiple packet copies and forwards them separately to all
the destination cores, as shown in Figure 4.2(a); (ii) Tree-Based Multicast (Samman,
Hollstein, and Glesner, 2010), it employs the spanning tree as the routing paths, in
which the source core is the root and all the destination cores are leaves, and the
copies of multicast packet are created in the intermediate routers if encountering new
branches of the multicast tree, as shown in Figure 4.2(b); (iii) Path-Based Multicast
(Ma, Jerger, and Wang, 2012; Ebrahimi, Daneshtalab, Liljeberg, Plosila, Flich, and
Tenhunen, 2014), the network is divided into several disjoint regions according to the
distribution of destination cores, and in each region a spanning path which locally con-
nects to all the destination cores is formed. Then the source core generates a separate
100
1 2 3 4
Figure 4.2: Replication-based multicast routing schemes for a 4 × 4
ONoC, (a) unicast-based scheme with exclusive wavelengths assign-
ment for each routing path; (b) tree-based and (c) path-based schemes
with optical splitters in the intermediate routers.
multicast copy for each region and transmits it along the predetermined routing path,
as shown in Figure 4.2(c).
However, there are some technological limitations on the optical devices in Optical
Network on Chip, such as the lack of optical buffering and optical processing devices,
the energy-consuming electronic-to-optical (E-O) and optical-to-electronic (O-E) con-
verters, the hardware cost and the power consumption are related to the used number
of wavelengths. Thus, generally the optical routing paths from the source core to the
destination core should be established before the data transmission. Therefore, the
replication-based multicast routing schemes cannot directly be used in ONoC. To im-
plement these multicast routing schemes in ONoC, exclusive wavelengths should be
assigned to the optical routing paths, or the optical splitters should be used in the
intermediate routers (Morris, Jolley, and Kodi, 2014), for the parallel transmission of
the copies of multicast packets. In the examples given in Figure 4.2, it can be seen that
for the multicast communication among a source core {9} and the group of destination
cores {0,1,3,6,7,8,15}, the replication-based multicast routing schemes need to establish
multiple routing paths and transmit a separate optical packet on each routing path.
In Figure 4.2(a), the unicast-based scheme uses the exclusive wavelength allocation for
different routing paths, and it requires as many as four different wavelengths. While
in Figure 4.2(b), the tree-based and path-based schemes use some optical splitters in
the intermediate optical routers, and only a part of optical signals are received when
encountering a destination core.
From Figure 4.2 it can be seen that the tree-based scheme and path-based scheme
101
can aggregate the multicast packets into the multicast tree and the multicast paths,
instead of transmitting a packet copy separately to each destination core. Thus, they
can reduce the traffic conflicts during the optical path setup process and the power
consumption for electronic-to-optical conversions. However, their performance and
efficiency are still constrained since (i) their performance closely depends on the dis-
tribution of the destination cores (Ebrahimi, Daneshtalab, Liljeberg, Plosila, Flich,
and Tenhunen, 2014). For example, when the destination cores are located in a small
region of the network, the tree-based multicast scheme will have a lot of branches in
this region which requires large-fanout splitters, and the path-based multicast scheme
will unnecessarily generate four routing paths for a small region; (ii) the established
multicast routing of trees/paths cannot be reused among the cores in the multicast
group for the interactive multicast traffic, since each tree/path is unidirectional from
a specific source core to the destination cores. When the other cores in the same mul-
ticast group have the interactive multicast traffic, it needs to setup separate multicast
routing paths, e.g., core {0} cannot reuse the multicast routing paths of core {9}.
4.1.3 Existing Multicast-Enabled Architectures
By making use of the advantages of optical communication, some wavelength-routed
ONoC architectures are inherently multicast enabled. They employ a global opti-
cal interconnection, such as ring and crossbar, and allocate a specific wavelength for
the communication between each pair of cores (O’Connor, 2004; Le Beux, Trajkovic,
O’Connor, Nicolescu, Bois, and Paulin, 2011). Thus, the multicast packets can be
transmitted from the source core to all the destination cores using different wavelengths
at the same time without blocking. The main advantage of these multicast enabled
ONoC architectures is that they can provide all-optical non-blocking communication
for both unicast and multicast traffics. However, since separate wavelength channels
are allocated between the cores in these ONoC architecture, the number of required
wavelengths increases with the increase of network size. Thus, they are limited by the
number of available wavelengths, and also the hardware cost and power consumption
are related to the number of wavelengths used. Therefore, these multicast enabled
ONoC architectures are only preferable for small scale ONoCs which connect tens of
cores with sufficient wavelength channels. For larger scale ONoCs, the hierarchical
interconnection and electronic-to-optical conversion can be used to extend these archi-
tectures, but that also introduces extra hardware cost and energy overhead. Moreover,
with the fixed wavelength assignment scheme for all the connected cores, these ONoC
102
architectures generally cannot achieve high wavelength efficiency with highly variable
communication patterns.
4.1.4 Main Contributions of DWRMR
The motivation of this chapter is to design a high performance and power efficient
ONoC architecture and multicast routing scheme for many-core processors. Thus, a
Dynamical-configured Wavelength-Reused and Multicast Ring based routing architec-
ture, namely DWRMR, is proposed. The main advantages of the proposed ONoC
architecture and multicast routing scheme can be summarized as follows. (i) A cyclic
optical multicast ring path, namely the multicast ring, is dynamically established for
each multicast group to interconnect the source core with all the destination cores.
As shown in Figure 4.3(a), an optical multicast ring is established for the source core
{9}. (ii) The multicast packets can be transmitted from the source core in the manner
of single-send-multi-receive with a single copy and a single wavelength. As shown in
Figure 4.3(b), the source core {9} can directly transmit its multicast packets using
the allocated optical wavelength in the multicast ring, and the destination cores can
receive the multicast packets by accessing the multicast ring, while the other interme-
diate nodes only pass the optical signals transparently. (iii) Benefited from the cyclic
optical routing path, the established multicast ring and the allocated wavelength can
be reused among the cores in the same multicast group for the interactive multicast
traffic. For example, after the multicast of core {9}, core {0} can transmit its multicast
packets by applying for the multicast ring. The multicast ring reuse is conducted on
the basis of an efficient optical token ring arbitration (details in Section 4.3.2). (iv)
The wavelength utilization of optical links is considered in the routing and wavelength
allocation, and the same wavelength is reused in link-disjoint multicast rings. The
detailed routing and wavelength allocation algorithm will be given in Section 4.4.
In general, the main contributions in this chapter can be summarised as follows:
• A multicast routing scheme based on the Optical Network on Chip, named as
DWRMR, is proposed for the multicast communication in many-core processors.
The key idea of DWRMR is to combine the benefits of optical multicast ring
and wavelength reuse. Each multicast ring is a dynamically constructed cyclic
routing path to connect all the cores in a multicast group. Multicast packets are
transmitted in the manner of single-send-multi-receive using a single wavelength.
The same wavelength is reused in link-disjoint multicast rings.
103
Figure 4.3: Overview of the proposed multicast scheme, (a) the dy-
namically established multicast ring for routing; (b) the principle of
single-send-multi-receive transmission.
• A hierarchical ONoC architecture is designed, which utilizes an optical control
plane to achieve centralized routing and wavelength allocation considering the
global wavelength utilization and to conduct flexible configuration of optical mul-
ticast rings, and an optical forwarding plane to provide non-blocking transmission
for massive multicast packets.
• An efficient optical-token based multicast ring reuse scheme is devised for the
interactive multicast communication between the cores in the same multicast
group. Moreover, the optical token arbitration will not introduce extra hardware
cost for the optical control plane.
• The multicast ring routing and wavelength allocation scheme is formulated using
the Integer Linear Programming (ILP) model, and a heuristic algorithm is pro-
posed which takes both the wavelength utilization of optical links and the length
of multicast ring into consideration to accommodate more multicast rings with
limited wavelength channels in ONoC.
• The performance of DWRMR is evaluated through extensive simulations using
multicast traffic patterns from both synthetic distributions and realistic data
traces. The simulation results indicate that DWRMR can outperform existing
schemes in the communication performance and power efficiency.
104
4.2 Network Architecture
Different from the traditional electronic-optical hybrid ONoC architectures, which con-
figure the optical routing paths using an electronic control network in a hop-by-hop
manner, in this chapter, a flexible ONoC architecture, DWRMR, which employs the
centralized routing and wavelength allocation and the fast optical-based routing path
configuration is proposed. From the perspective of networking, the network architec-
ture of DWRMR can be logically divided into three planes: core plane, control plane,
and forwarding plane, as shown in Figure 4.4. This design exploits the idea of software
defined network (SDN) into ONoC for the purpose of fast multicast ring construction.
In general, the functions of three planes are independent, and thus their structures
and communication processes can be designed and optimized separately. Moreover,
for the regularity of floorplan and the simplicity of routing computation for many-core
processors, without loss the generality, in this chapter the design of DWRMR scheme
is focused on the widely used mesh-based ONoC.
In the network architecture, the core plane contains all the micro-cores to realize the
computation applications and it is the customer of multicast services in DWRMR. The
control plane is the kernel part of DWRMR to construct the multicast rings and allocate
optical wavelengths. It collects the multicast requests from the core plane, conducts
the multicast ring routing and wavelength allocation considering the global wavelength
utilization, and dynamically configures the multicast rings in the forwarding plane. The
control plane consists of a centralized Multicast Ring Allocator (MRA) and a cyclic
optical arbitration channel. The MRA allocates a multicast routing ring and a carrier
wavelength for each multicast group from the perspective of global utilization in the
optical links. The arbitration channel uses a cyclic optical control channel to connect
every core with the MRA, and thus it can collect the multicast requests and distribute
the configuration packets in a very fast manner. The forwarding plane provides the
passive transmission for multicast packets from the source core to all the destination
cores along the configured multicast ring by using the allocated wavelength. The optical
routers in the forwarding plane are multicast enabled and they are configured according
to the allocation of the control plane. For the physical implementation of DWRMR,
the control plane and the forwarding plane can be integrated in different layers and
connected with the core plane using vertical links, e.g., through silicon via (TSV). It is
worth noting that the optical data network can also be extended to other topologies,
such as torus, since the optical control network implements the centralized routing
105
Optical Router Waveguide
Micro-
core
Access
Point M
RA
Vertical 
Link
Multicast 
Ring 
Allocator
Arbitration
Channel
Forwarding
Plane
Core
Plane
Control
Plane
Figure 4.4: Network architecture with three logical planes: core plane
(microcores), control plane (centralized multicast ring/wavelength al-
location), and forwarding plane (multicast packet delivery).
and wavelength allocation scheme and all the optical routers are connected to the
cyclic optical control channel. If the optical data network employs a different topology,
MRA will allocate the optical multicast ring and wavelength similarly according to
the physical connection of new topology. In the following, the main components and
functions of three planes are introduced.
4.2.1 Core Plane
In the core plane, each micro-core connects with the communication network through
a network interface (NI). From Figure 4.5(a), it can be seen that through the network
interface each core connects with an optical access point in the control plane and an
optical router in the forwarding plane. The network interface realizes the coordination
with the other two planes, such as sending the multicast routing requests to the control
plane and configuring the connected optical router in the forwarding plane to construct
the multicast rings, as shown in Figure 4.5(a). Moreover, in the network interface,
the electronic signals are modulated to optical signals in electronic-to-optical (E-O)
converter when the source core sends the multicast packets, and the optical signals are
converted back to electronic signals in optical-to-electronic (O-E) converter when the
destination core receives the multicast packets.
106
Figure 4.5: Main components and communication process in (a) the
core plane (for the local configuration), and (b) the control plane (for
the centralized routing and wavelength allocation).
Therefore, as shown in Figure 4.5(a), each NI consists of a local multicast table, a
configuring unit, a laser power adjustment unit, and two groups of E-O and O-E con-
verters. The local multicast table records the information of established multicast rings
which involve the current core, i.e., the addresses of the cores in each multicast group
and the carrier wavelength allocated. The configuring unit controls the connection
state of the corresponding optical router in the forwarding plane to establish/release
the multicast ring. The laser power adjustment unit regulates the power intensity of
optical signal according to the number of destinations before E-O conversion to im-
prove communication reliability and power efficiency (Fu, Han, Li, and Li, 2010). It
is worth noting that in Figure 4.5(a) there are two groups of E-O and O-E converters
to connect with the control plane and the forwarding plane, respectively. For the first
group which connects with the control plane, each network interface uses a specific
wavelength to transmit the control packets with the multicast ring allocator (MRA),
thus it only requires a pair of MR-based E-O converter and O-E converter with single
wavelength. For the second group which connects with the forwarding plane, each net-
work interface is able to transmit and receive data packets with multiple wavelengths,
thus it needs multiple pairs of MR-based E-O and O-E converters.
4.2.2 Control Plane
The control plane is designed to improve the efficiency of optical interconnects and
wavelengths and achieve fast optical routing configuration. It mainly comprises an op-
tical arbitration channel and a centralized multicast ring allocator, as shown in Figure
107
4.5(b). The optical arbitration channel only transmits the control packets between the
network interfaces and the MRA, e.g., the multicast routing requests, the multicast
ring configuration and release packets. Thus, each network interface in the core plane
has an optical access point in the arbitration channel. For the fast multicast ring con-
struction, the arbitration channel only needs to implement an N -to-1 and an 1-to-N
optical buses. In the N -to-1 optical bus, all the network interfaces of N cores use
different wavelengths to send multicast requests to the MRA in parallel; while in the
1-to-N optical bus, the MRA uses different wavelengths to simultaneously send config-
uration packets to the network interfaces of cores which are located on the allocated
multicast ring. Moreover, when it is extended to larger scale ONoCs without sufficient
wavelengths, multiple cores can share an access point, since the data rate of control
packets is much lower.
In the control plane, the MRA utilizes a centralized algorithm to allocate the multi-
cast ring and the carrier wavelength for each multicast group that with different number
of destinations and in different distributions. The MRA consists of a global wavelength
table, a Hamiltonian table, and a multicast ring optimizer. The global wavelength table
maintains the wavelength utilization of each optical link, while the Hamiltonian table
records the connection of Hamiltonian cycle which is used as the primary multicast
ring, details in Section 4.4.2. The multicast ring optimizer further optimizes the multi-
cast ring by considering both the minimization of the maximal wavelength utilization
in each optical link and the length of each multicast ring routing path, to accommodate
as many multicast rings as possible with the limited number of wavelengths. When a
multicast ring is allocated, the MRA updates the global wavelength table and sends
out the configuration packets to all the cores which are located in the multicast ring.
4.2.3 Forwarding Plane
The forwarding plane is a configurable wavelength-routed optical communication net-
work and it provides passive transmission for multicast packets. It consists of multicast
enabled optical routers and bidirectional optical links in a N×N mesh topology. As
shown in Figure 4.6, each optical router has five pairs of input/output ports to connect
with four neighbouring routers in different directions (north, east, south, west) and
a local core. The optical router utilizes the active microring resonators (MR) to im-
plement configurable optical switching for dynamically established multicast rings. As
introduced in Section 2.1.1, each MR in the router has a unique resonant wavelength
λr just like a wavelength-selective filter, and it can be tuned by using thermal-optical
108
16
9
1
5
2
6
3
4
7
8
10
11
12
13
14
15
Inputs
Core
N
E
S
W
Outputs Core NESW
[ \ ] [ ]
5 6 7 8
1 11 12
2 9 14
3 10 16
4 13 15
I O C W S E N
C
N X
E X
S X
W X
   
   
   
   
   
      
I
ir
I
i r 
I
i r 
off
on mon
Off-state
On-state
Multicast-state
(splitter)
(a) (b)
Figure 4.6: The principle of multicast-enabled optical router, (a) the
router architecture for a single wavelength; (b) different switch status
by tuning the resonant wavelength of MR.
or electronic-optical effects (Batten, Joshi, Stojanovic, and Asanovic, 2012). It can be
seen from Figure 4.6(b) that if the resonant wavelength of MR is tuned to mismatch
with the optical input signal at off-state, namely λi 6=λr, the optical signal will transmit
along the original waveguide; if the resonant wavelength is tuned to fully match with
the input signal at on-state, namely λi=λr, the optical signal will be coupled into the
MR and transmit along the other waveguide. Moreover, to achieve multicast routing,
the resonant wavelength of MR can by tuned to partially match with the input signal
at multicast-state (mon), namely λi∩λr, thus only a part of optical signal is coupled
to MR and it can be output to both waveguides (Fu, Han, Li, and Li, 2010).
To provide wavelength-routed multicast communication, the MRs in the optical
router can be configured according to the routing matrix as shown in Figure 4.6(a).
For example, if the optical signal inputs from the east port and outputs to the west
port, MR {9} is tuned to on-state; if the optical signal inputs from the south port and
outputs to the connected core and the north port, MR {3} is tuned to multicast-state
and MR {16} is tuned to on-state. It can be seen that only MRs {1,2,3,4}may be tuned
to multicast-state, thus this multicast-enabled router only introduces slight hardware
cost for tuning MRs. Some communications, e.g., from the north input to the west
output, do not need to tune any MR, so they are labelled to {X} in the routing matrix.
Note that the optical router is configured by the connected network interface in the
core plane, as shown in Figure 4.5(a), thus the routing matrix is stored in the local
configure unit for fast routing path configuration.
109
4.3 Communication Scheme
In the proposed DWRMR architecture, the communication scheme mainly involves two
important issues: the multicast ring based multicast communication, and the multicast
ring reuse within the multicast group.
4.3.1 Ring-Based Multicast Routing
In the DWRMR architecture, when a source core has multicast packets to transmit to
a group of destination cores, firstly it checks in the local multicast table of network
interface whether there already exists an established multicast ring connecting this
multicast group, labelled as (1) shown in Figure 4.5(a). If the multicast ring has been
established, i.e., the addresses of the source core and all the destination cores exactly
match with an existing multicast group in the local multicast table, the source core
can apply for it according to the multicast ring reuse scheme (details in Section 4.3.2);
otherwise, the source core needs to send a multicast request packet, which contains the
addresses of the source core and the destination cores, to the control plane to establish
a new multicast ring, labelled as (2) in Figure 4.5(a). It is worth noting that if the new
multicast group is a subnet of an existing multicast group, it can also apply to reuse
the established multicast ring to further improve the efficiency of optical routing path
and wavelength.
In the control plane, the multicast request packet is transmitted from the corre-
sponding optical access point of the source core to the multicast ring allocator through
the optical arbitration channel with a specific wavelength, as shown in Figure 4.5(b).
According to the spatial distribution of the destination cores and the global wavelength
utilization of optical links, the MRA calculates an optimized multicast ring as the rout-
ing path and allocates an optical wavelength by using a heuristic algorithm given in
Section 4.4.3. Note that in the allocated multicast ring, the source core connects to
all the destination cores with some intermediate cores, while the intermediate cores do
not belong to the multicast group and they cannot receive the multicast packets and
reuse the multicast ring. After the routing computation and wavelength allocation,
the MRA reserves the allocated wavelength of optical links passed by the multicast
ring in the global wavelength table. Then, it sends the configuration packets to the
network interface of all the cores that are located in the new multicast ring, including
the intermediate cores, in parallel via different wavelengths in the optical arbitration
channel to construct the multicast ring, and sends a notification packet, which contains
110
the addresses of the source and destination cores and the allocated wavelength, to the
cores in the multicast group to update their local multicast table.
When the configuration packet is received by the network interface of cores that
on the multicast ring, labelled as (3) shown in Figure 4.5(a), the local configuring
unit changes the interconnection state of the connected optical router to establish
the multicast ring, labelled as (4). According to the routing matrix in Figure 4.6(a),
there are three cases for the optical router configuration: (i) in the source router,
one MR is tuned to on-state to inject the optical multicast packets to the established
multicast ring, and another MR is tuned to on-state to collect the residual optical
signals after being transmitted the whole multicast ring; (ii) in the routers connecting
to a destination core, one MR is tuned to multicast-state for the receiving of the local
core and another MR is tuned to on-state for passing to the next hop router; (iii) in
the intermediate routers, only one MR is tuned to on-state for passing by the optical
signals directly to the next hop router. Note that if the residual optical signals can be
transmitted back to the source core during the multicast communication, that can be
used as a fault-tolerant scheme to indicate the optical links in the multicast ring work
in a good state. Moreover, when the notification packet is received by the source and
destination cores, their local multicast tables will record the newly allocated multicast
ring and carrier wavelength.
After the multicast ring construction and the laser adjustment according to the
number of destination cores labelled as (5) in Figure 4.5(a), the multicast packets are
modulated to optical signals and transparently transmitted from the source core to all
the destination cores one-by-one in the forwarding plane. At the destination side, each
NI receives the multicast packet and sends it to the destination core directly after O-E
conversion, labelled as (6) in Figure 4.5(a). Finally, when the multicast communication
is over for all the cores in the multicast group, e.g., if a multicast ring is idle for more
than p cycles, the current source core which owns the multicast ring sends a teardown
packet to the MRA. Then, the MRA frees the reserved wavelength of all the passed
optical links in the global wavelength table and sends a release packet to all the cores
located on the multicast ring. Then, in the network interface of all the passed cores,
the local configuring unit releases the corresponding connection of optical router and
the local multicast table deletes the stored information of the multicast ring.
111
4.3.2 Multicast Ring Reuse
In DWRMR, the multicast ring reuse scheme is based on the optical token ring ar-
bitration, and it is also implemented in the established multicast ring without extra
hardware costs. The optical token represents the temporary ownership of the multicast
ring, and it is released when the previous source core has finished it multicast packets.
Therefore, as shown in Figure 4.7, the established multicast ring has two operating
states, multicast routing and optical token arbitration, which can interchange by tun-
ing the MRs. For example in Figure 4.7, a multicast ring connects 10 cores with the
source core {1} and 6 destination cores {3,5,6,7,9,10}. It can be seen that in Figure
4.7(a), when the multicast ring is at the multicast routing state, the optical router
which connects to a destination core is tuned with one MR on and another multicast,
while the intermediate routers are tuned with only one MR on to pass by the optical
signals. In this way, the multicast packets can be received by every destination core. If
the source core {1} has finished its multicast packets, it multicasts a notification packet
to all the destination cores to inform the optical token arbitration will start. Then, the
multicast ring shifts to the optical token arbitration state. As shown in Figure 4.7(b),
NIs in the destination cores configure their optical routers by (i) only tuning the MR
which connects the next-hop router to on state for passing by the optical token if it
has no multicast traffic, e.g., core {3,5,7,10}; (ii) only tuning the MR which connects
the local core to on state if it has multicast traffic, e.g., core {6,9}. Hence, when the
original source core {1} injects the optical token, the first encountered candidate, core
{6}, can win the token. Then in the next clock cycle, core {6} becomes the new source
core to transmit its multicast packets and core {1} shifts to a destination core. Oth-
erwise, if no core competes for the optical token, it is transmitted back to the original
source core, and the source core waits for q cycles to multicast a new notification for
another round of token hunting until the source core has new multicast packets or one
destination core wins the token.
It is worth noting that due to the high speed of optical signals, the optical token
arbitration only lasts for dc clock cycles, which is determined by the length of multicast
ring and the optical propagation speed. No matter finally the optical token is received
by a destination core or the original source, the multicast ring will automatically shift
to the multicast routing state after dc cycles. Moreover, to prevent a multicast ring
being occupied by one core in too long time, the optical token will be released after
holding for a fixed time interval. Hence, every core in the same multicast group has
equal chance to use the multicast ring.
112
Figure 4.7: Multicast ring reuse scheme within the multicast group,
by interchanging between two states: (a) multicast routing, and (b)
optical token arbitration.
4.4 Routing and Wavelength Allocation Algorithm
The key problem in DWRMR is the multicast ring routing and wavelength allocation
scheme. In this chapter, it is formulated using the integer linear programming (ILP)
model, and the optimization objective is to improve the wavelength efficiency by ac-
commodating as many multicast rings as possible under the wavelength constraints in
the optical links of an N×N ONoC.
4.4.1 Preliminary Definition
For the DWRMR architecture with N2 cores connected in the N×N mesh-based optical
forwarding plane, it can be modelled as a directed graph G(V,E), in which each node
vi in set V corresponds to the combination of core {i} and the connected router, and
each link eij in set E is the unidirectional optical interconnect from node vi to an
adjacent node vj. Note that there are two optical links in opposite directions between
the neighbouring optical routers in the forwarding plane. Moreover, the address of each
node vi ∈ V is denoted using a two-dimensional coordinates (xi, yi) where 0≤xi, yi≤
N−1. Then, the direct network connectivity from node vi to node vj, denoted by χij,
can be represented by the existence of link eij, i.e., χij = 1 if ∃eij ∈ E, and χij = 0
otherwise. Thus, in DWRMR the direct network connectivity can be
χij =
{
1, |xi±1|=xj, yi=yj or xi=xj, |yi±1|=yj
0, otherwise.
(4.1)
113
In the optical network of the forwarding plane, suppose that each optical link and
optical router can support a maximum of Wmax wavelengths, and the bandwidth of
each wavelength channel (>10 Gbps) is sufficient for the multicast traffic. The set of
available wavelengths Λ = {λ1, λ2, ..., λWmax} can be used in each node vi∈V and link
eij ∈E. Each node can achieve non-blocking optical routing between different input
and output ports with the same wavelength by configuring different MRs as shown in
Figure 4.6, while each link can only use the same wavelength once. Let Wij = {ωijn}
indicate the usage status of each wavelength in link eij, i.e., if the n
th wavelength λn is
available then ωijn = 1, otherwise ωijn = 0. Thus, the wavelength utilization of optical
link eij, denoted by uij, is
uij = 1−
1
Wmax
Wmax∑
n=1
ωijn. (4.2)
For the multicast communication in DWRMR, since each multicast packet has a
source core and multiple destination cores, and the interactive multicast packet can be
transmitted from any of these cores to the others, the multicast group is defined as an
arbitrary number of cores which have multicast communications, denoted by Mf . It is
worth noting that a multicast group is the basic object for the multicast ring routing
and wavelength allocation scheme. Moreover, the multicast group Mf is represented
as Mf ={Sf , Df (k)}, where Sf and Df (k) are the source core and k destination cores
at the current time period, respectively.
4.4.2 Multicast Ring Model
According to the proposed communication scheme, the multicast ring is modelled as an
unidirectional cyclic routing path in the form of an ordered node permutation, denoted
by Pf = 〈p0, p1, ..., pl−1, pl〉, where pi∈Pf is the ith node counting from the source core,
l is the length of multicast ring in the number of hops, and p0 = pl = Sf . eij ∈ Pf is
used to stand for the multicast ring Pf passing the optical link eij, where pi, pj ∈ Pf .
Each multicast ring should satisfy the following constraints: (i) spanning cycle: the
multicast ring should connect all the cores of multicast group Mf with a limited number
of intermediate cores; (ii) single wavelength: all the optical links in a multicast ring
should use the same wavelength λf ; (iii) no link overlapping : each multicast ring can
pass an optical link at most once.
To construct as many multicast rings as possible with a limited number of wave-
lengths in the optical interconnects, the proposed routing and wavelength allocation
114
scheme mainly concerns the following two perspectives: (i) minimizing the maximal
wavelength utilization in each optical link, namely max(Wij); (ii) minimizing the length
of each multicast ring l. Thus, for each multicast ring Pf , an integer variable Xi is
defined to indicate the number of times that it passes the node vi; and an integer
variable Yijf is defined to indicate the number of times that it passes the link eij by
using the wavelength λf . Hence, the problem of multicast ring routing and wavelength
allocation for each multicast group Mf can be formulated by using an integer linear
programming model as follows:
Minimize
∑
eij∈Pf
uij, (4.3)
subject to
Xi ≥

0, if vi 6∈ {Sf , Df (k)};
1, if vi ∈ Df (k);
2, if vi = Sf ;
(4.4a)
l =
∑
vi∈V
Xi − 1 ≤ N2; (4.4b)
χij = 1, if vi = pn, vj = pn+1; (4.4c)
ωijf = 1, if vi = pn, vj = pn+1, λf ∈ Λ; (4.4d)
0 ≤ Yijf ≤ 1, if vi = pn, vj = pn+1, λf ∈ Λ. (4.4e)
Note that the summation of wavelength utilization of optical links in the multicast
ring is used as the optimization objective, which takes the above two perspectives into
consideration. constraints Eq. (4.4a) and Eq. (4.4b) indicate that each multicast ring
should start/end from/to the source core and connect with all the destination cores
with a limited length l. If the length of the achieved multicast ring exceeds N2, the
Hamiltonian cycle, which passes all the cores in the network exactly once (will be
introduced in Section 4.4.3), is used as the multicast ring. Eq. (4.4c) means that any
adjacent nodes in the multicast ring should be physically connected. Eq. (4.4d) is
the wavelength limitation that λf is available for all the links in a multicast ring. Eq.
(4.4e) states that the multicast ring cannot pass an optical link more than once.
Moreover, it can be seen that the above model aims to minimize the summation of
wavelength utilization of optical links passed by each multicast ring, instead of only
minimizing the multicast ring’s length or the maximal wavelength utilization, thus it
can prevent early wavelength exhaustion in some popular optical links. For example,
Mf = {va, vb, vc}, if a cyclic path P1 = 〈va, vb, vc, vd, va〉 has the minimal length while
115
path P2 = 〈va, vb, ve, vc, vf , va〉 has the minimal summation of wavelength utilization,
i.e., uab+ubc+ucd+uda>uab+ube+uec+ucf+ufa, and wavelengths λ1 and λ2 are both
available for P1 and P2, then the routing and wavelength allocation model selects P2
as the multicast ring and λ2 as the carrier wavelength. Therefore, with the limited
number of wavelengths, the proposed scheme can hold more multicast rings.
4.4.3 Heuristic Solution
In this chapter, a heuristic algorithm is proposed for the multicast ring routing and
wavelength allocation model considering the computation efficiency, because the com-
putation complexity is related to the network size, the number of available wavelengths,
the number of destination cores, the distribution of source and destination cores, the
previous wavelength utilization, and the order of cores in the multicast ring. In general,
the ILP model for similar routing and wavelength allocation problem is NP-complete,
and a heuristic solution is always used to reduce the computation complexity (Gong,
Zhou, Liu, Zhao, Lu, and Zhu, 2013). In the proposed heuristic algorithm, the pre-
computed Hamiltonian cycle is used as the primary multicast ring, then the multicast
group Mf is mapped on it, and the connections between any adjacent nodes are heuris-
tically optimized by replacing them with the optical routing paths having the minimal
overall wavelength utilization. In this way, the computation complexity of the pro-
posed scheme can be significantly reduced. For instance, according to the simulations,
it only takes about 20 ms for allocating 100 multicast rings for a 16×16 ONoC with 16
randomly distributed cores in each multicast group, running on a standard computer.
As shown in Figure 4.8, the Hamiltonian cycle is a spanning cycle that passes all
the nodes in a given network exactly once. In an N×N ONoC, two adjacent nodes are
connected by two unidirectional optical links in opposite directions. When N is even,
there are two link-disjoint Hamiltonian cycles as shown in Figure 4.8(a); while when
N is odd, there is only one link-disjoint spanning cycle which needs to pass at least
one node twice. That is because in a cyclic channel the number of links in the positive
directions (x+ or y+) should be equal to the number in the negative directions (x− or
y−). According to the connection method of Hamiltonian/spanning cycle, it is used as
the primary multicast ring in the proposed algorithm. Generally, the current many-core
processors are interconnected in an N×N network with an even N (Morris and Kodi,
2010; Daya, Chen, Subramanian, Kwon, Park, Krishna, Holt, Chandrakasan, and Peh,
2014; Dinechin, Amstel, Poulhis, and Lager, 2014), thus the following only focuses on
this case and uses the Hamiltonian cycle as the primary multicast ring. For the case of
116
N×N network with an odd N , the routing and wavelength allocation scheme is similar
by using the spanning cycle.
It can be seen from the Figure 4.8(a) that two link-disjoint Hamiltonian cycles which
in counter-clockwise and clockwise directions are denoted by H− and H+, respectively.
They can be used as the primary multicast ring in the routing and wavelength allocation
alternatively for load balance. Specifically, in the counter-clockwise Hamiltonian cycle
H−, the relationship between the address of any node (xi, yi) and its relative position
t counting from the source node (xs, ys)=(0, 0) can be represented as,
t =

(N2−yi)%(N2), xi = 0;
yi×(N−1)+xi, 0<xi<N, yi%2 = 0;
yi×(N−1)+N−xi, 0<xi<N, yi%2 = 1.
(4.5)
For example, in Figure 4.8(a), t = 2, 6, 9, 14 when (xi, yi) = (2, 0), (1, 1), (3, 2), (0, 2).
From Eq. (4.5), it can be seen that the relationship between the address of any node and
its relative position in the Hamiltonian cycle is uniquely determined. The relationship
in the clockwise Hamiltonian cycle H+ or counting from any other source node (xs, ys)
can be achieved by reversing or shifting operation, such as the Hamiltonian cycle H+
starting from the source node (1, 2) as shown in Figure 4.8(b). Note that Eq. (4.5) can
be used to map the source core and all the destination cores in a multicast group to
the Hamiltonian cycle.
The proposed heuristic algorithm works in three steps, as shown in Algorithm 2.
(i) Mapping and Dividing (line 1-3). For a new multicast group Mf , the multicast ring
Pf is initialized as the Hamiltonian cycle H that starts from the source node Sf , while
H− and H+ are employed alternatively for the purpose of load balance. The destination
nodes Df (k) are then mapped on H according to their relative positions according to
Eq. (4.5). Then Pf is naturally divided by the source node and k destination nodes
into k+ 1 segments, Pf = {P0, ..., Pk}. (ii) Segment Optimization (line 4-16). For
each segment path Pi in Pf , an alternative P
∗ with the minimal overall wavelength
utilization u∗ is searched. Note that the overall wavelength utilization of a routing
path is the same as Eq. (4.3). Firstly, all the alternative paths for Pi (not longer than
Pi) are calculated into {P ∗m}={P ∗1 , P ∗2 , ...,}, and each P ∗m is compared with the current
optimized P ∗. When an alternative P ∗m has lower wavelength utilization than the
current optimized alternative, i.e., u∗m<u
∗, then P ∗m replaces P
∗. Note that if multiple
alternatives have the minimal wavelength utilization u∗, the one with the minimal
length l∗m is selected as P
∗. Then, the set of available wavelengths W ∗i is achieved by
checking the common free wavelengths in ∀eij∈P ∗. (iii) Assembling (line 17-18). The
117
Algorithm 2: Heuristic Routing and Wavelength Allocation
Input: Mf = {Sf , Df (k)}, G(V,E), Wij = {ωijn}
1 Initialization: H ← H− or H ← H+; h0 shifts to Sf ;
2 do l← N2; Pf ← H; p0 = pN2 = Sf ; Map destinations: pti ↔ Df (i), i ∈ [1, k];
3 then Pf = {P0, ..., Pk}, where P0 = 〈p0, ..., pt1〉, Pk = 〈ptk , ..., pN2〉,
Pi = 〈pti , ..., pti+1〉 for i ∈ [1, k);
4 while 0 ≤ i ≤ k do
5 li ← ‖Pi‖; ui ←
∑
eij∈Pi
∑
λn∈Λ ωijn;
6 Set targets: l∗ ← li; u∗ ← ui; P ∗ ← Pi;
7 Calculate alternative paths {P ∗m}={P ∗1 , P ∗2 , ..., P ∗m};
8 while P ∗m ∈ {P ∗m} do
9 l∗m ← ‖P ∗m‖; u∗m ←
∑
eij∈P ∗m
∑
λn∈Λ ωijn;
10 if u∗m < u
∗ then
11 l∗ ← l∗m; u∗ ← u∗m; P ∗ ← P ∗m;
12 if u∗m = u
∗ then
13 if l∗m < l
∗ then
14 l∗ ← l∗m; P ∗ ← P ∗m;
15 P ∗i ← P ∗;
16 Check free wavelengths Wi = ∩Wij, eij ∈ P ∗;
17 Assemble local optimization: Pf = {P ∗0 , ..., P ∗k };
18 Select carrier wavelength: λf ∈ {W0 ∩W1 ∩ ... ∩Wk};
Output: Pf = 〈p0, p1, ..., pl−1, pl〉, λf
118
10111213
98714
45615
3210
13121110
141509
3218
4567
H- H+ H+
(0,3) (3,3)
(0,0) (3,0) (0,0) (3,0)
(3,3)(0,3)
x+
x-
y-y+
(a) (b)
Figure 4.8: Hamiltonian cycles and the labelling scheme for a 4 × 4
ONoC, starting from the source core of (a) (xs, ys) = (0, 0) in the
counter-clockwise Hamiltonian cycle H−, and (b) (xs, ys) = (1, 2) in
the clockwise Hamiltonian cycle H+, respectively.
final multicast ring Pf is assembled from all the local optimizations, and the carrier
wavelength λf is randomly chosen from the intersection of available wavelengths sets
W ∗i of all the segments. It is worth noting that if there is no available wavelength λf
for a multicast ring Pf in the assembling, the calculated connection of Pf will be stored
and wait for an available wavelength to be released, without the need to repeat the
routing computation.
The main advantages of the proposed algorithm include that (i) it converts a com-
plicate whole network searching problem into several paralleled local optimizations,
thus it is more appropriate for ONoCs considering the hardware complexity and the
energy consumption; (ii) it considers both to decrease the length of each multicast
ring and the maximal wavelength utilization of optical links in the network, thus it is
possible to accommodate more multicast rings with a limited number of wavelengths.
However, the drawback of this algorithm is that after the segment optimization it can-
not always guarantee there is an available wavelength for the whole multicast ring, thus
the multicast ring allocator needs to temporarily store the routing path of multicast
ring, which leads to some extra hardware costs.
Furthermore, to evaluate the proposed heuristic algorithm, a simple comparison
with the optimal results is conducted under a small network size ONoC. In the com-
parison, the network size of ONoC is set to 4×4, and there are 64 wavelengths available
119
in the optical routers and optical links. In each set of comparisons, 20 multicast groups
are randomly generated and each multicast group has 4 to 8 cores uniformly distributed
in the network. The optimal result is achieved by enumerating the combinations of all
possible routing and wavelength allocation for each multicast group, and it can achieve
the minimum number of wavelengths among all the possible combinations. By running
three sets of comparison, the average number of required wavelengths from the optimal
results is only 9.3, while the average number of required wavelengths from the proposed
heuristic scheme is 12.7. However, even with such a small network, each simulation
run of the optimal results can last for up to 5 hours running on a typical desktop.
4.5 Performance Evaluation
The communication performance of the proposed DWRMR architecture and ring-based
multicast communication scheme are evaluated through extensive simulations by using
both the synthetic multicast traffics and the realistic data traces. Moreover, the ONoC
architecture with the traditional multicast routing schemes, such as unicast-based rout-
ing (UM), tree-based routing (TM), and path-based routing (PM) are also compared
in the simulations.
4.5.1 Simulation Setup
To evaluate the performance of DWRMR and other multicast schemes, a network-
level simulation platform is developed based on the Noxim simulator (Catania, Mineo,
Monteleone, Palesi, and Patti, 2016). The same parameter settings are used in the
simulations for all the multicast schemes to make fair comparison. The parameter set-
tings are summarized in Table 4.1. The network size is set to 8×8 in mesh topology.
In each optical link, the number of available wavelengths is set from 16 to 64. The
channel bandwidth of all the optical devices, including the E-O and O-E converters
and the optical routers, is set to 10 Gbps/wavelength. The system clock works at the
frequency of 5 GHz. With this clock frequency, the propagation speed of optical signal
is 8 routers/cycle (Liu, Yang, and Melhem, 2015). E-O and O-E conversions both take
1 cycle. Each multicast packet has a constant size of 64 bits, while multiple multicast
packets can be transmitted successively for a large volume of data. Moreover, in the
simulations, the similar processes are implemented in the optical path construction,
namely for the other multicast schemes the centralized routing and wavelength alloca-
tion is also used, instead of a traditional hop-by-hop reservation in an electronic control
120
Table 4.1: Simulation Settings for DWRMR
Optical Electronic
Maximal wavelengths 16, 32, 64 Clock frequency 5 GHz
Channel bandwidth 10 Gbps/λ Processing delay 2 cycles
Transmission speed 8 routers/cycle E-O conversion delay 1 cycle
plane. The processing delay for the centralized routing and wavelength allocation takes
2 cycles in all the schemes. Finally, to ensure the accuracy of simulation results, each
simulation run lasts for 500,000 cycles with a warmup of 10,000 cycles.
In the simulations, the communication performance of different multicast schemes
are evaluated in terms of the average end-to-end packet delay and the average network
throughput. The unicast communication is not considered in the simulations, and it
can be implemented by using XY routing (Shacham, Bergman, and Carloni, 2008)
and with a separate set of wavelengths. The end-to-end packet delay of multicast
communication is defined as the time interval from the source core generating a new
multicast packet until the packet being transmitted in the forwarding plane to all
the destination cores. The network throughput is defined as the average volume of
multicast packets received in each core over a fixed time period. Both the synthetic-
based and trace-based multicast traffic patterns are used in the simulations.
4.5.2 Synthetic-Based Simulations
In the synthetic-based simulations, the multicast traffic is subjected to the following
distribution: (i) each core generates the multicast packets independently with an aver-
age data rate of θ packets/cycle/core, which follows the Poisson distribution, θ∈ [0, 1]
(Ma, Jerger, and Wang, 2012); (ii) the number of destination cores k in each multicast
packet follows the Normal distribution, k ∼ N(µk, σk), µk and σk are the expected
value and standard deviation of k; (iii) in each multicast packet, the source core and
the destination cores are distributed uniformly; (iv) the probability of the interactive
multicast packet, i.e., the multicast packets transmitted in the same multicast group
from different cores, denoted by τ , also follows the Poison distribution, τ ∈ [0, 1]. It is
worth noting the multicast traffic for each core is generated according to the above dis-
tribution in advance and stored in the separate traffic file. The same multicast traffic
files are used to evaluate the performance of different multicast routing schemes.
121
0 4 8 12 16 20 24 28 32
20
40
60
80
100
120
140
160
180
A
ve
ra
g
e
 E
n
d
-t
o
-E
n
d
 D
e
la
y 
(n
s)
Average Data Rate (Gbps/core)
 UM
 TM
 PM
 DWRMR
(a) (b)
0 4 8 12 16 20 24 28 32
0
50
100
150
200
250
300
350
400
450
A
ve
ra
g
e
 T
h
ro
u
g
h
p
u
t 
(G
b
p
s
/c
o
re
)
Average Data Rate (Gbps/core)
 UM
 TM
 PM
 DWRMR
Figure 4.9: Comparison with different multicast routing schemes,
unicast-based (UM), tree-based (TM), and path-based (PM), in the
average (a) end-to-end packet delay and (b) network throughput.
Comparison with Existing Multicast Schemes
In the first set of simulations, the communication performance of different multicast
routing schemes are compared with the same number of available wavelengths. The
number of wavelengths in each optical link and optical router is set to 64 for the 8×8
ONoC. For each multicast communication, the number of destination cores follows the
Normal distribution N(16, 5), since according to the analysis for data traces the number
of destination cores for each multicast packet is about 16 (Hestness, Grot, and Keckler,
2010). Moreover, among the multicast packets, there is τ = 0.1 percent of interactive
multicast packets which needs to be transmitted in the same multicast group.
It can be seen from Figure 4.9 that the proposed scheme DWRMR outperforms
the other multicast routing schemes in both the average end-to-end packet delay and
network throughput at any data rate. In the Figure 4.9(a), when the average data
rate is low, e.g., 1 Gbps/core, the average packet delay for DWRMR is only 30.0 ns,
whereas that can reach 44.0, 39.8, and 36.3 ns for the other three multicast schemes,
respectively. As the average data rate increases, the average packet delay using all
these schemes increases as well, since there are more conflicts on the routing paths and
wavelengths with more multicast traffic in the network. However, the DWRMR scheme
has the lowest increasing speed. If the maximal acceptable multicast delay is defined
to be 120 ns, then the maximum data rates that can be tolerated in each scheme are
about 8.5, 11.5, 13.5, and 27.5 Gbps/core, respectively. That indicates DWRMR can
achieve more than twice multicast capacity of other multicast schemes with the same
requirement on the average packet delay by using the same number of wavelengths.
122
The main reasons for much lower average packet delay and higher maximal data
rate in DWRMR scheme are: (i) the established optical multicast ring and the allo-
cated wavelength can be reused among the cores in the same multicast group, while
the established routing path cannot be reused in the other multicast schemes, and it
needs to establish a new multicast routing path for each source core and each multi-
cast communication frequently; (ii) the multicast ring based routing and wavelength
allocation scheme in DWRMR can decrease both the length of each multicast ring and
the accumulated wavelength utilization in all the optical links used in the multicast
ring, while the other multicast schemes greedily route the multicast packets along the
shortest path from the source core to the destinations, which can lead to unbalanced
wavelength utilization, i.e., no available wavelength in some optical links.
In Figure 4.9(b), the average network throughput tends to linearly increase be-
fore the network get saturated due to the wavelength limitation, that is because the
average data rate is below the multicast capacity of these routing schemes in ONoC.
Similarly, DWRMR can achieve much higher network throughput than the other multi-
cast schemes when the average data rate further increases. The saturated throughput,
namely the maximal throughput achieved when the network in saturation, can be 4.00,
3.95, 3.31 times higher than the unicast-based, tree-based, and path-based multicast
routing schemes, respectively.
In terms of hardware cost, different from the traditional ONoC architectures, DWRMR
employs a hierarchical network with an optical control plane for the centralized rout-
ing and wavelength allocation, instead of using an electronic control network. Thus,
it only requires moderate hardware cost for recording the established optical multicast
rings and the wavelength utilization. Because the buffer dominates the hardware cost
and power consumption in the electronic network (Latif, Seceleanu, and Tenhunen,
2010), the required buffer size is used as an example for the hardware cost comparison.
For an 8 × 8 many-core processor, the electronic routers in traditional ONoC archi-
tectures totally require at least a space of 81920 bits for the input buffer (four 64-bit
packets), while DWRMR only requires a space of 76800 bits even for recording 100
optical multicast rings which connect all the cores. 14336 bits should be used in both
schemes to indicate the wavelength utilization in each optical interconnect. DWRMR
can reduce the overall buffer space by 5.32% compared with the traditional ONoC
architectures. Moreover, the optical control channel requires 128 electronic-to-optical
and 128 optical-to-electronic converters, which only leads to slightly higher hardware
cost, as demonstrated in Table 3.3 in Chapter 3.
123
0 4 8 12 16 20 24 28 32
20
40
60
80
100
120
140
160
180
A
ve
ra
g
e
 E
n
d
-t
o
-E
n
d
 D
e
la
y 
(n
s)
Average Data Rate (Gbps/core)
 16 wavelengths
 32 wavelengths
 64 wavelengths
0 4 8 12 16 20 24 28 32
0
50
100
150
200
250
300
350
400
450
 16 wavelengths
 32 wavelengths
 64 wavelengths
A
v
e
ra
g
e
 T
h
ro
u
g
h
p
u
t 
(G
b
p
s
/c
o
re
)
Average Data Rate (Gbps/core)
(a) (b)
Figure 4.10: Performance evaluation with different number of avail-
able wavelength channels for DWRMR, in the average (a) end-to-end
packet delay and (b) network throughput.
Influence of the Number of Wavelengths
In this set of simulations, the influence of different number of available wavelengths
on the communication performance of DWRMR is evaluated. The maximal number of
available wavelengths Wmax is set to be 64, 32, and 16 in the simulations, respectively.
It can be seen from Figure 4.10 that DWRMR can achieve much lower average
packet delay and higher network throughput with more available wavelength channels.
Note that the most important property is DWRMR can even has better performance
with 32 wavelengths than other multicast schemes with 64 wavelengths, by comparing
with the simulation results in Figure 4.9. That indicates DWRMR has the ability to
efficiently utilize the limited wavelength channels to accommodate more multicast com-
munications. Moreover, if the many-core system has a fixed performance requirement
on the multicast communication, DWRMR can use only half number of wavelengths
than the other multicast schemes. Thus, it can greatly reduce the demand of laser
sources, E-O and O-E converters, and microring resonators in the optical routers. Due
to the complexity of optical devices, this reduction is more considerable compared
to the hardware cost on recording the allocated multicast rings and wavelengths for
different multicast groups.
Influence of Interactive Multicast Patterns
Since the DWRMR routing scheme has the advantage of reusing the allocated mul-
ticast rings and wavelengths for the cores in the same multicast group, in this set of
124
0 5 10 15 20 25 30 35 40 45 50 55 60 65 70
0
20
40
60
80
100
120
140
160
180
200
A
ve
ra
g
e
 E
n
d
-t
o
-E
n
d
 D
e
la
y 
(n
s
)
Average Data Rate (Gbps/core)
 DWRMR, 0.1
 DWRMR, 0.3
 DWRMR, 0.5
 DWRMR, 0.7
 DWRMR, 0.9
 UM, 0.5
 TM, 0.7
 PM, 0.5
0 5 10 15 20 25 30 35 40 45 50
0
100
200
300
400
500
600
700
800
A
v
e
ra
g
e
 T
h
ro
u
g
h
p
u
t 
(G
b
p
s/
co
re
)
Average Data Rate (Gbps/core)
 DWRMR, 0.1
 DWRMR, 0.3
 DWRMR, 0.5
 DWRMR, 0.7
 DWRMR, 0.9
 UM, 0.5
 TM, 0.5
 PM, 0.5
(a) (b)
Figure 4.11: Performance evaluation of multicast ring reuse for the
interactive multicast traffic, τ = 0.1, 0.3, 0.5, 0.7, 0.9 in the average
(a) end-to-end packet delay and (b) network throughput.
simulations, the communication performance with different ratios of interactive multi-
cast traffic in the same multicast group is evaluated. In the simulations, the ratio of
interactive multicast traffic is set to τ = 0.1, 0.3, 0.5, 0.7, and 0.9. Moreover, the other
three multicast routing schemes are also compared with DWRMR when τ =0.5.
It can be seen from Figure 4.11: (i) For the DWRMR scheme, since the allocated
multicast ring and wavelength can be reused for the interactive multicast traffic within
the same multicast group, the average packet delay spent for the multicast ring allo-
cation and configuration is significantly reduced, and the probability of conflicts on
the wavelength allocation (namely no available wavelength for the multicast ring) can
also be decreased, thus the higher ratio of interactive multicast traffic is, the more sig-
nificant improvement on the communication performance can be achieved. According
to the simulation analysis, the average possibility of wavelength conflicts, namely the
situation of no available wavelength after the computation of multicast ring over the
successful of routing and wavelength allocation, is about 0.423, 0.277, 0.149, 0.118, and
0.084 when τ= 0.1, 0.3, 0.5, 0.7, and 0.9, respectively. With the reduced wavelength
conflicts, the average end-to-end packet delay can be decreased significantly, since the
multicast packets can be transmitted immediately after the multicast ring configura-
tion instead of waiting for a free wavelength. In Figure 4.11, the zero-load delay can be
reduced from 30.85 ns to 16.04 ns when τ increases from 0.1 to 0.9, while the maximal
allowable data rate is more than doubled. (ii) For the other multicast schemes, since
the established multicast routing path is released just after the source core finishes
its multicast packets, while the other destination cores cannot reuse it, thus the in-
125
teractive multicast traffic has little influence on their performance, when comparing
the simulation results in Figure 4.11 and Figure 4.9. Note that when τ = 0.1, 0.3,
0.5, 0.7, 0.9, the maximal achievable data rates of DWRMR with a delay constraint of
120 ns are more than 2.2, 3.2, 4.6, 5.1, and 5.5 times higher than the other multicast
schemes, while the maximal network throughput are 3.4, 4.4, 5.9, 6.0, and 6.2 times
higher than them, respectively. (iii) When the ratio of interactive multicast traffic τ
increases from 0.1 to 0.5, the performance of multicast communication improves very
fast; while performance improvement slows down when τ further increases from 0.5 to
0.9. That is because the possibility of wavelength conflict can reduce very fast when
τ is low in the beginning, i.e., from 0.423 to 0.149 when τ increases from 0.1 to 0.5;
while the reduction of the possibility of wavelength conflict slows down when τ exceeds
0.5, i.e., from 0.149 only to 0.084 when τ further increases from 0.5 to 0.9.
4.5.3 Simulation with Data Traces
In the trace-based simulations, the multicast communication is filtered from the inter-
core communication of a 64-core system running PARSEC benchmark (Hestness, Grot,
and Keckler, 2010). If a core transmits multiple packets that with the same type to
different destination cores in the successive clock cycles, then it is considered as a
multicast communication. The addresses of the source core and all the destination
cores are recorded in the multicast trace files. Similarly, the same multicast trace files
are used in the simulations of different multicast routing schemes. Figure 4.12 gives
the simulation results for different multicast schemes with the realistic data traces.
It can be seen from Figure 4.12 that the average packet delay is significantly reduced
in all the applications using DWRMR scheme. In average, the average packet delays
achieved by using the other multicast schemes are 108.2 ns, 48.9 ns, and 35.8 ns,
respectively, while DWRMR can reduce it to only 14.9 ns, which is 50% less than even
the best of the other schemes. That is because, as demonstrated in Figure 4.1(b), a large
proportion of multicast packets (>60%) in the many-core processor are the interactive
multicast packets transmitted within the same multicast group, thus the advantage of
multicast ring reuse of DWRMR can be well exploited in these applications. Other
multicast schemes have no such ability and need to spend more time on establishing
the multicast routing paths. Thus, DWRMR can achieve much better performance
than the other multicast schemes.
126
bla
cks
cho
les
bod
ytra
ck
can
nea
l
ded
up
ferr
et
flui
dan
ima
te
sw
apt
ion
s
vip
s
x26
4
0
20
40
60
80
100
120
140
160
A
ve
ra
g
e
 P
a
ck
e
t 
D
e
la
y 
(n
s)
 UM
 TM
 PM
 DWRMR
Figure 4.12: Average multicast packet delay in different application
of trace-based simulations.
4.6 Summary
In this chapter, a multicast enabled Optical Network on Chip architecture, DWRMR,
and the multicast ring based communication scheme are proposed to solve the prob-
lems of multicast communication in many-core processors. DWRMR facilitates the
multicast communication by combining the wavelength reusable routing with the dy-
namically established optical multicast ring. Compared with the traditional multicast
schemes, the proposed scheme has the following advantages: (i) the established mul-
ticast ring can be reused within the same multicast group, thus it avoids the overhead
of setting up exclusive routing paths for each core in the multicast group when there
are interactive multicast traffics; (ii) each multicast ring is dynamically constructed,
so that the wavelength assignment is more flexible and the same wavelength can be
reused in the link-disjoint multicast rings; (iii) the multicast packets are transmitted in
the multicast ring using only single copy and wavelength in the manner of single-send-
multi-receive. The simulation results indicate that DWRMR can at least reduce the
zero-load delay by 47.4% and increase the maximal throughput by 4.9 times compared
with other multicast schemes, with 50% of interactive multicast in a 8×8 ONoC with
64 wavelengths. It is worth noting that when the acceptable packet delay is set to 120
ns, DWRMR only requires to use the half number of wavelengths (32 wavelengths) to
achieve the same performance compared with other schemes.
127
128
Chapter 5
Dark-ONoC: A Dark Silicon Aware
ONoC Architecture
Many-core processor, with a large number of cores integrated in a single chip, is be-
coming the mainstream computational platform for data center and cloud computing
systems. However, the restricted power budget drives many-core processor into the
dark silicon era, in which only a small number of cores can be activated simultaneously,
while the others must stay in dark state (power-gated). The dark silicon phenomenon
introduces significant and challenging problems on the design of inter-core communi-
cation network, such as the strict power limitation, increased average communication
distance, and highly variable distribution of communication requirements. Generally,
high bandwidth capacity and high power efficiency are both top-most design issues on
the inter-core communication network for many-core processor with dark silicon.
Optical Network on Chip (ONoC) is a chip-scale optical communication technology
with extremely low transmission delay, high throughput, and low power consumption
compared with the traditional electronic interconnects. In this chapter, a dynami-
cally configurable ONoC architecture is proposed, named as Dark-ONoC, which takes
the dark silicon into consideration. In Dark-ONoC, the non-blocking optical routing
paths are only established between the active cores to improve the bandwidth capac-
ity, and the number of required wavelengths is reduced through wavelength reuse to
save power. Dark-ONoC utilizes a hierarchical network architecture which divides the
routing and wavelength allocation (RWA) and the optical data transmission into dif-
ferent planes. The optical routing and wavelength allocation scheme is formulated as
a mapping problem that from the logical interconnections of active cores to the optical
links and wavelengths in ONoC. To decrease the number of required wavelengths, a
129
heuristic scheme is proposed which combines the wavelength utilization aware routing
and reusable wavelength allocation. Extensive simulations are conducted with the dark
silicon patterns from both synthetic distributions and real data traces. Simulation re-
sults indicate that the number of wavelengths is reduced by about 15% and the overall
power consumption is reduced by 23.4% compared to existing schemes.
5.1 Motivation
5.1.1 Many-Core Processor in Dark Silicon
Many-core processor is currently becoming the mainstream platforms for various data
center and cloud computing systems because of its powerful computation capability by
integrating a large number of cores in a single chip. At present, several commercially
available many-core processors can integrate tens of hundreds of cores (Kalray, 2012;
Mellanox, 2013). There will be more than thousands of cores in the next-generation
processor in the future (Kelm, Johnson, Lumetta, and Patel, 2010; Kurian, Miller,
Psota, Eastep, Liu, Michel, Kimerling, and Agarwal, 2010; Nychis, Fallin, Moscibroda,
Mutlu, and Seshan, 2012). However, the increase of power consumption, especially for
the inter-core communication, is a major concern for many-core processors.
Generally, the overall power consumption of a many-core processor cannot exceed
its maximal cooling capacity and it is thus restricted by a power budget, namely the
Thermal Design Power (TDP) (Khdr, Pagani, Shafique, and Henkel, 2015); other-
wise, it will deteriorate the performance and reliability due to thermal issues, such
as increased leakage power, reduced lifetime, and even breakdown. Because of the
restricted power budget and the limited cooling efficiency, dark silicon becomes an in-
evitable reality for many-core processors (Esmaeilzadeh, Blem, Amant, Sankaralingam,
and Burger, 2011). With the core-level power gating, dark silicon means only a small
portion of cores can be powered on (i.e, active cores) at the same time while the other
cores have to be power-gated or stay in deep sleeping state (i.e, dark cores). Note that
even in a Snapdragon MSM8976 processor (in 28 nm process) with eight cores used in
a VIVO X6 mobile phone which was released in 2015, there are always two cores in
sleeping state (dark) to save power (Qualcomm, 2015). As the manufacturing process
continuously scales down, the proportion of dark silicon in a many-core processor will
become larger and larger in the future. It was predicted at least 21% of the chip area
will be dark in 22 nm process and this proportion will increase to more than 50% in
130
the 8 nm process (Esmaeilzadeh, Blem, Amant, Sankaralingam, and Burger, 2011).
5.1.2 Properties of Dark Silicon
For a many-core processor with the core-level power gating scheme (i.e., turning on/off
each core separately), dark silicon can introduce several different properties for the
communication between the active cores (Taylor, 2013). In this chapter, dark silicon
pattern is used to illustrate the communication properties of dark silicon, and it refers to
the number of active cores (i.e., the ratio of dark silicon) and their spatial distribution
in the network (i.e., the dark silicon profile). Generally, in order to satisfy the overall
power budget and to achieve the thermal balance, the dark silicon pattern should have
the following properties.
(I) The maximal number of active cores is determined by the overall power budget
and the power cost of each core. To utilize the power more efficiently, some application-
specific cores with low power cost can be used to execute multiple popularly-used ap-
plications, instead of using general-purpose cores with high power cost for all the appli-
cations, such as in GreenDroid (Swanson and Taylor, 2011) and QsCores (Venkatesh,
Sampson, Goulding-Hotta, Venkata, Taylor, and Swanson, 2011). In addition, the dy-
namic voltage and frequency scaling (DVFS) schemes can be used to reduce the power
cost of the cores at run-time by sacrificing some performance (Yan, Li, Han, Li, Guo,
and Liang, 2012). Generally, most of the current existing many-core processors allow
each core working on multiple frequency values dynamically. In this chapter, from the
viewpoint of inter-core communication, no matter what the variety of cores and their
working voltages and frequencies are, if a core is not in the powered-off or sleeping state,
it is considered as an active core which requires to use the communication resources.
(II) The group of active cores dynamically vary for executing and cooling down, and
their distribution should roughly follow the uniform distribution for thermal balance.
First, to prevent the active cores from being overheated, they can keep operating
for a time period of more than hundreds of seconds before the other dark cores are
activated. Thus, it is not necessary to reserve the communication resources, such as
routers and links, for all the cores all the time, and the communication resources should
be allocated dynamically for the active cores. Second, even though the dark silicon
pattern dynamically changes, in a fixed time period it is relatively constant, and the
active cores should be distributed evenly to achieve thermal balance, instead of staying
in a small region which results in changing the dark silicon patterns too frequently
(Rahmani, Haghbayan, Kanduri, Weldezion, Liljeberg, Plosila, Jantsch, and Tenhunen,
131
G1(active) G2(dark) G3(dark) G4(dark)
76
32
9
5
1
8
4
0
10 11
151412 13
Active core Dark core
(0,0) (7,0)
(7,0) (7,7)
(a) (b)
Figure 5.1: Two typical kinds of dark silicon patterns in an 8×8 mesh
based ONoC with 64 cores, (a) fixed pattern, with four equivalent
groups which are active in turn; (b) random pattern, with variable
number of active cores and spatial distributions.
2015). Thus, the average communication distance between the active cores is increased,
since there are several dark cores located between the active cores.
Thus, the dark silicon pattern has significant influence on the performance and
power consumption of inter-core communication. In this chapter, the many-core pro-
cessor is interconnected using an ONoC with widely-utilized mesh topology (because
of its simple floorplan and flexible configuration). Two typical kinds of dark silicon
patterns are considered: fixed pattern and random pattern, as shown in Figure 5.1.
In the fixed dark silicon pattern, as shown in Figure 5.1(a), all the cores are divided
into several equivalent groups with the equal distance between two adjacent active
cores, and only one group is activated in each time period as in (Bokhari, Javaid,
Shafique, Henkel, and Parameswaran, 2014; Yang, Liu, Jiang, Li, Yi, and Sha, 2016). In
this scenario, since the number and location of active cores are fixed in advance and all
the groups of cores are symmetrical, the optical routing paths and wavelength allocation
can be stored constantly for the interconnection of different groups of cores. In the
random dark silicon pattern, as shown in Figure 5.1(b), the number of active cores and
their distribution are highly variable depending on the communication requirements
of different applications and different task mapping schemes. Since each core can be
power-gated separately, this pattern can obtain good overall performance and power
efficiency by activating an appropriate number of cores for different applications.
132
5.1.3 Existing Research on Dark Silicon
To satisfy the power budget and thermal safety of many-core processors with dark
silicon, most of existing researches focus on scheduling the operating voltages or fre-
quencies of cores, such as power gating and DVFS (Yan, Li, Han, Li, Guo, and Liang,
2012), and dynamically mapping each application to some specific cores that are uni-
formly distributed to prevent overheating of a chip area (Rahmani, Haghbayan, Kan-
duri, Weldezion, Liljeberg, Plosila, Jantsch, and Tenhunen, 2015). These schemes try
to increase the number of active cores and improve the overall performance through
parallel computing. However, they only consider the power consumption of computa-
tion in the cores, and may deteriorate the communication performance with increased
average distance between the active cores. This chapter focuses on the design of inter-
core network with high communication performance and high power efficiency, which
is complementary to the above schemes.
Existing researches on the power-efficient communication network for dark silicon
only employ the electronic interconnects. As introduced in Section 1.2.1, the tradi-
tional electronic Network on Chip (NoC) connects every core with a router to provide
the store-and-forward routing. If the dark silicon is not considered in NoC, network
resources (such as, buffers and wires), have to be provided for the interconnection of all
the cores no matter they are dark or active. In this way, NoC takes at least 30% of the
overall power consumption in many-core processor (Parikh, Das, and Bertacco, 2014).
Bokhari et. al. proposed a dark silicon aware NoC scheme which divides the whole net-
work into multiple identical layers and uses different voltage-frequency configurations
in each layer (Bokhari, Javaid, Shafique, Henkel, and Parameswaran, 2014). Then, a
specific layer is activated according to the communication requirement and the power
budget. Similarly, FoToNoC uses a folded torus topology and divides the network
into four regions with identical communication distance. The cores in different regions
take turns to be activated to achieve good communication performance and thermal
balance (Yang, Liu, Jiang, Li, Yi, and Sha, 2016). Router Packing combines the power
gating of routers and the packet routing algorithm. It ensures the network connec-
tivity between active cores while turning off as many routers as possible by detouring
the packets (Samih, Wang, Krishna, Maciocco, Tai, and Solihin, 2013). DimNoC uses
the buffer-level power gating, since the buffer in each router takes more than 57% of
overall power consumption of the router (Samih, Wang, Krishna, Maciocco, Tai, and
Solihin, 2013). Generally, the above schemes are efficient for small-scale many-core
processors with tens of cores using electronic interconnects. The proposed scheme is
133
for large-scale many-core processors with hundreds of cores based on ONoC, where the
communication distance has little impact on the packet delay and power consumption.
5.1.4 Significance of Dark Silicon Aware ONoC
Optical Network on Chip is a promising communication architecture for many-core
processor with dark silicon. In the traditional ENoC, the hop-by-hop routing and
store-and-forward transmission can lead to high power consumption. The increased
communication distance between active cores in dark silicon can greatly increase the
packet delay and the possibility of traffic congestions. ONoC can address the above
challenges by transmitting data packets in waveguides (optical medium) through the
modulated light signals with multiple wavelengths in parallel (Shacham, Bergman, and
Carloni, 2008; Li, Browning, Gratz, and Palermo, 2014; Liu, Zhang, Chen, Huang,
and Gu, 2015). By using ONoC, the average end-to-end delay can be reduced by
70% compared to the optimized electronic interconnects (Zhang and Louri, 2010), and
ideally the power efficiency can be improved by four times (Dokania and Apsel, 2009).
As introduced in Chapter 2, it has enormous advantages by exploiting ONoC in many-
core processor with dark silicon: (i) ultra low transmission delay and power dissipation
that are almost independent of the distance between active cores; (ii) non-blocking
optical routing and high bandwidth capacity via wavelength multiplexing; (iii) high
reliability because of the low propagation loss and noise; (iv) area-compact optical
devices with flexible configuration for different dark silicon patterns.
However, due to the intrinsic properties of optical devices, the design of high-
performance dark silicon aware ONoC also has some challenging issues. First, different
from the electronic devices, it lacks optical buffer and optical processing logics, thus the
optical routing paths and the carrier wavelengths have to be allocated before the data
transmission. Since the active cores and their interconnections are different from time
to time but stay constant in a fixed time period, the dark silicon aware ONoC should
conduct the routing and wavelength allocation scheme periodically and configure the
optical routing paths rapidly. Second, although ONoC can achieve high bandwidth
capacity through wavelength multiplexing, the number of available wavelengths is lim-
ited in an optical link, in addition the overall power consumption and the hardware
cost directly relate to the number of wavelengths used in ONoC (Bahadori, Rumley,
Nikolova, and Bergman, 2016). For example, the power consumption of ONoC in-
creases with the number of wavelengths Nλ. (i) Since the laser source should provide
enough power intensity for each wavelength used, its output power should be higher
134
than Nλ×Pworst for each core, where Pworst is the worst-case power consumption be-
tween two cores. (ii) Since there are Nλ×16N2 MRs in total in the optical routers of an
N×N mesh-based ONoC, as introduced in Chapter 2.3, the static power for tuning the
resonant wavelength of MRs is proportional to Nλ as well. According to the analysis
in (Morris, Kodi, Louri, and Whaley, 2014), for an ONoC connecting 64 cores with 64
wavelengths, up to 77,000 MRs need to be used for the all-to-all interconnection, and
the power consumption for tuning the MRs is as much as 27.5 Watts, which is about
4.5 times of the laser power.
Thus, the ONoC architecture with the fixed wavelength-based routing between all
the connected cores is not feasible for many-core processor with dark silicon. It needs to
design a dark silicon aware ONoC architecture which can allocate the optical routing
paths and wavelengths only for the active cores with different dark silicon patterns
to ensure the communication performance, and the number of required wavelengths
should be minimized to reduce the overall power consumption.
5.1.5 Main Contributions of Dark-ONoC
The research objective of this chapter is to design an ONoC architecture with high
communication performance and low power consumption for many-core processors with
dark silicon. Thus, a dark silicon aware ONoC architecture is proposed, named as
Dark-ONoC. It establishes the non-blocking optical routing paths only between the
active cores in different dark silicon patterns with the decreased number of wavelengths.
Therefore, it can achieve low end-to-end delay, high network throughput, and low power
consumption for the communication between active cores in the many-core processor
with dark silicon. The main contributions can be summarized as follows:
• A hierarchical network architecture, Dark-ONoC, is designed to address the com-
munication issues of dark silicon for many-core processors. The control plane
periodically allocates the optical routing paths and wavelengths for active cores.
After the routing configuration by the control plane, massive data packets can be
transmitted with non-blocking in the optical data plane, thereby achieving very
low communication delay, high network throughput, and low power consumption.
• The power consumption model of optical inter-core communication in relation
to the number of wavelengths is built for Dark-ONoC. The joint routing and
wavelength allocation problem is formulated as a mapping problem that between
the logical interconnections of active cores and the optical links and wavelengths
135
in ONoC. The optimization objective is to minimize the maximum number of
wavelengths used in each optical link.
• A heuristic routing and wavelength allocation scheme is proposed. It combines
the wavelength utilization aware routing and reusable wavelength allocation, to
setup non-blocking optical routing paths only for the active cores and to reduce
the number of required wavelengths by balancing the wavelength utilization in
the network and reusing the same wavelength in the link-disjoint routing paths.
• Dark-ONoC with the heuristic routing and wavelength allocation scheme is eval-
uated through extensive simulations. Dark silicon patterns from both synthetic
distributions and data traces are used in the simulations. Compared with exist-
ing schemes, around 15% of wavelengths are reduced in Dark-ONoC, while the
overall power consumption is reduced at least by 23.4%.
5.2 Network Architecture
The key idea of Dark-ONoC is to periodically allocate non-blocking optical routing
paths only for the active cores with different dark silicon patterns, and to reduce the
number of required wavelengths via wavelength reuse in the link-disjoint routing paths.
The network architecture and main components are introduced in the following.
5.2.1 Hierarchical Network
To obtain fast configuration of optical routing paths and wavelengths for different dark
silicon patterns, Dark-ONoC employs a hierarchical network architecture which logi-
cally consists of three planes: electronic core plane, optical control plane, and optical
data plane, as shown in Figure 5.2. In general, the electronic core plane contains the
cores to realize parallel computing, while the manager core maintains the dark/active
states of the cores and conducts dark silicon pattern transition, namely wake up the
dark cores and power down the active cores; the optical control plane maintains the
wavelength utilization of optical links in the network, and periodically constructs opti-
cal routing paths and allocates carrier wavelengths for the active cores in a centralized
manner; the optical data plane utilizes the configurable optical routers to provide non-
blocking data transmission between the active cores. Note that there are two control
components in Dark-ONoC: the manager core in the electronic core plane controls the
136
dark silicon pattern, the routing and wavelength allocator in the optical control plane
controls the network configuration.
Different with existing ONoC architectures which use an electronic control network
and an optical data network (Shacham, Bergman, and Carloni, 2008), Dark-ONoC has
the following advantages: (i) it can provide fast optical routing configuration using the
optical control plane, namely configuring multiple intermediate routers on the routing
path using different wavelengths at the same time, instead of an electronic hop-by-
hop reservation and routing configuration. Thus, it can significantly reduce the path
configuration delay and avoid the power-consuming buffers in the electronic routers;
(ii) it conducts the routing and wavelength allocation in a centralized manner, using
the global wavelength utilization of the optical links, which can improve the efficiency
of optical links and wavelengths in the data network and reduce the number of required
wavelengths, thereby decreasing the overall power consumption.
Moreover, it can be seen that the network architecture of Dark-ONoC is similar to
the network architecture of DWRMR in Chapter 4, while the functions of corresponding
planes and the related components are different to realize different purposes. Similarly,
Dark-ONoC explores the wavelength reuse in the routing and wavelength allocation
scheme to reduce the number of used wavelengths. Without loss the generality, Dark-
ONoC employs the widely used mesh topology in the optical data network. It can
also be extended to other topologies, such as torus, by changing the constraints of
physical interconnects in the routing and wavelength allocation scheme accordingly. In
the following, the main functions of these three planes and their key components are
introduced separately.
5.2.2 Electronic Core Plane
All the cores are distributed in an N×N array and every core can turn to active or
dark independently controlled by the manager core, as shown in Figure 5.2. Each core
is connected with the network through an network interface (NI), as shown in Figure
5.3(a). The kernel component in this plane is the manager core. Firstly, it maintains
the dark silicon pattern of the current time period, namely the number of active cores
and their locations. Secondly, the manager core supervises the dark silicon pattern
transition. To prevent the active cores from being overheated, they are only allowed to
continuously operate for a fixed time period. The manager core selects a new group of
dark cores and wakes them up to continue the operation. This is dark silicon pattern
transition. Moreover, the most important function of the manager core is to request
137
Configurable
Optical Router
Optical
Waveguide
Active 
Core
Optical
Interface
Inter-Layer 
Link
Control 
Channel
Optical 
Data Plane
Electrical
Core Plane
Optical 
Control Plane
Direct Link
Dark 
Core
Manager Core
RW
A
Routing and 
Wavelength 
Allocator
Figure 5.2: Dark-ONoC architecture: electronic core plane with a
manager core for the dark silicon pattern transition; optical control
plane for the centralized routing and wavelength allocation, and fast
configuration of optical routing paths; optical data plane for the non-
blocking transmission of massive optical packets.
the optical routing paths and carrier wavelengths for all the active cores. It can be
seen from Figure 5.2 that the manager core is directly connected with the routing
and wavelength allocator (RWA) in the optical control plane through a vertical link.
After the dark silicon pattern is determined for a new time period, the manager core
will send the routing request containing the locations of active cores and their logical
interconnections to the RWA.
When the optical routing paths are configured between the active cores, the routing
information will be sent back from the RWA to all the active cores. Since in Dark-
ONoC the non-blocking wavelength-routed communication is provided, each routing
path is only determined by the carrier wavelength and the output port of the source
router. As shown in Figure 5.3(b), each active core maintains a local routing table in
its network interface, and utilizes the allocated wavelength and output port to trans-
mit data packets to a specific destination core. The detailed routing and wavelength
allocation scheme will be described in Section 5.4.
Note that the electronic direct link is more power efficient for the communica-
138
Optical 
Router
Core
Network 
Interface
North
West
South
East
Electrical 
Direct Link
Optical Link
Destination Wavelength Output
Core 2 North
Core 3 West
Core 4 East
Core 5 South
Core 6 West
... ... ...
Core n South
Core 1:
1
2
k
Destination Wavelength Output
Core 1 South
... ... ...
Core 2:
1
1
1
1
(a) (b)
Figure 5.3: (a) Each core connects with an optical router through the
network interface (NI). (b) Local routing table for each active core
in the network interface, in which each routing path is related to a
wavelength and an output port.
tion between neighbouring cores than the optical interconnects, because it avoids the
electronic-to-optical (EO) conversion in the source core and the optical-to-electronic
(OE) conversion in destination core (Shacham, Bergman, and Carloni, 2008). Thus,
the manager core will not apply an optical routing path for neighbouring active cores,
and the network interface will use the electronic direct link for the packets between
neighbouring cores as shown in Figure 5.3(a). The increased hardware cost will not
be a major concern, because it needs no electronic router for routing and buffering
between two neighbouring cores.
5.2.3 Optical Control Plane
The optical control plane consists of a centralized routing and wavelength allocator and
a cyclic optical control channel to establish the optical routing paths. Firstly, according
to the request of logical interconnections from the manager core, the RWA calculates the
optical routing paths between any two active cores and allocates a specific wavelength
for each routing path. Since the RWA maintains the global wavelength utilization of
all the optical links, it considers both reducing the number of required wavelengths
in each optical link and decreasing the length of each routing path, namely balancing
the wavelength utilization in the whole optical network. Moreover, when the optical
139
routing paths are determined, the same wavelength is reused among the link-disjoint
routing paths, i.e., routing paths without any shared link, in first priority. Secondly,
the RWA also configures the optical routers in the optical data plane through the
optical control channel to establish the optical routing paths.
The optical control channel provides high speed transmission for the control packets.
In Dark-ONoC, there are two kinds of control packets. (i) The dark silicon related
control packets are transmitted from the manager core to wake up a group of dark
cores or to shut down a group of active cores in the dark silicon pattern transition. (ii)
The routing related control packets are transmitted from the RWA to configure all the
optical routers which are located on the allocated optical routing paths. It can be seen
that all the control packets are transmitted from one source (the manager core or the
RWA) to multiple destinations (active/dark cores or optical routers), thus the optical
control channel uses a cyclic 1-to-n optical bus when there are n active/cores in the
network, as shown in Figure 5.2. There is a separate optical interface to connect with
every core in the electronic core plane and every router in the optical data plane, as
shown in Fig. 5.4. The manager core and the RWA share the same optical interface,
since different kinds of control packets do not use the control channel at the same
time. For the non-blocking transmission of control packets, a specific wavelength is
assigned for each optical interface, namely the control packets to different interfaces
are transmitted using different wavelengths at the same time and each interface can
receive the right packets according to the wavelength, e.g., wavelength λ1 to λn in
Figure 5.4. Thus, the optical interface connected to the manager core and the RWA
has n electronic-to-optical (EO) converters with different wavelengths, and the other
interfaces only have an optical-to-electronic (OE) converter with a specific wavelength
listening to the control channel. If the number of cores exceeds the available number of
wavelengths, each interface can be shared by multiple cores/routers via time division
multiplexing (TDM) (Zhang, Gu, Yang, Chen, and Hao, 2014).
5.2.4 Optical Data Plane
In the optical data plane, optical routers are connected in an N×N mesh using bidi-
rectional optical links. Each optical router is configurable and able to transmit packets
with Nλ different wavelengths. For a specific wavelength, it can use an optical switch
with a separate set of MRs. As shown in Figure 5.5, the configurable optical router
only requires 16 MRs with the parallel injection from and ejection to the active cores.
The configuration of optical routers can be dynamically changed under the control of
140
Core 0
Core 1
Core 2
NI
NI
NI
Core n
NI
RWA
Optical 
Control Channel
1
2
3
n
Optical 
Router
Optical 
Router
Optical 
Router
Optical 
Router
Manager 
Core
Optical 
Interface
n
1
2
OE
OE
OE
EO
Figure 5.4: The connection of optical interfaces in the optical control
channel from the RWA to cores and routers, with one manager core
and n cores using n wavelengths (λ1,...,λn).
RWA by tuning different MRs according to the configuring matrix given in Figure 5.5.
MRs R1 to R8 are used to pass the packets to different neighbouring optical routers,
MRs M1 to M4 are used to send the packets from the core to four different directions
in parallel, and MRs D1 to D4 are used to receive the packets from four different
directions in parallel. For example, if an optical packet is transmitted from the north
input to the west output, then MR R6 is tuned on.
It can be seen from the configuring matrix that each core can simultaneously
send/receive different packets to/from different directions through the same wavelength
with non-blocking by tuning on different MRs. For a Dark-ONoC with N×N cores, the
optical data plane requires 16NλN
2 MRs in total, since each optical router has Nλ sets
of optical switches for Nλ wavelengths. To reduce the static power consumption, only
MRs in the optical routers located in the routing paths with allocated wavelengths will
be tuned on (with a delay of less than 0.1ns (Morris, Kodi, Louri, and Whaley, 2014)).
5.3 Communication Process
The dark silicon patterns are assumed to be varied periodically in this chapter. Thus,
the communication process can generally be divided into three steps: dark silicon
141
R1D1
R2
R8
R7
R4
R3
R6
R5
D2
D3
D4
M2
M3
M4
M1
West East
North
South
Out1
In1
In2 Out2
In3
In4
Out3
Out4
Core
West
North
East
South
Core West North East South
- M1 M2 M3 M4
D1 - R1 0 R4
D2 R6 - R3 0
D3 0 R8 - R5
D4 R7 0 R2 -
I\O
Figure 5.5: The configurable optical router for a specific wavelength,
and the MR configuring matrix. Mi: MR for modulation; Di: MR
for photodetection; Ri: MR for routing.
pattern transition, optical routing configuration, and optical data transmission.
5.3.1 Dark Silicon Pattern Transition
In Dark-ONoC, the dark silicon pattern transition is conducted periodically under
the control of manager core. Thus, a fixed time period for each dark silicon pattern
is defined to be T . When a new time period Ti begins, the manager core at first
determines the new dark silicon pattern by selecting a group of cores which are dark in
previous time period Ti−1. Then, the manager core through the optical control channel
wakes up the selected group of dark cores and shuts down the previous group of active
cores for cooling down. After the new dark silicon pattern is determined, the optical
logical interconnections between the new group of active cores are also determined.
Note that if two active cores are located adjacently, they use the electronic direct links
and there is no optical logical interconnection between them. Then, the manager core
sends a routing request, which contains the addresses of all the active cores and their
logical interconnections to the RWA in the optical control plane.
5.3.2 Optical Routing Configuration
When the RWA has received the routing request, it conducts the routing computation
and wavelength allocation according to the logical interconnections of active cores and
the global wavelength utilization of each optical link. Each logical interconnection is
142
mapped to a bidirectional optical routing path with a limited length, and a specific
carrier wavelength is allocated to the routing path. Note that the communication
from the active core i to the active core j employs the same routing path and carrier
wavelength with the communication from the active core j to the active core i, but
they use the optical links in opposite directions. To reduce the number of required
wavelengths, the same wavelength is reused in link-disjoint routing paths, i.e., optical
paths without shared link on route, and the wavelength utilization is balanced in the
network. The routing and wavelength allocation scheme will be given in Section 5.4.
When the optical routing paths and wavelengths are allocated, the RWA configures
all the optical routers which are located in the routing paths through the optical control
channel. Since different wavelengths are used for configuring different optical routers
in the optical control channel, the configuration of optical routers can be conducted
at the same time. The connection state of each optical router is configured according
to the matrix in Figure 5.5, by tuning on the corresponding MRs. After the optical
routing configuration is finished, the RWA sends the information of routing paths and
wavelengths to the network interfaces of all the active cores.
5.3.3 Optical Data Transmission
For each active core, the optical routing path to another active core is only determined
by the carrier wavelength and the output port of the source optical router. The infor-
mation of wavelengths and output ports are maintained in the local routing matrix of
network interface. When an active core needs to communicate with the other active
cores, it checks the local routing matrix to get the appropriate wavelength and output
port, and then converts electronic signals to optical signals with the checked wave-
length. Note that since towards different output ports, the optical router has different
MR-based modulators as shown in Figure 5.5, the source active core can directly mod-
ulate the data using the MR of the corresponding output port and wavelength. The
optical data packets are then transmitted in the pre-configured optical routing path
to the destination core with non-blocking and no intermediate processing. Similarly,
at the destination side the optical packets will be received by one of input ports. The
MR-based photodetector for this input port with the corresponding wavelength will
filter the optical data packets out and convert them back to electronic signals. More-
over, since non-blocking optical transmission is achieved by wavelength multiplexing,
Dark-ONoC can realize multicast communication by concurrently sending data packets
to different destinations with different wavelengths just like in DWRMR in Chapter 4.
143
5.4 Routing and Wavelength Allocation Scheme
In this section, a motivating example is given at first to introduce the principle of
routing and wavelength allocation. Then, the routing and wavelength allocation scheme
is formulated as a mapping problem. Finally, a heuristic algorithm is proposed to
allocate optical routing paths for the active cores and decrease the number of required
wavelengths through wavelength reuse.
5.4.1 Motivating Example
As the example shown in Figure 5.6, there are five active cores randomly distributed
in an 8×8 mesh-based ONoC. Each node in the figure refers to a core and its optical
router for brevity. In Dark-ONoC, since the dark silicon pattern is fixed in each time
period and an active core can communicate with any other active core during this
long period, the non-blocking all-to-all optical routing paths are established through
wavelength multiplexing between the active cores. Thus, in total it needs to establish
ten bidirectional routing paths between five active cores. It is worth noting that if
two routing paths pass the same optical link, they have to use different wavelengths to
prevent wavelength conflict; otherwise, if they have no shared optical link, the same
wavelength can be reused in both routing paths, even if they pass through the same
optical router from different inputs and to different outputs.
In Figure 5.6(a), the principle of traditional routing schemes in electronic-based NoC
for dark silicon is to use the shortest routing paths to connect the active cores, and
then the routing paths are converged to reduce the number of intermediate routers as
in (Samih, Wang, Krishna, Maciocco, Tai, and Solihin, 2013). This scheme can obtain
low transmission delay for electronic-based NoC and only needs to pass a small number
of intermediate routers, e.g., it passes 9 intermediate routers and the average length
of routing paths is 4 hops. However, if it is used in ONoC, in Figure5.6(a) multiple
optical routing paths need to use the same optical links in the network. To prevent the
wavelength conflict, it requires to use 5 different wavelengths for non-blocking routing,
which can result in a high static power cost as it needs to tune on 14×5×18=1260 MRs,
even though it only needs to power on 14 optical routers. Moreover, since some optical
links are overused in this scheme, it can only be employed under the assumption that
the number of active cores is small and there are sufficient wavelengths available.
In Dark-ONoC, to reduce the number of required wavelengths and the overall power
consumption, it advocates the design principle that optical routing paths between the
144
Active Node
Intermediate 
Node
Dark Node
Wavelength 
Channels
(a) (b)
1 1
2 2
3 3
4 4
5 5
(0,0) (0,0)(7,0) (7,0)
(7,7)(7,7) (0,7)(0,7)
Figure 5.6: Two different routing and wavelength allocation method-
ologies: (a) minimizing the number of intermediate routers (9 inter-
mediate routers and 5 wavelengths) and reducing the length of routing
paths; (b) minimizing the number of required wavelengths (29 inter-
mediate routers and 1 wavelength).
active cores should share less optical links as much as possible. As shown in Figure
5.6(b), with the same active cores, the optical routing paths take detour to avoid
overlapping in any optical link with each other, thus only one wavelength is needed in
this scheme. Accordingly, only 34×18 = 612 MRs need to be tuned on for the single
wavelength in 34 optical routers, which is less than half of MRs used in the previous
routing and wavelength allocation scheme. In this way, the static power consumption
can be significantly reduced.
However, the downside of the second scheme is that it needs to pass through 29
intermediate routers and the average length of optical routing paths increases to 4.8
hops. However, that has little impact on the communication performance since the
optical transmission is very fast in ONoC, e.g., 8 hops/cycle for an 8×8 mesh-based
ONoC (Liu, Yang, and Melhem, 2015), and the end-to-end transmission delay is still
very small even though the packets are not routed along the shortest optical routing
paths. Also, it can be seen that the number of MRs increased due to more intermediate
routers is much smaller than that increased for using more wavelengths.
5.4.2 Problem Formulation
The routing and wavelength allocation scheme is a fundamental problem in Dark-
ONoC. In this section, it is formulated as a mapping problem that from the logical
145
interconnections of active cores to the optical links and wavelengths. The optimization
objective is to minimize the number of required wavelengths.
In Dark-ONoC, the optical routing paths with their specific wavelengths are con-
figured in the optical data network. Without loss the generality, it assumes that N2
cores are connected in an N×N mesh topology in the following. Each core accesses
to an optical router and the optical routers are interconnected by bidirectional (two
opposite directional) optical links. A set of wavelengths Λ = {λ1, λ2, ..., λn} are used
in each optical router and optical link, where n is the maximal number of available
wavelengths. Thus, the optical data network can be represented as an undirected graph
G(V,E,W ), where the set of vertices V = {vi} includes all N2 cores and their optical
routers; the set of edges E={eij} includes all the optical links between two neighbour-
ing vertices vi and vj in an N×N mesh; and the set of wavelengths W ={ωijk} stands
for the wavelength utilization in each edge eij ∈E. If the wavelength λkinΛ is used in
eij, then ωijk = 1; otherwise ωijk = 0. Thus, the number of wavelengths allocated in
an optical link is Wij =
∑
ωijk for ∀eij. Moreover, for the mesh topology, each vertex
vi also has a two-dimensional coordinate (Xi, Yi) to denote its location for simplicity,
Xi, Yi ∈ [0, N−1], as illustrated in Figure 5.6.
In a specific time period, suppose there are Na active cores demanding optical
communications between each other in Dark-ONoC. Let Va be the set of active cores
and it is a subset of V , i.e., Va⊆V . To provide non-blocking optical communication,
separate routing paths should be configured between any two active cores. Thus, there
are Na(Na−1)
2
bidirectional logical interconnections in total being required to allocate
optical routing paths and wavelengths, suppose without neighbouring active cores.
Let La = {lmn} be the set of logical interconnections between active cores, and let
Λa = {λmn} be the set of wavelengths allocated to La, vm ∈ Va, vn ∈ Va, Λa ⊆ Λ.
Hence, the active cores and their interconnections form a logical network which can be
represented by <(Va, La,Λa). Nλ denotes the number of different wavelengths in Λa
being allocated to interconnections in La. According to the communication scheme in
Section 5.3, if vm and vn are neighbouring vertices which use direct electrical link, lmn
will be removed from <(Va, La,Λa).
Therefore, the routing and wavelength allocation scheme can be formulated as
a mapping problem, <(Va, La,Λa) → G(V,E,W ). The optimization objective is to
minimize the number of required wavelengths in the optical data network, namely
min.(Nλ). Since in Dark-ONoC all the active cores are known and fixed in each time
period, the mapping of vertices from Va to V is determined. The mapping problem is
146
thus simplified as (La,Λa)→ (E,W ). Each logical interconnection lmn in La is mapped
to a routing path rmn in Ra and a specific wavelength λmn=λk, where Ra represents a
set of routing paths and rmn represents a sequence of intermediate vertices between vm
and vn. It has rmn =< vp0 , vp1 , ..., vph−1 , vph >, where vpi ∈ V is the ith hop, vp0 = vm,
vph = vn, and h is the length of routing path rmn. It must also has evpivpi+1 ∈ E for
each hop of optical link.
The routing and wavelength allocation scheme can be formulated as the following
optimization model:
Minimize Nλ = max(k), ∀ωijk ∈ W, (5.1)
subject to
Tmni =
{
0, if vi 6∈ rmn;
1, if vi ∈ rmn;
(5.2a)
|Xm−Xn|+|Ym−Yn| ≤ hmn ≤ 2N−1, ∀vm, vn∈Va; (5.2b)
ωpipi+1k = 0, ∀rmn∈R, vm, vn∈Va, λk ∈ Λ; (5.2c)
Nl =
Na(Na−1)
2
. (5.2d)
Note that the number of required wavelengths Nλ is the largest k in ∀ωijk ∈ W , which
indicates k wavelengths have been allocated to the optical routing paths. Constraint
Eq. (5.2a) restricts that each vertex vi can be passed at most once in an optical routing
path rmn, where T
mn
i is the number of times vi being passed in rmn. Constraint Eq.
(5.2b) limits the routing flexibility, namely the length of each routing path, denoted by
hmn, is between the distance from the source node vm to the destination node vn and the
largest distance between two nodes in an N×N mesh, i.e., 2N−1. Constraint Eq. (5.2c)
indicates that in any intermediate link of the routing path, it must exist a wavelength
λk ∈Λ to be available, namely ωpipi+1k = 0. Once the wavelength is allocated, ωpipi+1k
is set to 1. Constraint Eq. (5.2d) indicates all the logical interconnections should be
mapped to a routing path, where Nl is the total number being mapped.
It is worth noting that each optical routing path consists of multiple optical links,
and multiple optical routing paths can pass the same optical links by using different
wavelengths. To prevent the wavelength conflict in each optical link, the number of
required wavelengths does not always equal to the maximal wavelength usage in all the
optical links, specifically Nλ≥max(Wij), as the example shown in Figure 5.6(a), where
the maximal wavelength usage in the optical links is max(Wij)=4 and the number of
required wavelengths is Nλ=5.
147
5.4.3 Heuristic Routing and Wavelength Allocation Scheme
It can be seen from the problem formulation that the optimization complexity is re-
lated to the number of active cores, their locations, the order for routing different
interconnections, the network size, and the maximal number of available wavelengths.
The joint problem of routing and wavelength allocation is NP-complete in general,
and the heuristic schemes are often used (Yoo, Ahn, and Kim, 2003). In this section, a
heuristic routing and wavelength allocation scheme is proposed, which divides the joint
optimization problem into two sub-problems and solves them separately by using: the
wavelength aware routing scheme and the reusable wavelength allocation scheme.
Wavelength Aware Routing
The wavelength aware routing scheme maps the logical interconnections into the optical
routing paths considering the potential wavelength utilization of optical links in the
whole network. As shown in Figure 5.6(b), even though it always has Nλ≥max(Wij), if
the wavelength utilization of optical links can be balanced around the network, namely
by reducing max(Wij), the number of required wavelengths Nλ can also be decreased.
Thus, in the proposed scheme, the number of times for passing each optical link is
used as an important parameter for routing the logical interconnections between active
cores in the optical network, as shown in Algorithm 3.
For a specific dark silicon pattern with a fixed number of active cores, the wavelength
aware routing scheme operates in two steps. First, the active cores are sorted according
to their distances to the center of the network, denoted by di. For each active core vi,
its distance to the network center is di= |Xi−N−12 |+|Yi−
N−1
2
|. Then, the wavelength
aware routing is started from the active cores near the center of the network. The main
reason is that in a mesh network the optical links near the center of the network are
used more often than the links far away from the network center, and it is harder for
the active cores near the network center to detour the central links. Thus, they can
become the bottleneck of the wavelength aware routing scheme if they are not routed
firstly. Therefore, when it starts the routing computation from the active cores near
the network center, these cores can occupy the central links first and leave the other
active cores to detour. In this way, it can decrease the number of times of the central
links being used by the optical routing paths and accordingly reduce the maximum
number of times of all the optical links being used in the whole network, which will
also result in a smaller number of wavelengths from the wavelength allocation scheme.
148
Algorithm 3: Wavelength Aware Routing Scheme
Input:
V : the set of all cores in an N×N mesh, where vi ∈ V ;
E: the set of all links in an N×N mesh, where eij ∈ E;
Va: the set of active cores, where Va ⊆ V ;
Output:
P : the set of routing paths, Pij∈P is a path between two active cores vi and vj ;
χ: the set of link usage counts, where χij∈χ is the number of times link eij is used by
the paths in P ;
Operation:
1) For ∀vi∈ Va, calculate its distance to the network center; its distance is represented
by di= |Xi−N−12 |+|Yi−
N−1
2 |;
2) Sort the active cores in Va by di in the ascending order;
3) Remove vi∈Va in sequence and apply a modified Dijkstra’s algorithm as below to
find out the shortest path to another active core ∀vj∈Va;
3-1) Find out the shortest paths from vi to the active cores in Va using the cost
function max(χmn), where edge emn is in the corresponding path Pij that is under
construction;
3-2) Any time an edge emn is added or deleted from a path Pij in the search of the
shortest paths, χmn is increased or decreased by one accordingly;
4) Repeat Step 3) until Va is empty.
149
Second, it searches the shortest routing paths for each active core to other active
cores using a modified Dijkstra’s algorithm. In the proposed scheme, the cost of each
edge is the wavelength usage of the optical link, namely the number of times being
passed by a routing path. The traditional Dijkstra’s algorithm uses the summation
of the cost in all the passed links to calculate the shortest path between two nodes.
In this case, the traditional Dijkstra’s algorithm means the accumulated wavelength
usage in the optical routing path. Whereas, in the modified Dijkstra’s algorithm, the
maximal wavelength usage of each optical link passed by the optical routing path is
used to calculate the cost of the routing path. More precisely, for a routing path Pij,
if its edge emn has the maximum χmn among all the edges in Pij, then χmn is used to
represent the cost of Pij. When an edge emn is added or deleted from a routing path
in the modified Dijkstra’s algorithm, χmn is increased or decreased by one accordingly.
The reason for using the maximal wavelength usage in the optical links within the
routing path, instead of using the accumulated wavelength usage of the routing path,
is that (i) if the accumulated wavelength usage of the routing path is used, it can
obtain some routing paths with a small number of hops but a high wavelength usage
in some intermediate edges, which will leads to a high number of required wavelengths;
(ii) while if the maximal wavelength usage of optical links within the routing path
is used, it can balance the wavelength usage in the whole network, (even with a big
number of hops in the routing path, which will not increase the communication delay a
lot because of the fast optical transmission), which is possible to decrease the number
of required wavelengths. Therefore, in order to distinguish with the algorithm which
uses the accumulated wavelength usage of the routing path, the proposed algorithm is
referred as the modified Dijkstras algorithm.
To balance the wavelength utilization and prevent the optical routing path being
too long, in step 3-1) of Algorithm 3, if the summation of the routed hops from the
source core and the left hops to the destination core equals 2N−1, namely the largest
distance between two cores in an N×N mesh, then the next hop will only be chosen from
the cores on the shortest path, namely making one hop closer to the destination, by
using χmn. It is worth noting that the proposed wavelength aware routing scheme aims
to decrease the number of required wavelengths by reducing the maximal wavelength
usage of optical links in the network, namely by balancing the wavelength utilization.
In this case, it may exist multiple routing paths between the same two active cores
which have the same maximal wavelength usage in optical links, while some routing
paths only have one optical link with the maximal wavelength usage and some other
150
paths have multiple optical links with the maximal wavelength usage. Currently, in
the implementation of the proposed wavelength aware routing scheme, all the possible
routing paths with the same maximal wavelength usage in their optical links are stored
in a table and sorted by the length of routing paths (i.e., the number of hops) and
the addresses of intermediate nodes, and the first routing path in the table is chosen
for the two active cores. Namely, the maximal wavelength usage is the first criterion
for an optical routing path, and the length of routing path is the second criterion.
However, it still may exist some possible routing paths between two active cores with
the same maximal wavelength usage and the same length, while they have different
number of optical links with the maximal wavelength usage. Even though current
implementation can reduce the possibility of this situation by using the routing path
with a short length, one drawback of the proposed wavelength aware routing is that
it may choose an optical routing path which has more optical links with the maximal
wavelength usage. One possible solution is to increase the accumulated wavelength
utilization in the routing path as the third criterion. However, it will further increase
the computation complexity of wavelength aware routing.
Reusable Wavelength Allocation
The reusable wavelength allocation scheme assigns a specific wavelength for each optical
routing path considering the reduction of required wavelengths through wavelength
reuse. It searches as many link-disjoint routing paths as possible and allocates the
same wavelength for them, then assigns different wavelengths only for the routing
paths with any shared link, as summarized in Algorithm 4.
To decrease the number of required wavelengths, the wavelength allocation scheme
iteratively conducts the following three steps until every optical routing path is allo-
cated a wavelength. (i) Finding the optical link e∗ij which is used in the maximum
number of routing paths, namely the link eij with max(χij). Since different wave-
lengths need to be allocated to the optical routing paths that share any link, the pro-
posed scheme starts from the routing paths that share the optical link with max(χij),
denoted as P ∗. Also max(χij) is the minimum number of wavelengths needed for all
the routing paths in P ∗. (ii) Allocating different wavelengths to the optical routing
paths that pass link e∗ij. For each path Pij in P
∗, it first checks whether a wavelength
λk can be reused, where k starts from 1. Basically, if ∀emn ∈ Pij with ωmnk = 0, no
link uses λk and it can allocate λk to Pij. Then it sets ωmnk = 1 for each link emn
in Pij to reserve λk for Pij. (iii) Updating the link usage times χ. Once a path has
151
Algorithm 4: Reusable Wavelength Allocation Scheme
Input:
χ: the set of link usage times, where χij∈χ;
P : the set of routing paths, Pij∈P is a path between two cores vi and vj ;
Output:
W : wavelength utilization matrix, where wijk∈W ; wijk=1 means λk is used in link
eij ; otherwise λk is free in link eij ;
Λ: the set of wavelength allocation for each path, λij ∈ Λ is the wavelength allocated
to path Pij ;
Nλ: the number of required wavelengths in Dark-ONoC;
Operation:
1) Initialize Nλ=1, Wijk=0;
2) Find the link e∗mn with the maximum χmn in χ; put any path Pij that contains link
e∗mn to P
∗ and remove P ∗ from P ;
3) For each path Pij ∈ P ∗, allocate a wavelength as follows: 3-1) k=1;
3-2) For each Pij , check if
∑
emn∈Pij wijk == 0;
3-3) If the condition is true, goto (3-4); else k=k+1 goto (3-2);
3-4) If k > Nλ, Nλ=Nλ+1;
3-5) Allocate wavelength λk to Pij , λij = λk; for each edge emn ∈ Pij , set wmnk = 1,
χmn=χmn−1;
4) Repeat steps 2) to 3) until P becomes empty;
152
been allocated a wavelength, it is removed from P and the link usage times in every
hop χmn is decreased by 1. Note that if it cannot reuse any wavelength in existing
used wavelength set, namely the wavelengths λk with k≤Nλ, a new wavelength has to
be allocated by increasing Nλ. Therefore, the algorithm can reuse wavelengths among
existing routing paths as many as possible and thus decrease the number of required
wavelengths Nλ.
The heuristic algorithm can greatly reduce the computational complexity by di-
viding the joint routing and wavelength allocation problem into two relatively simple
sub-problems. Its complexity is about O(N4N2a ) for an N×N many-core processor
with Na active cores approximately. In general, for an ONoC with 16×16 cores and
64 randomly-chosen active cores, it only takes 73 ms running on a typical computer.
In order to measure the proposed heuristic routing and wavelength allocation scheme,
a small-scale comparison with the optimal results is also carried out using the same
dark silicon patterns. In the comparison, an ONoC with small network size, i.e., 4×4,
is configured with 64 available wavelengths in the optical routers and optical links.
Each set of comparisons is conducted with 20 groups of randomly generated active
cores, and each group has 4 to 8 active cores uniformly distributed in the network.
The optimal result is achieved by enumerating the combinations of all possible routing
and wavelength allocation for the non-blocking routing of each group of active cores.
By conducting three sets of comparison, the average number of required wavelengths
from the optimal results is only 7.6, while the average number of required wavelengths
from the proposed heuristic scheme is 9.4. It is worth noting that even with such a
small network, each simulation run of the optimal results needs to take about 4 hours
running on the same desktop.
5.5 Performance Evaluation
In this section, the proposed Dark-ONoC architecture and the heuristic routing and
wavelength allocation scheme are evaluated by using the performance model and sim-
ulations. Extensive simulations are carried out with both data traces and synthetic
traffic patterns. Moreover, the proposed routing and wavelength allocation scheme is
compared with existing schemes in ONoC.
153
5.5.1 Performance Model
The performance of inter-core communication in Dark-ONoC generally can be evalu-
ated using three parameters: average end-to-end packet delay, network throughput, and
power consumption. They are modelled separately as follows.
The end-to-end packet delay is defined as the time interval that a packet is sent by
the source core until it is received by the destination core. In Dark-ONoC, since the
optical routing paths and wavelengths are allocated periodically before communication,
and the packets are transmitted directly through the optical routing path with the
allocated wavelength without blocking and intermediate processing, thus in the time
period of a specific dark silicon pattern, the average end-to-end packet delay, denoted
by D, can be calculated as
D = Deo +
Nhop
Vo
+Doe, (5.3)
where Deo and Doe are the delays for E-O and O-E conversions in the network interface
of source and destination cores, respectively; Nhop is the average length of optical
routing paths in the hops of routers; Vo is the transmission speed of optical signals
in waveguide. Note that the time spent for the construction of optical routing paths
between active cores is neglectable compared to the time period of each dark silicon
pattern (hundreds of seconds, or 1011 cycles), and the optical routing paths are fixed
between active cores for a dark silicon pattern, thus it is not considered in the average
packet delay. Therefore, it can be seen that the average end-to-end packet delay is
mainly determined by the average length of optical routing paths. However, due to the
high speed of optical signals, e.g., Vo=8 hops/cycle for 8×8 mesh in a 20mm×20mm
chip (Liu, Yang, and Melhem, 2015), the difference of packet delays between any pair
of active cores in Dark-ONoC is very small.
The network throughput is defined as the average volume of data traffic which can
be transmitted in the network for each core in each clock cycle. Since the non-blocking
optical communication is provided between the active cores in Dark-ONoC, the network
throughput, denoted by T , can be computed as
T = min(θ,
1
Deo
), (5.4)
where θ is the average traffic rate for each core. When the traffic rate is small, the
network throughput equals to the average traffic rate due to the non-blocking com-
munication; when the traffic rate exceeds the speed of E-O conversion, the network
throughput equals to the maximal allowed traffic rate of E-O converters, i.e., 1
Deo
.
154
The power consumption for inter-core communication in Dark-ONoC includes two
parts: the optical power consumption PO which is provided by the laser source and
attenuated in all the optical devices on route, and the electronic power consumption PE
which is spent in the optical routers for tuning MRs and in the E-O and O-E converters
for modulation and photodetection (Li, Browning, Gratz, and Palermo, 2014).
To guarantee the communication between any two active cores, PO is the worst-case
optical power provided by the laser source. The main reason of using the worst-case
power is that the off-chip laser used in ONoC cannot be flexibly tuned by the processor
(Zhang and Louri, 2010). PO is computed as:
PO =
1
η
×Nλ × 10Prs × 10ILwc ×Na, (5.5)
where η is the power efficiency of laser source, Prs is the sensitivity of photodetector
in dBm, ILwc is the worst-case insertion loss (optical power attenuation in dB) of
optical devices on the optical routing path between any two active cores (Chan, Hendry,
Biberman, Bergman, and Carloni, 2010). The optical devices include waveguides, MRs,
and optical routers. η and Prs are constant parameters of optical devices. According to
Eq. (5.5), for a many-core processor with Na active cores, PO is decided by the number
of wavelengths Nλ and the worst-case insertion loss ILwc in the optical routing paths.
For a specific optical routing path, the insertion loss IL can be calculated as
IL=ILl×(Nhop−1)+ILr×Nhop+ILeo+ILoe, (5.6)
where Nhop is the total number of hops (routers) in the routing path; ILl and ILr are
the insertion losses of one optical link and one optical router, respectively; ILeo and
ILoe are the insertion losses of E-O and O-E converters, respectively (Chan, Hendry,
Biberman, Bergman, and Carloni, 2010). Since ILl, ILr, ILeo, and ILoe are constant
parameters for specific optical devices and router structure, the worst-case insertion
loss ILwc is decided by the maximum length of the routing paths between the active
cores. Therefore, in summary, the optical power consumption PO is determined by
the number of wavelengths Nλ and the maximum length of the optical routing paths
max(Nhop).
The electronic power consumption PE mainly includes the dynamic power for mod-
ulation (E-O conversion) and photodetection (O-E conversion), and the static power for
thermally tuning the resonant wavelength of MRs (Li, Browning, Gratz, and Palermo,
2014). Thus, PE can be calculated as:
PE = NMR×Nλ×PMT + (EEO+EOE)×BO×Nλ×
∑
θi, (5.7)
155
where NMR is the total number of MRs used for each wavelength; PMT is the tuning
power for each MR (Zhang and Louri, 2010); EEO and EOE are the energy costs
for modulation and photodetection in fJ/bit, respectively (Chan, Hendry, Biberman,
Bergman, and Carloni, 2010); BO is the bandwidth of optical link for each wavelength;
θi is the actual traffic load of an active core, θi ∈ [0, 1]. In Dark-ONoC, each optical
router needs to use 16 MRs for one wavelength, as shown in Figure 5.5. Thus, it has
NMR=16(Na+Nr) where Nr is the total number of intermediate optical routers passed
by all the optical routing paths. In general, Nr depends on the distribution of active
cores and the length of routing paths. PMT , EEO, EOE, and BO are device dependent
constants. θi depends on the traffic pattern of active cores.
From Eq. (5.7), it can be seen there is a tradeoff between the total number of
intermediate optical routers Nr and the number of wavelengths used Nλ, since the
reduced Nλ is often due to the long detoured routing paths. However, as analysed in
Section 5.4.1, the reduction in the number of wavelengths is the dominant factor in
reducing the overall power consumption of Dark-ONoC. Moreover, the power model
does not include the power consumption of routing and wavelength allocation since it
is negligible compared to the power consumption for inter-core communications in a
long time period of dark silicon pattern.
5.5.2 Simulation Setup
To evaluate the communication performance and power consumption of Dark-ONoC,
an architecture-level simulator is developed by using C++, which implements the pro-
posed routing and wavelength allocation scheme for different dark silicon patterns and
calculates the optical routing paths and the number of required wavelengths. In gen-
eral, the simulator consists of four parts: the kernel network architecture, the dark
silicon pattern regulator, the routing and wavelength allocator, and the performance
analyser. The kernel network architecture is an N ×N mesh-based ONoC constructed
by the components of cores, optical routers, and optical links, where each core has a
dark/active indicator and a local routing table, each optical router and optical link
has a wavelength utilization table which maintains the status of each wavelength. The
dark silicon regulator can read different dark silicon patterns achieved from the data
traces or synthetic distributions, and accordingly activate/shutdown a group of cores.
The routing and wavelength allocator implements the proposed routing and wavelength
allocation scheme according to the wavelength utilization tables in optical routers and
optical links. The performance analyser calculates the performance parameters, such
156
Table 5.1: Power Consumption Parameters of Optical Devices
Parameter Value Parameter Value
Laser efficiency 30% Waveguide passing 1.5 dB/cm
Coupler 1 dB Waveguide crossing 0.15 dB
Modulator 85 fJ/bit Waveguide bending 0.005 dB/90o
Photodetector 50 fJ/bit MR drop 0.5 dB/MR
Receiver sensitivity -26 dBm MR pass 0.005 dB/MR
Optical bandwidth 10 Gbps/λ Thermal tuning 26 µW/MR
as the number of required wavelengths, packet delay, network throughput, power con-
sumption, according to the optical routing paths and wavelength allocation results from
the routing and wavelength allocator. This simulator can also corporate with some ex-
isting network-level simulators, such as Noxim (Catania, Mineo, Monteleone, Palesi,
and Patti, 2016), by providing the optical routing paths and wavelength allocations.
Typical simulation parameters for ONoCs are configured in the proposed simulator.
All electronic devices use a system clock of 1 GHz. Each packet has a fixed size of 64
bits, which is the same size used for the control messages in cache coherence (Hestness,
Grot, and Keckler, 2010). Each cache line, e.g., 64 bytes, is thus transmitted by using
multiple successive packets. All the optical devices including E-O and O-E converters
and optical routers work at a bandwidth of 10 Gbps per wavelength (Liu, Zhang,
Chen, Huang, and Gu, 2015). The delays for E-O and O-E conversions are about
7 cycles for each packet, respectively. The transmission speed of optical signals in
waveguide Vo is 8 routers/cycle (Liu, Yang, and Melhem, 2015). In the simulations,
the time duration of each dark silicon pattern is set to 10 thousands cycles, i.e., the dark
silicon pattern regulator in the simulator configures a new dark silicon patten every 10
thousands cycles. The whole simulation run lasts for 100 million cycles. Thus, there
are 10,000 time periods with different dark silicon patterns. To analyse the overall
power consumption, it uses the typical power parameters for the optical devices as
in (Zhang and Louri, 2010; Chan, Hendry, Biberman, Bergman, and Carloni, 2010),
which are listed in Table 5.1.
In the simulations, Dark-ONoC is compared with the electronic NoC and optical
NoC using traditional XY routing and XY-YX routing. In XY routing, data packets
are always routed along X coordinate first and then turned to Y coordinate to its
destination (Shacham, Bergman, and Carloni, 2008), while the XY-YX routing can
adaptively balance the traffic in X and Y coordinates (Li, Zeng, and Jone, 2006). The
157
electronic NoC does not need to establish routing paths in advance. The processing
delay of an electronic router is set to 2 cycles which is its minimal in general (Parikh,
Das, and Bertacco, 2014). According to the Orion 3.0 simulator (Kahng, Lin, and Nath,
2015), the energy consumption for a single hop transmission is about 97.58 fJ/bit in an
electronic router with 64-bit packet and 8-packet input buffer, and the static power cost
is 0.247 mW. To make fair comparison, the optical NoC also periodically establishes
non-blocking optical routing paths between active cores for XY and XY-YX routing
schemes, instead of dynamically setup an optical routing path for each communication
in a hop-by-hop manner, and the number of required wavelengths is compared.
5.5.3 Simulation with Fixed Dark Silicon Patterns
The Number of Required Wavelengths
In the fixed dark silicon patterns, all the cores are divided to several groups and ac-
tivated in turn in each time period, as shown in Figure 5.1(a). In the first set of
simulations, the number of required wavelengths is evaluated with different network
sizes and different fixed dark silicon configurations. As shown in Figure5.7, the network
sizes are set to 8×8, 12×12 and 16×16, with a fixed ratio of active cores and a fixed
number of active cores, respectively. In Figure 5.7(a), 25% of active cores, i.e., 16, 36,
and 64 active cores are arranged in one of 4 groups like in Figure 5.1(a); while in Figure
5.7(b), there are always 16 active cores distributed in different network sizes, thus the
cores are divided to 4, 9, and 16 groups.
From Figure 5.7(a), it can be seen that Dark-ONoC requires the least number
of wavelengths compared with ONoCs using the traditional XY and XY-YX routing
schemes for all different network sizes, because of its capability of balancing the wave-
length utilization through the heuristic routing and wavelength allocation scheme. In
general, the number of required wavelengths is reduced by 27.5% and 13.2% in Dark-
ONoC in average compared to XY and XY-YX routing schemes, respectively. In Dark-
ONoC, the wavelength utilization of optical links tend to be more balanced and more
link-disjoint paths can reuse the same wavelength; while in ONoCs with XY and XY-
YX schemes, the wavelength utilization is not balanced, especially the optical links
near the center of network are overused, thus they require to use more wavelengths for
the non-blocking communication of active cores. It is worth noting that as the net-
work size increases, the number of required wavelengths using Dark-ONoC is increased
slower than other schemes. Moreover, according to the analysis of wavelength aware
158
8x8 12x12 16x16
0
5
10
15
20
25
30
35
40
A
ve
ra
g
e
 N
u
m
b
e
r 
o
f 
R
e
q
u
ir
e
d
 W
a
ve
le
n
g
th
s
(b)
  XY
  XY-YX
  Dark-ONoC
8x8 12x12 16x16
0
50
100
150
200
250
A
ve
ra
g
e
 N
u
m
b
e
r 
o
f 
R
e
q
u
ir
e
d
 W
a
ve
le
n
g
th
s
(a)
  XY
  XY-YX
  Dark-ONoC
Figure 5.7: Average number of wavelengths for different schemes with
the fixed dark silicon patterns in different network sizes, (a) with 25%
active cores (16, 36, and 64 for 8×8, 12×12, 16×16); and (b) with 16
active cores.
routing in Section 5.4.3, the accumulated wavelength usage of optical links within a
routing path can be used as the third criterion for choosing the optical routing path
between active cores. However, according to the simulation results, the number of
required wavelengths can only be reduced by 2.1% (from 20.25 to 19.82), 3.9% (from
71.25 to 68.43), and 4.9% (from 178.5 to 169.8) for different network sizes, respectively,
while the computation complexity will increase a lot (the time for each simulation run
can increase by at least two times).
In Figure 5.7(b), it can be seen that Dark-ONoC also uses the least number of wave-
lengths. Compared to XY and XY-YX routing schemes, the number of wavelengths in
Dark-ONoC is reduced in average by 34.5% and 23.6%, respectively. Moreover, it is
worth noting that the total number of wavelengths used in Dark-ONoC decreases as
the network size scales up, i.e., 20, 18 and 17 wavelengths are used in Dark-ONoC for
three different network sizes, while XY and XY-YX do not have such a trend. That
is because the number of optical routing paths between 16 active cores is fixed, i.e.,
120, while there are more optical links available for detouring to reduce the number of
wavelengths used in each link in Dark-ONoC when the network size increases.
Communication Performance
In this set of simulations, the communication performance of Dark-ONoC and the
proposed routing and wavelength allocation scheme is evaluated with the fixed dark
silicon patterns. As shown in Figure 5.8, the average end-to-end packet delay, network
159
ENo
C(XY
)
ENo
C(XY
-YX)
ONo
C(XY
)
ONo
C(XY
-YX)
Dark
-ONo
C
0
5
10
15
20
25
30
35
40
A
ve
ra
g
e
 E
n
d
-t
o
-E
n
d
 D
e
la
y
 (
n
s
)
(a)
  8x8
  16x16
ENoC
(XY)
ENo
C(XY
-YX)
ONo
C(XY
)
ONo
C(XY
-YX)
Dark
-ONo
C
0
50
100
150
200
250
300
350
400
450
500
550
600
M
a
xi
m
a
l T
h
ro
u
g
h
p
u
t 
C
a
p
a
ci
ty
 (
G
b
p
s)
(b)
  8x8
  16x16
ENoC
(XY)
ENo
C(XY
-YX)
ONo
C(XY
)
ONo
C(XY
-YX)
Dark
-ONo
C
0
50
100
150
200
250
300
350
400
450
500
550
600
650
A
ve
ra
g
e
 P
o
w
e
r 
C
o
n
s
u
m
p
ti
o
n
 (
m
W
)
(c)
  Optical Power
  Electrical Power
Figure 5.8: Performance comparison for fixed dark silicon pattern, (a)
average packet delay; (b) maximal network throughput; (c) average
power consumption.
throughput, and power consumption of different schemes are compared. The network
sizes are set to 8×8 and 16×16 and with 25% of active cores in the simulations. For the
electronic NoCs, the XY and XY-YX routing schemes are used thus they can result in
the shortest paths, which are more favourable to the electronic NoCs in general.
From Figure 5.8(a), it can be seen that the end-to-end packet delays in the elec-
tronic NoCs are much higher than Dark-ONoC and ONoCs, because they use the
pre-configured non-blocking optical paths between the active cores, while the electronic
NoCs have to route the packets hop by hop from the source core to the destination core
for each communication. Note that the average packet delays for the electronic NoCs
are obtained at a very low data rate, i.e. 0.01 packet/cycle/core, which is in favour
of the electronic NoCs as there is no congestion taken into account in the simulations
with such a low data rate. For ONoCs with XY and XY-YX routing schemes, since the
optical routing paths and wavelengths are configured in advance and the optical signals
have high speed in waveguides, the average packet delay is very low, mainly dominated
by the delays of E-O and O-E conversions, and the lengths of routing paths have little
impact on the average packet delay. Therefore, although the optical routing paths in
Dark-ONoC are a little longer to reduce the number of wavelengths, its average packet
delay is almost equal to the delays achieved in ONoCs using the traditional XY and
XY-YX routing schemes.
Figure 5.8(b) demonstrates the maximal network throughputs of different schemes.
The maximal throughput is achieved by increasing the average data rate when the
network gets saturated, e.g., the average packet delay exceeds 200 cycles. Due to the
non-blocking optical communication through wavelength multiplexing between active
cores in Dark-ONoC and ONoC schemes, their throughput capacity is only limited by
160
the speed of E-O conversions. Also their overall throughputs increase with the number
of active cores because each core can transmit packets to other active cores simultane-
ously using different wavelengths. In contrast, the electronic NoCs cannot achieve high
throughput because more congestions emerge immediately when the average data rate
increases. The throughput of the electronic NoCs decrease with the network size due
to higher possibility of congestions in the longer routing paths of a larger network.
Figure 5.8(c) shows the average power consumption of the 8×8 mesh network. In the
simulations, the average data rate of each core is set to 0.2 packets/cycle/core. From
the figure, it can be seen that Dark-ONoC has the least overall power consumption
compared with other schemes. First, it can be seen that the electronic NoCs con-
sume much higher power due to the hop-by-hop routing and buffering in the electronic
routers, especially when the distances between active cores are increased from thermal
balance. Second, for ONoCs with XY and XY-YX routing schemes, they need to use
more wavelengths (28 and 25 wavelengths, respectively) than the Dark-ONoC, thereby
leading to higher electronic power consumption for tuning the resonant wavelengths of
MRs. Note that because the Dark-ONoC only uses 20 wavelengths in this configura-
tion, the electronic power consumption is significantly (36%) less than the other ONoC
schemes. Even though the optical power consumption of Dark-ONoC is a little higher
than the other ONoC schemes due to its detoured optical routing paths, the overall
power consumption of Dark-ONoC is still much lower. According to the figure, the
overall power consumption of Dark-ONoC is around 400 mW, which is at least 23.4%
less than the other ONoC schemes.
In summary, it can be seen that in the fixed dark silicon patterns Dark-ONoC
can achieve much higher communication performance and lower power consumption
than the electronic NoCs, and also achieve much lower power consumption than the
traditional ONoC schemes by reducing the number of required wavelengths.
5.5.4 Simulation with Random Dark Silicon Patterns
In the random dark silicon patterns, the number of active cores and their distribution
are highly variable in different time periods. Two sets of simulations are conducted
with both synthetic patterns and real communication patterns based on the data traces.
Since the power consumption of the electronic NoCs is much higher than Dark-ONoC
and also its average end-to-end packet delay is much higher, Dark-ONoC is only com-
pared with ONoCs in the following simulations. In addition, when the non-blocking
optical routing paths are configured between the active cores in Dark-ONoC and the
161
8 in 8x8 16 in 8x8 24 in 8x8 32 in 8x8
0
10
20
30
40
50
60
70
A
ve
ra
g
e
 N
u
m
b
e
r 
o
f 
R
e
q
u
ir
e
d
 W
a
ve
le
n
g
th
s
(a)
  XY
  XY-YX
  Dark-ONoC
16 in 16x16 32 in 16x16 48 in 16x16 64 in 16x16
0
20
40
60
80
100
120
140
160
180
200
A
ve
ra
g
e
 N
u
m
b
e
r 
o
f 
R
e
q
u
ir
e
d
 W
a
v
e
le
n
g
th
s
(b)
  XY
  XY-YX
  Dark-ONoC
Figure 5.9: Average number of required wavelengths for different
schemes with synthetic random dark silicon patterns, (a) in 8×8
ONoC with 8, 16, 24, and 32 active cores; and (b) in 16×16 ONoC
with 16, 32, 48, and 64 active cores.
other ONoC schemes, the average end-to-end delay and the average network throughput
achieved are very similar. Since the number of required wavelengths is the dominant
factor in the power consumption, to show the advantage of Dark-ONoC over ONoC
with the traditional XY and XY-YX routing schemes, only the number of required
wavelengths is presented in the simulation results.
Synthetic Patterns
In this set of simulations, the synthetic dark silicon patterns are used, namely the
number of active cores Na is set to a parameter and the active cores are randomly
selected in the whole network with the uniform distribution.
As shown in Figure 5.9, the number of active cores Na is set from 8 to 32 active
cores in the 8×8 ONoCs and from 16 to 64 active cores in the 16×16 ONoCs. It can be
seen that Dark-ONoC with the proposed heuristic routing and wavelength allocation
scheme can achieve the least number of wavelengths in networks with different sizes
and different number of active cores. When the number of active cores is small, such
as 8 in 8×8 in Figure 5.9(a) or 16 in 16×16 in Figure 5.9(b), ONoCs with XY and
XY-YX routing schemes do not obviously lead to the overuse of optical links in the
network, since the total number of routing paths is small for the network size and the
active cores are uniformly distributed with few possibility to share the optical links in
their routing paths, thereby causing a small difference from Dark-ONoC in the number
of required wavelengths.
162
However, it can be seen that as the number of active cores increases and thus more
optical routing paths are required to be allocated between active cores, Dark-ONoC
shows its advantage in the balance of wavelength utilization and more wavelength reuse
in the link-disjoint optical routing paths. Thus, it can obtain significantly smaller
number of wavelengths than the other two schemes in ONoCs. For instance, Dark-
ONoC only needs 44 wavelengths in the 16×16 mesh with 32 active cores, while 64
and 60 wavelengths are required by using XY and XY-YX routing schemes in the same
scenario, namely a reduction of 31.25% and 26.67%, respectively. Moreover, according
to the two simulations in Figure 5.9, for different number of active cores, Dark-ONoC in
average can reduce the number of wavelengths by at least 15.8% and 16.6% compared
with the other schemes in 8×8 and 16×16 ONoCs, respectively.
Real Patterns based on Data Traces
In this set of simulations, the dark silicon patterns are achieved from the communica-
tion patterns of some real data traces. The data traces are obtained from a 64-core
systems running different applications in PARSEC benchmark (Hestness, Grot, and
Keckler, 2010). According to the analysis of data traces, around 90% of communica-
tions randomly happen between two cores within a time interval of 250 cycles in every
data trace. Thus, in the simulations if a core does not send and receive any packet in
250 cycles, it is considered as a dark core from the perspective of networking, and thus
it can obtain different dark silicon patterns accordingly. Even though these dark sili-
con patterns only demonstrate the communication properties of many-core processor,
instead of from the real power gating of cores due to the power budget, they can be
used to evaluate the number of required wavelengths in Dark-ONoC for providing non-
blocking optical routing paths between a set of randomly distributed cores. Moreover,
according to the number of active cores, the different dark silicon patterns achieved
from the data traces are divided into two cases: light case with 8-16 active cores and
heavy case with 16-32 active cores, respectively, as shown in Figure 5.10.
It can be seen Dark-ONoC needs the least number of wavelengths in the dark silicon
patterns of different applications. In the light case of dark silicon patterns in Figure
5.10(a), Dark-ONoC can reduce the number of wavelengths by 23.4% in average; while
in the heavy case of dark silicon patterns in Figure 5.10(b), it can reduce the number of
wavelengths by 14.5% in the dark silicon patterns of different applications. Note that
since the distribution of active cores is not uniform in the dark silicon patterns, the
number of required wavelengths can be higher in the light case with fewer number of
163
can
nea
l
ded
up ferr
et
fluid
anim
ate
swa
ptio
ns vips x26
4
0
5
10
15
20
25
A
v
e
ra
g
e
 N
u
m
b
e
r 
o
f 
R
e
q
u
ir
e
d
 W
a
v
e
le
n
g
th
s
  XY
  XY-YX
  Dark-ONoC
can
nea
l
ded
up ferr
et
fluid
anim
ate
swa
ptio
ns vips x26
4
0
10
20
30
40
50
60
70
80
90
100
A
ve
ra
g
e
 N
u
m
b
e
r 
o
f 
R
e
q
u
ir
e
d
 W
a
ve
le
n
g
th
s
  XY
  XY-YX
  Dark-ONoC
(a) (b)
Figure 5.10: Average number of required wavelengths for different
dark silicon patterns from data traces in 8×8 ONoC, in (a) light case
with 8-16 active cores, and (b) heavy case with 16-32 active cores.
active cores, such as canneal application, when the active cores are located in a small
region. It also indicates the uniform distribution of active cores is necessary from the
viewpoint of reducing the number of required wavelengths. Moreover, it can be seen in
different number of active cores and different distributions of active cores, Dark-ONoC
is able to balance the wavelength utilization in the network and use less wavelength to
provide non-blocking communication between active cores.
5.6 Summary
In this chapter, the Dark-ONoC architecture is proposed for the inter-core communica-
tion of many-core processors with dark silicon. The research objective is to achieve high
communication performance and low power consumption at the same time, by estab-
lishing the non-blocking optical routing paths only between the active cores with differ-
ent dark silicon patterns. The main contributions include: (i) Dark-ONoC employs a
flexible network architecture which can dynamically configure the optical routing paths
for different dark silicon patterns; (ii) Dark silicon aware routing and wavelength allo-
cation scheme is proposed to provide non-blocking communication between the active
cores, and the number of required wavelengths is decreased through wavelength reuse;
(iii) Simulation results indicate that Dark-ONoC can reduce the average number of
required wavelengths by at least 15% and the overall power consumption by 23.4%
compared with the existing schemes.
164
Chapter 6
Conclusion and Future Work
This thesis concentrates on the design of wavelength-reused Optical Network on Chip
for the high-performance and energy-efficient communication of many-core processors.
The main idea is to transmit data packets with optical signals simultaneously in differ-
ent wavelengths through wavelength routing, and to utilize as less wavelengths as possi-
ble through wavelength reuse. Targeting at three important communication issues: net-
work scalability, multicast communication, and dark silicon, three wavelength-reused
ONoC architectures and communication schemes have been proposed accordingly. It
is worth noting that WRH-ONoC is an architecture-level wavelength-reused scheme,
DWRMR and Dark-ONoC are routing-level wavelength-reused schemes. Without loss
the generality, DWRMR and Dark-ONoC are designed based on the widely used mesh-
topology ONoC architecture and focus on the exploration of routing and wavelength
allocation scheme, while the proposed routing and wavelength allocation schemes are
topology independent. They can also be implemented in ONoC with other topologies
by adjusting the interconnections of network architecture and updating the connection
function in the routing and wavelength allocation model accordingly. This chapter con-
cludes the main contributions of proposed schemes and their limitation issues. Some
prospective research problems are also discussed for the future work.
6.1 Conclusions
In this thesis, three wavelength-reused ONoC architectures and communication schemes,
i.e, WRH-ONoC, DWRMR, and Dark-ONoC, have been proposed for the network scal-
ability, multicast communication, and dark silicon problems of many-core processors.
Their advantages and limitations are concluded in the following.
165
6.1.1 Main Contributions
WRH-ONoC is an architecture-level wavelength-reused ONoC scheme targeting at
the network scalability problem. The main contributions of WRH-ONoC architec-
ture and its communication scheme include the following. (i) The limited number of
available wavelengths are reused in each local λ-router for non-blocking optical com-
munication within a subsystem, and the hierarchical network with multiple λ-routers
and gateways can provide high-bandwidth global communication. (ii) Optical com-
munication is provided for both intra-subsystem and inter-subsystem traffics. The
non-blocking wavelength-based routing is realized for the intra-subsystem communica-
tion, and only a few hops of gateways need to be passed by an inter-subsystem packet.
(iii) Multiple sibling gateways are used between two λ-routers for load balance, which
can alleviate the bottleneck phenomenon in the upper levels of a hierarchical network.
(iv) The multicast communication is implemented in a combination of intra-subsystem
multicasting and inter-subsystem unicasting, where the maximal number of multicast
copies that need to be generated only equals to the number of subsystems. (v) The
communication performance of WRH-ONoC is analysed by using both the theoretical
modelling and simulations. According to the simulation results with both synthetic
traffic patterns and real data traces, WRH-ONoC can achieve much better communi-
cation performance than the related ONoC schemes. For example, at least 46.0% of
reduction on the zero-load packet delay and 72.7% of improvement on the maximal
throughput can be achieved compared with the other ONoC schemes for the unicast
communication when interconnecting 400 cores; the zero-load delay can be reduced by
63.9% and the maximal throughput can be increased by 8.4 times for the multicast
communication compared with the PNoC scheme with the tree-based multicast routing
scheme, even with only 5% of multicast traffic.
DWRMR is a routing-level wavelength-reused scheme targeting at the multicast
communication problem. The main contributions of DWRMR and its communication
scheme include the following. (i) It considers the property of multicast communication,
i.e., a large ratio of multicast traffic is interactive multicast within the same multicast
group, and the property of optical interconnects, i.e., high-speed transmission in an
optical cyclic routing path with the single-send-multi-receive ability. (ii) An efficient
ONoC architecture is proposed consisting of a separate optical control plane and an
optical data plane. The optical control plane can make use of the information of global
wavelength utilization to allocate an efficient multicast ring dynamically according to
the distribution of destination cores and the wavelength utilization in all the optical
166
interconnects. The optical control channel can configure all the optical routers in the
allocated multicast routing path at the same time with very low time delay. (iii) Each
multicast ring only requires one wavelength for all the cores in each multicast group
and it can be reused for the interactive multicast communications. The link-disjoint
multicast rings are also searched out to reuse the same wavelength. (iv) The routing
and wavelength allocation problem is formulated as an optimization problem to min-
imize the accumulated wavelength utilization of each multicast ring, and a heuristic
algorithm is proposed to minimize the length of routing path and the number of used
wavelengths at the same time, which can significantly reduce the computation complex-
ity. (v) The communication performance of DWRMR is evaluated through simulations
with multicast traffics from both synthetic patterns and real data traces. Simulation
results indicate that it can achieve much better multicast communication performance
and requires much less wavelengths. For example, at least 47.4% of the zero-load delay
can be reduced and 4.9 times of the maximal throughput can be increased with 50%
of interactive multicast in an ONoC with 64 wavelengths.
Dark-ONoC is also a routing-level wavelength-reused scheme targeting at the dark
silicon problem. The main contributions of Dark-ONoC and its communication scheme
include the following. (i) By considering the properties of dark silicon, namely only a
set of cores can be active in the same time period and a new set of cores will replace
the previous active cores, the non-blocking optical routing paths are only dynamically
established for the active cores in Dark-ONoC according to different dark silicon pat-
terns, i.e., the number and distribution of active cores. (ii) A hierarchical network
architecture is designed, with the manager core to regulate the dark silicon patterns in
the core plane, a centralized routing and wavelength allocator and a fast optical control
channel in the optical control plane for the allocation and configuration of optical rout-
ing paths, and a configurable optical data plane for the optical communication between
the active cores. (iii) The relation between the power consumption of Dark-ONoC and
the number of used wavelengths is modelled, and the routing and wavelength allocation
scheme is designed to reduce the number of required wavelengths for the non-blocking
optical communication of active cores. A heuristic algorithm which combines the wave-
length aware routing and the reusable wavelength allocation is proposed to balance the
wavelength utilization in the optical interconnects and to reuse the same wavelength
in as many optical routing paths as possible. (iv) The communication performance
and power consumption of Dark-ONoC are evaluated through simulations with dark
silicon patterns from both real data traces and synthetic patterns. According to the
167
simulation results, Dark-ONoC can achieve high communication performance and very
low power consumption for a many-core processor with dark silicon. For instance, it
can reduce the average number of required wavelengths by at least 15% and the overall
power consumption by 23.4% compared with the traditional schemes.
6.1.2 Limitation Issues
Due to the technological limitations of current optical devices, the design of high-
performance and energy-efficient ONoC is also a challenging problem. The limitation
issues of proposed wavelength-reused ONoC architectures are summarized in following.
WRH-ONoC utilizes the electronic gateways for wavelength reassignment between
two λ-routers. Thus, it introduces some extra hardware cost, such as electronic-to-
optical (E-O) and optical-to-electronic (O-E) converters, buffer, internal wavelength
switching. The efficiency of gateway also has significant influence on the communication
delay, the probability of traffic congestion, and the power consumption. Moreover, as a
hierarchical network architecture, since more traffic is converged to the upper level in
the network hierarchy, the upper-level gateways may become the bottleneck of WRH-
ONoC. Although multiple sibling gateways are used between the same two λ-routers
with the load balance ability to increase the throughput in upper levels, there is a
trade-off on the number of sibling gateways between the communication performance
and the hardware cost. The number of sibling gateways is limited by the hardware
cost, especially when the many-core processor integrates a large number of cores.
DWRMR employs a dynamically established optical multicast ring for the multi-
cast communication within a multicast group. It can achieve high-performance multi-
cast communication for many-core processors with much less wavelengths than existing
multicast routing schemes. However, when the data rate of interactive multicast traffic
is high, the other cores in the same multicast group need to wait for the occupied mul-
ticast ring. Moreover, the proposed heuristic routing and wavelength allocation scheme
is designed for the common cases of multicast communication, for example the number
of destination cores is about 25% of the total number of cores in the network and the
destination cores are uniformly distributed. For some specific multicast patterns, e.g.,
the number of destination cores is small and they are distributed in a small region
of the network, the proposed routing and wavelength allocation scheme which uses a
Hamiltonian cycle as the initial multicast ring needs to optimize for several steps to
achieve a good multicast ring.
Dark-ONoC periodically establishes the non-blocking optical routing paths among
168
the active cores. Since for a long-enough time period, the inter-core communication
can appear between any two actives in general, all-to-all optical routing paths are
established among the active cores in Dark-ONoC. Thus, it needs no extra routing
and wavelength allocation within each time period before the communication between
two active cores, and an optical routing path is only determined by the allocated
wavelength and the output port in the source router. However, for some application-
specific communication patterns, the communication graph of active cores is fixed and
the inter-core communication may only exist among some specific cores, thus it is not
necessary to establish a fully connected optical network among all the active cores.
6.1.3 Potential Improving Solutions
To leverage the main advantages of proposed wavelength-reused ONoC architectures
and communication schemes, and to alleviate their technological limitations, some
potential improving solutions can be employed accordingly.
For WRH-ONoC, one possible solution to the bottleneck problem in the upper
level is that the number of sibling gateways between two λ-routers in different levels
can be configured independently according to the communication properties, instead
of using the same number of sibling gateways for the whole network. This solution
can use more sibling gateways in the upper levels of the network. However, with a
fixed wavelength limitation, the number of levels in the network hierarchy will increase
accordingly, since more sibling gateways are used between λ-routers in the upper levels.
Thus, it needs to consider the trade-off between network throughput (the number of
sibling gateways in the upper level) and communication delay (the number of levels in
the network), and the trade-off between network throughput and hardware cost.
For DWRMR, one possible solution to the congestion problem due to heavy in-
teractive multicast is that the number of multicast rings for each multicast group can
be configured according to the communication requirements. When the interactive
multicast communication takes a large ratio and the established multicast ring is occu-
pied by the other cores for quite a long time period, a core with multicast packets can
request to establish a new multicast ring for the same multicast group. Note that the
new multicast ring can use the same route but with a different wavelength. However,
this solution can only be employed when there are sufficient wavelengths. Moreover,
the possible solution for the localized multicast group is that the heuristic routing and
wavelength allocation scheme can be extended by using a smaller baseline multicast
ring, such as a cycle spanning only a small region, instead of a Hamiltonian cycle
169
spanning the whole network. However, this solution cannot always provide the best
multicast rings for all kinds of different multicast patterns as well.
For Dark-ONoC, one possible solution to improve the efficiency of optical routing
paths is to design an application-specific routing and wavelength allocation scheme
when the communication graph is determined. The logical interconnections between
the active cores without any communication will be deleted before the routing and
wavelength allocation, and the number of required wavelengths can be decreased ac-
cordingly. Moreover, if the communication graph is totally unknown, a sleeping-based
scheme can be designed for the established optical routing paths, namely an optical
routing path between two active cores can be switched to sleeping state when there is
no communication exceeding a fixed time period. This solution can dynamically reduce
the power consumption, but there is no reduction on the number of wavelengths.
6.2 Future Work
This thesis focuses on solving three important communication problems with wavelength-
reused ONoC. However, there are still a lot of challenging problems on the design of
high-performance and energy-efficient ONoC for many-core processors, which should
exploit wavelength reuse. Three prospective research issues are given in the following.
6.2.1 Reliable ONoC Architecture
There are several physical constraints in the ONoC architecture which can influence
the reliability of optical devices (Miller, 2000), especially for the wavelength-specific
components, e.g., microring resonator. The resonate wavelength of an MR is highly
sensitive to the geometric dimensions, e.g., the radius of microring and the width
of waveguide (Xu, Yang, and Melhem, 2012b), and to the thermal effects (Ye, Xu,
Wu, Zhang, Wang, Nikdast, Wang, and Liu, 2013). However, the manufacturing-
induced process variations and the run-time thermal effects are inevitable issues for
a many-core processor. Some researches were carried out to model the influence of
process variations and thermal effects on the communication performance (Ye, Xu,
Wu, Zhang, Wang, Nikdast, Wang, and Liu, 2013; Li, Mohamed, Chen, Dudley, Meng,
Shang, Mickelson, Joseph, Vachharajani, Schwartz, and Sun, 2012). The traditional
schemes to solve the faults/errors induced by process variations and thermal effects
include to compensate the resonant wavelength drift by adding some supplementary
microrings (Xu, Yang, and Melhem, 2012b), and to heat or to apply electronic current
170
to the microring resonators (Nitta, Farrens, and Akella, 2011). These solutions can
dynamically adjust the resonant wavelengths of MRs to guarantee the communication
performance and reliability of ONoC. However, it can also introduce large hardware
cost, e.g., to tune every MR in a large-scale ONoC architecture.
The design of high-performance and reliable ONoC architecture is a challenging
problem. Thus, one future work is to explore the properties of process variations and
thermal effects, and to design a Reliable ONoC architecture with the wavelength reuse
ability. The perspective solution is to devise some backup optical routing paths which
use a separate set of wavelengths. When there is a wavelength drift happening in an
optical routing path, the backup routing path can be rapidly configured to replace the
unreliable routing path. The allocated wavelength in the unreliable optical routing
path is reclaimed and can be used to establish a new backup routing path. In this
way, the reliable communication is possible to be achieved in the ONoC architecture
without a large amounts of extra hardware cost.
6.2.2 3D ONoC Architecture
Several three-dimensional (3D) many-core architectures have been proposed to inte-
grate the cores, memory subsystem, and communication network in different layers
of the same chip through 3D stacking technologies (Morris, Kodi, and Louri, 2012;
Ramini, Grani, Bartolini, and Bertozzi, 2013). The main advantages of 3D network
architecture include that the average communication distance is significantly reduced
compared to the 2D network with the same number of cores, and the routing diversity
is increased since there are more links available between the cores. Some 3D ONoC
architectures have been proposed to deploy the optical network in multiple physical
layers (Ramini, Bertozzi, and Carloni, 2012; Zhang and Louri, 2010). They can employ
the wavelength-based routing between the cores to make use of the path diversity in
3D network. The insertion loss induced by optical devices along the optical routing
path can also be decreased, since the average length of optical routing path is reduced.
3D ONoC introduces significant opportunities and challenges on the design of wave-
length reuse scheme. First, there are more optical interconnects in a 3D ONoC, thus
more wavelengths are possible to be reused in the link-disjoint optical routing paths in
the routing and wavelength allocation scheme. Second, it can lead to the bottleneck
problem in the center of the optical network, since more optical routing paths require
to pass the optical interconnects near the center of the network. It may require to
utilize more wavelengths even than 2D ONoC if without an efficient wavelength reuse
171
scheme. Thus, another prospective work is to explore the wavelength-reused 3D ONoC
architecture to reduce the number of required wavelengths and power consumption.
The main idea is to utilize the enormous optical interconnects in 3D ONoC, and to
minimize the number of required wavelengths by considering the maximal wavelength
utilization in the optical interconnects and the maximal insertion loss in each optical
routing path. In this way, it is possible to balance the wavelength utilization in 3D
ONoC through bypassing some congested optical interconnects for wavelength reuse,
and to restrict the times of bypassings in each optical routing path according to an
expected insertion loss for high power efficiency.
6.2.3 Intra/Inter-chip Hybrid ONoC Architecture
Silicon optical interconnects are also promising for the communication between different
many-core processor chips. Some intra/inter-chip optical communication architectures
are proposed (Grot, Hestness, Keckler, and Mutlu, 2011; Wu, Xu, Ye, Wang, Nikdast,
Wang, and Wang, 2015). Within each many-core processor chip, an ONoC block is
designed for the optical communication among the local cores; for the communication
between different chips, an inter-chip optical network is designed to connect all the
ONoC blocks. These architectures can obtain extremely low communication delay
and power consumption, since the intra-chip and inter-chip optical communication
architectures are unified in the design.
However, existing intra/inter-chip hybrid communication architectures only concern
how to construct the hierarchical optical network with the intra-chip and inter-chip net-
works. Since the wavelength reuse scheme is not considered, the size of each optical
network block is small. Hence, one prospective work is to explore the wavelength reuse
in the intra/inter-chip network architectures. The main idea is to design a block-based
ONoC architecture, with multiple ONoC blocks in different network sizes for local com-
munication, and an inter-ONoC optical network to interconnect all the ONoC blocks.
All the wavelengths can be reused in each ONoC block similar to the previous research;
multiple optical routing paths are established between two ONoC blocks to improve
the bandwidth capacity in the inter-ONoC optical network. This intra/inter-chip hi-
erarchical ONoC network makes the wavelength reuse in the routing and wavelength
allocation scheme more complicate.
In summary, the principle of wavelength reuse will be further investigated in the
future research of high-performance and energy-efficient Optical Network on Chip for
the communication of many-core processors.
172
References
Abadal, S., Martinez, R., Alarcon, E., and Cabellos-Aparicio, A. (2014). Scalability-
Oriented Multicast Traffic Characterization. In IEEE/ACM International Sympo-
sium on Networks-on-Chip (NoCS), 180–181.
Agarwal, A., Iskander, C., and Shankar, R. (2009). Survey of Network on Chip (NoC)
Architectures and Contributions. Journal of Engineering, Computing and Architec-
ture, 3 (1), 1–15.
Bahadori, M., Rumley, S., Nikolova, D., and Bergman, K. (2016). Comprehensive De-
sign Space Exploration of Silicon Photonic Interconnects. IEEE Journal of Lightwave
Technology , 34 (12), 2975–2987.
Bahirat, S. and Pasricha, S. (2014). METEOR: Hybrid Photonic Ring-mesh Network-
on-Chip for Multicore Architectures. ACM Transactions on Embedded Comput-
ing Systems (TECS) - Special Issue on Design Challenges for Many-Core Proces-
sors , 13 (3s), 116:1–116:33.
Bartolini, S., Lusnig, L., and Martinelli, E. (2013). Olympic: A Hierarchical All-
Optical Photonic Network for Low-Power Chip Multiprocessors. In IEEE Euromicro
Conference on Digital System Design (DSD), 56–59.
Batten, C., Joshi, A., Stojanovic, V., and Asanovic, K. (2012). Designing Chip-Level
Nanophotonic Interconnection Networks. IEEE Journal on Emerging and Selected
Topics in Circuits and Systems , 2 (2), 137–153.
Bhat, U. (2008). Simple Markovian Queueing Systems. In An Introduction to Queueing
Theory: Modeling and Analysis in Applications, 29–73. Birkhuser Boston.
Bienia, C., Kumar, S., Singh, J., and Li, K. (2008). The PARSEC Benchmark Suite:
Characterization and Architectural Implications. In ACM International Conference
on Parallel Architectures and Compilation Techniques (PACT), 72–81.
173
Bjerregaard, T. and Mahadevan, S. (2006). A Survey of Research and Practices of
Network-on-Chip. ACM Computing Surveys (CSUR), 38 (1), 1–51.
Blake, G., Dreslinski, R., and Mudge, T. (2009). A Survey of Multicore Processors.
IEEE Signal Processing Magazine, 26 (6), 26–37.
Bokhari, H., Javaid, H., Shafique, M., Henkel, J., and Parameswaran, S. (2014). dar-
kNoC: Designing Energy-Efficient Network-on-Chip with Multi-Vt Cells for Dark
Silicon. In ACM/IEEE Design Automation Conference (DAC), 1–6.
Borkar, S. (2007). Thousand Core Chips: a Technology Perspective. In ACM/IEEE
Design Automation Conference (DAC), 746–749.
Catania, V., Holsmark, R., Kumar, S., and Palesi, M. (2006). A methodology for
design of application specific deadlock-free routing algorithms for NoC systems. In
IEEE International Conference on Hardware/Software Codesign and System Synthe-
sis (CODES+ISSS), 142–147.
Catania, V., Mineo, A., Monteleone, S., Palesi, M., and Patti, D. (2016). Cycle-
Accurate Network on Chip Simulation with Noxim. ACM Trans. Model. Comput.
Simul., 27 (1), 4:1–4:25.
Chan, J. and Bergman, K. (2012). Photonic Interconnection Network Architec-
tures Using Wavelength-Selective Spatial Routing for Chip-Scale Communications.
IEEE/OSA Journal of Optical Communications and Networking , 4 (3), 189–201.
Chan, J., Hendry, G., Biberman, A., Bergman, K., and Carloni, L. (2010). PhoenixSim:
A Simulator for Physical-Layer Analysis of Chip-Scale Photonic Interconnection Net-
works. In IEEE/ACM Design, Automation Test in Europe Conference Exhibition
(DATE), 691–696.
Chen, C., Agarwal, N., Krishna, T., Koo, K., Peh, L.-S., and Saraswat, K. (2010).
Physical vs. Virtual Express Topologies with Low-Swing Links for Future Many-
Core NoCs. In ACM/IEEE International Symposium on Networks-on-Chip (NOCS),
173–180.
Chen, C. and Joshi, A. (2013). Runtime Management of Laser Power in Silicon-
Photonic Multibus NoC Architecture. IEEE Journal of Selected Topics in Quantum
Electronics , 19 (2), 1–13.
174
Chen, C., Zhang, T., Contu, P., Klamkin, J., Coskun, A., and Joshi, A. (2014). Sharing
and Placement of On-chip Laser Sources in Silicon-Photonic NoCs. In IEEE/ACM
International Symposium on Networks-on-Chip (NOCS), 88–95.
Chen, G., Chen, H., Haurylau, M., Nelson, N., Fauchet, P., Friedman, E., and Albonesi,
D. (2005). Predictions of CMOS Compatible On-chip Optical Interconnect. In ACM
International Workshop on System Level Interconnect Prediction (SLIP), 13–20.
Chen, K., Chao, C., and Wu, A. (2015). Thermal-Aware 3D Network-On-Chip (3D
NoC) Designs: Routing Algorithms and Thermal Managements. IEEE Circuits and
Systems Magazine, 15 (4), 45–69.
Chen, Z., Gu, H., Chen, Y., and Zhang, H. (2013). Wavelength Assignment in Optical
Network-on-Chip: Design and Performance. In IEEE International Conference of
IEEE Region 10 (TENCON), 1–4.
Chen, Z., Gu, H., Yang, Y., and Chen, K. (2012). Low Latency and Energy Efficient
Optical Network-on-Chip Using Wavelength Assignment. IEEE Photonics Technol-
ogy Letters , 24 (24), 2296–2299.
Collet, J., Litaize, D., Campenhout, J., Jesshope, C., Desmulliez, M., Thienpont, H.,
Goodman, J., and Louri, A. (2000). Architectural Approach to the Role of Optics in
Monoprocessor and Multiprocessor Machines. OSA Applied Optics , 39 (5), 671–682.
Concer, N., Bononi, L., Soulie, M., Locatelli, R., and Carloni, L. (2009). CTC: An
End-to-End Flow Control Protocol for Multi-Core Systems-on-Chip. In ACM/IEEE
International Symposium on Networks-on-Chip (NOCS), 193–202.
Dally, W. and Towles, B. (2001). Route Packets, Not Wires: On-Chip Interconnection
Networks. In ACM/IEEE Design Automation Conference (DAC), 684–689.
Daya, B., Chen, C.-H., Subramanian, S., Kwon, W.-C., Park, S., Krishna, T., Holt,
J., Chandrakasan, A., and Peh, L.-S. (2014). SCORPIO: A 36-Core Research Chip
Demonstrating Snoopy Coherence on a Scalable Mesh NoC with In-network Order-
ing. In ACM/IEEE International Symposium on Computer Architecture (ISCA),
25–36.
Demir, Y. and Hardavellas, N. (2014). EcoLaser: An Adaptive Laser Control for
Energy-Efficient On-Chip Photonic Interconnects. In IEEE/ACM International
Symposium on Low Power Electronics and Design (ISLPED), 3–8.
175
Dinechin, B., Amstel, D., Poulhis, M., and Lager, G. (2014). Time-Critical Computing
on a Single-Chip Massively Parallel Processor. In IEEE/ACM Design, Automation
Test in Europe Conference Exhibition (DATE), 1–6.
Dokania, R. and Apsel, A. (2009). Analysis of Challenges for On-chip Optical Inter-
connects. In ACM Great Lakes Symposium on VLSI (GLSVLSI), 275–280.
Dong, P., Shafiiha, R., Liao, S., Liang, H., Feng, N.-N., Feng, D., Li, G., Zheng, X.,
Krishnamoorthy, A., and Asghari, M. (2010). Wavelength-Tunable Silicon Microring
Modulator. Optical Express , 18, 10941–10946.
Ebrahimi, M., Daneshtalab, M., Liljeberg, P., Plosila, J., Flich, J., and Tenhunen, H.
(2014). Path-Based Partitioning Methods for 3D Networks-on-Chip with Minimal
Adaptive Routing. IEEE Transactions on Computers (TC), 63 (3), 718–733.
Eisley, N., Peh, L., and Shang, L. (2008). Leveraging On-Chip Networks for Data
Cache Migration in Chip Multiprocessors. In ACM/IEEE International Conference
on Parallel Architectures and Compilation Techniques (PACT), 197–207.
Esmaeilzadeh, H., Blem, E., Amant, R., Sankaralingam, K., and Burger, D. (2011).
Dark Silicon and the End of Multicore Scaling. In ACM/IEEE International Sym-
posium on Computer Architecture (ISCA), 365–376.
Feng, K., Ye, Y., and Xu, J. (2013). A formal study on topology and floorplan char-
acteristics of mesh and torus-based optical networks-on-chip. Microprocessors and
Microsystems , 37 (8), 941–952.
Fu, B., Han, Y., Li, H., and Li, X. (2010). Accelerating Lightpath Setup via Broadcast-
ing in Binary-tree Waveguide in Optical NoCs. In IEEE/ACM Design, Automation
Test in Europe Conference Exhibition (DATE), 933–936.
Fu, H., Liao, J., Yang, J., Wang, L., Song, Z., Huang, X., Yang, C., Xue, W., Liu, F.,
Qiao, F., Zhao, W., Yin, X., Hou, C., Ge, W., Zhang, J., Wang, Y., Zhou, C., and
Yang, G. (2016). The Sunway TaihuLight Supercomputer: System and Applications.
Science China - Information Sciences , 59 (7), 072001:1–072001:16.
Fusella, E. and Cilardo, A. (2016). Lighting Up On-Chip Communications with Pho-
tonics: Design Tradeoffs for Optical NoC Architectures. IEEE Circuits and Systems
Magazine, 16 (3), 4–14.
176
Fusella, E. and Cilardo, A. (2017). H2ONoC: A Hybrid Optical-Electronic NoC Based
on Hybrid Topology. IEEE Transactions on Very Large Scale Integration (VLSI)
Systems , 25 (1), 330–343.
Fusella, E., Flich, J., Cilardo, A., and Mazzeo, A. (2015). On the Design of a Path-Setup
Architecture for Exploiting Hybrid Photonic-Electronic NoCs. In IEEE Workshop
on Exploiting Silicon Photonics for Energy-Efficient High Performance Computing,
9–16.
Geer, D. (2005). Chip Makers Turn to Multicore Processors. IEEE Computer Maga-
zine, 38 (5), 11–13.
Gong, L., Zhou, X., Liu, X., Zhao, W., Lu, W., and Zhu, Z. (2013). Efficient Re-
source Allocation for All-optical Multicasting over Spectrum-sliced Elastic Optical
Networks. IEEE/OSA Journal of Optical Communications and Networking , 5 (8),
836–847.
Grani, P. and Bartolini, S. (2014). Design Options for Optical Ring Interconnect
in Future Client Devices. ACM Journal on Emerging Technologies in Computing
Systems (JETC), 10 (4), 30:1–30:25.
Grani, P., Bartolini, S., Furdiani, E., Ramini, L., and Bertozzi, D. (2014). Integrated
cross-layer solutions for enabling silicon photonics into future chip multiprocessors.
In IEEE International Mixed-Signals, Sensors, and Systems Test Workshop Proceed-
ings, 1–8.
Gratz, P., Grot, B., and Keckler, S. (2008). Regional Congestion Awareness for Load
Balance in Networks-on-Chip. In IEEE International Symposium on High Perfor-
mance Computer Architecture (HPCA), 203–214.
Gries, M., Hoffmann, U., Konow, M., and Riepen, M. (2011). SCC: a Flexible Archi-
tecture for Many-Core Platform Research. IEEE Computing in Science Engineer-
ing , 13 (6), 79–83.
Grot, B., Hestness, J., Keckler, S., and Mutlu, O. (2011). Kilo-NOC: A Heterogeneous
Network-on-chip Architecture for Scalability and Service Guarantees. In ACM/IEEE
International Symposium on Computer Architecture (ISCA), 401–412.
177
Gu, H., Chen, K., Yang, Y., Chen, Z., and Zhang, B. (2017). MRONoC: A Low Latency
and Energy Efficient on Chip Optical Interconnect Architecture. IEEE Photonics
Journal , 9 (1), 1–12.
Gu, H., Mo, K., Xu, J., and Zhang, W. (2009). A Low-power Low-cost Optical Router
for Optical Networks-on-Chip in Multiprocessor Systems-on-Chip. In IEEE Com-
puter Society Annual Symposium on VLSI (ISVLSI), 19–24.
Gu, H., Xu, J., and Zhang, W. (2009). A Low-Power Fat Tree-based Optical Network-
on-Chip for Multiprocessor System-on-Chip. In IEEE/ACM Design, Automation
Test in Europe Conference Exhibition (DATE), 3–8.
Guerre, A., Ventroux, N., David, R., and Merigot, A. (2010). Hierarchical Network-on-
Chip for Embedded Many-Core Architectures. In ACM/IEEE International Sympo-
sium on Networks-on-Chip (NOCS), 189–196.
Guerrier, P. and Greiner, A. (2000). A Generic Architecture for On-chip Packet-
Switched Interconnections. In IEEE/ACM Design, Automation Test in Europe Con-
ference Exhibition (DATE), 250–256.
Gunn, C. (2006). CMOS Photonics for High-Speed Interconnects. IEEE Micro Maga-
zine, 26 (2), 58–66.
Hamedani, P., Jerger, N., and Hessabi, S. (2014). QuT: A Low-Power Optical Network-
on-Chip. In ACM/IEEE International Symposium on Networks-on-Chip (NOCS),
80–87.
Hendry, G., Chan, J., Kamil, S., Oliker, L., Shalf, J., Carloni, L., and Bergman, K.
(2010). Silicon Nanophotonic Network-on-Chip Using TDM Arbitration. In IEEE
Symposium on High Performance Interconnects (HOTI), 88–95.
Henkel, J., Wolf, W., and Chakradhar, S. (2004). On-Chip Networks: A Scalable,
Communication-Centric Embedded System Design Paradigm. In IEEE International
Conference on VLSI Design, 845–851.
Hestness, J., Grot, B., and Keckler, S. (2010). Netrace: Dependency Drive Trace-based
Network-on-chip Simulation. In IEEE International Workshop on Network on Chip
Architectures (NoCArc), 31–36.
Hoskote, Y., Vangal, S., Singh, A., Borkar, N., and Borkar, S. (2007). A 5-GHz Mesh
Interconnect for a Teraflops Processor. IEEE Micro Magazine, 27 (5), 51–61.
178
Hou, W., Guo, L., Cai, Q., and Zhu, L. (2014). 3D Torus ONoC: Topology design,
router modeling and adaptive routing algorithm. In IEEE International Conference
on Optical Communications and Networks (ICOCN), 1–4.
Hruska, J. (2015). IBM to Demonstrate First On-package Silicon Photonics. In Ex-
tremeTech.
Hu, J. and Marculescu, R. (2005). Energy- and Performance-Aware Mapping for Regu-
lar NoC Architectures. IEEE Transactions on Computer-Aided Design of Integrated
Circuits and Systems , 24 (4), 551–562.
Intel (2007). Intel Research Advances ’Era Of Tera’. In Intel News Release.
Intel (2013). Intel Silicon Photonics Demonstrated at 100 Gbps. In Intel News Release.
Intel (2016). Product Specifications: Intel Xeon Phi Processor 7290F. In
http://ark.intel.com/products/95831/.
Jiao, J. and Fu, Y. (2011). B2RAC: A Physical Express Link Addition Methodol-
ogy for Network on Chip. In ACM International Workshop on Network on Chip
Architectures (NoCArc), 17–22.
Kahng, A., Li, B., Peh, L., and Samadi, K. (2012). ORION 2.0: A Power-Area
Simulator for Interconnection Networks. IEEE Transactions on Very Large Scale
Integration (VLSI) Systems , 20 (1), 191–196.
Kahng, A., Lin, B., and Nath, S. (2015). ORION 3.0: A Comprehensive NoC Router
Estimation Tool. IEEE Embedded Systems Letters , 7 (2), 41–45.
Kalray (2012). MPPA: The Supercomputing on a Chip Solution. In
http://www.kalrayinc.com/kalray/products/.
Kao, Y. and Chao, H. (2014). Design of a Bufferless Photonic Clos Network-on-Chip
Architecture. IEEE Transactions on Computers (TC), 63 (3), 764–776.
Karkar, A., Mak, T., Tong, K.-F., and Yakovlev, A. (2016). A Survey of Emerging
Interconnects for On-Chip Efficient Multicast and Broadcast in Many-Cores. IEEE
Circuits and Systems Magazine, 16 (1), 58–72.
Kazmierczak, A., Briere, M., Drouard, E., Bontoux, P., Rojo-Romeo, P., O’Connor,
I., Letartre, X., Gaffiot, F., Orobtchouk, R., and Benyattou, T. (2005). Design,
179
Simulation, and Characterization of a Passive Optical Add-drop Filter in Silicon-on-
Insulator Technology. IEEE Photonics Technology Letters , 17 (7), 1447–1449.
Kelm, J., Johnson, M., Lumetta, S., and Patel, S. (2010). WayPoint: Scaling Coherence
to 1000-Core Architectures. In ACM/IEEE International Conference on Parallel
Architectures and Compilation Techniques (PACT), 99–109.
Khalili, F. and Zarandi, H. (2012). A Fault-Tolerant Low-Energy Multi-Application
Mapping onto NoC-based Multiprocessors. In IEEE International Conference on
Computational Science and Engineering, 421–428.
Khdr, H., Pagani, S., Shafique, M., and Henkel, J. (2015). Thermal Constrained
Resource Management for Mixed ILP-TLP Workloads in Dark Silicon Chips. In
ACM/IEEE Design Automation Conference (DAC), 1–6.
Kim, H., Seo, J., and Han, T. (2011). 3CEO: Three Dimensional Cmesh Based
Electrical-Optical Router for Networks-on-Chip. In IEEE International Conference
on ICT Convergence, 114–119.
Kim, J., Nicopoulos, C., Park, D., Narayanan, V., Yousif, M., and Das, C. (2006).
A Gracefully Degrading and Energy-Efficient Modular Router Architecture for On-
Chip Networks. In ACM/IEEE International Symposium on Computer Architecture
(ISCA), 4–15.
Kirman, N., Kirman, M., Dokania, R., Martinez, J., Apsel, A., Watkins, M., and Al-
bonesi, D. (2006). Leveraging Optical Technology in Future Bus-based Chip Multi-
processors. In IEEE/ACM International Symposium on Microarchitecture (MICRO),
492–503.
Koester, S., Dehlinger, G., Schaub, J., Chu, J., Ouyang, Q., and Grill, A. (2005).
Germanium-on-Insulator Photodetectors. In IEEE International Conference on
Group IV Photonics, 171–173.
Koohi, S. and Hessabi, S. (2014). All-Optical Wavelength-Routed Architecture for a
Power-Efficient Network on Chip. IEEE Transactions on Computers (TC), 63 (3),
777–792.
Krishna, T., Peh, L., Beckmann, B., and Reinhardt, S. (2011). Towards the Ideal
On-Chip Fabric for 1-to-Many and Many-to-1 Communication. In IEEE/ACM In-
ternational Symposium on Microarchitecture (MICRO), 71–82.
180
Kumar, A., Peh, L.-S., Kundu, P., and Jha, N. (2007). Express Virtual Channels:
Towards the Ideal Interconnection Fabric. In ACM/IEEE International Symposium
on Computer Architecture (ISCA), 150–161.
Kumar, S., Jantsch, A., Soininen, J., Forsell, M., Millberg, M., Oberg, J., Tiensyrja, K.,
and Hemani, A. (2002). A Network on Chip Architecture and Design Methodology.
In IEEE Computer Society Annual Symposium on VLSI (ISVLSI), 105–112.
Kurian, G., Miller, J., Psota, J., Eastep, J., Liu, J., Michel, J., Kimerling, L., and
Agarwal, A. (2010). ATAC: A 1000-core Cache-coherent Processor with On-chip
Optical Network. In ACM/IEEE International Conference on Parallel Architectures
and Compilation Techniques (PACT), 153–164.
Latif, K., Seceleanu, T., and Tenhunen, H. (2010). Power and Area Efficient Design of
Network-on-Chip Router through Utilization of Idle Buffers. In IEEE International
Conference and Workshops on Engineering of Computer Based Systems, 131–138.
Le Beux, S., Li, H., O’Connor, I., Cheshmi, K., Liu, X., Trajkovic, J., and Nicolescu,
G. (2014). Chameleon: Channel Efficient Optical Network-on-chip. In IEEE/ACM
Design, Automation Test in Europe Conference Exhibition (DATE), 304:1–304:6.
Le Beux, S., Trajkovic, J., O’Connor, I., Nicolescu, G., Bois, G., and Paulin, P. (2011).
Optical Ring Network-on-Chip (ORNoC): Architecture and Design Methodology.
In IEEE/ACM Design, Automation Test in Europe Conference Exhibition (DATE),
1–6.
Lee, B., Chen, X., Biberman, A., Liu, X., Hsieh, I., Chou, C., Dadap, J., Xia, F.,
Green, W., Sekaric, L., Vlasov, Y., Osgood, R., and Bergman, K. (2008). Ultrahigh-
Bandwidth Silicon Photonic Nanowire Waveguides for On-Chip Networks. IEEE
Photonics Technology Letters , 20 (6), 398–400.
Li, C., Browning, M., Gratz, P., and Palermo, S. (2014). LumiNOC: A Power-Efficient,
High-Performance, Photonic Network-on-Chip. IEEE Transactions on Computer-
Aided Design of Integrated Circuits and Systems (TCAD), 33 (6), 826–838.
Li, F., Nicopoulos, C., Richardson, T., Xie, Y., Narayanan, V., and Kandemir,
M. (2006). Design and Management of 3D Chip Multiprocessors Using Network-
in-Memory. In ACM/IEEE International Symposium on Computer Architecture
(ISCA), 130–141.
181
Li, M., Zeng, Q.-A., and Jone, W.-B. (2006). DyXY - A Proximity Congestion-Aware
Deadlock-Free Dynamic Routing Method for Network on Chip. In ACM/IEEE De-
sign Automation Conference (DAC), 849–852.
Li, X., Gu, H., Chen, K., Song, L., and Hao, Q. (2016). STorus: A new topology for
optical network-on-chip. Optical Switching and Networking , 22 (1), 77 – 85.
Li, Z., Mohamed, M., Chen, X., Dudley, E., Meng, K., Shang, L., Mickelson, A., Joseph,
R., Vachharajani, M., Schwartz, B., and Sun, Y. (2012). Reliability Modeling and
Management of Nanophotonic On-Chip Networks. IEEE Transactions on Very Large
Scale Integration (VLSI) Systems , 20 (1), 98–111.
Liu, A., Liao, L., Chetrit, Y., Basak, J., Nguyen, H., Rubin, D., and Paniccia, M.
(2010). Wavelength Division Multiplexing Based Photonic Integrated Circuits on
Silicon-on-Insulator Platform. IEEE Journal of Selected Topics in Quantum Elec-
tronics , 16 (1), 23–32.
Liu, F., Gu, H., and Yang, Y. (2010). Performance Study of Virtual-Channel Router
for Network-on-Chip. In IEEE International Conference On Computer Design and
Applications, Volume 5, V5–255–V5–259.
Liu, F., Gu, H., and Yang, Y. (2012). DTBR: A Dynamic Thermal-Balance Routing
Algorithm for Network-on-Chip. Computers and Electrical Engineering , 38 (2), 270–
281.
Liu, F., Zhang, H., Chen, Y., Huang, Z., and Gu, H. (2015). WRH-ONoC: A
Wavelength-Reused Hierarchical Architecture for Optical Network on Chips. In IEEE
Conference on Computer Communications (INFOCOM), 1912–1920.
Liu, F., Zhang, H., Chen, Y., Huang, Z., and Gu, H. (2016). Dynamic Ring-based
Multicast with Wavelength Reuse for Optical Network on Chips. In IEEE Interna-
tional Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSOC),
153–160.
Liu, J., Yang, J., and Melhem, R. (2015). GASOLIN: Global Arbitration for Streams
of Data in Optical Links. In IEEE International Parallel and Distributed Processing
Symposium (IPDPS), 93–102.
Lu, H., Fu, B., Wang, Y., Han, Y., Yan, G., and Li, X. (2015). RISO: Enforce
Noninterfered Performance With Relaxed Network-on-Chip Isolation in Many-Core
182
Cloud Processors. IEEE Transactions on Very Large Scale Integration (VLSI) Sys-
tems , 23 (12), 3053–3064.
Luo, J., Killian, C., Le Beux, S., Chillet, D., Li, H., O’Connor, I., and Sentieys,
O. (2015). Channel Allocation Protocol for Reconfigurable Optical Network-on-
Chip. In IEEE Workshop on Exploiting Silicon Photonics for Energy-Efficient High
Performance Computing, 33–39.
Ma, S., Jerger, N., and Wang, Z. (2011). DBAR: an Efficient Routing Algorithm
to Support Multiple Concurrent Applications in Networks-on-Chip. In ACM/IEEE
International Symposium on Computer Architecture (ISCA), 413–424.
Ma, S., Jerger, N., and Wang, Z. (2012). Supporting Efficient Collective Communi-
cation in NoCs. In IEEE International Symposium on High Performance Computer
Architecture (HPCA), 1–12.
Ma, S., Jerger, N., Wang, Z., Lai, M., and Huang, L. (2014). Holistic Routing Algo-
rithm Design to Support Workload Consolidation in NoCs. IEEE Transactions on
Computers , 63 (3), 529–542.
Marculescu, R., Hu, J., and Ogras, U. (2005). Key Research Problems in NoC Design:
A Holistic Perspective. In IEEE/ACM/IFIP International Conference on Hard-
ware/Software Codesign and System Synthesis (CODES+ISSS), 69–74.
Marculescu, R., Ogras, U., Peh, L., Jerger, N., and Hoskote, Y. (2009). Outstanding
Research Problems in NoC Design: System, Microarchitecture, and Circuit Per-
spectives. IEEE Transactions on Computer-Aided Design of Integrated Circuits and
Systems (TCAD), 28 (1), 3–21.
Meindl, J. (2003). Beyond Moore’s Law: the Interconnect Ara. Computing in Science
Engineering , 5 (1), 20–24.
Mellanox (2013). Seventy Two Core Processor SoC with 8x 10Gb Ethernet Ports, PCIe
and Networking Offloads. In TILE-Gx72 Processor Product.
Miller, D. (2000). Rationale and Challenges for Optical Interconnects to Electronic
Chips. Proceedings of the IEEE , 88 (6), 728–749.
Mo, K., Ye, Y., Wu, X., Zhang, W., Liu, W., and Xu, J. (2010). A Hierarchical Hybrid
Optical-Electronic Network-on-Chip. In IEEE Computer Society Annual Symposium
on VLSI (ISVLSI), 327–332.
183
Moore, G. (2006). Progress in Digital Integrated Electronics. IEEE Solid-State Circuits
Society Newsletter , 20 (3), 36–37.
Morris, R., Jolley, E., and Kodi, A. (2014). Extending the Performance and Energy-
Efficiency of Shared Memory Multicores with Nanophotonic Technology. IEEE
Transactions on Parallel and Distributed Systems (TPDS), 25 (1), 83–92.
Morris, R. and Kodi, A. (2010). Exploring the Design of 64- and 256-Core Power
Efficient Nanophotonic Interconnect. IEEE Journal of Selected Topics in Quantum
Electronics , 16 (5), 1386–1393.
Morris, R., Kodi, A., and Louri, A. (2012). Dynamic Reconfiguration of 3D Photonic
Networks-on-Chip for Maximizing Performance and Improving Fault Tolerance. In
IEEE/ACM International Symposium on Microarchitecture (MICRO), 282–293.
Morris, R., Kodi, A., Louri, A., and Whaley, R. (2014). Three-Dimensional Stacked
Nanophotonic Network-on-Chip Architecture with Minimal Reconfiguration. IEEE
Transactions on Computers (TC), 63 (1), 243–255.
Munk, P., Freier, M., Richling, J., and Chen, J. (2015). Dynamic Guaranteed Service
Communication on Best-Effort Networks-on-Chip. In IEEE Euromicro International
Conference on Parallel, Distributed, and Network-Based Processing (PDP), 353–360.
Nicopoulos, C., Park, D., Kim, J., Vijaykrishnan, N., Yousif, M., and Das, C. (2006).
ViChaR: a Dynamic Virtual Channel Regulator for Network-on-Chip Routers. In
IEEE/ACM International Symposium on Microarchitecture (MICRO), 333–346.
Nitta, C., Farrens, M., and Akella, V. (2011). Addressing System-Level Trimming
Issues in On-Chip Nanophotonic Networks. In IEEE International Symposium on
High Performance Computer Architecture (HPCA), 122–131.
Nychis, G., Fallin, C., Moscibroda, T., Mutlu, O., and Seshan, S. (2012). On-chip
Networks from a Networking Perspective: Congestion and Scalability in Many-core
Interconnects. In ACM Conference on Applications, Technologies, Architectures, and
Protocols for Computer Communication (SIGCOMM), 407–418.
O’Connor, I. (2004). Optical Solutions for System-level Interconnect. In ACM/IEEE
International Workshop on System Level Interconnect Prediction (SLIP), 79–88.
184
Ogras, U. and Marculescu, R. (2006). ”It’s a Small World After All”: Noc Performance
Optimization via Long-range Link Insertion. IEEE Transactions on Very Large Scale
Integration (VLSI) Systems , 14 (7), 693–706.
Olofsson, A. (2016). Epiphany-V: A 1024 processor 64-bit RISC System-On-Chip. In
arXiv:1610.01832, 1–15.
Oracle (2015). Oracle Announces Breakthrough Processor and Systems Design with
SPARC M7. In Oracle Press Release.
Owens, J., Dally, W., Ho, R., Jayasimha, D., Keckler, S., and Peh, L. (2007). Research
Challenges for On-Chip Interconnection Networks. IEEE Micro Magazine, 27 (5),
96–108.
Pan, Y., Kim, J., and Memik, G. (2010). FlexiShare: Channel Sharing for an
Energy-Efficient Nanophotonic Crossbar. In IEEE International Symposium on
High-Performance Computer Architecture (HPCA), 1–12.
Pan, Y., Kumar, P., Kim, J., Memik, G., Zhang, Y., and Choudhary, A. (2009).
Firefly: Illuminating Future Network-on-Chip with Nanophotonics. In ACM/IEEE
International Symposium on Computer Architecture (ISCA), 429–440.
Pande, P., Grecu, C., Jones, M., Ivanov, A., and Saleh, R. (2005). Performance Evalu-
ation and Design Trade-Offs for Network-on-Chip Interconnect Architectures. IEEE
Transactions on Computers , 54 (8), 1025–1040.
Parikh, R., Das, R., and Bertacco, V. (2014). Power-aware NoCs Through Routing and
Topology Reconfiguration. In ACM/IEEE Design Automation Conference (DAC),
1–6.
Petracca, M., Lee, B., Bergman, K., and Carloni, L. (2008). Design Exploration of
Optical Interconnection Networks for Chip Multiprocessors. In IEEE Symposium on
High Performance Interconnects (HOTI), 31–40.
Petracca, M., Lee, B., Bergman, K., and Carloni, L. (2009). Photonic NoCs: System-
Level Design Exploration. IEEE Micro Magazine, 29 (4), 74–85.
Preston, K., Droz, N., Levy, J., and Lipson, M. (2011). Performance Guidelines for
WDM Interconnects based on Silicon Microring Resonators. In IEEE Conference on
Laser Science to Photonic Applications (CLEO), 137–153.
185
Qualcomm (2015). Snapdragon 652 Processor. In https://www.qualcomm.
com/products/snapdragon/processors/652.
Rahmani, A., Haghbayan, M., Kanduri, A., Weldezion, A., Liljeberg, P., Plosila, J.,
Jantsch, A., and Tenhunen, H. (2015). Dynamic Power Management for Many-
core Platforms in the Dark Silicon Era: A Multi-objective Control Approach.
In IEEE/ACM International Symposium on Low Power Electronics and Design
(ISLPED), 219–224.
Rahmani, A., Latif, K., Liljeberg, P., Plosila, J., and Tenhunen, H. (2010). Research
and Practices on 3D Networks-on-Chip Architectures. In IEEE Norchip Conference,
1–6.
Ramini, L., Bertozzi, D., and Carloni, L. (2012). Engineering a Bandwidth-Scalable
Optical Layer for a 3D Multi-core Processor with Awareness of Layout Constraints.
In ACM/IEEE International Symposium on Networks-on-Chip (NOCS), 185–192.
Ramini, L., Grani, P., Bartolini, S., and Bertozzi, D. (2013). Contrasting Wavelength-
routed Optical NoC Topologies for Power-efficient 3D-stacked Multicore Processors
Using Physical-layer Analysis. In IEEE/ACM Design, Automation Test in Europe
Conference Exhibition (DATE), 1589–1594.
Ramini, L., Tala, M., and Bertozzi, D. (2014). Exploring Communication Protocols
for Optical Networks-on-Chip based on Ring Topologies. In IEEE/OSA Asia Com-
munications and Photonics Conference, 1–3.
Rodrigo, S., Flich, J., Duato, J., and Hummel, M. (2008). Efficient Unicast and
Multicast Support for CMPs. In IEEE/ACM International Symposium on Microar-
chitecture (MICRO), 364–375.
Ruadulescu, A., Goossens, K., Micheli, G., Murali, S., and Coenen, M. (2006). A
Buffer-Sizing Algorithm for Networks on Chip Using TDMA and Credit-Based End-
to-end Flow Control. In IEEE International Conference on Hardware/Software Code-
sign and System Synthesis (CODES+ISSS), 130–135.
Samih, A., Wang, R., Krishna, A., Maciocco, C., Tai, C., and Solihin, Y. (2013).
Energy-Efficient Interconnect via Router Parking. In IEEE International Symposium
on High Performance Computer Architecture (HPCA), 508–519.
186
Samman, F., Hollstein, T., and Glesner, M. (2010). Adaptive and Deadlock-Free Tree-
Based Multicast Routing for Networks-on-Chip. IEEE Transactions on Very Large
Scale Integration (VLSI) Systems , 18 (7), 1067–1080.
Shacham, A. and Bergman, K. (2007). Building Ultralow-Latency Interconnection
Networks Using Photonic Integration. IEEE Micro Magazine, 27 (4), 6–20.
Shacham, A., Bergman, K., and Carloni, L. (2008). Photonic Networks on Chip
for Future Generations of Chip Multiprocessors. IEEE Transactions on Comput-
ers (TC), 57, 1246–1260.
Swanson, S. and Taylor, M. (2011). Greendroid: Exploring the Next Evolution in
Smartphone Application Processors. IEEE Communications Magazine, 49 (4), 112–
119.
Take, Y., Matsutani, H., Sasaki, D., Koibuchi, M., Kuroda, T., and Amano, H. (2014).
3D NoC with Inductive-Coupling Links for Building-Block SiPs. IEEE Transactions
on Computers , 63 (3), 748–763.
Tala, M., Castellari, M., Balboni, M., and Bertozzi, D. (2016). Populating and Explor-
ing the Design Space of Wavelength-Routed Optical Network-on-Chip Topologies by
Leveraging the Add-Drop Filtering Primitive. In ACM/IEEE International Sympo-
sium on Networks-on-Chip (NOCS), 1–8.
Tan, X., Yang, M., Zhang, L., Jiang, Y., and Yang, J. (2012). A Generic Optical
Router Design for Photonic Network-on-Chips. IEEE Journal of Lightwave Technol-
ogy , 30 (3), 368–376.
Tan, X., Yang, M., Zhang, L., Wang, X., and Jiang, Y. (2014). A Hybrid Optoelectronic
Networks-on-Chip Architecture. IEEE/OSA Journal Lightwave Technology , 32 (5),
991–998.
Taylor, M. (2013). A Landscape of the New Dark Silicon Design Regime. IEEE Micro
Magazine, 33 (5), 8–19.
Vangal, S., Howard, J., Ruhl, G., Dighe, S., Wilson, H., Tschanz, J., Finan, D., Singh,
A., Jacob, T., Jain, S., Erraguntla, V., Roberts, C., Hoskote, Y., Borkar, N., and
Borkar, S. (2008). An 80-Tile Sub-100-W TeraFLOPS Processor in 65-nm CMOS.
IEEE Journal of Solid-State Circuits , 43 (1), 29–41.
187
Vantrease, D., Schreiber, R., Monchiero, M., McLaren, M., Jouppi, N., Fiorentino,
M., Davis, A., Binkert, N., Beausoleil, R., and Ahn, J. (2008). Corona: System
Implications of Emerging Nanophotonic Technology. In ACM/IEEE International
Symposium on Computer Architecture (ISCA), 153–164.
Venkatesh, G., Sampson, J., Goulding-Hotta, N., Venkata, S., Taylor, M., and Swan-
son, S. (2011). QsCores: Trading Dark Silicon for Scalable Energy Efficiency with
Quasi-specific Cores. In IEEE/ACM International Symposium on Microarchitecture
(MICRO), 163–174.
Vlasov, Y., Green, W., and Xia, F. (2008). High-Throughput Silicon Nanopho-
tonic Wavelength-Insensitive Switch for On-Chip Optical Networks. Nature Pho-
tonics , 31 (2), 242–246.
Wang, X., Gu, H., Yang, Y., Wang, K., and Hao, Q. (2016). A Highly Scalable
Optical Network-on-Chip With Small Network Diameter and Deadlock Freedom.
IEEE Transactions on Very Large Scale Integration (VLSI) Systems , 24 (12), 3424–
3436.
Wang, Z., Xu, J., Wu, X., Ye, Y., Zhang, W., Nikdast, M., Wang, X., and Wang,
Z. (2014). Floorplan Optimization of Fat-Tree-Based Networks-on-Chip for Chip
Multiprocessors. IEEE Transactions on Computers (TC), 63 (6), 1446–1459.
Werner, S., Navaridas, J., and Lujn, M. (2015). Amon: An Advanced Mesh-like Optical
NoC. In IEEE Annual Symposium on High-Performance Interconnects (HOTI), 52–
59.
Winter, M. and Fettweis, G. (2011). Guaranteed Service Virtual Channel Allocation
in NoCs for Run-Time Task Scheduling. In IEEE/ACM Design, Automation Test
in Europe Conference Exhibition (DATE), 1–6.
Wu, X., Xu, J., Ye, Y., Wang, X., Nikdast, M., Wang, Z., and Wang, Z. (2015). An
Inter/Intra-Chip Optical Network for Manycore Processors. IEEE Transactions on
Very Large Scale Integration (VLSI) Systems , 23 (4), 678–691.
Xie, Y., Nikdast, M., Xu, J., Wu, X., Zhang, W., Ye, Y., Wang, X., Wang, Z., and
Liu, W. (2013). Formal Worst-Case Analysis of Crosstalk Noise in Mesh-Based Op-
tical Networks-on-Chip. IEEE Transactions on Very Large Scale Integration (VLSI)
Systems , 21 (10), 1823–1836.
188
Xie, Y., Xu, W., Zhao, W., Huang, Y., Song, T., and Guo, M. (2015). Performance
Optimization and Evaluation for Torus-Based Optical Networks-on-Chip. Journal
of Lightwave Technology , 33 (18), 3858–3865.
Xu, Y., Yang, J., and Melhem, R. (2012a). Channel Borrowing: An Energy-efficient
Nanophotonic Crossbar Architecture with Light-weight Arbitration. In ACM Inter-
national Conference on Supercomputing (ICS), 133–142.
Xu, Y., Yang, J., and Melhem, R. (2012b). Tolerating Process Variations in Nanopho-
tonic On-Chip Networks. In ACM/IEEE International Symposium on Computer
Architecture (ISCA), 142–152.
Yan, G., Li, Y., Han, Y., Li, X., Guo, M., and Liang, X. (2012). AgileRegulator: A
Hybrid Voltage Regulator Scheme Redeeming Dark Silicon for Power Efficiency in
a Multicore Architecture. In IEEE International Symposium on High Performance
Computer Architecture (HPCA), 1–12.
Yang, L., Liu, W., Jiang, W., Li, M., Yi, J., and Sha, E. (2016). FoToNoC: A Hierar-
chical Management Strategy based on Folded Torus-like Network-on-Chip for Dark
Silicon Many-core Systems. In IEEE Asia and South Pacific Design Automation
Conference (ASP-DAC), 725–730.
Ye, Y., Xu, J., Huang, B., Wu, X., Zhang, W., Wang, X., Nikdast, M., Wang, Z., Liu,
W., and Wang, Z. (2013). 3-D Mesh-Based Optical Network-on-Chip for Multipro-
cessor System-on-Chip. IEEE Transactions on Computer-Aided Design of Integrated
Circuits and Systems , 32 (4), 584–596.
Ye, Y., Xu, J., Wu, X., Zhang, W., Liu, W., and Nikdast, M. (2012). A Torus-Based Hi-
erarchical Optical-Electronic Network-on-Chip for Multiprocessor System-on-Chip.
ACM Journal on Emerging Technologies in Computing Systems (JETC), 8 (1), 5:1–
5:26.
Ye, Y., Xu, J., Wu, X., Zhang, W., Wang, X., Nikdast, M., Wang, Z., and Liu, W.
(2013). System-Level Modeling and Analysis of Thermal Effects in Optical Networks-
on-Chip. IEEE Transactions on Very Large Scale Integration (VLSI) Systems , 21 (2),
292–305.
Yoo, Y., Ahn, S., and Kim, C. (2003). Adaptive Routing Considering the Number
of Available Wavelengths in WDM Networks. IEEE Journal on Selected Areas in
Communications , 21 (8), 1263–1273.
189
Zhang, B., Gu, H., Wang, K., Yang, Y., and Tan, W. (2017). Low polling time TDM
ONOC with direction-based wavelength assignment. IEEE/OSA Journal of Optical
Communications and Networking , 9 (6), 479–488.
Zhang, B., Gu, H., Yang, Y., Chen, K., and Hao, Q. (2014). Flyover Architecture
for Cluster and TDM-Based Optical Network-On-Chip. IEEE Photonics Technology
Letters , 26 (24), 2422–2425.
Zhang, L., Ma, X., Yu, J., Yang, M., Liu, P., Yang, J., and Jiang, Y. (2015). Exploration
of generalized circuit switching optical network-on-chip architecture. In IEEE Optical
Interconnects Conference (OI), 82–83.
Zhang, X. and Louri, A. (2010). A Multilayer Nanophotonic Interconnection Net-
work for On-Chip Many-Core Communications. In ACM/IEEE Design Automation
Conference (DAC), 156–161.
Zhao, J., Gong, Y., Tan, W., and Gu, H. (2016). 3D-DMONoC: A New Topology for
Optical Network on Chip. In IEEE International Conference on Optical Communi-
cations and Networks (ICOCN), 1–3.
Zuffada, M. (2012). The Industrialization of the Silicon Photonics: Technology Road
Map and Applications. In IEEE Proceedings of the ESSCIRC (ESSCIRC), 7–13.
190
