Rochester Institute of Technology

RIT Scholar Works
Theses
8-2014

Heterogeneous Photonic Network-on-Chip with Dynamic
Bandwidth Allocation
Ankit Himanshu Shah

Follow this and additional works at: https://scholarworks.rit.edu/theses

Recommended Citation
Shah, Ankit Himanshu, "Heterogeneous Photonic Network-on-Chip with Dynamic Bandwidth Allocation"
(2014). Thesis. Rochester Institute of Technology. Accessed from

This Thesis is brought to you for free and open access by RIT Scholar Works. It has been accepted for inclusion in
Theses by an authorized administrator of RIT Scholar Works. For more information, please contact
ritscholarworks@rit.edu.

Heterogeneous Photonic Network-on-Chip with Dynamic
Bandwidth Allocation
By

Ankit Himanshu Shah
A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of
Master of Science in Computer Engineering
Supervised by
Dr. Amlan Ganguly
Department of Computer Engineering
Kate Gleason College of Engineering
Rochester Institute of Technology
Rochester, NY
August, 2014

Approved By:

_____________________________________________

___________

___

Dr. Amlan Ganguly
Primary Advisor – R.I.T. Dept. of Computer Engineering

_

__

___________________________________

_________

Dr. Andres Kwasinski
Secondary Advisor – R.I.T. Dept. of Computer Engineering

_____________________________________________

______________

Dr. Sonia Lopez Alarcon
Secondary Advisor – R.I.T. Dept. of Computer Engineering

1

_____

Dedication

Dedicated to my parents Mr. Himanshu Navinchandra Shah
And Mrs. Mona Himanshu Shah

2

Acknowledgements

I would like to express my great appreciation to my primary advisor Dr.
Amlan Ganguly for his constant guidance, patience and support that he
extended throughout the duration of this work. Dr Ganguly was always there
to review my work and give valuable suggestions which always helped in
keeping the work right on track.
Also, I would also like to thank Naseef Mansoor for all the valuable
discussions and my parents for their encouragement and heartfelt support
during the course of this work.

3

Abstract
Advancements in the field of chip fabrication has facilitated in integrating more
number of transistors in a given area which has lead to an era of multi-core
processors. Future multi-core chips or chip multiprocessors (CMPs) will have
hundreds of heterogeneous components including processing engines, custom logic,
GPU units, programmable fabrics and distributed memory. Such multi-core chips are
expected to run varied multiple parallel workloads simultaneously. Hence, different
communicating cores will require different bandwidths leading to the necessity of a
heterogeneous Network-on-Chip (NoC) architecture. Simply over-provisioning for
performance will invariably result in loss of power efficiency. On the other hand,
recent research has shown that photonic interconnects are capable of achieving
high-bandwidth and energy-efficient on-chip data transfer. In this paper we propose
a dynamic heterogeneous photonic NoC (d-HetPNOC) architecture with dynamic
bandwidth allocation to achieve better performance and energy-efficiency
compared to a homogeneous photonic NoC architecture with the same aggregate
data bandwidth.

4

Contents
Abstract....................................................................................................................................................... 4
Chapter 1 Introduction ........................................................................................................................ 8
1.1

Era of Multi-core processors ............................................................................................. 8

1.2

Heterogeneous Multi-Core chips ..................................................................................... 9

1.3

Interconnects in Multi-Core chips................................................................................ 11

1.4

Network on Chip (NoC) paradigm ............................................................................... 12

1.5

Emerging interconnects ................................................................................................... 14

1.6

Thesis Contribution ........................................................................................................... 17

Chapter 2 Related Work ................................................................................................................... 19
2.1

Photonic Elements ............................................................................................................. 19

2.2

Existing PNoC Architectures .......................................................................................... 24

Chapter 3 Dynamic Heterogeneous Photonic NOC (d-HetPNoC) ..................................... 27
3.1

Network Architecture ....................................................................................................... 27

3.2

Dynamic Bandwidth Allocation (DBA) Mechanism .............................................. 29

3.3

Flow Control and Routing ............................................................................................... 32

3.4

Experimental Results ........................................................................................................ 35

Chapter 4

Conclusion .................................................................................................................... 58

Reference ................................................................................................................................................ 60

5

List of figures
Figure 1-1: Speedup of 1024B flit size over baseline (32B flit size) with benchmarks
from CUDA SDK (upper case) [26] and Rodinia (lower case) [25] with number of
kernel launches in parenthesis ....................................................................................................... 11
Figure 1-2: Network-on-Chip architecture ................................................................................ 12
Figure 1-3: NoC switch architecture............................................................................................. 14
Figure 2-1: A basic photonic switch.............................................................................................. 22
Figure 2-2: Firefly Architecture: (a) Cross-bar between clusters of same assembly,
(b) waveguide supporting inter-cluster crossbars. Reproduced form [20] ................. 25
Figure 2-3: Reservation-assisted Single Write Multiple Read. Adapted from [20] .... 25
Figure 3-1: Dynamic bandwidth allocation enabled PNoC architecture. ....................... 28
Figure 3-2: Microarchitecture of photonic router................................................................... 34
Figure 3-3: Peak Bandwidth of Firefly PNoC and d-HetPNoC for uniform-random
and skewed traffic patterns for (a) Bandwidth set 1 (Total Wavelengths = 64) (b)
Bandwidth set 2 (Total Wavelengths = 256) (c) Bandwidth set 3 (Total Wavelengths
= 512). ...................................................................................................................................................... 39
Figure 3-4: Packet Energy of Firefly PNoC and d-HetPNoC for uniform-random and
skewed traffic patterns for (a) BW Set 1 (Total Wavelengths = 64) (b) BW set 2
(Total Wavelengths = 256) (c) BW Set 3 (Total Wavelengths = 512) ............................. 43
Figure 3-5: Peak Core Bandwidth and Packet Energy for Firefly PNoC for synthetic
and real application based traffic scenarios .............................................................................. 45
Figure 3-6: Peak Core Bandwidth and Packet Energy for Firefly PNoC for synthetic
and real application based traffic scenarios .............................................................................. 51
Figure 3-7: Comparison of (a) Peak Core Bandwidth (b) Energy Per Message of dHetPNoC for BW Set 1 (Total Wavelengths = 64), BW Set 2 (Total Wavelengths =
256) and BW Set 3 (Total Wavelengths = 512) for Uniform Random and Skewed
Traffic Patterns ..................................................................................................................................... 53
Figure 3-8: Effect of increase in total number of wavelengths on Peak Bandwidth
and Area for d-HetPNoC for Skewed 3 traffic pattern. .......................................................... 53
Figure 3-9: Effect of increase in total number of wavelengths on Energy per Message
and Area for d-HetPNoC for Skewed 3 traffic pattern. .......................................................... 54
Figure 3-10: Comparison of (a) Peak Core Bandwidth (b) Energy Per Message of
Firefly architecture for BW Set 1 (Total Wavelengths = 64), BW Set 2 (Total
Wavelengths = 256) and BW Set 3 (Total Wavelengths = 512) for Uniform Random
and Skewed Traffic Patterns............................................................................................................ 56

6

List of Tables
Table 3-1: Frequency of communication for applications with different bandwidth
for skewed traffic scenarios ............................................................................................................. 35
Table 3-2: Frequency of communication for applications with bandwidth set 1 for
skewed traffic scenarios .................................................................................................................... 35
Table 3-3: Simulation Parameters................................................................................................. 37
Table 3-4: Power or Energy dissipation of photonic components ................................... 37
Table 3-5: Energy of different photonic components ............................................................ 41

7

Chapter 1 Introduction
Modern sophisticated applications need more computational power and have
been the driving force to build powerful computers. This was initially achieved by
increasing the clock speed of the processor to get faster and computationally
intensive computers. This however led to increased power consumption and
increased dissipation of heat. In accordance with Moore’s Law the number of
transistors that are fabricated on a chip has doubled roughly every 18 months. With
increase in the number of transistors and limit on the increase in clock speed, the
cores on a chip were increased to exploit the parallelism in the computational
requirement of an application. A significant amount of information is exchanged
between these cores; hence interconnects between the cores plays an important
role in multi-core chips.

1.1

Era of Multi-core processors

In the past the performance of the processors was improved by increasing the
clock speeds. This helped speed up single-threaded, serial code. Increase in clock
speed was achieved by increasing the depth of the pipeline. Deeper pipeline are no
longer profitable as the flip-flops delay is comparable to the combinational logic
delay. Another disadvantage of higher clock speeds was that Central Processing
Units (CPUs) were hitting a power wall. Power dissipation in a processor is directly
proportional to the clock speed at which the CPU operates i.e. CPU frequency. Higher
clock speeds led to increased power dissipation. Moreover, the high frequency
8

micro-architectures are not suited for some of the low power techniques that have
been invented to reduce power.
The steady advancements in device and fabrication technology have enabled us
to increase the transistor integration density [1]. Consequently the number of CPU
cores that can be fabricated on the chip are increasing. Multi-core chips or chip
multiprocessors (CMPs) provide parallel processing capabilities by running multiple
threads at the same time consequently improving the processing time. Most of the
modern day applications are multithreaded applications and they perform better on
multi-core chip compared to uni-core chip. The multi-core chips can also be made to
operate at lower frequency than the single core chip thus reducing the power
dissipation in the chip. Moreover some of the cores in a multi-core chip can be
turned off for tasks that require lesser computational power making them ideal for
low power devices.

1.2

Heterogeneous Multi-Core chips

Decrease in feature size of transistor has resulted in increase in the integration
density on a chip. Future CMPs will have hundreds to thousands of cores. To enable
energy-efficient data-intensive computations, such cores will comprise of
heterogeneous components like custom logic, CPUs, GPGPUs, reconfigurable
hardware and distributed memory units [2]. Custom logic, GPGPUs and
reconfigurable hardware would be used to speed up the parallelizable sections of the
application while a conventional CPU will be used to execute the serial section of the
application [3].

9

Such a heterogeneous system will require a heterogeneous interconnection fabric in
which different channels have different bandwidths, to support non-uniform data
traffic. Such heterogeneous interconnection fabric is envisioned to improve the
communication between different heterogeneous components by reducing latency
and energy required for communication. Traditional planar dielectric interconnects
cannot deliver the dynamic bandwidth requirements of heterogeneous CMPs. Ability
of interconnects to provide dynamic bandwidth will determine the system
performance as the number of heterogeneous cores increases in a CMPs. Hence new
interconnect options are studied to support the increasing number of heterogeneous
cores in a multi-core Chip.

1.1.2 Heterogeneous Bandwidth Requirement
Heterogeneous multi-core system will require a heterogeneous interconnection
fabric in which different channels have different bandwidths, to support nonuniform data traffic. It has been shown that traditional Networks-on-Chips (NoCs)
with heterogeneous resource allocation can improve the performance-overhead
trade-offs even with conventional metallic interconnects and mesh-based topologies
[31]. Different classes of applications are shown to benefit from different types
interconnects in [32]. Some applications benefit from low latency interconnects
while others from high-bandwidth ones.
The heterogeneous bandwidth requirements in a future generation CMP can be
understood from studying GPU-memory interactions, which will make up parts of
its fabric. GPUs are highly bandwidth dependent [27], with drastic performance

10

losses when the GPU-memory bandwidth is low. We have evaluated what is the
maximum performance improvement that we can attain when we provide very high
bandwidth. Figure 1-1 shows the speedup when the bandwidth of GPU-memory
interconnects was increased by varying the flit size from 32B to 1024B at 700MHz.
It is seen that despite the high bandwidth links most of the benchmarks show very
modest performance improvement of less than below 1%. On the other hand a few
of the benchmarks show considerable speedup of up to 63%. This indicates that
differentiated interconnect channels are required in the same NoC fabric to harness
their benefits for different types of workloads running in parallel on heterogeneous
multi-core chips.

Figure 1-1: Speedup of 1024B flit size over baseline (32B flit size) with benchmarks
from CUDA SDK (upper case) [26] and Rodinia (lower case) [25] with number of kernel
launches in parenthesis

1.3

Interconnects in Multi-Core chips
Initially, CMPs used shared memory to communicate between the cores.

However shared memory model is not scalable. With increase in the number of
cores, sophisticated interconnects are needed for communication between the

11

cores. Different interconnection architectures were developed for effective
communication between the cores. Bus based, crossbar based, packet based
interconnect architectures are some of the interconnection architectures. Intel and
AMD uses bus based interconnection architecture. QuickPath interconnect from
Intel is a point-to-point interconnect. QuickPath interconnect uses 20 bit wide bus
running at 3.2 GHz to provide uni-directional raw bandwidth of 12.8 GB/s for
communication between cores [4]. AMD’s Hyper Transport 3.0 is a 32-bit wide bus
running at 5.2 GHz [5].

1.4

Network on Chip (NoC) paradigm

With increase in the number of cores on a chip, global on-chip communication
will play an important role in overall performance of the chip. Using long electrical
wires for global communication is unreliable because of increased crosstalk and
noise sensitivity. Hence there is a need for an interconnection network for
communication between the cores.

12

Figure 1-2: Network-on-Chip architecture

Network on chip uses a modular approach for communication between the
cores. Each core or processing element is connected to a network router and the
network routers are interconnected using some network topology. SPIN, CLICHÉ or
Mesh, Torus, Folded Torus, Octagon and Butterfly Fat Tree (BFT) are some of the
network architectures [6]. Figure 1-2 depicts the CLICHÉ network architecture.
There are various switching techniques used for the data to reach from source to
destination. Switching techniques used are: Circuit Switching, Packet Switching and
Wormhole Switching. In Circuit Switching, a complete path needs to be set up
between the source and the destination before the real communication begins. The
whole physical link cannot be used until entire transmitted data reaches the
destination. The disadvantage with this kind of switching is the set up of path is very
slow which delays the transfer of message from source to destination. Moreover, the
channels between source and destination are not being used during the idle period
and this leads to low channel utilization [7]. In packet switching, data is divided into
fixed length units called the packets. Each packet has the routing information and
hence there is no need for the path set up. The network router routes the packets
based on the routing information within the packet. Since each packet needs extra
bits to store the routing information, the size of the buffers required at the switches
is high. Moreover since each packet is routed using the routing information, the
packets may reach the destination out of order and some additional processing
would be required to put the message back together. In wormhole switching, each
packet is divided into fixed length flow control units (flits). The header flit has the
routing information and is used to establish a path from source to destination. The
13

body flits follow the path established by the header flit. Since the size of the flits is
small, the buffer space required at the intermediate switches would be less. Since
the header flit establishes a path between source and destination, it can block other
communications. This problem is solved by introducing virtual channels (VC’s) in
the intermediate switches as shown in figure 1-3.

Figure 1-3: NoC switch architecture
Each incoming port and outgoing port will have multiple VC’s to hold flits
belonging to different packets. Arbitration will be done to use the link
interconnecting the network routers. If all the VC’s are full then the header flit will
be dropped and the source will have to retransmit the header flit. Figure 1-3 shows
the basic switch architecture.

1.5

Emerging interconnects

The most frequently used interconnects are electrical wires for transfer of data
and control messages between the cores. However as fabrication technologies scale
down, the size of electrical wires also scales down resulting in increased resistance
of the wire. Moreover for shared medium arbitrated bus, each core connected to the
shared bus increases the intrinsic parasitic capacitance. This increased resistance
14

and intrinsic parasitic capacitance of electrical wire results in higher propagation
delay and heat dissipation in case of global communication. The bandwidth offered
by electrical wires is also very less. Hence multiple wires would be needed for
communication between a pair of cores. Future chips are expected to have hundreds
to thousands of cores. Laying down parallel buses with increase in number of cores
would be impossible thus affecting the scalability of the system. According to
International Technology Roadmap for Semiconductors (ITRS) novel scalable
interconnects would be needed to meet the performance requirements of the future
chips.
Some of the emerging interconnect paradigms are three dimensional (3-D)
integration, wireless interconnects and photonic interconnects. These interconnect
technologies are envisioned to support communication on a multi-core chip.
3-D integrated circuits (IC’s) use 3-D integration to vertically stack dies with
through silicon vias (TSV) for inter-layer communication. Various architectural
designs for 3-D NoC are Symmetric NoC Router Design, 3-D NoC-Bus Hybrid Router
Design, True 3-D Router Design and Multi-layer 3-D NoC Router Design [8]. The
major advantage of 3-D IC’s is considerable decrease in length and number of global
interconnects, resulting in an increase in the performance and decrease in power
consumption and area of wire limited circuits. Despite the advantages major
challenges in implementing 3-D NoC are crosstalk and noise analysis, thermal
mitigation and interconnect modeling [9].

15

In Wireless NoC, the global communication takes place using single hop, longrange, high bandwidth and low energy links operating in the millimeter (mm)-wave
frequency range [10]. Various architectures have been proposed for reduced energy
dissipation and latency designed using traditional CMOS technology. One such
architectural implementation has been shown in [10]. However it has been
predicted that the intra-chip communication using the bandwidth available using
conventional CMOS based RF technology is not going to be sufficient [11]. Recent
research has been directed towards carbon nanotubes (CNTs) since they exhibit
excellent emission and absorption characteristics leading to dipole like radiation
behavior making them promising for use as antennas in wireless NoC [11]. But, the
failures in fabrication of CNTs are much higher than CMOS process. The electrical
characteristics of the CNT are difficult to control. This leads to failure of links,
negating the advantages of wireless links.
Photonic interconnects are used to carry optical data from source to destination.
The data from the processing element (PE) is sent to the modulator. Modulator is
basically an electro-optic device. An off-chip laser source injects light of various
wavelengths into the optical waveguide. Modulator converts the received data into
optical data by modulating it on one of the wavelengths of the laser source. When
the optical data reaches the detector, it converts the optical data back to electrical
format after which it is directed towards PE.
Optical fiber can handle multiple wavelengths form the laser source at the same
time. Dense Wavelength division Multiplexing (DWDM) is used to improve the data

16

bandwidth. Light travels much faster in an optical interconnect as compared to
packets in planar dielectric interconnects. Hence optical interconnects facilitates
high bandwidth low latency communication. Moreover the energy dissipation in
photonic interconnects is also less as compared to planar dielectric interconnects.

1.6

Thesis Contribution

In this thesis work, we propose crossbar based photonic NoC (PNoC)
architecture

which

dynamically

allocates

bandwidth

between

different

heterogeneous components like custom logic, CPUs, GPGPUs, reconfigurable
hardware and distributed memory units on a chip.
The following point summarizes the contributions made during this work.
 Proposed Network Architecture
A heterogeneous PNoC architecture with dynamic bandwidth allocation
called d-HetPNoC is proposed.
 Experimental Evaluations
o Develop a cycle accurate simulator to implement the PNoC architecture with
3-stage switches namely, input, output arbitrations and routing.
o Obtain experimental results of the proposed d-HetPNoC architecture with
crossbar based Firefly architecture with respect to the following parameters
using the cycle accurate simulator
•

Peak achievable bandwidth

•

Packet energy dissipation

•

Non-uniform traffic patterns
17

•

Area overheads

 Publication
o Ankit Shah, Naseef Mansoor, Ben Johnstone, Amlan Ganguly, Sonia Lopez
Alarcon. “Heterogeneous Photonic Network-on-chip with Dynamic
Bandwidth Allocation” System on chip conference (SoCC) 2014 in Las
Vegas, Nevada.

18

Chapter 2 Related Work
Recent advances in the fabrication technology have made it possible to integrate
various photonic elements on multi-core chips. Consequently various high
performance and low energy architectures have been proposed for implementation
of PNoC. The various photonic elements and architectures used in the study are
shown below.

2.1

Photonic Elements

The on-chip laser source, Micro Ring Resonators (MRR) and the photonic
waveguide are the most important components of PNoC. On-chip laser source
provides the necessary multi-wavelength light source. PNoC uses Dense Wavelength
Division Multiplexing (DWDM) for increasing the bandwidth of the optical fiber
links. Micro Ring Resonator is used for converting the electrical packet to optical
packet of particular wavelength. Photonic waveguide is used to carry the photonic
packets between photonic routers. Multiple demodulators are used to filter the data
packets modulated on specific wavelength. Each of the photonic elements is
explained in depth in the section below.

2.1.1Micro ring resonator
The micro ring resonators (MRRs) act as optical filters and can be made into
electro-optical modulators, lasers and detectors when carrier injection, optical gain
or absorption mechanisms are incorporated. Photonic network needs high
integration density and low power consumption making micro ring resonator a
popular choice because of its small size, high quality factor Q, transparency to off19

resonance light and no intrinsic reflection [12]. Silicon adiabatic micro ring
resonators with radius of 2 µm are shown in [13]. These micro ring resonators
provide high integration density because of small radius. Moreover, these MRRs
consume less power as the power consumption of the modulator is directly
proportional to the circumference and inversely proportional to quality factor Q of
micro ring resonator. Also the total bandwidth of the micro-ring based WDM
modulation system is limited by free-spectral range (FSR). FSR is inversely
proportional to the circumference of the MRR. Hence these MRRs have FSR of 6.92
THz making it possible to fit in more wavelengths and thus increasing the total
aggregate data bandwidth [13].
The MRRs can modulate the light signal from the laser source at a speed of 12.5
Gb/s. The adiabatic micro ring modulators are able to meet the requirements of the
PNoC architectures by providing better power consumption and lesser resistance
than the older mach-zehnder modulator (MZM) [12]. The light wave from the laser
source consists of various wavelengths.

Only the light wave whose frequency

matches the resonant frequency of the MRR will be coupled on to the MRR. Light
waves of other frequencies will not be affected. The resonant frequency of each MRR
can be changed by applying heat to them. The heat is applied on the MRR with the
help of local heaters. We assume a single heater element per MRR in the PNoC to
enable the thermal tuning. Hence each MRR can be tuned to a different resonant
frequency thus utilizing all the frequencies from the laser source and enabling WDM
for higher aggregate data bandwidth.

20

2.1.2Photo-detector
The demodulation is done using the on-chip photo-detectors. When WDM is
used, MRR is used in conjunction with on-chip photo-detector. MRR is used for
selecting the light having the same wavelength as resonant wavelength of the MRR.
The filtered output of the MRR goes to a germanium (Ge) p-i-n photo-detector which
absorbs the light and converts it into electrical current. This current is amplified
and fed to a threshold device. If the electric current is greater than the threshold
voltage it is considered as 1 else it is considered as 0. Photo-detector parameters
such as power consumption, bit rate and photo-detection threshold play an
important role in governing the efficiency of PNoC. Germanium photo-detectors of
dimension 0.7umx20um have been demonstrated to operate at 40 Gbps [13]. The
photo detector responsivity as high as 1.08A/W has been demonstrated [14].

2.1.3Photonic Switching Elements (PSEs)
PSEs are required in some PNoC architectures to turn light by 900. PSEs are
made up of MRRs. The basic structure of PSE is shown in Figure 2-1. When the PSE
is in off state the light passes through without making a turn as shown in figure 2-1
(b). When the PSE is in on state, the wavelength of light which matches the resonant
wavelength of MRR gets turned by 900 as shown in figure 2-1 (a). An example of
PNoC requiring PSEs is the 2Dimensional Folded Torus (2DFT) [15]. This PNoC uses
electronic network to carry the header flits and the photonic network to carry the
body flits. Header flit of the photonic packet uses the electronic network to set up
the path from source to destination using dimension order routing. Header flit
21

reserves each intermediate router for the photonic flits that are supposed to follow.
Since this PNoC architecture uses dimension order routing, photonic flits may need
to turn at each intermediate router. This is done by means of PSE. Implementation
of a blocking router using PSEs has been demonstrated in [15].

Figure 2-1: A basic photonic switch
Though a blocking router restricts simultaneous flow of information in multiple
directions, constructing a non-blocking switch using PSEs requires a highly complex
structure. This has a negative impact on the area and, more importantly, the optical
signal integrity, as each PSE hop introduces additional loss and crosstalk. Therefore,
the design choice would be to blocking switch because of its compactness and to
bear it’s blocking properties in mind when designing the network topology and
routing algorithm [15].

22

2.1.4 Laser Source
A multi-wavelength laser source, with small area is required for a PNoC. There
are 2 types of multi-wavelength laser source: off-chip laser source like the comb
laser source which is coupled to the chip using fiber optics and on-chip laser source
like single wavelength distributed-feedback (DFB) laser array [16]. It has been
demonstrated in [16] that heterogeneously integrated on-chip sources are preferred
as they are energy efficient and energy proportional and result in overall system
efficiency.

2.1.5Optical waveguides and couplers
The on-chip optical waveguides are work on the same principle of the
conventional optical fibers that carry optical signal over long distance. On-chip
optical waveguide consists of core which carries the light and cladding surrounding
the core. The core and cladding is made of materials with different refractive index.
Refractive index of cladding is significantly lower than the refractive index of core
thus causing total internal reflection of light and confining the optical signal within
the core. The optical signal undergoes multiple reflections inside the core while
moving along the waveguide. In PNoC, nanophotonic waveguides in silicon on
insulator (SOI) fabricated with deep ultraviolet (UV) lithography is used as the
medium for carrying the optical packets [17]. For low power consumption, the
output from the laser diode should be efficiently coupled with the silicon waveguide.
Spot-size converters (SSC) are used to couple the laser light from laser diode to
silicon waveguide. The photo-detectors can be fabricated by bonding of InGaAs/InP
23

wafers directly to silicon waveguides and formation of metal-semiconductor-metal
structures, giving responsivities as high as 0.74 A/W [18]. A complete optical
transmission link, having a single silicon waveguide integrated with both laser diode
and photo-detector, is demonstrated in [19].

2.2

Existing PNoC Architectures

The PNoC in existing literature that is used in the study is the Firefly
Architecture [20].

2.2.1 Firefly Architecture
Firefly architecture is a cross-bar based hybrid, hierarchical architecture. Four
processing elements share a router to form a local node. Local nodes within the
cluster are connected in form of Concentrated MESH (CMESH). Communication
between the nodes within the cluster takes place using the traditional electrical
networks thus exploiting the benefits offered by electrical interconnects for short,
local communication. Inter-cluster communication takes place using nanophotonic
interconnects.
As shown in figure 2-2(a), each router is named as CxRy where ‘x’ denotes the
cluster number and ‘y’ stands for assembly number. All the routers having same ‘x’
value belong to same cluster and hence communicate using electrical interconnects.
All the routers having the same ‘y’ value belong to an assembly and are
interconnected using cross-bar based nanophotonic interconnects as shown in
figure 2-2(b).

24

Figure 2-2: Firefly Architecture: (a) Cross-bar between clusters of same
assembly, (b) waveguide supporting inter-cluster crossbars. Reproduced
form [20]

Figure 2-3: Reservation-assisted Single Write Multiple Read. Adapted from
[20]

25

Firefly architecture uses Reservation-assisted single write multiple read (RSWMR) for implementing nanophotonic crossbar. There are two types of channels
for communication between the routers; Reservation channel and Data channel.
Reservation channel is used to establish a path between source router and
destination router. Data channel carries the actual data from source router to
destination router. Reservation channels carry the reservation flit which contains
the source router id, destination router id and duration of communication. Once
reservation flit reaches the destination, the photo-detectors on the data channel
corresponding to source router are turned on to accept the data.

Conceptual

diagram of R-SWMR is shown in figure 2-3.
The disadvantage of this architecture is that its inability to dynamically assign
bandwidth between pair of nodes between clusters. Also since all the modulators
and demodulators are on for any communication, this architecture is energy
inefficient.

26

Chapter 3 Dynamic Heterogeneous Photonic NOC (d-HetPNoC)
Bandwidth requirement between various cores in future multi-core chips will
dynamically vary depending on the mapped applications and nature of the cores. The
applications mapped on specific cores may change over time due to various reasons
such as start and end of a task or dynamic thermal management schemes such as
temperature-aware task allocation. This will result in dynamically varying traffic
patterns between the cores. In order to cater to such dynamically varying demands
of bandwidth between communicating pairs we propose a scheme to allocate
bandwidth-on-demand for photonic NoC architectures. Recent literature has
explored photonic NoCs from tile based architectures to crossbar based high radix
ones [15], [20] – [22]. It is argued in [23] that crossbar-based photonic NoC
architectures can scale better in terms of reliability and performance by using novel
photonic devices with crosstalk suppression. Hence we modify a crossbar based
baseline photonic NoC architecture to enable the dynamic bandwidth allocation.

3.1

Network Architecture

The crossbar architecture adopted is a Single Write Multiple Read (SWMR)
photonic crossbar. Cores are grouped in clusters and each cluster will have a data
channel consisting of multiple DWDM wavelengths to all other clusters. An energyefficient variation of the SWMR crossbar has been demonstrated in [20], where a
reservation request is broadcast on separate channels from the source cluster to
establish a path containing the destination ID. This allows the destination to keep the
27

demodulators to be switched on only when it receives a packet rather than always,
thus saving energy. We propose to modify this baseline crossbar so that in addition
to establishing a path, a variable number of wavelengths are allocated to the channel
in proportion to the traffic requirement. This traffic requirement is determined by
the task running on the cores, which governs the frequency and volume of data
communication with other cores. Consequently, this bandwidth allocation happens
whenever there is a change in the task mapping on the chip and not on a per-packet
basis. Hence, the overheads associated with this scheme are greatly mitigated.

Figure 3-1: Dynamic bandwidth allocation enabled PNoC architecture.
In the proposed d-HetPNoC, we have considered a hierarchical, hybrid
configuration crossbar as in [20]. The whole CMP is divided into clusters of 4 cores.
28

These 4 cores are interconnected using traditional copper interconnects in an all-toall manner avoiding multi-hop paths within a cluster. As the cores in a cluster are
physically close, using wire line links can achieve reliable and fast communication.
This intra-cluster configuration is different from the concentrated Mesh in [20]. Each
cluster is equipped with a photonic router, which is interconnected using photonic
channels with all other photonic routers. This architecture is shown in figure 3-1.

3.2

Dynamic Bandwidth Allocation (DBA) Mechanism

DBA is possible by assigning variable number of wavelengths to the write
channels of the clusters. When there is a change in the task allocation on a core, the
network reconfigures itself and allocates necessary bandwidth to the cluster of the
core. The total aggregate data bandwidth depends on the total number of DWDM
wavelengths in all the data waveguides together. Since this aggregate bandwidth
budget has to be shared between all the clusters we propose a token-based
distributed mechanism to request and acquire wavelength channels in each photonic
router.

3.2.1 Token Passing based Channel Allocation
The DBA is achieved by using a token-based mechanism. This mechanism grants the
right to request bandwidth or wavelengths to one photonic cluster at a time to avoid
reusing already allocated wavelengths within a single waveguide. This token is
circulated between the photonic routers using a separate control waveguide with
maximum DWDM. The token consists of several bits where, each bit in the token
denotes the status of a specific wavelength in a specific data waveguide i.e., whether
29

it is currently allocated to any router or not. The size of the token in bits, NTW is
equal to the total number of wavelengths, which can be dynamically allocated and
given by,

N TW = (λW * N W ) − N λR

(1)

In (1), NW is the number of waveguides needed for data communication, λW is the
number of DWDM wavelengths that can be accommodated in a single waveguide and
NλR is sum total of the number of wavelengths reserved by each cluster as discussed
later in this section. When a cluster has the token, the photonic router can acquire
wavelengths and change their status based on its requirements.
If there is any change in the applications running on a particular core, it sends an
updated demand for bandwidth to the photonic router. This information is in the
form of a demand table, which contains the number of wavelengths required for
communication with all the other clusters. We have assumed that the core will
determine these numbers based on the traffic requirements of the current task with
all other clusters. The photonic router consists of 6 tables; current table, request table
and 4 demand tables from the 4 cores. The current table consists of current
bandwidth allocated to the cluster for communication with the other clusters. This
table is initialized to a certain predetermined minimum number such that each
cluster has a minimum bandwidth allocated to its write channel. This ensures that no
cluster starves even if all other clusters consume all the data bandwidth. This
minimum number can be determined based on the overall data bandwidth in the
PNoC and is at least 1 wavelength per cluster. The total number of such reserved
wavelengths for minimum bandwidth allocation is denoted by NλR in (1).
30

Each entry in the request table is the maximum of all the corresponding entries in
the demand tables. In this way, the entries in the request table always contain the
highest demanded bandwidths or number of wavelengths to the other clusters. Once,
the photonic router acquires the token it captures or relinquishes wavelengths based
on the request table and number of currently acquired and available wavelengths.
The cluster aims to acquire the highest number of wavelengths among all the entries
in the request table, which corresponds to the maximum bandwidth that the cluster
will need for communication. Multiple wavelengths for a particular cluster could be
spread over multiple waveguides depending upon availability of wavelengths. Once,
the wavelengths are acquired or relinquished the current table in the router is
updated to reflect the current allocated bandwidths to all other clusters. The router
also records the specific identifiers of all the wavelengths it has acquired. The
wavelength identifiers consist of the waveguide number and the wavelength number
within that waveguide. After this the token is modified to reflect the latest status of
the wavelengths and released to the next cluster.
Depending upon the availability of the wavelengths it may not be possible to satisfy
all the requests from all the clusters. Hence, the request table is not modified after
the wavelengths are allocated and the current table is updated. This will enable the
router to try to acquire additional wavelengths if necessary the next time the token
returns to the cluster. This scheme works even when the task allocation to specific
cores happen asynchronously with the circulation of the token as the request table
can be updated even when the token is not present in the photonic router. The
micro-architecture of the photonic router is shown in figure 3-2.

31

There is an overhead associated with token passing. The time taken by a token to
traverse the link, TL between two photonic routers is given by,

TL = N TW /(λW ∗ B)

(2)

Where, B is the bandwidth per DWDM wavelength. The worst-case time required
by a particular photonic router to repossess the token is given by TL * NPR, where NPR
is the total number of photonic routers. As there is one photonic router per cluster of
4 cores, NPR is equal to (NC/4) where NC is the total number of cores on the chip. In
addition, there would be an overhead required by the photonic router to process
demands and update the request and current tables. However, since this will happen
only when there is a change in the task mapping on a core, the overheads will be
greatly mitigated if not completely amortized, as these changes will happen at a
slower rate by several orders compared to packet transfer. The transmission of
demand tables and computation of the request tables can happen while the router is
waiting to capture the token resulting in complete masking of the overhead. The
updating of the request table and current table are also disjoint from the path of data
flow within the router thus eliminating its impact on the data latency.

3.3

Flow Control and Routing

The routing and flow control is achieved by using the reservation channel
assisted SWMR channels as proposed in [20]. Intra-cluster communication happens
through the electronic links between the cores or from cores to the photonic router.
Inter-cluster communication utilizes the photonic channels between the photonic
routers.

32

3.3.1 Photonic Flow Control
Whenever a cluster needs to communicate a packet to another cluster it
broadcasts a reservation flit over its reservation channel for establishing a
connection between source and destination. The reservation flit contains the ID of
the destination, and the wavelength identifiers that are to be used for this pair from
the current table at the source. The specific wavelengths are chosen among the
allocated ones for the cluster based on the corresponding entry in the demand table
for the destination. Upon receiving the wavelength identifiers, the destination cluster
switches on the demodulators on those specific wavelengths only for the duration of
a packet. This results in energy savings compared to Firefly where all the
wavelengths are turned on for all transmissions irrespective of the required data
rate. It will be shown in section 4.1 that the timing requirement of piggybacking the
wavelength identifiers with the reservation flit in d-HetPNoC results in no additional
timing overheads.

3.3.2 Photonic Router Architecture
All the intra-cluster routers are electronic which are responsible for packet
transfer between cores within a particular cluster. These are 3-stage routers with
input arbitration, routing/crossbar and output arbitration adopted from [24]. The
photonic router in each cluster has a similar micro-architecture as the electronic
routers. They have 4 electronic links to the 4 switches in its cluster and photonic
channels to other clusters. The photonic router with DBA is schematically shown in
figure 3-2.

33

Figure 3-2: Microarchitecture of photonic router.

Using this fabric of hybrid and hierarchical photonic crossbar based NoC
architecture with non-uniform DBA, we improve performance of the CMPs, which
are designed for applications with heterogeneous and dynamically varying traffic
patterns. In the next section, we present experimental evaluation of the proposed
architecture.

34

3.4

Experimental Results

In this section, we evaluate the performance and energy efficiency of the proposed
d-HetPNoC architecture and compare it with the baseline crossbar-based Firefly
architecture. Traffic patterns that require uniform bandwidth as well as highly
unbalanced bandwidths are used to evaluate these architectures.

3.4.1 Performance Evaluation of the d-HetPNoC
Applications mapped on the cores can demand high or low bandwidths with
other clusters. For our experiments we have considered 3 sets of 4 different
bandwidths for the photonic channels. Different sets of bandwidths used for
communication between different cores are shown in table 3-1.
Bandwidth(BW) Set

Bandwidth (Gbps)

BW Set 1 (Total Wavelengths = 64)

12.5

25

50

100

BW Set 2 (Total Wavelengths = 256)

50

100

200

400

BW Set 3 (Total Wavelengths = 512)

100

200

400

800

Table 3-1: Frequency of communication for applications with different
bandwidth for skewed traffic scenarios

Frequency
Application

100
Gbps
50%
75%
90%

50
Gbps
25%
12.5%
5%

25
Gbps
12.5%
6.25%
2.5%

Traffic
12.5
Gbps
12.5%
6.25%
2.5%

Skewed1
Skewed2
Skewed3

Table 3-2: Frequency of communication for applications with bandwidth
set 1 for skewed traffic scenarios

35

Electro-optic modulators and demodulators operating at 12.5Gbps on a single
wavelength carrier channel have been demonstrated [28]. Hence, the minimum
channel bandwidth we have considered is 12.5Gbps, which can be realized with a
single wavelength. Higher speed channels can be achieved by using higher number
of wavelengths. The number of wavelengths required by an application running on a
core is given by dividing the required bandwidth by minimum channel bandwidth.
The values shown in table 3-2 represent the actual memory-interaction bandwidths
required by various processing cores (e.g. CPU, GPGPU, custom logic etc.) [2].
We experimented by applying the skewed traffic patterns in table 3-2 to the
different bandwidth sets in table 3-1 to study its effect on throughput and energy
per message (EPM). Traffic patterns with increasing skew demands a higher
frequency of communication for high bandwidth applications over the low
bandwidth ones. We also evaluate the DBA enabled d-HetPNoC with a uniformrandom traffic pattern where all communication requires the same uniform
bandwidth and all cores communicate with all other cores with equal data rate. The
performance and energy consumption of the d-HetPNoC is compared with that of
the baseline Firefly architecture to demonstrate the advantages over a uniform
bandwidth allocation.
The NoC architectures are characterized using a cycle accurate simulator that
models the progress of the data flits accurately per clock cycle accounting for those
flits that reach the destination as well as those that are dropped. The simulation
parameters are listed in table 3-3.

36

System Size

Number of cores, 64
Number of clusters, 16
Cluster size, 4 cores

Die Area

20 * 20 nm

Clock Frequency

2.5 GHz

Simulation Cycle

10000 with 1000 reset cycle

Packet Property

BW Set 1: Packet Size 64 flits, Flit Size 32 bits
BW Set 2: Packet Size 16 flits, Flit Size 128 bits
BW Set 3: Packet Size 8 flits, Flit Size 256 bits

Router Memory

VC per port, 16
Buffer Depth per VC, 64 flits

Switching

Wormhole based packet switching

Photonic Data and Bandwidth

BW Set 1
Firefly PNOC, 4 wavelengths per channel * 16 channels
d-HetPNoC, maximum channel bandwidth of 8 channels
BW Set 2
Firefly PNOC, 16 wavelengths per channel * 16 channels
d-HetPNoC, maximum channel bandwidth of 32
channels
BW Set 3
Firefly PNOC, 32 wavelengths per channel * 16 channels
d-HetPNoC, maximum channel bandwidth of 64
channels

Table 3-3: Simulation Parameters
The network switches are synthesized from a RTL level design using 65nm
standard cell libraries from [29], using Synopsys. The delays and energy dissipation
on the wired links were obtained through Cadence simulations taking into account
the specific lengths of each link based on the established connections following the
topology of the NoCs.
Component
Modulator/Demodulator
Tuning
Laser Source

Power/Energy
40fJ/bit [28]
2.4 mW/nm [28]
1.5mW/wavelength [30]

Table 3-4: Power or Energy dissipation of photonic components
37

The power dissipation of the photonic components such as modulators,
demodulators and laser sources are as shown in table 3-4. Power dissipation is
explained in detail in section 3.4.1.2. The maximum number of wavelengths that can
be accommodated in a single waveguide is considered to be 64 as in [20].

3.4.1.1

Peak Bandwidth

Peak bandwidth is measured as average number of bits successfully arriving at all
cores per second.

38

Figure 3-3: Peak Bandwidth of Firefly PNoC and d-HetPNoC for uniformrandom and skewed traffic patterns for (a) Bandwidth set 1 (Total
Wavelengths = 64) (b) Bandwidth set 2 (Total Wavelengths = 256) (c)
Bandwidth set 3 (Total Wavelengths = 512).

The peak bandwidth of both Firefly and d-HetPNoC architectures for different
bandwidth sets and different traffic patterns is shown in Figure 3-3. As can be seen,
with uniform traffic the d-HetPNoC and the baseline crossbar-based Firefly
performs similarly for all three bandwidth sets as both architectures provide the
exact same bandwidth between all pairs of clusters. This is because in a uniformrandom traffic all communication channels require the same bandwidth resulting in
the same configuration for both Firefly and the d-HetPNoC. This equality is despite
the fact that the d-HetPNoC has to send some additional information regarding
which wavelengths to use to the destinations in the reservation flit along with the
destination ID and packet size. The size of each wavelength identifier is 6 bits, which
denote the binary encoded wavelength number (out of 64 per waveguide). For BW

39

set 1, which is the best case, a waveguide number is not needed, as a single
waveguide is sufficient to accommodate all 64 wavelengths for the data channels
used in our experiments. Since a cluster may need a maximum of 8 wavelengths
identifiers to be sent to the destination it will take 60ps using a single waveguide
(64 wavelengths providing 800Gbps) to send the reservation flit. Consequently, this
information can be sent in a single clock cycle (400ps) along with the rest of
reservation flit as in Firefly requiring no additional timing overhead. For BW set 3,
which is the worst case, 8 waveguides are needed to accommodate 512
wavelengths. Consequently, 3 bits (log28) would be required for waveguide number.
These 3 bits denote the waveguide number in binary format. Since a cluster may
need a maximum of 64 wavelengths identifiers to be sent to the destination it will
take 720ps using a single waveguide (64 wavelengths providing 800Gbps) to send
the reservation flit. This information can be sent in a two clock cycles along with the
rest of reservation flit resulting in slightly additional timing overhead.
As the skew in traffic increases, the communication between the high bandwidth
applications increases. In Firefly architecture, the uniformly assigned bandwidth is
insufficient for the high bandwidth applications. This insufficient bandwidth causes
the packets from this frequently communicating high bandwidth application to wait
longer in the photonic routers. This in turn, congests the photonic routers resulting
in a degraded performance. Conversely, the d-HetPNoC provides sufficient
bandwidth to these high bandwidth application, reduces the waiting time for the
packets in the photonic routers. Hence even with an increase in the frequency of
communication between the high bandwidth applications the photonic routers do
40

not suffer from congestion as much as in the case of Firefly. Consequently, the dHetPNoC architecture performs better than the Firefly architecture with an
increased skew in the traffic.
Percent increase in peak bandwidth of d-HetPNoC architecture as compared to
Firefly architecture goes from as low as 0.1% in case of Uniform Random traffic to as
high as 7% in case of skewed traffic pattern.

3.4.1.2

Packet Energy

There are several components of packet energy dissipation as data is transferred over
the PNoC fabrics. The energy dissipated in a PNoC is given by equation(3),

E packet = E electrical + E photonic

(3)

Energy dissipated by the photonic components is given by equation (4),

E photonic = Elaunch + E mod ulation + Etuning + E buffer

(4)

where, Elaunch, Emodulation, Etuning, and Ebuffer are the energy dissipated at launching
photonic signals from light source, modulation/demodulation, tuning of MRR, and
storing in buffer respectively. The energy dissipation per bit for various components
of a PNoC is given in table 3-5.
Component

Energy in Pico joule (pJ)/bit

Emodulation
Etuning
Elaunch
Ebuffer
Erouter

0.04
0.24
0.15
0.0781250
0.625

Table 3-5: Energy of different photonic components

41

42

Figure 3-4: Packet Energy of Firefly PNoC and d-HetPNoC for uniformrandom and skewed traffic patterns for (a) BW Set 1 (Total Wavelengths =
64) (b) BW set 2 (Total Wavelengths = 256) (c) BW Set 3 (Total
Wavelengths = 512)

The packet energy of the Firefly and the d-HetPNoC architectures is shown in Figure
3-4. Packet energy is the energy dissipated in transferring one packet completely
from source to destination at network saturation. The Firefly architecture has the
same packet energy compared to the d-HetPNoC for the uniform-random traffic, as
they are practically the same architecture in this case. However, with increased
skew in traffic the packet energy also increases as the congestion in the photonic
routers increases. Alternatively, for the d-hetPNoC the packet energy increases less
with increase in skew of the traffic because of more efficient utilization of the
available bandwidth. In the next subsection, we evaluate our proposed d-HetPNoC
with specific case studies.

43

3.4.2 Case Studies with Synthetic and Real Application based traffic
patterns
In this section, we present case studies for both the architectures with synthetic
and real application based traffic patterns. For the synthetic traffic patterns we
considered hotspot traffic coupled with the skewed communication pattern. In this
case, a core is determined to be the hotspot core and all cores send a certain
percentage of all traffic to the hotspot. The rest of the traffic is distributed following
the skewed traffic types outlined in table 3-1. For our case study, the skewed
hotspot1 and skewed hotspot2 traffic patterns generates 10% of the total traffic to
the hotspot core and the rest 90% utilizes the skewed 2 and skewed 3 traffic
patterns mentioned in table 3-1 respectively. The skewed hotspot3 and skewed
hotspot4 considers a 20% of traffic to the hotspot coupled with skewed 2 and
skewed 3 traffic patterns respectively. This kind of patterns captures the both high
frequency communication with some central authority in the CMP like a scheduler
or controller via the hotspot pattern as well as skewed core to memory interactions.
For the real application based traffic, parallel GPU applications like MUM,
BFS, CP, RAY and LPS [26] are mapped to 20, 4, 4, 4 and 16 cores respectively. These
cores are considered to be GPUs occupying 12 clusters. Remaining 4 clusters are
considered to have memory cores, which contain the data for the applications
mapped to the GPU cores. Then the bandwidth requirement is determined using
actual core to memory interaction from profiling these applications in GPGPUSim
[27], using GPU-memory bandwidth of 128B flit-size at 700MHz. These particular
benchmarks are chosen as BFS and MUM show significant speedup with increase in
44

GPU-memory bandwidth, while the other others do not. Hence, this combination
represents an actual multi-core chip running multiple parallel applications. The
peak bandwidth and packet energy values for these traffic patterns are shown in
Figure 3-5. In all the cases the peak bandwidth of the d-HetPNoC is better than the
Firefly architecture. This is because of the insufficient bandwidth allocation in the
Firefly architecture for the high bandwidth communications. However, the
degradation in energy and bandwidth is less for the d-HetPNoC as it can allocate
high bandwidth to communication channels that need it unlike the baseline Firefly.
The same trend is observed regardless of the actual percentage traffic with the
hotspot.

Figure 3-5: Peak Core Bandwidth and Packet Energy for Firefly PNoC for
synthetic and real application based traffic scenarios
45

In case of the real application based traffic, the interaction between the memory
clusters and some of the core clusters require higher bandwidth. This results in a
lower peak bandwidth for Firefly compared to the d-HetPNoC as it cannot provide
the high bandwidth to the clusters that need it.

3.4.3 Area Overhead
The dynamic PNoC gives the flexibility to dynamically allocate bandwidth for data
communications. However, it incurs an overhead in terms of photonic waveguides
and electro-optic devices to enable the dynamic allocation scheme.
Dynamic bandwidth allocation for data communication can be achieved by assigning
different number of wavelengths between any pair of photonic routers. Let λN be
the total wavelengths required to support all data communications between
photonic routers. For dynamic PNoC, number of waveguides, NWD is proportional to
the total bandwidth requirement and is given by λ N / λW  . ⋅ Function gives the
next higher integer value in case the division results in a floating point number.
Total number of modulators, TMD for dynamic PNoC is the sum total of the
modulators required for data waveguide(s) N MDD , reservation waveguide(s)
N MRD and control waveguide N MCD . It is given by formula below:

TMD = N MDD + N MRD + N MCD

(5)

Since each photonic router needs to have the capability to modulate on any
wavelength in any waveguide, N MDD is given as product of number of photonic

46

routers N PR , maximum number of wavelengths in a waveguide and number of
waveguides.

N MDD = N PR * λW * NWD

(6)

In reservation waveguide, each photonic router writes to a dedicated waveguide.
Hence N MRD is given as product of number of photonic routers and maximum
number of channels per waveguide.
N MRD = N PR * λW

(7)

In control waveguide, each photonic router on receiving the token can write to the
control waveguide using all the channels in the waveguide. N MCD is given as product
of number of photonic routers and maximum number of channels per waveguide in
control waveguide.

N MCD = N PR * λW

(8)

Putting the values of equation 6, 7, 8 in equation 5 we get:

TMD = N PR * λW * N WD + 2 * N PR * λW

(9)

Each photonic router in Firefly Architecture writes on its dedicated waveguide for
data communication. Hence number of waveguides, NWF is equal to number of
photonic routers. The number of wavelengths required per waveguide, λNF for
achieving the same total bandwidth as dynamic PNoC is given by (λN / NWF ) .Total
number of modulators, TMF for Firefly Architecture is the sum total of the
47

modulators required for data waveguide(s), N MDF and reservation waveguide(s)
N MRF . It is given by formula below:
T MF = N MDF + N MRF

(10)

Each photonic router in data waveguide writes to a dedicated waveguide on λNF
channels. Hence N MDF is given as product of number of photonic routers N PR and λNF .

N MDF = N PR * λ NF

(11)

In reservation waveguide, each photonic router writes on all channels in its
dedicated waveguide. Hence N MRF is given as product of number of photonic routers
and maximum number of channels per waveguide.
N MRF = N PR * λW

(12)

Putting the values of equation 11 and 12 in equation 10 we get:

TMF = N PR * λ NF + N PR * λW

(13)

For dynamic PNoC, total number of detectors, TDMD for dynamic PNoC is the sum
total of the detectors required for data waveguide(s) N DMDD , reservation
waveguide(s) N DMRD and control waveguide N DMCD . It is given by formula below:

TDMD = N DMDD + N DMRD + N DMCD

(14)

In data waveguides, since each photonic router needs to have the capability to
receive on any wavelength in any waveguide, N DMDD is given as product of number
48

of photonic routers N PR , maximum number of channels per waveguide in a
waveguide and number of waveguides.

N DMDD = N PR * λW * N WD

(15)

In reservation waveguide, each photonic router reads from all waveguides except
the one to which it writes. Hence N DMRD is given as:
N DMRD = N PR * λW * ( N PR − 1)

(16)

In control waveguide, each photonic router can receive on all the channels in the
waveguide. N DMCD is given as product of number of photonic routers and maximum
number of channels/wavelengths in control waveguide.

N DMCD = N PR * 64

(17)

Putting the values of equations 15, 16 and 17 in equation 14 we get:

TDMD = N PR * λW * N WD + N PR * λW * ( N PR − 1) + N PR * λW

(18)

Total number of detectors, TDMF for Firefly Architecture is the sum total of the
detectors

required

for

data

waveguide(s),

N DMDF

and

reservation

waveguide(s) N DMRF . It is given by formula below:
T DMF = N DMDF + N DMRF

(19)

Each photonic router in Firefly architecture reads on λNF channels in all data
waveguides except its own write waveguide. Hence N DMDF is given as:
49

N DMDF = N PR * λ NF * ( N PR − 1)

(20)

In reservation waveguide, each photonic router reads from all waveguides except
the one to which it writes. Hence N DMRF is given as:
N DMRF = N PR * λW * ( N PR − 1)

(21)

Putting the values of equations 20 and 21 in equation 20 we get:

TDMF = N PR * λ NF * ( N PR − 1) + N PR * λW * ( N PR − 1)

(22)

We consider MRR’s having radius of 5µm [28]. Hence the total area, AD required by
electro-optic devices in dynamic PNoC is given by:

AD = (TMD + TDMD ) *π * (5µm) 2

(23)

Total Area, AF required by electro-optic devices in Firefly architecture is given by:

AF = (TMF + TDMF ) * π * (5µm) 2

(24)

We consider a 64 core-16 cluster system to study the increase in area with increase
in total bandwidth requirement. The total modulator/demodulator area for dHetPNoC and Firefly are 1.608 mm2 and 1.367 mm2 respectively for the
configuration with 64 data wavelengths studied in this work. Summing equation 9
and equation 18 would give the total number of modulators and demodulators
needed for data waveguides. From Figure 3-6 we observe that when the aggregate
data bandwidth or total number of wavelengths for data communication is less, the
area overhead is minimal. When bandwidth requirement is small, the number of
50

data waveguides, NWD is small. Dynamic PNoC provides the feature of dynamically
allocating the wavelengths. Hence it needs to provide the flexibility to write to any
wavelength within any waveguide. With less number of waveguides, photonic
router would need the capability to write to fewer waveguides and hence the
hardware overhead is less. As the total bandwidth requirement increases, NWD
increases. Since the photonic router may need to write to any WDM channel within
any waveguide depending on the dynamic wavelength allocation, the number of
modulators needed to support data communication also increases.

Figure 3-6: Comparison of total area of d-HetPNoC and Firefly
architecture with increase in total bandwidth requirement
This is the major factor for area overhead in d-HetPNoC. It can be seen from
equation 6 that there is a linear relationship between the modulators needed for
data communication in d-HetPNoC and the total bandwidth requirement. The other
although less significant, factor for area overhead is the use of dedicated control

51

waveguide for circulating the token between photonic routers allowing them to
dynamically use the available wavelengths for data communication. This factor
remains constant and is independent of the aggregate data bandwidth requirement.
While the total aggregate data bandwidth remains the same between the crossbar
based Firefly and the d-HetPNoC, the total area dedicated to the modulators and demodulators are higher in d-HetPNoC due to the flexibility of all clusters being able to
write to any wavelength in the data waveguides.

52

Figure 3-7: Comparison of (a) Peak Core Bandwidth (b) Energy Per
Message of d-HetPNoC for BW Set 1 (Total Wavelengths = 64), BW Set 2
(Total Wavelengths = 256) and BW Set 3 (Total Wavelengths = 512) for
Uniform Random and Skewed Traffic Patterns

Figure 3-8: Effect of increase in total number of wavelengths on Peak
Bandwidth and Area for d-HetPNoC for Skewed 3 traffic pattern.

53

Figure 3-9: Effect of increase in total number of wavelengths on Energy
per Message and Area for d-HetPNoC for Skewed 3 traffic pattern.

Though area overhead of d-HetPNoC as compared to Firefly increases with increase
in total bandwidth requirement, d-HetPNoC also provides a corresponding
performance improvement as shown in figure 3-7. From figure 3-7 we can infer that
for all traffic patterns, there is a significant improvement in peak bandwidth and
decrease in energy per message with increase in total bandwidth requirement. In
figure 3-8 we compare the total area against peak bandwidth and in figure 3-9 we
compare the total area against energy per message for skewed 3 traffic pattern, as
the total number of wavelengths increase from 64 to 512. From figures 3-8 and 3-9
we can infer that as the total wavelength changes from 64 to 512, the total area
increases by 70% but the corresponding increase in peak bandwidth is 751.31%
while the decrease in packet energy is 10.89%. While the bandwidth increases
significantly the packet energy does not change much. Peak bandwidth is given as
product of throughput, size of the flit and clock frequency. Increase in total number

54

of wavelengths signifies that most of the cores are running high bandwidth
applications. Since these applications are high bandwidth, more wavelengths are
used to carry data on the photonic links. With increase in number of wavelengths to
carry data, the number of bits carried in one clock cycle increases. This is achieved
by increasing the flit size which contributes to increase in peak bandwidth. With
increased number of wavelengths to carry data, there is also an increase in the
throughput as more flits reach the destination in less time. These are the major
reasons for significant increase in peak bandwidth. The energy dissipation for a
single bit transfer remains practically unchanged across the photonic interconnects
resulting in little decrease in overall packet energy.

55

Figure 3-10: Comparison of (a) Peak Core Bandwidth (b) Energy Per
Message of Firefly architecture for BW Set 1 (Total Wavelengths = 64), BW
Set 2 (Total Wavelengths = 256) and BW Set 3 (Total Wavelengths = 512)
for Uniform Random and Skewed Traffic Patterns

Figure 3-10 shows the effect of increase in total bandwidth requirement on peak
bandwidth and energy per message for Firefly architecture for different traffic
patterns. From figure 3-10 we can infer that as the total wavelength changes from
64 to 256, the total area increases by 41.17% but the corresponding increase in total
peak bandwidth is 764.52% and corresponding decrease in energy per message is
10.85%. The reasons for increase in peak bandwidth and decrease in packet energy
are same as in case of d-HetPNoC. However if we compare figures 3-7 and figure 310 we can see that for all traffic patterns, with increase in total number of
wavelengths, the absolute values of peak bandwidth are lower and energy per
message are higher than that of d-HetPNoC. The reason for higher value of peak
bandwidth is d-HetPNoC uses the available wavelengths more effectively thus
56

causing an increase in throughput and hence causing an increase in peak bandwidth.
Energy per message constitutes of photonic link energy and photonic buffer energy.
With increased skew in traffic, the routers get congested. With increased skew, since
d-HetPNoC uses bandwidth more effectively, flits occupy the buffers in routers for a
shorter duration as compared to that in Firefly architecture. This causes far less
congestion in routers in case of d-HetPNoC. Also since flits occupy the buffers for
shorter duration, the photonic buffer energy is lesser in case of d-HetPNoC thus
causing energy per message to be lower in case of d-HetPNoC.

57

Chapter 4 Conclusion
In this thesis, we propose a cross-bar based heterogeneous photonic NoC with
dynamic bandwidth allocation, which can allocate different bandwidths between
different clusters of cores. This dynamic bandwidth requirement is dependent on
the type of core and also the application running on the cores.
Section 3.4.1.1 shows that d-HetPNoC provides higher peak data bandwidth
for all kinds of traffic patterns as compared to Firefly Architecture. The d-HetPNoC
architecture provides about 8% increase in peak data bandwidth over Firefly
Architecture. Peak data bandwidth also increases with increase in the total
bandwidth requirement. Section 3.4.1.2 shows that d-HetPNoC dissipates less
energy as compared to Firefly architecture thus making the architecture energy
efficient. The d-HetPNoC dissipates up to 5% less energy as compared to Firefly
architecture. Thus this scheme is demonstrated to achieve higher performance and
energy-efficiency for the same overall data bandwidth compared to a homogeneous
photonic NoC.
Since d-HetPNoC provides DBA, there are certain over heads associated with
the scheme. The main over head of d-HetPNoC is the area overhead. We envision
that this over ahead could be mitigated by restricting the cluster to use wavelengths
from certain waveguides. An example would be to restrict a certain photonic router
say PRx to wavelengths of Waveguide(x) and Waveguide(x+1). Hence PRx , would at
any point of time would need modulators and de-modulators for all wavelengths in
Waveguide(x) and Waveguide(x+1), thus reducing the number of modulators and de-

58

modulators. Despite the overheads, architectures with DBA are suitable for future
CMPs, which integrates heterogeneous cores like custom logic, GPGPUs,
programmable fabrics and memory.
Future work would be to find better ways to effectively manage bandwidth
allocation with minimal overheads.

59

Reference
[1]

Moore’s law http://www.intel.com/content/www/us/en/silicon-innovations/moores-lawtechnology.html

[2]

E.S. Chung, P.A. Milder, J.C. Hoe and K. Mai, "Single-Chip Heterogeneous
Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs?," 43rd
Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Dec.
2010, pp.225-236.

[3]

Chung, E.S.; Milder, P.A; Hoe, J.C.; Ken Mai, "Single-Chip Heterogeneous
Computing: Does the Future Include Custom Logic, FPGAs, and GPGPUs?,"
Microarchitecture (MICRO), 2010 43rd Annual IEEE/ACM International Symposium
on, vol., no., pp.225,236, 4-8 Dec. 2010
doi: 10.1109/MICRO.2010.36

[4]

Intel Quick path interconnecthttp://www.intel.com/content/www/us/en/io/quickpath-technology/quickpathtechnology-general.html

[5]

Hyper-transport-http://www.hypertransport.org/default.cfm?page=Technology

[6]

Pande, P.P.; Grecu, C.; Jones, M.; Ivanov, A; Saleh, R., "Performance evaluation and
design trade-offs for network-on-chip interconnect architectures," Computers, IEEE
Transactions on , vol.54, no.8, pp.1025,1040, Aug. 2005
doi: 10.1109/TC.2005.134

[7]

Yogita A. Sadawarte, Mahendra A. Gaikwad, and Rajendra M. Patrikar. 2011.
Comparative study of switching techniques for network-on-chip architecture. In
60

Proceedings of the 2011 International Conference on Communication, Computing &
Security (ICCCS '11). ACM, New York, NY, USA, 243-246.
DOI:10.1145/1947940.1947992 http://doi.acm.org/10.1145/1947940.1947992
[8]

Carloni, L.P.; Pande, P.; Yuan Xie, "Networks-on-chip in emerging interconnect
paradigms: Advantages and challenges," Networks-on-Chip, 2009. NoCS 2009. 3rd
ACM/IEEE International Symposium on , vol., no., pp.93,102, 10-13 May 2009
doi: 10.1109/NOCS.2009.5071456

[9]

Pavlidis, V.F.; Friedman, E.G., "3-D Topologies for Networks-on-Chip," Very Large
Scale Integration (VLSI) Systems, IEEE Transactions on , vol.15, no.10,
pp.1081,1090, Oct. 2007
doi: 10.1109/TVLSI.2007.893649

[10]

Deb, S.; Ganguly, A; Chang, K.; Pande, P.; Beizer, B.; Heo, D., "Enhancing

performance of network-on-chip architectures with millimeter-wave wireless
interconnects," Application-specific Systems Architectures and Processors (ASAP),
2010 21st IEEE International Conference on , vol., no., pp.73,80, 7-9 July 2010
doi: 10.1109/ASAP.2010.5540799
[11]

Ganguly, A; Chang, K.; Deb, S.; Pande, P.P.; Belzer, B.; Teuscher, C., "Scalable

Hybrid Wireless Network-on-Chip Architectures for Multicore Systems," Computers,
IEEE Transactions on , vol.60, no.10, pp.1485,1502, Oct. 2011
doi: 10.1109/TC.2010.176
[12]

Q. Xu, D. Fattal, and R. Beausoleil, "Silicon microring resonators with 1.5-µm

radius," Opt. Express 16, 4309-4315 (2008).
[13]

Biberman, A.,”Adiabatic microring modulators” Proc. Optical Fiber

61

Communication Conference and Exposition and the National Fiber Optic Engineers
conference (OFC/NFOEC), 2013.
[14]

pp.1-3

D. Ahn, C. Hong, J. Liu, W. Giziewicz, M. Beals, L. Kimerling, J. Michel, J. Chen, and

F. Kärtner, "High performance, waveguide integrated Ge photodetectors," Opt.
Express 15, 3916-3921 (2007).
[15]

A. Shacham et al., “Photonic Network-on-Chip for Future Generations of Chip

Multi-Processors”, IEEE Transactions on

Computers, Vol. 57, no. 9, 2008, pp.

1246-1260.
[16]

Heck, M.J.R.; Bowers, J.E., "Energy Efficient and Energy Proportional Optical

Interconnects for Multi-Core Processors: Driving the Need for On-Chip Sources,"
Selected Topics in Quantum Electronics, IEEE Journal of , vol.20, no.4, pp.1,12, JulyAug. 2014
doi: 10.1109/JSTQE.2013.2293271
[17]

Bogaerts, W.; Baets, R.; Dumon, P.; Wiaux, V.; Beckx, S.; Taillaert, D.; Luyssaert, B.;

Van Campenhout, J.; Bienstman, P.; Van Thourhout, D., "Nanophotonic waveguides in
silicon-on-insulator fabricated with CMOS technology," Lightwave Technology,
Journal of , vol.23, no.1, pp.401,412, Jan. 2005
doi: 10.1109/JLT.2004.834471(410) 23
[18]

K. Ohira, K. Kobayashi, N. Iizuka, H. Yoshida, M. Ezaki, H. Uemura, A. Kojima, K.

Nakamura, H. Furuyama, and H. Shibata, "On-chip optical interconnection by using
integrated III-V laser diode and photodetector with silicon waveguide," Opt. Express
18, 15440-15447 (2010).
[19]

S. Assefa et al., “CMOS-integrated 40GHz germanium waveuide photodetector for

62

on-chip optical interconnects.”, Proc. Optcal Fiber Communication - incudes post
deadline papers, 2009. OFC 2009. pp. 1-3.
[20]

Yan Pan, Prabhat Kumar, John Kim, Gokhan Memik, Yu Zhang, and Alok

Choudhary. 2009. Firefly: illuminating future network-on-chip with nanophotonics.
In Proceedings of the 36th annual international symposium on Computer architecture
(ISCA '09). ACM, New York, NY, USA, 429-440. DOI=10.1145/1555754.1555808
http://doi.acm.org/10.1145/1555754.1555808
[21]

A. Joshi et al., “Silicon-Photonic Clos Network for Global On-Chip

Communication”, Proc. 3rd International Symposium on Networks-on-Chip (NOCS),
May 2009, pp. 124-133.
[22]

D. Vantrease et al., “Corona: System Implications of Emerging

Technology,” Proc. IEEE International

Nanophotonic

Symposium on Computer Architecture

(ISCA), 21-25 June, 2008, pp. 153-164.
[23]

Y. Xie et al. "Crosstalk Noise and Bit Error Rate Analysis for Optical Network-on-

Chip", Proceedings of IEEE/ACM Design Automation Conference (DAC), 2010, pp.
657-660.
[24]

P. Pande,C. Grecu, M. Jones, A. Ivanov, R. Saleh, "Performance evaluation and

design trade-offs for network-on-chip interconnect architectures," , IEEE
Transactions on Computers, vol.54, no.8, pp.1025-1040, Aug. 2005
[25]

S. Che et. al., “Rodinia: A Benchmark Suite for Heterogeneous Computing”,

In Proceedings of the 2009 IEEE International Symposium on Workload
Characterization (IISWC), Washington, DC, USA, 44-54
[26]

CUDA Toolkit Documentation, URL: http://docs.nvidia.com/cuda/cuda-

63

samples/
[27]

A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, and T. M. Aamodt, “Analyzing

CUDA Workloads Using a Detailed GPU Simulator”, (ISPASS), 2009, pp. 163-174.
[28]

P. Dong et. al., "Tunable high speed silicon microring modulator,", Conference

on Lasers and Electro-Optics (CLEO) and Quantum Electronics and Laser Science
Conference (QELS), May 2010, pp.1,2.
[29]

Circuits MultiProjets, URL: http://cmp.imag.fr

[30]

K. Preston et al., "Performance Guidelines for WDM Interconnects Based on

Silicon Microring Resonators," Proc. Laser and Electro-Optics(CLEO), 2011, pp. 1-2.
[31]

A. Mishra, N. Vijaykrishnan and C. R. Das, “A case for Heterogeneous On-Chip

Interconnects for CMPs”, Proc. of International Symposium Computer Arhitecture
(ISCA), 2011. pp. 389-400.
[32]

A. K. Mishra, O. Mutlu, and C. R. Das. 2013. A heterogeneous multiple network-

on-chip design: an application-aware approach. In Proceedings of the 50th Annual
Design Automation Conference (DAC '13). ACM, New York, NY, USA, , Article 36 , 10
pages.

64

