Low latency optical switch for high performance computing with minimized processor energy load [Invited] by Liu, S et al.




Abstract------ Power  density and cooling issues are 
limiting the per formance of high per formance chip 
multiprocessors (CM P) and off-chip communications 
cur rently consume over  20% of power  for  memory, 
coherence, PCI  and Ethernet links.  Photonic 
transceivers integrated with CM Ps are being 
developed to overcome these issues, potentially 
allowing low hop count switched connections 
between chips or  data center  servers.  H owever , 
latency in setting up optical connections is cr it ically 
impor tant in all computing applications and having 
transceivers integrated on the processor  chip also 
pushes other  network functions and their  associated 
power  consumption onto the chip.  I n this paper , we 
propose a low latency optical switch architecture 
which minimizes power  consumed on the processor  
chip for  two scenar ios: multiple socket shared 
memory coherence networks and optical top-of-rack 
switches for  data centers.  The switch architecture 
reduces power  consumed on the CM P using a control 
plane with a simplified send and forget server  
inter face and the use of a hybr id M ach-Zehnder  
I nter ferometer  (M ZI ) and semiconductor  optical 
amplifier  (SOA) integrated optical switch with 
electronic buffer ing.  Results show that the proposed 
architecture offers a 42 % reduction in head latency 
at low loads compared with a conventional scheduled 
optical switch as well as offer ing increased 
per formance for  streaming and incast traffic 
patterns.  Power  dissipated on the server  chip is 
shown to be reduced by over  60% compared with a 
 
Manuscr ipt  received July 1 2014.    
S. L iu was with the Elect ronic Engineer ing Depar tment , 
University Col lege London, London WC1E 7JE, UK.  She is now 
with Barclays Investment  Bank, London, UK. 
M.R Mardarbux and P.M. Wat ts are with the Elect ronic 
Engineer ing Depar tment , University Col lege London, London 
WC1E 7JE, UK (phi l ip.wat ts@ucl.ac.uk).  
Q. Cheng, A. Wonfor , R.V. Penty and I .H. White are with the 
Centre for  Advanced Photonics and Elect ronics, University of 
Cambr idge, Cambr idge CB3 0FA, UK. 
scheduled optical switch architecture with r ing 
resonator  switching. 
 
I ndex Terms------Assignment and routing algor ithms; 
Networks; Optical I nterconnects 
I . I NTRODUCTION 
esearch effor ts in opt ical  networking for  data centers 
are aimed at  both lower latency and lower energy 
consumpt ion.  Al though total  energy consumpt ion is a 
cr i t ical  issue for  the largest  data centers, networking 
equipment  only accounts for  5% of this with the major i ty 
consumed at  the server  or  chip mult iprocessor  (CMP) level  
[1].  I n addit ion, CMP power densi ty and thermal 
management  issues are ser iously l imit ing processor  
per formance [2].  High per formance server  chips require 
>1Tb/s of off-chip bandwidth including Ethernet , PCI , main 
memory and coherence l inks which are consuming >20% of 
total  power [3].  I n paral lel , there is a major  research effor t  
aimed at  packaging opt ical  communicat ions components 
within the CMP, for  example using si l icon photonics [4-9] to 
minimize latency and energy consumpt ion and el iminate 
communicat ions bot t lenecks. However, on-chip opt ics also 
necessi tates integrat ing the PHY and MAC layers and their  
associated energy consumpt ion onto chip and also requires 
an opt ical  power supply.  Previous work has shown that  a 
large propor t ion of the network energy consumed on the 
processor  chip is due to buffer ing, t ransmission control  and 
absorbed opt ical  power in on-chip and chip-to-chip networks 
[10, 11].  Therefore, network archi tectures are required 
which provide low latency but  reduce the energy dissipated 
on the processor  chip.   
A large propor t ion of the bandwidth of a high 
per formance server  is used for  point -to-point  main memory 
l inks (200 Gb/s in [3]) and low latency and power opt ical  
replacement  opt ions have been studied [12, 13].  However, 
in this paper we focus on appl icat ions which can benefi t  
from opt ical  switching in par t icular  chip-to-chip memory 
coherence, Ethernet  and PCI  networks.  Chip-to-chip 
coherence networks are used in high per formance servers 
which share the memory space across mult iple chips to 
improve paral lel  appl icat ion per formance (Fig. 1) by 
A Low Latency Optical Switch for  
H igh Per formance Computing with 
M inimized Processor  Energy Load 
[I nvited] 
Shiyun L iu, Qixiang Cheng, Muhammad Ridwan Madarbux, Adr ian Wonfor ,  
Richard V. Penty, Ian H. White and Phi l ip M. Wat ts  
R 
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 
 
2 
exchanging control  (typical ly 8B) and data (16-256B) 
messages to ensure that  memory is consistent  across al l  
caches [14].  Coherence network latency has a cr i t ical  effect  
on mult iprocessor  per formance as processors must  stal l  
unt i l  coherence t ransact ions on the network have 
completed.  This low latency requires high bandwidth (460 
Gb/s in [2]) and hence coherence networks consume a 
signi ficant  propor t ion of processor  chip power.  Switched 
photonics potent ial ly reduces the latency and power 
consumpt ion of coherence networks spanning mult iple chips 
by providing a single network connect ing every core or  
cluster  of cores (Fig. 1b) rather  than the separate on-chip 
and chip-to-chip networks used in current  servers (Fig. 1a).   
Ethernet  and PCI  server  inter faces can both benefi t  from 
switched photonic connect ions.  Al though data centers can 
have >105 servers, scal ing opt ical  switches to these por t  
counts is chal lenging as is the associated global al locat ion 
problem.  Semiconductor  opt ical  ampl i fier  (SOA) switching 
has been shown to al low large switching fabr ics at  the 
physical  layer  [15].  Opt ical  switches on a single integrated 
circui t  are l imited by losses, but  have been shown to be 
viable with 64 por ts or  more using SOAs [16], Mach-
Zehnder inter ferometers (MZI ) [5] and si l icon r ing 
resonators [6].  Switches of this radix can replace the 
elect ronic top-of-rack (ToR) switch in leaf and spine data 
center  archi tectures as shown in Fig. 2 [17] providing 
sufficient  l inks to connect  a rack of servers as wel l  as 
upl inks to core elect ronic routers.  This archi tecture offers 
two hop connect ions to any other  server  within the data 
center , keeps the al locat ion problem manageable and avoids 
the issues of mult iple stage switching [18].  Current  servers 
feature a smal l  number of 10 Gb/s Ethernet  inter faces and 
therefore consume a much smal ler  propor t ion of server  off-
chip communicat ions power than memory and coherence 
l inks.  However, in this case, opt ical  switching has the 
potent ial  to reduce latency and total  network power 
consumpt ion and provide higher  bandwidth without  
elect ronic pin or  front  panel l imits.  Higher  bandwidth can 
also mit igate the per formance issues of data center  
workloads such as those caused by incast  t raffic. 
Opt ical  networks for  shared memory [19-22] and data 
center  [23-26] appl icat ion have been previously proposed.  
I n contrast , we propose a low latency opt ical  switching 
archi tecture support ing at  least  64 por ts which speci fical ly 
minimizes power consumed and dissipated on the CMP in 
these appl icat ions.  Low latency is provided using 
speculat ive t ransmission, in which messages are sent  before 
a switch path is establ ished, combined with fast  elect ronic 
al locators and elect ronic buffers at  the switch for  packets 
which fai l  al locat ion.  Energy consumpt ion on the server  
chip is minimized by (1) providing a simple server-side send 
and forget  network inter face with minimal buffer ing and 
control  logic (2) use of hybr id MZ/SOA switching which 
reduces the opt ical  power absorbed at  the server  t ransmit ter  
and (3) avoiding any signi ficant  receiver  side buffer ing by 
ensur ing in-order  del ivery.  I ni t ial  resul ts were presented in 
[10].  This paper provides a more detai led descr ipt ion and 
power models for  the proposed switch archi tecture including 
pract ical  al locat ion circui ts for  the MZI /SOA switch and 
character izat ion of the latency in problem workloads such 
as incast  t raffic.  
 The rest  of the paper is organized as fol lows: Sect ion I I  
descr ibes the recirculat ion network control  plane and the 
hybr id MZI /SOA switch archi tecture.  Sect ion I I I  presents 
latency resul ts for  common data center  workloads taking 
into account  the operat ing clock frequency of the control  
plane.  Sect ion IV presents the power model for  the network 
along with resul ts showing the reduct ion in power 
dissipated on the processor  chip.  Sect ion V discusses the 
resul ts and their  impact  on future comput ing systems 
focusing on the mult i -socket  shared memory network and 
opt ical  top-of-rack appl icat ions descr ibed above.  Final ly, 
sect ion VI  concludes.   
 
I I . NETWORK ARCHITECTURE 
A. Control  Plane 
 
Basel ine Vi r tual  Channel  Swi tch 
Figure 3a shows a high per formance input  queued vir tual  
channel (VC) scheduled switch connect ing mult iple compute 
cores.  I n an N por t  switch, the source por t  contains N-1 
fi rst  in fi rst  out  (FIFO) queues, known as vir tual  channels 
(VC), one for  each dest inat ion.  The source por ts queue new 
messages in the appropr iate FIFO and send requests for  a 
switch path to the al locator .  The al locator  (also known as 
scheduler  or  arbi ter) at tempts to find the best  switch 
configurat ion to serve al l  requests and sends grants back to 
the por ts which have been successful .  Unsuccessful  
requests wi l l  be served in future al locat ion cycles.  The use 
of VCs and the iSLIP al locat ion algor i thm has been shown 
to achieve 100% throughput  under random t raffic in a fair  
manner [27].  iSLIP is a separable (arbi t rates separately for  
output  and input  por ts of al l  outstanding requests) round 
robin al locator  which updates pr ior i ty states in a way which 
avoids individual  arbi ters becoming synchronized. I n this 
work, a broadband switch and wavelength st r iped 
t ransmission is used in order  to achieve high bandwidth per  
por t  and hence low ser ial izat ion latency (compared with 
al ternat ive approaches using wavelength select ive elements, 
e.g. [18, 23]).  As shown in Fig 4a, control  signal ing between 
the por t  and al locator  (requests and grants) increases 
arbi t rat ion latency.  For  this reason, opt ical-elect r ical-
opt ical  (OEO) conversion to al low queuing at  the every 
switch por t  has been proposed [18].  However, to get  ful l  
energy and latency advantage of opt ical  switching, data 
should remain in the opt ical  domain from the source por t  
through to the dest inat ion por t .  I n this work, we use the VC 
switch shown in Fig 3a with latency character ist ics shown 
in Fig 4a as the basel ine.  Our proposed switch deals with 
the issue of control  latency whi le using OEOs for  only the 
packets which fai l  al locat ion.   
    
Proposed Send and Forget I nter face wi th Buffered 
Swi tch 
I n the proposed scheme, shown in Fig. 3b hereafter  
descr ibed as the buffered switch, speculat ive t ransmission is 
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 
 
3 
used to minimize control  signal ing latency. Speculat ive 
t ransmission of messages, in which data is sent  without  
wait ing for  a grant , has been previously proposed ei ther  
operat ing independent ly [4] or  in paral lel  wi th a scheduled 
al locator  [28].  However, our  previous work showed that  
high per formance speculat ive schemes require complex logic 
and buffer ing at  the t ransmit ter  (and also at  the receiver  i f 
in-order  del ivery is required) which increases power 
consumpt ion on the server  chip [10]. Our speculat ive 
implementat ion which simpl i fies the server  side of the 
network operates as fol lows. Each t ransmit ter  has a simple 
FIFO queue which considerably reduces the power and area 
of buffer ing resources on the processor  chip.  To meet  the 
aim of providing a low energy send and forget  inter face at  
the server , the switch must  not  drop packets. When the 
channel is free, the t ransmit ter  control ler  fi rst  checks that  
there is a free slot  in the switch buffers (and hence there is 
no chance that  the packet  may be dropped).  This single bi t  
ful l  control  signal , the only connect ion required back to the 
server  from the switch, is asser ted when there is one free 
slot  in the switch buffers to al low for  the fact  that  a packet  
may already be in t ransi t .  I f there is buffer  space avai lable, 
the control ler  sends a switch path request  to the al locator  
for  the packet  at  the front  of the FIFO and then several  
clock cycles later , speculat ively sends the packet  in a 
wavelength st r iped format .  The number of clock cycles 
between request  and data t ransmission is determined by 
the al locat ion t ime and the switch reconfigurat ion t ime and 
is discussed fur ther  in sect ion I I I .  Content ion resolut ion is 
handled ent i rely at  the switch using elect ronic buffers.  
Al though fiber  delay l ine buffers have been studied for  al l -
opt ical  packet  switching, single chip integrated WDM 
transceivers [7] and fast  dense elect ronic memory provide 
improved area and t iming character ist ics for  low latency 
networks [23].  I f al locat ion is successful , the switch is 
reconfigured and the packet  is del ivered with low latency 
(Fig. 4b).  I f al locat ion is unsuccessful , the packet  is sent  to 
the switch buffers after  conversion back to the elect ronic 
domain (Fig. 4c).  Packets are queued by source por t  
(mapping input  1 to buffer  1, input  2 to buffer  2 etc) and 
t ransmit ted through the switch in a later  al locat ion cycle.  
I n contrast  to [23, 26], this direct  mapping of buffers to 
source por ts considerably simpl i fies the al locat ion problem 
and switch archi tecture and is essent ial  for  the send and 
forget  protocol  to ensure that  no packet  wi l l  be dropped. Also 
in contrast  to [23, 26] these buffers store wavelength st r iped 
messages rather  than a single ser ial  message per  
wavelength increasing opt ical  t ransmit ter  and receiver  
count  but  reducing ser ial izat ion latency and increasing 
throughput .  St r ict  in-order  del ivery is adopted by always 
giving pr ior i ty in al locat ion to packets in the switch buffers 
over  new packets from the servers as our  previous work 
showed that  there is a signi ficant  power cost  in reorder ing 
packets at  the receiver  [10]. 
B. Optical  Swi tch Archi tecture 
Several  opt ical  broad-band (mult iple wavelength) 
integrated switching technologies with ns reconfigurat ion 
t imes have been demonstrated based on semiconductor  
opt ical  ampl i fiers (SOA) [16], Mach-Zehnder inter ferometers 
(MZI ) [5] and r ing resonators [6, 9].  I n this work we use the 
hybr id MZI  and SOA di lated switch archi tecture [29] which 
has been shown to scale to 128 por ts by using an 8x8 por t  
switch in a recirculat ing loop exper iment  [30].  The MZIs in 
this device have been designed to operate over  the 
wavelength range 1540 -- 1560 nm providing a large 
bandwidth for  wavelength st r iped t ransmission.  Operat ion 
using 10 wavelengths of 10 Gb/s each has been 
demonstrated for  an 8-por t  switch [31].  I n this archi tecture, 
the SOAs overcome the main l imitat ion of pure MZI  devices 
to increase crosstalk suppression to over  50dB and also 
provide gain to reduce the overal l  switch inser t ion loss 
which signi ficant ly reduces input  opt ical  l ink power (and 
hence the power absorbed on the processor  chip as descr ibed 
in sect ion IV A).   
Figure 5a shows the hybr id di lated switch archi tecture 
which is a type of but ter fly network [32] based on 4-por t  
switch bui lding blocks each using 4 MZI  and 8 SOAs.  These 
blocks are interconnected with passive shuffle networks 
which compr ise passive waveguides, bends and waveguide 
crossings.  An NxN switch, such as that  required for  the 
basel ine VC switch, requires an array of (Nlog2(N))/2 4-por t  
switching blocks arranged as log2(N) stages (or  columns) 
and N/2 rows.  The hybr id switch design uses a di lated 
scheme to achieve a lower crosstalk rat io.  The purpose of 
di lat ion is to ensure that  each individual  MZI  switching 
element  only carr ies one signal  at  a t ime and hence the 
maximum usage of the total  switch fabr ic is 50%. This is the 
reason that  the input /output  stages only use 2 of the 4 por ts.  
This archi tecture signi ficant ly reduces component  count  and 
waveguide crossing losses compared with a crossbar 
archi tecture.  For  fur ther  detai ls of the hybr id MZI /SOA 
switches refer  to [29-31].  
Al though the switch buffer  scheme requires 2N por ts (N 
por ts for  processor  l inks and N por ts for  the buffers), as 
buffer  por ts never  need to send to other  buffer  por ts, a ful l  
2N x 2N switch is not  required and the internal  archi tecture 
can be simpl i fied.  We consider  two cases as shown in Fig. 
5b and 5c.  Case 1 uses two NxN switches, N 1x2 input  
switches and (N/2) 4x2 output  switches.  The input  and 
output  switches are constructed from the 4-por t  switch 
blocks descr ibed above and shown in Figure 5a.  Overal l , the 
case 1 switch uses Nlog2(N) + (3N/2) 4-por t  blocks.  Fai led 
speculat ive packets are routed to the buffers by the input  
switch.  These packets are routed across the second NxN 
switch to the output  switches when the output  por t  is free.  
Fur ther  simpl i fied st ructures are possible such as the 
case 2 archi tecture shown in Fig. 5c.  As with case 1, an 
input  switch determines whether  the packet  is routed to the 
main NxN switch or  the buffers.  However, in this case 
fai led speculat ive packets stored in the buffers are routed 
back through the 2x2 input  switch and the main NxN 
switch when both the input  and output  por ts are free.  The 
number of 4-por t  blocks is reduced to Nlog2(N)/2 + N, but  
the l imitat ion is that  packets from the t ransmit ter  and 
recirculat ion buffer  on the same input  por t  but  dest ined for  
di fferent  output  por ts cannot  pass through the switch 
simultaneously.  I n the fol lowing sect ions, we evaluate the 
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 
 
4 
performance and power character ist ics of the two buffered 
switch designs relat ive to the VC switch. 
Another  key advantage of the proposed archi tecture is 
that  var iat ions in power and opt ical  signal  to noise rat io 
which could affect  physical  layer  per formance are expected 
to be very smal l  and, i f necessary, can be cal ibrated out  by 
adjust ing the gain of individual  SOAs.  I n the ToR scenar io, 
al l  t ransmissions, whether  server  to server  or  server  to/from 
core router , take one pass through the switch.  Within the 
switch i tsel f, al l  t ransmissions pass through the same 
number of 4-por t  switching blocks. I n addit ion, the 
di fferences in at tenuat ion due to fiber  t ransmission distance 
between server-to-server  t raffic and server-to-router  t raffic 
wi l l  be minimal on the data center  scales (<1 km).  However, 
future work is required to define fabr icat ion var iat ions for  
the integrated photonic components and cal ibrat ion 
procedures for  their  mit igat ion.   
I I I . LATENCY RESULTS 
We have modeled the control  planes of the basel ine 
scheduled VC network and the proposed buffered switch 
network in SystemVer i log including buffer ing, t ransmission 
control  and switch al locat ion.  Delays are inser ted into the 
control  plane model to account  for  the t ime of fl ight  of 
opt ical  data and control  signals between network por ts, 
al locator  and switch.  The SystemVer i log model al lows us 
both to simulate the network control  plane to obtain latency 
under var ious t raffic pat terns but , in addit ion, key circui ts 
can be synthesized using an appl icat ion speci fic integrated 
circui t  (ASIC) design flow to obtain the minimum clock 
per iod, area and power consumpt ion possible in a real  
CMOS circui t .  I n our  previous work, latency values for  a 
64-por t  ToR switch were repor ted in clock cycles [10].  
However, as discussed in sect ion I I I  B below, the al locat ion 
circui t  depends on both the switch and control  plane 
archi tecture, each having a di fferent  achievable clock 
per iod.  The al locator  clock per iod in turn determines the 
clock per iods of other  circui ts and hence has a major  impact  
on overal l  latency.  Sect ion I I I  A descr ibes al locat ion circui ts 
required for  the di fferent  control  plane and switch cases and 
their  t iming character ist ics in a 45 nm CMOS process.  
Then Sect ion I I I  B repor ts latency without  congest ion whi le 
sect ions I I I  C -- D show the relat ive per formance under load 
with random, st reaming and incast  t raffic.    
A.     Al location Ci rcui ts 
Figure 6a shows a separable al locator  such as iSLIP 
sui table for  a VC crossbar switch consist ing of a two stage 
process of output  por t  arbi t rat ion fol lowed by input  por t  
arbi t rat ion.  I n previous work [10], i t  was shown that  the 
clock per iod of this circui t  increases rapidly with number of 
por ts reaching 2.3 ns for  a 64-por t  switch in a 45nm CMOS 
process.  The input  and output  arbi t rat ion stages can be 
pipel ined to reduce the clock per iod [27] and al though this 
does not  reduce al locat ion latency, i t  reduces the latency of 
other  control  plane funct ions which use the same clock as 
the al locator .  Al locators for  speculat ive t ransmission using 
crossbar switches (Fig. 6b) only require output  por t  
arbi t rat ion (as the input  por t  decision has been made at  the 
t ransmit ter) and hence are more scalable having a clock 
per iod of 0.75 ns for  64-por ts [10].  However, as discussed 
above, crossbar switches have a high component  count  
leading to more scalable switch archi tectures such as the 
hybr id di lated st ructure descr ibed in Sect ion I I  B.  The 
hybr id di lated switch requires more complex al locat ion 
because there are mult iple paths through the switch for  
each pair  of input  and output  por ts.   This switch is a type of 
but ter fly network, for  which dest inat ion tag rout ing can be 
used [32].  Dest inat ion tag rout ing is an 
obl ivious/determinist ic rout ing technique which can suffer  
from poor load balancing.  On the other  hand i t  is simple to 
implement  and fast  so is often used in pract ice.  Here, we 
adopt  dest inat ion tag rout ing to obtain minimum latency at  
low loads.  Adapt ive rout ing may reduce latency at  high 
network loads and wi l l  be considered in future work.  The 
al locator  for  the hybr id di lated switch (using ei ther  VC or  
buffered switch archi tectures) is shown in Fig. 6c.  I t  
consists of an array of 4-por t  arbi ters, one for  each 4-por t  
block.  The 4-por t  arbi ter  has been synthesized in the same 
45nm CMOS process and found to have a cr i t ical  path 
length of 0.34 ns, giving a minimum clock per iod including 
sequencing overheads of 2.15 ns for  the 6 cascaded arbi ters 
of a 64-por t  switch.  Pipel ining can easi ly be appl ied 
between stages of arbi t rat ion.   
   A final  important  point  to be made about  al locator  circui ts 
is that  the resul ts in [10] showed that , despi te their  cr i t ical  
effect  on latency, the al locator  power consumpt ion is not  
signi ficant  compared with other  network power sources.  
However, the al locator  synthesis power resul ts are included 
in the energy analysis of Sect ion IV.  
 
B. Latency wi thout contention 
Figure 7 shows the head latency for  var ious server-to-
switch distances and switch configurat ions without  
content ion (the case in which al l  speculat ive t ransmissions 
in the recirculat ion case are successful  and the switch 
buffers are therefore not  used).  The head latency is defined 
as the t ime between new data arr iving in the input  buffers 
unt i l  the fi rst  bi t  ar r ives at  the receiver  and does not  
include ser ial izat ion latency to remove the effect  of the 
di fference in message sizes between appl icat ions.  Table I  
summarizes the clock per iod and al locat ion pipel ining used 
in each case.  Other  control  plane funct ions shown in Fig. 4 
such as sending requests, processing grants and 
synchronizat ion of requests and grants with the local  clock 
domain take one clock cycle each (at  the al locator  clock 
per iod) based on t iming resul ts from synthesis.  The 
discont inui t ies in Figure 7 are caused by rounding up 
request , grant  and data t ransmission to the nearest  clock 
cycle.  Using a l inear  fi t  on these resul ts and adding the 
ser ial izat ion latency (assuming 10 wavelengths of 10 Gb/s), 
the no content ion latency of the buffered MZI /SOA switch in 
ns as a funct ion of distance from por t  to switch, x (in m), 
and packet  length, p in (B), is: 
 
𝐿 = 8.9 + 20.0𝑥 + 0.08𝑝            (1) 
 





 𝐿 = 7.1 + 10.0𝑥 + 0.08𝑝            (2) 
 
for  the VC switch and crossbar.  I t  can be observed that  the 
latency advantage of the recirculat ion switch increases with 
network dimensions, from several  ns for  a chip-to-chip 
coherence network (typical  dimensions 10 - 30cm) to 20 -- 40 
ns in the case of a rack scale network (2 -- 4 m).   
 
C. Latency wi th Random Traffic 
The SystemVer i log model was used to character ize the 
per formance of the switch under load using the techniques 
descr ibed in [32].  Figure 8a shows the compar ison between 
the VC switch and the two buffered switch cases with 
uni form random packet  inter -arr ival  t imes and random 
dest inat ions for  the ToR appl icat ion case.  The latencies 
include the opt ical  t ime of fl ight  for  data and control  signals 
between servers and ToR (assuming a 2m fiber  connect ion) 
and the ser ial izat ion latency of 128B packets using 10 
wavelengths of 10Gb/s.  I n pract ice, packets in the ToR 
switch appl icat ion could be up 9000B long (assuming 
Ethernet).  However, as our  SystemVer i log al locator  and 
buffer  designs are current ly l imited to fixed packet  sizes, we 
simulate for  128B packets.  Larger  packets would need to be 
spl i t  up and routed separately in this scenar io.  All  FIFOs 
in the t ransmit ter  and switch buffers can contain 4 packets.  
Unl ike the resul ts in [17], real ist ic clock per iods and 
synchronizat ion overheads are included as discussed in 
sect ion I I I  A.  I t  can be observed that  the case 1 buffered 
switch maintains i ts latency advantage over  the VC switch 
up to the saturat ion load of 65% load despite having 
approximately 32 t imes lower buffer ing requirements.  The 
simpler  case 2 recirculat ion switch saturates at  50 % load.   
The al locat ion algor i thms used in this work are designed 
to be fast  rather  than achieving a maximal matching 
between requests and grants.  The saturat ion load or  
maximum throughput  of the VC network could be increased 
using mult iple i terat ions of iSLIP to approach maximal 
matching at  the expense of a latency penalty at  low loads 
[10, 27].  However, increasing the number of i terat ions in 
the current  buffered switch al locator  using a fast  
determinist ic rout ing algor i thm wi l l  not  provide any fur ther  
benefi t .  Therefore the proposed buffered switch archi tecture 
t rades off throughput  to achieve minimum latency.  Fur ther  
research is required to invest igate adapt ive al locat ion and 
rout ing algor i thms to increase throughput  in the MZI /SOA 
buffered switch.  
D. Latency wi th Streaming and Incast Traffic 
Random t raffic is wel l  known to be benign [27, 32].  We 
also tested the switch control  planes using st reaming and 
incast  t raffic.  I n the st reaming case, one source por t  sends 
al l  i ts t raffic to a single dest inat ion por t  wi th random 
dest inat ions for  t raffic from al l  other  source por ts.  This 
simulates the t ransmission of packet ized video or  large 
segmented messages.  I n the incast  case, al l  source por ts 
send to a single dest inat ion por t .  This t raffic pat tern is 
common in data center  workloads, for  example in large scale 
search algor i thms and is wel l  known to st ress data center  
networks.  Figures 8b and 8c show the per formance for  
st reaming and incast  t raffic respect ively for  the same ToR 
scenar io.  Both buffered switch cases have a higher  
saturat ion load than the VC switch for  both st reaming and 
incast  t raffic.  Round robin arbi t rat ion used in both the VC 
and buffered switch cases, wi l l  not  give pr ior i ty to the 
st reaming or  incast  packets.  However, in the buffered 
switch case, fai led speculat ive packets are stored close to the 
switch for  rapid ret ransmission whereas in the VC case 
addit ional  control  latency is incurred reducing the 
st reaming por t  ut i l izat ion and hence maximum throughput .  
I t  can be observed that  the saturat ion loads are very low in 
the incast  case, as expected, due to st ressing a single 
receiver .  The saturat ion load was found to be very sensi t ive 
to the number of incast  por ts but  independent  of the switch 
buffer  depths due to the st r ict  in-order  del ivery pol icy. 
IV. ENERGY ANALYSIS 
To assess the energy consumpt ion in each network and 
demonstrate that  the send and forget  inter face combined 
with MZI /SOA switching can reduce power consumed in 
future processor  chips with integrated opt ical  t ransceivers, 
we have modeled the power consumpt ion of each network 
component .  This sect ion descr ibes the energy models and 
gives resul ts for  the total  network power and the power 
consumed on the processor  chip.  This analysis is for  
t ransceivers which are integrated on chip with the processor  
elements with opt ical  power suppl ied by off-chip lasers as 
shown in Fig. 9.  Lasers are not  power gated.  Other  
assumed parameters with references are given in Table I I .   
A. Optical  Power  and Swi tch Power  Requi rements 
As previously discussed, one of the key advantages of the 
MZI /SOA switch archi tecture in the ToR or  shared memory 
appl icat ions is the low inser t ion loss due to the gain of the 
SOA elements.  However, increasing the SOA length and 
bias current  to increase gain and reduce inser t ion loss has 
to be balanced against  the increased spontaneous emission 
noise (and hence a higher  receiver  power penalty) and 
higher  power consumpt ion.  Al though only 8-por t  MZI /SOA 
switches have been fabr icated to date [13], the archi tecture 
has been shown to operate with 64-por ts with 1.9 dB 
receiver  penalty using 20 mA bias current  for  each SOA by 
measurement  of a 2x2 hybr id di lated switch in a 
recirculat ing loop [20].  I n this configurat ion, each 4-por t  
switch block has a loss of 1.2 dB giving 7.2 dB and 8.4 dB 
overal l  losses respect ively for  the two buffered switch cases.  
This represents a good t radeoff between low inser t ion loss, 
opt ical  signal  to noise rat io (OSNR) and dr ive power 
consumpt ion.  The dr ive power of each SOA is 20 mW which 
dominates the overal l  power consumpt ion of the MZI /SOA 
switch with the MZI  dr ive power being negl igible by 
compar ison [29].  For  compar ison, we use a si l icon photonic 
si l icon micro r ing resonator  (MRR) switch connected in a 3-
stage Clos configurat ion.  Si l icon r ing resonator  switches are 
at t ract ive due to low area, dr ive powers and potent ial  cost , 
but  have relat ively high losses.  Using figures ext rapolated 
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 
 
6 
from publ ished l i terature (see Table I I ) we calculate that  a 
64-por t  MRR switch wi l l  have loss of 17.7 dB.  Note that  in 
pract ice the hybr id MZI /SOA switch, r ing resonator  
switches and laser  sources require temperature control .  
However, as the power consumpt ion of temperature control  
wi l l  be simi lar  for  al l  cases, we do not  include this in the 
energy compar ison.  
I nput  opt ical  power requirements were calculated using 
the switch inser t ion losses discussed above and other  
component  loss parameters given in Table I I .  Figure 9 
shows the power budgets for  the MZI /SOA switch and the 
r ing resonator  crossbar switch, demonstrat ing the reduct ion 
in opt ical  power absorbed on the processor  chip in the 
former case.  I t  is important  to note that  al l  chip-to-chip 
l inks are assumed to use fiber  which has negl igible loss on 
these network scales and, hence, there is no signi ficant  
di fference in the power budgets between the ToR and shared 
memory network appl icat ions.    
B. Electronic Control  and Transmission Power  
Power models for  the control  plane circui ts are obtained 
by synthesizing the SystemVer i log models of the 
t ransmit ter  control ler , al locator  and recirculat ion buffers 
using in a 45nm standard cel l  ASIC flow and Synopsys 
Design Vision.  Act ivi ty data is captured from the 
SystemVer i log simulat ions using Modelsim and power is 
est imated using Synopsys Pr imet ime.   
The power consumpt ion of t ransmit ter  and receiver  front  
ends is taken from measurements on a recent ly repor ted 
t ransceiver  [7].  Ser ial izat ion and deser ial izat ion (SERDES) 
power is found using the CONTEST open source t ransceiver  
design toolki t  [37].  SERDES and t ransmit ter  front  ends are 
assumed to be power gated; receiver  front  ends, control  
plane circui ts and opt ical  power suppl ies are always on. 
C. Power  Dissipated on Server  Chip 
Figure 10a shows the power dissipated on the processor  
chip at  30% network load for  the MMR and MZI /SOA 
switches and the VC and proposed buffered switch 
archi tectures.  The gain of the MZI /SOA switch 
substant ial ly reduces the opt ical  power absorbed on the 
server  chip due to a reduct ion in the power budget  from 26.8 
dB with the MMR switch down to 16.3 dB.  The simpl i fied 
send and forget  inter face used for  the buffered switch also, 
signi ficant ly reduces the network adapter  (t ransmit ter  
control) power due to reduced FIFO storage requirements 
(reduced from 55.7 mW at  in the VC case to 0.9 mW at  30 % 
load for  the ToR appl icat ion).  However, these figures are for  
128 B packets.  Greater  packet  lengths wi l l  increase FIFO 
memory requirements and hence adapter  power 
consumpt ion.  For  example, providing storage for  four  1500 
B Ethernet  packets in the ToR case wi l l  increase the power 
consumpt ion of the send and forget  adapter  to 8.0 mW at  
30% load.  However, in this case, the VC adapter  wi l l  
increase to 304 mW.  I n the shared memory case, maximum 
packet  lengths are fixed by the cache block size.  The 
remaining power consumpt ion in the buffered switch is 
dominated by receivers and SERDES.  Receivers could be 
power gated at  the expense of a latency penalty using a 
reservat ion scheme [19].  SERDES is an inevi table 
consequence of operat ing at  the high bi t  rates of opt ical  
l inks, but  is energy propor t ional  wi th a fixed energy per  bi t  
[37].  Overal l , at  30% network load, the buffered switch 
archi tecture reduces the power dissipated on the processor  
chip by 64 % from 171.0 mW to 61.1 mW in the ToR 
appl icat ion and by 60 % from 150.6 mW to 60.8 mW in the 
shared memory appl icat ion. These resul ts are for  the case 2 
switch.  There is an addit ional  power dissipat ion of 1.7 mW, 
constant  over  al l  load levels using the case 1 switch due to 
the loss of the output  switch. 
I n al l  cases, the power dissipated on the server  chip 
scales l inear ly with load as shown in Fig 10b.  The gradient  
of dissipated power against  load is greater  for  the VC 
archi tectures, due to the more complex adapter , par t icular ly 
for  the larger  packets of the ToR appl icat ion.  
D. Total  Network Power  
Figure 11a shows contr ibut ions to the total  power of the 
64-por t  switch networks at  30% load.  For  MMR switches, 
the power is dominated by opt ical  power due to high opt ical  
losses.  Assuming MZI /SOA switches are used, the buffered 
switch archi tectures have increased power consumpt ion 
over  the VC case as the power of the addit ional  
t ransmit ters, receivers and adapters at  the switch and the 
effect  of the addit ional  input /output  switches outweighs that  
of the larger  VC t ransmit ter  adapter .  As shown by Figure 
11b, the low required t ransmit ter  powers combined with the 
gain provided by the SOAs means that  the MZI /SOA switch 
cases are also more energy propor t ional  wi th power 
consumpt ion of 4.8 -- 6.6 W at  low loads.  The power of the 
MZI /SOA switches approaches that  of the MRR switch at  
high loads as the SOA power dominates.  MMR switches 
would require t ransmit ter  based opt ical  power gat ing, not  
easy to apply without  a latency penalty, to achieve the same 
levels of energy propor t ional i ty.  The increase in the 
buffered case 1 switch compared with the VC switch is 2.4 
W or  48 %.  This increases to 4.3 W (20%) at  60% load as the 
switch buffers are used more often. The energy 
propor t ional i ty of the SOA based switches also means that  
there is only a smal l  increase in total  power for  the more 
complex case 1 buffered switch compared with the case 2 
switch.  The power di fferences between the two switch cases 
is reduced at  high loads as more packets use the buffers in 
case 2. 
V. DISCUSSION 
In sect ion I , two potent ial  appl icat ions of the buffered 
switch archi tecture in future high per formance servers were 
descr ibed: opt ical  top-of-rack replacement  and mult iple 
socket  shared memory networks.   
For  the top-of-rack switch appl icat ion, opt ical  switching 
using wavelength st r iped (WDM) l inks provides high 
bandwidth without  pin or  front  panel l imitat ions or  the 
requirement  for  power hungry elect ronic switching fabr ics.  
Store and forward 10G Ethernet  switches can int roduce 
latencies from 100ns up to 10 us plus processing depending 
on the packet  length.  High per formance cut  through routers 
can star t  to forward the packet  after  receiving the fi rst  54B 
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 
 
7 
(MAC addresses, Ether type and IPv4 layer  3 and 4 headers) 
taking on the order  of 100 ns before star t ing to forward 
packets of any length.  By compar ison, the opt ical  ToR 
bypasses the buffer ing and processing in elect ronic switches 
but  int roduces an overhead due to opt ical  switch al locat ion 
and reconfigurat ion.  The opt ical  buffered switch proposed 
in this paper mit igates this overhead using speculat ive 
t ransmission.  To accurately compare the opt ical  buffered 
switch with an elect ronic cut  through switch independent ly 
of packet  length and distance, the 100ns cut  through 
forwarding t ime should be compared with the sum of the 
request  synchronizat ion, al locat ion and switching t imes 
which is 7 clock cycles or  5 ns (see table I ).  The 100 Gb/s 
bandwidth of the opt ical  ToR also reduces ser ial izat ion 
latency compared with current  10 Gb/s Ethernet  ToRs to 
reduce incast  issues without  pin or  front  panel bandwidth 
l imits.  I t  has to be noted however, that  appl icat ions 
running in a data center  environment  have a wide range of 
end-to-end latency requirements down to a few 
microseconds and not  al l  appl icat ions wi l l  benefi t  from the 
reduced latency.  From an energy point  of view, the 
Ethernet  por ts on current  CMPs represent  only a smal l  
propor t ion of chip power consumpt ion, so the reduct ion in 
CMP dissipat ion for  the proposed archi tecture is a relat ively 
minor  advantage.  Total  power consumpt ion compar isons 
with elect ronic Ethernet  switches are di fficul t ; however, the 
energy propor t ional i ty demonstrated by the MZI /SOA 
switch is an important  advantage over  elect ronic 
equivalents [1].  I t  has to be noted however that  the energy 
savings through reduced buffer ing in the send and forget  
inter face are near  the lower bound as we consider  relat ively 
smal l  packets of 128B.    
I n the shared memory coherence network case, whi le 
energy propor t ional i ty is also an important  advantage, the 
power dissipated on the CMPs is cr i t ical  in order  to reduce 
the large propor t ion of power consumed by off-chip 
communicat ions in such chips.  The server  chip descr ibed in 
[3] has total  coherence bandwidth of 460 Gb/s using 
elect ronic SERDES consuming 11.1 mW/(Gb/s) giving a 
power consumpt ion of 5.1W, signi ficant  compared with the 
120W total  processor  power envelope.  By compar ison, the 
processor  chip power dissipat ion of our  archi tecture (at  30% 
load) is 0.5 mW/(Gb/s), consuming only 0.23W for  the same 
coherence bandwidth.  Such compar isons are di fficul t , for  
example the elect ronic SERDES power includes other  
physical  layer  funct ions such as clock recovery, coding and 
equal izat ion (some of which are not  required in the opt ical  
case) whereas our  buffered switch power figures includes 
buffer ing not  included in the elect ronic case.  However, the 
more than order  of magnitude reduct ion suggests that  the 
proposed archi tecture can make signi ficant  reduct ions in 
CMP dissipat ion.  As discussed in sect ion I , latency is a key 
factor  in shared memory networks.  The proposed 
archi tecture has the abi l i ty to connect  each core over  an 
opt ical  switch, avoiding the two stage network of current  
mult iple-socket  systems.  The resul ts demonstrate that  
cores on di fferent  chips can be connected with simi lar  
latency to cores on the same chip.  For  example, for  an 
elect ronic 16 core mesh network-on-chip using single cycle 
routers operat ing at  1 GHz clock frequency, the head 
latency (ignor ing messages size) is between 3 and 13 ns 
depending on the posi t ion of source and dest inat ion cores 
[32].  Figure 7 shows that  networks with <30cm distance 
between por t  and switch have a head latency of <10ns. 
I n both appl icat ions, scalabi l i ty in both por t  count  and 
bandwidth per  por t  is important  to support  future increase 
in compute capaci ty and densi ty.  Our ongoing research into 
hybr id switch design aims to bui ld very large por t  count  
opt ical  switches. We bel ieve that  integrat ion of larger  than 
128 por t  count  opt ical  switches is feasible in the future. We 
have demonstrated 10×10Gb/s operat ion with the hybr id 
MZI /SOA switch [31] and we are now aiming at  
demonstrat ing higher  bi t  rate operat ions. The large 
operat ing wavelength range also al lows operat ion with more 
than 10 wavelengths. 
Final ly, we do not  consider  the latency or  energy 
impl icat ions of data synchronizat ion in this work which wi l l  
be an important  issue in future chip-to-chip opt ical ly 
switched interconnects.  Source synchronous wavelength 
st r iped opt ical  l inks have been demonstrated operat ing at  
up to 4 Gb/s [38] and due to the fundamental ly lower delay 
var iat ion in photonic compared with elect ronic l inks [39] are 
a possible candidate for  higher  bi t  rates.  I nject ion locking 
clock recovery, ei ther  elect ronic [40, 41] or  opt ical  [42] is 
another  promising solut ion to the synchronizat ion problem 
and recovery t imes below 25 ns have been demonstrated in 
both cases. 
VI . CONCLUSIONS 
We have proposed a low latency opt ical  switch 
archi tecture for  data center  top of rack and shared memory 
coherence network appl icat ions and compared i t  wi th a high 
per formance opt ical  VC switch and elect ronic al ternat ives.  
The proposed archi tecture has the important  proper ty of 
minimizing the power consumed and dissipated in future 
server  chips with integrated photonic t ransceivers thus 
mit igat ing the dark si l icon effect .  SOA based switching is 
often thought  to be a high power opt ion.  However , the 
resul ts shown in this paper demonstrate that  i t  gives 
greater  energy propor t ional i ty and al lows effect ive power 
management .  The speculat ive control  plane with elect ronic 
buffer ing at  the switch both reduces latency and fur ther  
reduces the complexi ty and power consumpt ion of the server  
side circui ts. 
ACKNOWLEDGMENT 
This work was supported by the UK Engineer ing and 
Physical  Sciences Research Counci l  (EPSRC) INTERNET 
program grant  and an EPSRC Fel lowship grant  to Phi l ip 
Wat ts.  Both Universi ty Col lege London and the Universi ty 
of Cambr idge are members of GreenTouch.     
REFERENCES 
[1] L .A.Barroso, J.Cl idaras, U.Hölzle, ‘‘The Datacenter  as a 
Computer ’’, 2nd edit ion, (Morgan Claypool 2013). 
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 
 
8 
[2] H. Esmaeilzadeh, E. Blem, R. St . Amant , K. Sankaral ingam, 
D. Burger , ‘‘Dark Si l icon and the End of Mult icore Scal ing’’, 
IEEE Micro 32(12), 2012 
[3] J. L . Shin, D. Huang, B. Petr ick, C. Hwang, K.W. Tam, A. 
Smith, H. Pham, H. L i, T. Johnson, F. Schumacher , A.S. Leon, 
A. St rong, ‘‘A 40 nm 16-core 128-thread SPARC SoC 
processor ’’, IEEE J. Sol id-State Circuits 46(1), pp. 131--144, 
2011.  
[4] A. Shacham and K. Bergman, ‘‘Bui lding Ult ralow Latency 
Interconnect ion Networks Using Photonic Integrat ion,’’ IEEE 
Micro 27(4), 2007.  
[5] B. Lee, A. Rylyakov, W. Green, S. Assefa, C. Baks, R. Rimolo-
Donadio, D. Kuchta, M. Khater , T. Barwicz, C. Reinholm, E. 
K iewra, S. Shank, C. Schow, and Y. Vlasov, ‘‘Monoli thic si l icon 
integrat ion of scaled photonic switch fabr ics, cmos logic, and 
device dr iver  circuits,’’ J. of L ightwave Technology, vol. 32, pp. 
743-751, Feb 2014 
[6] A. Biberman, G. Hendry, J. Chan, H. Wang, K.B Preston, N. 
Sherwood-Droz, J.S. Levy, M. L ipson, ‘‘CMOS-Compat ible 
Scalable Photonic Switch Architecture Using 3D-Integrated 
Deposited Si l icon Mater ials for  H igh-Per formance Data 
Center  Networks’’, Proceedings of Opt ical Fiber  
Communicat ions (OFC) Conference, Los Angeles, March 2011. 
[7] X. Zheng, F. L iu, J. Lexau, D. Pat i l , G. L i, Y. Luo, H. Thacker , 
I . Shubin, J. Yao, K. Raj, R. Ho, J.E. Cunningham, A.V. 
Kr ishnamoor thy, ‘‘Ult ra-efficient  10Gb/s hybr id integrated 
si l icon photonic t ransmit ter  and receiver ’’, Opt . Express 19(6), 
5172-5186, 2011 
[8] Y. L iu, J. M. Shainl ine, X. Zeng, and M. Popovic, ‘‘Ult ra-low-
loss waveguide crossing ar rays based on imaginary coupl ing of 
mult imode bloch waves,’’ in Advanced Photonics 2013. 
[9] A. Poon, X.S. Luo, F. Xu, H. Chen, ‘‘Cascaded Microresonator -
Based Matr ix Switch for  Si l icon On-Chip Opt ical 
Interconnect ion,’’ Proc. of the IEEE, vol. 97, no. 7, 2009. 
[10] P.M. Wat ts, A.W. Moore, S.W.Moore, ‘‘Energy implicat ions of 
photonic networks with speculat ive t ransmission’’, J. Opt ical 
Comms and Networking 4(6), 2012 
[11] M. Or t in Obon, L . Ramini, V. Viñals, D. Ber tozzi, ‘‘Captur ing 
Sensit ivi ty of Opt ical Network Quali ty Metr ics to i ts Network 
Inter face Parameters’’, Workshop on Exploit ing Si l icon 
Photonics for  energy-efficient  heterogeneous paral lel  
architectures (SiPhotonics'14), Vienna, Jan 2014. 
[12] C. Bat ten, A. Joshi, J. Orcut t , A. Khi lo, B. Moss, C.W. 
Holzwar th, M.A. Popovic, H.Q. L i, H.I . Smith, J.L . Hoyt , F.X. 
Kar tner , R.J. Ram, V. Stojanovic, K. Asanovic, , ‘‘Bui lding 
Manycore Processor -to-DRAM Networks’’, IEEE Micro, Vol. 
29, pp. 8-21, 2009 
[13] S. Beamer , C. Sun,Y-J. Kwon, A. Joshi, C. Bat ten, V. 
Stojanovi, K. Asanovi, ‘‘Re-Architect ing DRAM Memory 
Systems with Monoli thical ly Integrated Si l icon Photonics’’, 
Proceedings of the Internat ional Symposium on Computer  
Architecture (ISCA), June 2010. 
[14] J. L . Hennessy and D. A. Pat terson, Computer  Architecture, A 
Quant i tat ive Approach. Morgan Kaufmann, 4th ed., 2007. 
[15] O. L iboiron-Ladouceur , B.A. Small , K. Bergman, ‘‘Physical 
layer  scalabi l i ty of WDM opt ical packet  interconnect ion 
networks’’, Journal of L ightwave Technology, Vol. 24, pp. 262-
270, 2006 
[16] I . White, A.E. Tin, K. Wil l iams, H.B. Wang, A. Wonfor , R. 
Penty, ‘‘Scalable opt ical switches for  comput ing appl icat ions’’, 
Journal of Opt ical Networking  8(2), pp. 215-224, 2009.  
[17] S. L iu, Q. Cheng, A. Wonfor , R. Penty, I . White, P.M. Wat ts, 
‘‘A Low Latency Opt ical Top of Rack Switch for  Data Centre 
Networks with Minimized Processor  Energy Load’’, 
Proceedings of Opt ical Fiber  Communicat ions (OFC), San 
Francisco, March 2014. 
[18] R. Lui jten, C. Minkenberg, R. Hemenway, M. Sauer , R. 
Grzybowski, ‘‘Viable opto-elect ronic HPC interconnect  fabr ics’’,  
Proceedings of the ACM/IEEE Supercomput ing Conference 
2005 
[19] Y. Pan, P. Kumar , J. K im, G. Memik, Y. Zhang, A. Choudhary, 
‘‘Firefly: I l luminat ing future network-on-chip with 
nanophotonics,’’ in Int . Symp. on Comput . Archit ., 2009. 
[20] D. Vantrease, R. Schreiber , M. Monchiero, M. McLaren, N.P. 
Jouppi, M. Fiorent ino, A. Davis, N. Binker t , R.G. Beausolei l , 
J.H. Ahn, ‘‘Corona: System implicat ions of emerging 
nanophotonic technology,’’ in Int . Symp. on Computer  
Architecture (ISCA), 2008. 
[21] A. Kr ishnamoor thy, R. Ho, X.Z. Zheng,  H. Schwetman, J. 
Lexau, P. Koka, G.L . L i, I . Shubin, J.E. Cunningham, 
‘‘Computer  systems based on si l icon photonic interconnects,’’ 
Proc. of the IEEE, vol. 97, no. 7, 2009. 
[22] S. Beamer , K. Asanovic, C. Bat ten, A. Joshi, and V. 
Stojanovic, ‘‘Designing mult i-socket  systems using si l icon 
photonics,’’ in Proceedings of the 23rd Internat ional 
Conference on Supercomput ing (ICS), 2009. 
[23] X. Ye, Y. Yin, S. Yoo, P. Mejia, R. Proiet t i , V. Akel la, ‘‘DOS: A 
scalable opt ical switch for  datacenters,’’ I n Proc. Symp. Arch. 
for  Networking and Comms. Systems (ANCS), 2010.  
[24] N. Far r ington, G. Por ter , S. Radhakr ishnan, H. H. Bazzaz, V. 
Subramanya, Y. Fainman, G. Papen, and A. Vahdat , ‘‘Hel ios: a 
hybr id elect r ical/opt ical switch architecture for  modular  data 
centers,’’ ACM SIGCOMM Computer  Communicat ion Review, 
vol. 41, no. 4, pp. 339-350, 2011. 
[25] A. Singla, A. Singh, K. Ramachandran, L . Xu, and Y. Zhang, 
‘‘Proteus: a topology malleable data center  network,’’ I n 
Proceedings of the 9th ACM SIGCOMM Workshop on Hot  
Topics in Networks, Oct . 2010. 
[26] L . L iu, Z. Zhang, Y. Yang, ‘‘Packet  Scheduling in a low --
latency opt ical interconnect  with elect ronic buffers’’, J. of 
L ightwave Technology, Vol. 30, No. 12, pp. 1869-1881, June 
2012 
[27] N. McKeown, ‘‘The iSLIP scheduling algor i thm for  input -
queued switches,’’ IEEE/ACM Trans. Netw., vol. 7, no. 2, pp. 
188--201, 1999 
[28] I . I l iadis, and C. Minkenberg, ‘‘Per formance of a speculat ive 
t ransmission scheme …’’ IEEE Trans. Networking  16(1), 2008  
[29] Q.Cheng, A. Wonfor , R.V. Penty, I .H. White, ‘‘Scalable, low 
energy hybr id photonic space switch’’, Journal of L ightwave 
Tech. 31 (18), pp. 3077-3084, 2013 
[30] Q.Cheng, A. Wonfor , J.L . Wei, R.V. Penty, I .H. White, 
‘‘Demonstrat ion of the feasibi l i ty of large por t  count  opt ical 
switching using a hybr id MZI -SOA switch module in a 
recirculat ing loop’’, Opt ics Let ters 39 (18), pp. 5244-5247, Sept  
2014. 
[31] Q Cheng, A Wonfor , J L  Wei, R V Penty, and I  H White, 
'Monoli thic MZI -SOA Hybr id Switch for  Low-power  and Low-
penalty Operat ion', Opt ics Let ters, Vol. 39, I ssue 6, pp 1449-
1452, 2014. 
[32] W.J.Dally and B.Towles, ‘‘Pr inciples and Pract ices of 
interconnect ion networks’’, Morgan Kaufmann, 2004 
[33] P. Koka, M. O. McCracken, H. Schwetman, X. Zheng, R. Ho, 
and A. V. Kr ishnamoor thy, ‘‘Si l icon-photonic network 
architectures for  scalable, power -efficient  mult i-chip systems,’’ 
SIGARCH Comput . Archit . News, vol. 38, pp. 117--128, June 
2010. 
[34] Y. L iu, J. M. Shainl ine, X. Zeng, and M. Popovic, ‘‘Ult ra-low-
loss waveguide crossing ar rays based on imaginary coupl ing of 
mult imode bloch waves’’, Opt ics Let ters, Vol. 39, No. 2, pp. 
335-338, 2014  
[35] D. L ivshits, A. Gubenko, S. Mikhr in, V. Mikhr in, C.H. Chen, 
M. Fiorent ino, R. Beausolei l , ‘‘H igh efficiency diode comb laser  
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 
 
9 
for  DWDM opt ical interconnects’’, IEEE Opt ical Interconnects 
Conference, May 2014.  
[36] V.R. Almeida, R.R. Panepucci, M. L ipson, ‘‘Nanotaper  for  
compact  mode conversion’’, Opt ics Let ters, vol. 28, pp. 1302-
1304, 2003 
[37] Y. Audzevich, P.M. Wat ts, A. West , A. Mujumdar , S.W. Moore, 
A.W. Moore, "Power  Opt imized Transceivers for  Future 
Switched Networks," IEEE Trans. on VLSI , Vol. 22, No. 10, 
pp. 2081-2092, 2013. 
[38] C.E Gray, O. L iboiron-Ladouceur , D.C. Keezer , K. Bergman, 
‘‘Test  elect ronics for  a mult i -Gb/s opt ical packet  switching 
network,’’ in Elect ronics Packaging Technology Conference 
(EPTC), December  2006. 
[39] G. Q. Chen, H. Chen, M. Haurylau, N.A. Nelson, D.H. 
Albonesi, P.M. Fauchet , E.G. Fr iedman, ‘‘Predict ions of CMOS 
compat ible on-chip opt ical interconnect ,’’ in Integrat ion, the 
VLSI  journal, vol. 40, 2007. 
[40] B. L i, L .S. Tamil, D. Wolfe, J. Plessa, ‘‘10 Gb/s burst -mode 
opt ical receiver  based on act ive phase inject ion  and dynamic 
threshold level set t ing’’, IEEE Communicat ions Let ters, Vol. 
10, No.10, pp. 722 -724, Oct  2006. 
[41] J. Lee, M. L iu, ‘‘A 20 Gb/s Burst -Mode CDR Circuit  Using 
Inject ion-Locking Technique’’, Proc. IEEE Int . Sol id-State 
Circuits Conf. (ISSCC),  2007. 
[42] L . Jun, J. Par ra-Cet ina, P. Landais, H.J.S Dorren, N. 
Calabret ta, ‘‘Per formance Assessment  of 40 Gb/s Burst  Opt ical 
Clock Recovery Based on Quantum Dash Laser ’’, IEEE 












Fig. 1. Shared memory coherence networks for  mult iple socket  
servers (a) Due the fundamental difference between elect ronic 
communicat ions for  on-chip (wide buses of small  wires) and off-
chip (ser ial t ransceivers dr iving t ransmission l ines), separate 
networks are cur rent ly used for  on-chip and chip-to-chip 
coherence.  (b) Opt ical switching could provide a single network 





Fig. 2. Integrated opt ical t ransceivers packaged with the chip 
mult iprocessor  and an opt ical top-of-rack switch connect ing to 
spine Ethernet  switches can provide 2 hop connect ions between 
any two processors in a data center .  The opt ical top-of-rack 
switch replaces the convent ional Ethernet  switch used for  this 
purpose with lower  power  consumpt ion and latency.  






Fig. 3. Control plane architectures (a) basel ine input  queued VC 
switch (b) proposed send and forget  inter face with elect ronic 
buffers at  the switch.  Al l  data t ransmission is wavelength st r iped 
consist ing of 10 wavelengths at  10 Gb/s each. The network 
adapter  is the server  side inter face.  The switch and al locator  are 
located in the top-of-rack switch.  OE = opt ical to elect ronic 





Fig. 4. Latency compar ison of (a) VC Switch (b) Successful al locat ion in a speculat ive or  buffered switch and (c) fai led al locat ion in the 










Fig. 5. Hybr id MZI /SOA switch architectures (a) An NxN switch 
consists of a matr ix of 4-por t  switch blocks as shown in the 
cal lout . The input  and output  stages of the NxN switch use only 2 
por ts due to di lat ion.  The two opt ions for  connect ing the buffer  
por ts in the buffered switch architectures are shown in (b) case 1 





TABLE I  
CONTROL PLANE LATENCY PARAMETERS 
Parameter  Value 
Synchronizat ion of requests/grants 1 cycle 
VC al locator  pipel ining 2 cycles 
VC al locator  clock per iod 
Buffered xbar  switch al locator  pipel ining 




Buffered MZI /SOA al locator  pipel ining 
Buffered MZI /SOA al locator  clock per iod 
















Fig. 6. Al locators for  opt ical  switches used in this work. (a) A 
separable VC al locator  for  a crossbar  opt ical switch. (b) An 
al locator  for  an opt ical crossbar  using the buffered switch control 
plane (c) An al locator  for  the hybr id MZI /SOA opt ical switch using 




















































Fig. 7. Head latency without  content ion taking into account  
al locator  clock per iod differences and network dimensions.  
Typical scales for  the shared memory coherence and top-of-rack 




TABLE I I  
ENERGY MODELING ASSUMPTIONS 
Parameter  Value 
Bit  rate per  wavelength 10 Gb/s 
No. of wavelengths per  por t  10 
Loss of 4-por t  MZI /SOA switch block 
SOA dr ive cur rent  at  1V bias voltage 
1.2dB[30] 
20 mA [30] 
Ring resonator  modulator  loss 4 dB [33] 
Si l icon waveguide loss 1 .3 dB/cm 
Off-chip waveguide loss negl igible 
Waveguide crossing loss  
Ring resonator  through loss 
Ring resonator  drop loss 
0.04 dB [34] 
0.33 dB [9] 
1.6 dB [9] 
Power  Consumpt ion of r ing 
resonator  per  Circumference 
1.3 W/m [9] 
 
Receiver  sensit ivi ty 
Receiver  front -end power  
Transmit ter  front -end power  
-18 dBm[7] 
2.6 mW [7] 
0.66 mW [7] 
Laser  Efficiency 
Loss at  si l icon/fibre inter face 
Packet  size for  ToR appl icat ion 
Packet  size for  shared memory 
appl icat ion 
30% [35] 











































Fig. 8. Latency vs load for  (a) uniform random t raffic (b) 
st reaming t raffic between two por ts with random t raffic on other  
por ts (c) incast  t raffic. 
 
 
Fig. 9. Power  budgets for  l inks using a 64-por t  micro-r ing 
resonator  (MMR) crossbar  and a 64-por t  MZI /SOA switch. 







Fig. 10. Power  dissipated on the server  chip (a) sources of power  
dissipat ion at  30% load (b) power  dissipat ion versus load.  The 









Fig. 11. Total network power  (a) showing breakdown by 
component  at  30% load and (b) power  against  network load.  The 
switch buffers include receivers, elect ronic FIFOs, modulators and 
SERDES.  Server  t ransmit ters include modulators and SERDES.  
The adapter  contains al l  the server  based FIFOs and t ransmission 
control.  
