A Winograd-based Integrated Photonics Accelerator for Convolutional
  Neural Networks by Mehrabian, Armin et al.
JOURNAL OF SELECTED TOPICS IN QUANTUM ELECTRONICS 1
A Winograd-based Integrated Photonics Accelerator
for Convolutional Neural Networks
Armin Mehrabian, Member, IEEE, Mario Miscuglio, Member, OSA, Yousra Alkabani, Member, IEEE, Volker J.
Sorger, Senior Member, IEEE and Tarek El-Ghazawi, Fellow, IEEE
Abstract—Neural Networks (NNs) have become the main-
stream technology in the artificial intelligence (AI) renaissance
over the past decade. Among different types of neural networks,
convolutional neural networks (CNNs) have been widely adopted
as they have achieved leading results in many fields such as
computer vision and speech recognition. This success in part
is due to the widespread availability of capable underlying
hardware platforms. Applications have always been a driving
factor for design of such hardware architectures. Hardware
specialization can expose us to a novel architectural solutions,
which can outperform general purpose computers for tasks
at hand. Although different applications demand for different
performance measures, they all share speed and energy efficiency
as high priorities. Meanwhile, photonics processing has seen a
resurgence due to its inherited high speed and low power nature.
Here, we investigate the potential of using photonics in CNNs by
proposing a CNN accelerator design based on Winograd filtering
algorithm. Our evaluation results show that while a photonic
accelerator can compete with current-state-of-the-art electronic
platforms in terms of both speed and power, it has the potential to
improve the energy efficiency by up to three orders of magnitude.
Index Terms—Convolutional Neural Networks, Photonics,
Winograd
I. INTRODUCTION
THE field of AI has undergone revolutionary progress overthe past decade. Wide availability of data and cheaper
than ever compute resources have contributed immensely to
this growth. In parallel, advancements is the field of modern
neural networks, known as deep learning (DL) have attracted
the attention of academia and industry their success across
a variety of AI applications including but not limited to
computer vision, speech recognition and natural language
processing. Among the different types of neural networks,
(CNNs) are considered the viable architecture for an AI
application. This is in part due to their remarkable versatility
applicable to most AI tasks. However, all of this comes at the
price of high computational costs.
The use of Integrated photonics in neural networks for
implementing neuron functionalities have proved to be a
valuable approach for limiting the power consumption and
increase the operating speed [1]. Photonics benefit from the
coherent nature of electromagnetic waves, which interfere
while propagating through the photonic integrated circuit
(PIC), mimicking multiply and accumulate (MAC) function,
Manuscript received April 1, 2019
which is central to many AI techniques and algorithms. The
main advantage of photonic neural networks over electronics
is that the energy consumption for performing a series of
multiplications and additions does not scale with MAC
speed. However, the training of an optical neural network
necessitates an active modulation of the optical signal in a
hybrid opticalelectronic configuration [2]. These architectures
are face significant hurdles when compared to their electronic
counterparts. To be competitive, they are expected to
have low power consumption and high-speed electro-optic
modulators[3], [4], [5] in addition to converters and I/O
interface. However, when trained, photonic neural networks
do not rely any additional energy for active switching.
Therefore the architectures, which perform the weighting
could be realized completely passive, and the computations
happen without the consumption of any dynamic power [6],
[7].
In this panorama, all optical neural networks (AONNs) rep-
resent a promising future. Current all optical implementations
in free space [8] and in integrated photonics [9], [10], [11] can
outperform their electronic counterparts providing promises of
great energy efficiency and speed enhancement for learning
tasks.
In this manuscript, we explore the potentials of using high-
speed, low-power photonics in a CNN accelerator, exploiting
coherent all optical matrix multiplication in wavelength
division multiplexing (WDM), using microring weight
banks (MRRs). Our architecture is inspired by [12], [13],
where Winograd filtering algorithm is adopted to perform
convolution to speedup the execution time and reduce the
computational complexity. We investigate the performance
of the architecture in terms of speed and power. We also
investigate the robustness of the network and tolerance against
noise.
We summarize the main contributions of this work as,
• a first proposed photonic CNN architecture based on
Winograd filtering algorithm
• an analytical framework to evaluate the speed perfor-
mance of our proposed accelerator
• an in-house simulator based on a modified Google Ten-
sorflow tool to simulate the performance of our proposed
photonic accelerator with power and noise awareness
c© 2019 IEEE
ar
X
iv
:1
90
6.
10
48
7v
1 
 [c
s.E
T]
  2
5 J
un
 20
19
JOURNAL OF SELECTED TOPICS IN QUANTUM ELECTRONICS 2
• a modified training process to become more robust to
inevitable hardware noise sources during the inference
stage
II. CONVOLUTIONAL NEURAL NETWORKS (CNNS)
A CNN is a neural network comprised of one or more
convolutional layers. CNNs are mostly known for their great
performance on image data, however, their applications extend
to many other data types with local features. At the very high
level, each convolution layer is comprised of many feature
detectors known as filters that scan an input for presence
or absence of a particular set of features. Hence in a CNN
the layer inputs and outputs are referred to as feature maps
(fmap). By cascading multiple of these convolutional layers,
a hierarchy of feature detectors are formed. In this hierarchy,
feature detectors closer to the input detect primitive features.
As we move towards the final layers the type of features
detected become more abstract. Conventionally, the dimension
of each filter in a a CNN is 3D with two first dimensions
be the height and width of the filer and the last dimension
known as channel dimension represents various filters. Using
convolutional filters to scan input data had been practiced
well before the rise of the field deep learning and CNNs.
However, in traditional signal processing, such filters are hand-
engineered by experts in the corresponding fields, which is
costly, designed generally for specific purposes, and vulnerable
to designer bias. In a modern CNN, these filters are learned
through the training process. Figure 1 shows the overall
architecture of a CNN layer.
Fig. 1: A single layer of a CNN. Each of the N filters (left)
scan the input feature maps (middle) for features. This results
in output feature maps, with N channels equal to the total
number of filters.
III. PHOTONIC REALIZATION OF CNNS
In data communication and computation photonics has
the potential to offer practical solutions to overcome some
of the limitations currently facing electronic systems. In a
neuromorphic system, processing elements (PEs) are arranged
in a distributed fashion with ideally large number of incoming
(fan-in) and outgoing (fan-out) connections. Inspired from bi-
ological neural systems, some of these connection are required
to connect neurons from farthest parts of the brain. In addition,
neuromorphic PEs are mostly specific-purpose processors in
contrast to general purpose processors such as CPUs.
Neuromorphic processing can benefit from photonics in three
major ways. First, photonics can significantly reduce the
amount of energy consumed in interconnects among PEs by
avoiding energy dissipation due to charging and discharging
of electrical wires. Secondly, current neuromorphic algorithms
known to neural networks and in particular in CNNs heavily
rely on multiply and accumulate (MAC) operation, which
can be realized with very low energy budgets in photonics
compared to its electronic counterpart. Finally, photonics can
increase communication and computation bandwidth by ex-
ploiting WDM. WDM allow for higher density computation
and communication between PEs by packing more channels
and parallel computations in a neuromorphic processor.
A. Photonic Convolution Kernels and MAC Operation
One major advantage of a photonic MAC operator is that it
can be performed with almost zero energy consumption [14].
However, the if the signal is converted from optical to elec-
trical, the conversion and successive electronic manipulations
impose energy loss. To build a photonic convolutional filter,
we use a microring resonator (MRR) network proposed in
[15]. Figure 2 depicts the a single MRR neuron. Input WDM
signals are weighted through tunable MRRs. Weighted inputs
are later incoherently summed up using a photodetector.
Thus, by use of N wavelengths, it is possible to establish
up to N2 independent connections. Maximum N with current
technologies is estimated to be around 108 channels resulting
in a total of 10k connections [16]. It should be noted that
in modern neural networks known, fully-connected neural
layers have all-to-all connections between, thus requiring N2
synaptic connections between two N -neuron layers. However,
10k connection is nearly sufficient to even implement fully-
connected neural network on simple benchmark datasets such
as MNIST with 728 neurons at the inputs. In contrast, CNNs
benefit from sparse connections between local input regions
and filters. A common CNN architecture usually has filters of
shape 3×3 up to 11×11, thus requiring 9 to 121 connections
between the inputs receptive field and the filter. In general,
smaller filters are favored over larger filters, as they are capable
of detecting finer local patterns. Larger and global patterns are
usually detected at later layers of CNNs because they are more
abstract and already build on top of previously found features.
We use the proposed scheme in Figure 2, to perform two
heuristic Winograd transformation, and one element-wise
matrix multiplication (EWMM) unit on each wavelength.
Figure 3 shows the details of a MRR weighting function,
operating on a single wavelength λ0. The MRR behaves
as tunable analog filter centred at λ0, in which the voltage
applied to the EOM module let only a portion of the
light travel through the waveguide. The modulation can be
triggered by an analog electric field fed by a memristor,
which stores the weights with a 6 bit resolution. In details,
when a bias voltage V0 is applied, the transmission spectrum
(T) of the ring has a resonant frequency λ0 and when WDM
JOURNAL OF SELECTED TOPICS IN QUANTUM ELECTRONICS 3
Fig. 2: A broadcast-and-weight neuron. Inputs Xi are mod-
ulate different wavelength lasers. Modulated beams are then
bundled through WDM.
r
k ThroughIn
µ x(λi)
Di
V0 , V1 
λ
Di
λ0 λ1
T
Drop
Fig. 3: Microring resonator (MRR) operation for performing
point-wise multiplication.
light passes through the coupled waveguide, the component
with wavelength λ0 is coupled into the ring. By raising the
bias voltage to V1, the resonant frequency shifts to λ1 due
to the change in the effective refractive index of the ring.
The difference between V0 and V1 controls the difference
between λ0 and λ1, i.e. the transmission (∆i). The variation
of the transmission at λ0 represents, in our scheme, the
pointwise multiplication. The most used MRR modulator has
silicon based p-i-n junction that is side coupled to a wave
guide as described in [17] or p-n junction reported [18].
Current silicon-based MRR modulators [19][20][21], as well
as foundry level implementations, exhibit a speed up to 50
GHz, with a driving voltage of usually a few Volts (1-2V)
and an efficiency (Vpil) of few tenths of Vcm. Experimental
results that corroborate our estimation are reported in [22],
where Silicon-based electro-optic MRRs exhibit a modulation
in a working spectrum of 0.1 nm and a speed of 11 GHz and
as low as 2dB insertion losses. This by no means is a limiting
factor in the inference stage, considering that the network has
been trained and the weights are set. Therefore, the latency
of the network is given by the time-of-flight of the photon.
Beside the uncertainty due to fabrication imperfections,
which could be compensated, the main source of noise that
affects a MRR modulator is electrical noise and, in this
case, eventual non-ideality in setting the analog voltages with
memristive device which could vary over time. Moreover,
for high data rate situations(> 20Gb/s), the intra-channel
cross-talk becomes relevant, and power penalties need to
be considered [23][24]. Regarding the operating dynamic
power, the maximum allowed optical power flowing in each
physical channel of the photonic accelerator is bound by
the optical power that would produce non-linearities in the
silicon waveguides and the minimum power that the photo-
detector can distinguish from noise (SNR=1). Foundry level
[25] integrated Germanium photo-diodes can reach up to 40
GHz with a responsivity of 0.6 A/W and a Noise equivalent
power (NEP) of around 1pW/
√
Hz operating in reverse bias
(−2V). Research-level photodetectors working in the 100s
of GHz range have also been demonstrated [26], [27], [28]
However, the dynamic range of the photodiode needs to be
accurately set to avoid saturation and account for the bit
resolution [9]. For this scheme, according to the bit resolution,
the estimated dynamic range is 20dB. The speed of the optical
part of the accelerator, without considering the I/O interface,
according to [29] is given by the total number of the MRR and
their pitch. Photodetection and phase cross-talk are expected
to be the main sources of error in the proposed scheme.
B. Memristors as Analogue Weight Storage
Neuromorphic systems inspired by human brain rely
upon two major principles, namely massively distributed
processing and proximity of local memory to these processing
elements. While these memory units demand some level
of programability (plasticity), their programming speed
requirements is relatively only in kHz regime. At this
time, almost all state-of-the-art neural networks, perform the
training and the inference phases separately. This means that
once the weights are trained and set, for the inference phase,
one does not need to change the weights. In addition, weights
in our proposed system are represented by an analog voltage
bias of MRRs. Thus, a potential weight storage would be
analog and non-volatile with long retention time.
Memristive memory devices have attracted the attention of
researchers due to their interesting characteristics including
but not limited to non-volatility, long state retention time, and
ultra-low power consumption [30][31][32]. Over the past few
years the bit-resolution of such memristive memory devices
has risen monotonically [33][34][35][36]. Recently, authors in
[32] proposed, fabricated, and evaluated an analog multibit
memristive memory with bit-resolution of up to 6.5 bits. Each
memristive device takes up 20µ× 20µ in area and can retain
the resistance state for up to 8 hours. In AlexNet the 3rd
convolutional layer has the largest number of convolutional
filter weights equal to 884, 736. Assuming overhead circuitry
increases the footprint to 50µ× 50µ, the memristive memory
required for the largest layer of AlexNet can be realized in less
than 0.25cm2.
IV. FAST ALGORITHMS FOR CONVOLUTION OPERATION
We already know that convolution operation comprise
the bulk of all operations in a CNN for both training and
inference stages. However, each stage imposes a different
type of performance requirement. During the training, the
emphasis is more on throughput rather time. This is mainly
due to the fact that the model under train needs to observe
JOURNAL OF SELECTED TOPICS IN QUANTUM ELECTRONICS 4
TABLE I: Kernel size breakdown in state-of-the-art CNNs. It
can be seen that filter of size 5 × 5 comprise only a minute
fraction of total filters.
CNN 1× 1 (%) 3× 3 (%) Small 1D filters 5× 5 (%)
GoogLeNet 64.9 17.5 1.7 15.9
Inception V3 43.2 17.9 35.7 3.2
Inception V4 40.9 16.1 43 0
MobileNet 93.3 6.7 0 0
ResNet50 68.5 29.6 1.9 0
VGG16 0 100 0 0
a large ”ensemble” of data, known as the batch, as fast
as possible. As a result, time is generally amortized over
many inputs. On the other hand, during the inference stage,
applications are mostly latency sensitive. For instance, in a
self-driving car application only a few inputs image scenes
may need to be processed per second, but at a very low
latency timescale. Having said that, a neuromorphic processor
designed to be used for inference is expected to satisfy
stringent timing requirements.
An important parameter that is shown to have a significant
impact on the latency of CNNs is the size of their filters.
It is generally known from a functional point of view that
CNNs with smaller filters are preferred over CNNs with
large filters [37][38][39]. Table I shows the breakdown of
filter size for some of the state of the art CNN architectures.
This is mainly due to the fact that small filters are better in
finding local features without sacrificing the resolution. More
abstract and more global features can be detected in higher
layers of a CNN built on previous local layer features. Note
that, as we discussed in section III, a physical implementation
of photonic MRRs favors small size filters due to limited
number of available wavelength bands. This synergy between
functional and photonic realization of CNNs is the primary
motivation behind this work.
At the time of writing this paper, there are three major
ways to speed up the convolution operation. First, the General
Matrix Multiplication (GEMM) approach, in which the con-
volution is converted to matrix multiplication operation using
Toeplitz matrix. The downside to this method is that Toeplitz
conversion expands the input by a factor of r×r where r is the
size of the filter. Second method uses Fast Fourier Transform
(FFT) to perform tiled convolution operations. From Fourier
theorem we know that cyclic convolution can be performed
by transforming the input and filters into Fourier domain.
An element-wise multiplication (also known as Hadamard
multiplication) result in an equivalent of convolution, but in
Fourier domain. An inverse FFT operation transforms the cal-
culated convolution back into the original domain. FFT-based
convolution had been the method of choice for convolution
operation [40][41][42] until the recent past. Lately, it is shown
that FFT-based convolution is better suited for larger filter
sizes [12]. The third method is uses the Winograd filtering
algorithm, which we explain in detail in the following section.
A. Winograd Algorithm
In a 2D convolution, a single output component of the
convolution is calculated by,
yn,k,p,q =
c∑
n=1
r∑
x=1
r∑
y=1
xn,c,p+x,q+y × wk,c,x,y (1)
The operation in equation 1 is repeated for all outputs convolu-
tion components. In a brute-force convolution the total number
of multiplications required to perform a full convolution is
equal to
(m× r)2 (2)
where m is the size of the an output feature map channel
and r is the size of the filter. At the time of writing this
paper, Winograd convolution in the most efficient convolution
algorithm being used for CNNs [12]. Winograd convolution
is based on the minimal filtering principles. The algorithm
states that in order to calculate m outputs with a finite impulse
response (FIR) filter of size r, denoted by F (m, r), the number
of required multiplications is,
n = (m+ r − 1) (3)
While equation 3 is derived for the 1D convolution operation,
one can nest it with itself to acquire a 2D convolution.
Therefore, the number of multiplications needed for the same
2D convolution is given by,
(m+ r − 1)2 (4)
From equation 2 and 4 we can infer that Winograd results
in reduction in the complexity by a factor of,
(mr)2
(m+ r − 1)2 (5)
It should be noted that in our proposed photonic accelerator,
multiplication operations are carried out by MRRs. Any
reduction in the total number of multiplication operations,
and thus MRRs, can save us not only in footprint of the
design, but also the design complexity.
Now, in order to understand how minimal Winograd works,
let us first consider the case for 1D convolution. Let matrix
W be the matrix of weights, and matrix D be the data matrix.
Winograd computes the F (2, 1) convolution as following[
d0 d1 d2
d1 d2 d3
]w0w1
w2
 = [m1 +m2 +m3
m2 −m3 −m4
]
(6)
where values mi are intermediate values found by
m1 = (d0 − d2)× w0
m2 = (d1 + d2)× w0 + w1 + w2
2
m3 = (d2 − d1)× w0 − w1 + w2
2
m4 = (d0 − d3)× w2
Above equations show that with only 4 multiplications
between inputs and weights, Winograd can compute a F (2, 3)
JOURNAL OF SELECTED TOPICS IN QUANTUM ELECTRONICS 5
Fig. 4: High-level flow diagram of Winograd filtering technique for convolution operation. Unlike conventional convolution,
which computes a single output at a time, Winograd algorithm computes a tile of output, here of size m × m. In order to
generate an output tile, Winograd requires to fetch an input tile of size n× n. Both input tile and filter are transformed into
Winograd domain. Within the Winograd domain, previously transformed input and filter are multiplied in an element-wise
fashion. Finally, the output of the element-wise multiplication is transformed back into the original domain and channels are
collapsed into a single value per tile element.
convolution. Note, that all wi terms can be pre-computed after
the training stage. In order word, during the inference time,
while data values di, corresponding to inputs change, however,
the wi values remain the same throughout the inference. The
1D Winograd can be expressed by a closed matrix form as
Y = AT [(G× w) (BT × d)] (7)
where AT , BT , and G are three heuristic transforms described
by equations 8, 9, and 10. w is the weights vector and d is
the input vector.
AT =
[
1 1 1 0
0 1 −1 −1
]
(8)
BT =

1 0 −1 0
0 1 1 0
0 −1 1 0
0 1 0 −1
 (9)
G =

1 0 0
1
2
1
2
1
2
1
2 − 12 12
0 0 1
 (10)
One conclusion from equation 6 is that to compute a single
output of 1D convolution only a window of (m+ r−1) input
values are needed.
In a modern CNN the bulk of convolution operations are
comprised of 2D convolutions. Equation 7 can be easily
extrapolated for 2D convolution by nesting two 1D Winograd
convolutions. The resulting 2D Winograd would be,
Y = AT [(G× w ×GT ) (BT × d)×B] (11)
From [12] for the case of F (4 × 4, 3 × 3) matrices BT , G,
and AT have the forms,
AT =

1 1 1 1 1 0
0 1 −1 2 −2 0
0 1 1 4 4 0
0 1 −1 8 −8 0
 (12)
BT =

4 0 −5 0 1 0
0 −4 −4 1 1 0
0 4 −4 −1 1 0
0 −2 −1 2 1 0
0 4 0 −5 0 1
 (13)
G =

1
4 0 0−1
6
−1
6
−1
6−1
6
1
6
−1
6
1
24
1
12
1
6
1
24
−1
12
1
6
0 0 1
 (14)
The number of addition and multiplications required for
Winograd transform, not the element-wise multiplications,
increases quadratically with the tile size. Thus, Winograd is
expected to perform most efficiently for smaller filter sizes,
and thus smaller input tiles.
Algorithm 1: Winograd for 2D convolution
for row=0; row<H; row+=m do
for column=0; column<H; column+=m do
for channel=0; channel<c; channel+=1 do
for filter=0; filter<N; kerne;+=1 do
load input tile;
transform input tile;
load transformed filter;
perform EWMM
end
end
end
Output Winograd convolution result
end
V. ARCHITECTURE DESIGN
In this paper we propose a photonic CNN accelerator
based on Winograd algorithm and realized using the photonic
JOURNAL OF SELECTED TOPICS IN QUANTUM ELECTRONICS 6
neuron introduced in [16]. Figure 4 depicts the architecture
of a single Winograd PE. Our proposed accelerator processes
a single layer of a CNN at a time. This is mainly due to
the fact that in a CNN different tiles of output feature maps
are computed sequentially, and thus arrive at different times.
But, in order to initiate processing of the next layer, all
the inputs from the previous layer need to be available and
synchronized. Our approach to process one layer at the time
enforces this synchronization. Furthermore, implementing
multiple layers of a CNN will result large area overheads.
At the input of our accelerator, an input tile of shape
n× n× c along with filters of size r× r× c are transformed
into the Winograd domain. Input and filters’ transforms
are then multiplied element by element. The output of this
multiplication needs to be transformed back using an inverse
Winograd transform. The signals at this stage are digitized
using an array of ADC and placed onto the output line buffers
to be stored back in the off-chip memory.
Figure 5 presents the overview of the our proposed
architecture. Our proposed architecture runs on two clock
domains. First a high speed 5GHz clock domain, which
accommodates low latency components of the accelerator
including the photonic components. In section VI-A we
explain our rationale on how we arrive d at the 5GHz high
speed clock frequency. The rest of the accelerator including
input feature map buffers, filter buffers, filter Winograd
DSP module, and filter path DAC run on a slower clock
domain because there is no time sensitivity on filter path, and
data transfer from/to off-chip memory. At the heart of our
accelerator, we have an Element-Wise Matrix Multiplication
unit, which we implement in photonics using the photonic
neuron. We store the input feature maps and filters in an
off-chip memory. Both the input feature maps and filters
require to go through Winograd transformation, which are
matrix multiplications described in equations 13 and 14. It
should be noted that while inputs feature maps change for
different tiles of inputs, filters are fixed for each layer. For
that, we implemented the input feature map transformations
in photonics and filter Winograd transformations in electronic
DSP. This way, we will not pay the overhead associated
with photonic implementations including the conversion of
electronic filters to photonics. Later, the transformed filters
and input feature map tile are converted into analog signals
to modulate the laser beams. However, as the filters are
fixed over the processing time of the layer, analog filter
signals need to be maintained for that time. Thus, we propose
to use the non-volatile analog memristive memory bank,
which maintains these voltages in their analog form for long
retention times.
In Winograd convolution in each iteration a tile of n × n
is processed. In order to process an entire feature map,
the transformed filter tile needs to move across the input
feature map. In this paradigm two successive input tiles
share size (r − 1) × n elements. This, introduces data reuse
opportunities, to avoid multiple queries of same data block.
Here our goal is to exploit this opportunity at the front-end
of our accelerator. Our design is inspired by the work in [13],
where authors utilize line buffers to avoid redundant queries
from the off-chip memory. Figure 6 shows an example line
buffer design to load and hold a 3× 3 input feature map tile.
Input tiles are fetched from off-chip memory and loaded into
the line buffer. Buffered tiles are then passed into the digital
to analog Converter (DAC) using parallel channels.
In parallel to the input data stream, transformed filter
weights are converted into the analog signals to program the
analog memristive memory. We then use the voltage generated
using the stored analog signals to modulate the laser source
for the filters. Each output signal generated by a DAC is then
used to modulate an laser beam of a particular wavelength
λi. It is worth noting that for each set of filter modulated
by the laser source, input line goes through multiple iteration
corresponding to different input tiles. Once both input tile
laser beam and the filter laser beam are ready, the EWMM,
multiplies each element of the Winograd input feature map
tile with its corresponding Winograd filter value. The output
from EWMM unit must be transformed back from Winograd
domain into the original domain by the inverse Winograd
transform. The result contains output feature map tiles for
multiple channels c. Output feature map values are digitized
and stored back into the off-chip memory.
A key principle in HPC is trying to minimize the IO and
other communication latencies, ideally to less than that for
computing time to avoid computing unit under-utilization.
From Algorithm 1 we can see that the two inner-most loops
iterate over different channels of the input feature map tile
and different filters. Moreover, operations within these two
loops are independent form one another. This, provides par-
allelization opportunities at the cost of additional hardware.
In other words, the amount parallelization, and thus speed up,
we can achieve scales linearly with the number of pipeline
replications in our system. This linear scaling plateaus as
once computation bandwidth approaches the data transfer
bandwidth. Our envisioned design uses an arbitrary number of
100 parallel paths. Our evaluation results in the next section
justifies this selection.
VI. EVALUATION
In this section we evaluate the performance of our acceler-
ator for the 3 × 3 filters of the VGG16 network against the
recent FPGA [43][13][44][45][46] and GPU implementations
[13].
A. Speed
Here we develop a model to estimate the execution time
of the our accelerator. First we model the time required to
convolve one input tile with one filter and we call it Ttile filter.
Following that, we generalize the model to the case where we
parallelize the process based on available resources. For one
input feature map tile and a filter, both, the input branch and
the filter branch of the Figure 5 are fully pipelined. Therefore,
the execution time of a layer is determined by the longer of
JOURNAL OF SELECTED TOPICS IN QUANTUM ELECTRONICS 7
Fig. 5: High-level architecture of our proposed photonic accelerator. Input feature-maps and filters are initially stored in an
off-chip memory. Input tiles of size n × n are loaded into the input line buffer one at a time. Kernel weights do not change
once the CNN is trained. Thus, we perform filters’ Winograd in electronics and the cost is amortized over many input tiles.
Winograd transform for input feature map tiles are computes in photonics. The photonic element-wise matrix multiplication
(EWMM) unit performs the core Winograd element-wise Winograd multiplications. Outputs are digitized and placed onto the
output line buffer. Finally, processed layer outputs are stored in the off-chip memory.
Fig. 6: An example line buffer design to load and hold an
input tile of size 3× 3.
the two paths of filter path and input path. For each iteration of
the filter path, input data path goes through multiple iterations.
This is because a single filter operates on many input data
tiles. That said, the input data path sets the upper bound on
the delay. Our execution time model is comprised of two
major components namely, the Input/Output time (TIO) and
the computation time (TComp). We define (TIO) as.
TIO = max(Tload, Toffload) (15)
where Tload is the time it takes to transfer data from the
off-chip memory to the input of the laser sources. Moreover,
we can implement a total of Pinput DACs to speedup the
data transfer. Considering the fact that our input matrices
are of the shape 4 × 4, we used an array of 16 DACs in
this work. Similarly Toffload is the time to store back the
computed outputs from the inverse Winograd transform to
the off-chip memory. It should be noted we would want to
match the rate of the ADC at the output with input DAC to
any speed mismatch and thus congestion in the pipeline. At
the time of this review both on-chip DACs and ADCs are
capable of operating at sampling rates of more than 18 GS/s
for bit resolution of at least 8 bits [47][17]. Furthermore, with
recent advances in memory technology, current memories are
able to transfer data at high IO bandwidths up to more than
512Gb/s [48]. This high memory bandwidth allows us to
buffer data and filters from off-chip memory at high transfer
rates and feed it to our input line buffers. However, for our
line buffers we need memories with high clock frequency
and short access time. Current reported memory technologies
have access time as short as 200ps.
At the photonic core of our accelerator, Tcompute is,
Tcompute = Tlaser+TWinograd+TEWMM+TiWinograd (16)
where TWinograd is the time to compute the Winograd trans-
form, TEWMM is the time to perform the element-wise matrix
multiplication, and TiWinograd is the time compute the inverse
Winograd transform. Once the laser is set up, input signals
only incur a time delay equivalent to flight time of the light
before they are fed into the ADC. Having said that the clock
frequency of the pipeline is determined by
frequencyclock ≤ 1
min(Tload, Toffload, Tcompute)
(17)
As a result of equation 17, we picked a clock frequency
of 5Ghz, which satisfies equation 17. From equation 17
Ttile filter is simply found by,
Ttile filter =
1
5GHz
= 200ps (18)
For a F (4×4, 3×3) Winograd, each Ttile filter returns an
output block of size 9 equivalent to 9 convolution operations.
Having a clock frequency of 5GHz, our proposed accelerator
performs at 9×5G = 45GOPs/s. Figure 7 shows the average
JOURNAL OF SELECTED TOPICS IN QUANTUM ELECTRONICS 8
convolution speed comparison of our proposed accelerator
against the state-of-the-art FPGA and GPU implementations.
Fig. 7: Comparison of convolution operation speed for FPGA,
GPU, and our photonic implementation. The last column
labeled with (p) represents the speed of the photonic core in
the absence of electronics.
B. Power
In order to estimate the dynamic power consumption of our
proposed system, we built our in-house estimator by augment-
ing the standard Google Tensorflow tool. While primarily used
for training and inference stages of neural networks, at the
core, Tensorflow is a symbolic mathematical graph process-
ing platform. Tensorflow enables users to express arbitrary
computations into a dataflow graph, which is extremely useful
in the context of neural networks. However, out-of-the-box
Tensorflow is completely agnostic to physical realization of
the neural networks being implemented. Thus, we augmented
Tensorflow high-level components with mathematical models
of electro-optical components. Figure 8 depicts the native
Tensorflow toolkit hierarchy against our augmented version. In
Fig. 8: High level Tensorflow toolkit hierarchy vs. augmented
Tensorflow.
our estimator, each primitive mathematical operation is given
two physical models namely, the power model and the noise
model. While, the noise model can impact the functionality,
thus accuracy, in a neural network, the power model only
models/measures power consumed power. Table II shows some
TABLE II: Mapping of primitive math operations to their
hardware realization.
Math Operation Photonic Representation
Addition Photodiode
Multiplication MRR
Connection Waveguide
Non-linear Activation Electro-absorption Modulator
of these mathematical operations mapped to to their physical
realizations. Figure ??
Fig. 9: Comparison of convolution operation power for FPGA,
GPU, and our photonic implementation. The last column
labeled with (p) represents the power consumption of the
photonic core in the absence of electronics.
Figure 9 depicts the power comparison results. Finally We
plotted the energy efficiency figure of merit defined by the
ratio of speed to power in Figure
VII. TRAINING, INFERENCE, AND NOISE
We initially trained our neural network offline on a
conventional digital computer. Later we during the inference
time we loaded the trained weights into our in-house
simulator, which is equipped with noise sources modeling the
physical noise in a photonic implementation. Our hypothesis
was that inference on a noisy neural network would result in
some loss in accuracy. This is mostly due to the fact that,
the network used during the training is noise-less, with 32-bit
floating point resolution, while during inference the weights
all in a sudden face a noisy network. In other word, the
network during inference experiences unseen noise behavior
that results in accuracy loss. We tested our hypothesis by
sweeping noise level and observing its effect on accuracy. For
that reason, we identified two major noise sources, namely
the neuron output noise and the weight noise. The neuron
output noise represents the noise introduced at the output of
each neuron by the photodiode and the nonlinear activation
JOURNAL OF SELECTED TOPICS IN QUANTUM ELECTRONICS 9
Fig. 10: Comparison of energy efficiency for FPGA, GPU, and
our photonic implementation. The last column labeled with (p)
represents the power consumption of the photonic core in the
absence of electronics. The results show that using photon-
ics as an accelerator has the potential of improving energy
efficiency by up to more than three orders of magnitude.
Fig. 11: Visualalization of an augmented convolutional layer
using power and noise models for VGG16 network.
function. Figure 12 shows how accuracy is impacted by noise
during inference for the case that the network was trained
free of any noise source.
Our next hypothesis was that, if we allow for certain amount
of noise during the training, the model would become more
robust to during the inference stage. To that end, we trained
the network with output noise source on. We only added the
output noise, and left the weight noise off, because weights
are required to be calculated with maximum precision during
training. In fact we observed that even a minute amount of
noise added to the weights during the training could destroy
the accuracy of the network to its baseline level of about
10%. Figure 12 depicts the effect of adding an output noise
equivalent to 0.1% of the maximum signal swing at the output
of neurons. Note, the addition of this amount of noise during
the training may result in slight 2% accuracy loss for low level
of noise during inference. However, the model becomes more
robust to higher levels of inference. This shows that modeling
noise by adding training noise can fine-tune the network for
a physical noisy realization. Lastly adding further amount of
noise beyond the initial 0.1% resulted the more significant
inference accuracy.
VIII. CONCLUSION
In this paper we presented a photonic CNN accelerator
based on Winograd filering convolution algorithm. Winograd
reduces the total number of multiplication, thus hardware, to
perform convolution operation. We evaluated the speed of our
accelerator by developing an analyical framework. Our results
show that a photonic accelerator can compete with state-of-
the-art Winograd based FPGA and GPU implementations. In
such photonic accelerator has the potential of improving the
every efficiency by up to three orders of magnitude. However,
the overall speed is bound by the limitations of IO and conver-
sions in DAC and ADC. To evaluate power performance we
augmented the native hardware-agnostic Google Tensorflow
tool with power models of our hardware components. Similar
to speed analysis, electronic IO and convertors are the major
consumers of power. he photonic core can operate consum-
ing up to two orders of magnitude less power. In addition,
we modeled noise into our Tensorflow-based simulator, to
investigate the effect of implementation noise sources such
as photodiode noise and MRR noise on the functionality
(accuracy) of our CNN. We found training the CNN with a
small noise component equivalent to 0.1% of the signal swing
result in the CNN become more robust to inference-time noise
introduced by noisy photodiodes and MRRs.
REFERENCES
[1] P. R. Prucnal and B. J. Shastri, Neuromorphic Photonics. CRC Press,
May 2017, google-Books-ID: VbvODgAAQBAJ.
[2] J. K. George, A. Mehrabian, R. Amin, J. Meng, T. F. d. Lima, A. N.
Tait, B. J. Shastri, T. El-Ghazawi, P. R. Prucnal, and V. J. Sorger,
“Neuromorphic photonics with electro-absorption modulators,” Optics
Express, vol. 27, no. 4, pp. 5181–5191, Feb. 2019. [Online]. Available:
https://www.osapublishing.org/oe/abstract.cfm?uri=oe-27-4-5181
[3] C. Wang, M. Zhang, X. Chen, M. Bertrand, A. Shams-Ansari,
S. Chandrasekhar, P. Winzer, and M. Lonar, “Integrated lithium niobate
electro-optic modulators operating at CMOS-compatible voltages,”
Nature, vol. 562, no. 7725, p. 101, Oct. 2018. [Online]. Available:
https://www.nature.com/articles/s41586-018-0551-y
[4] A. Liu, R. Jones, L. Liao, D. Samara-Rubio, D. Rubin, O. Cohen,
R. Nicolaescu, and M. Paniccia, “A high-speed silicon optical
modulator based on a metaloxidesemiconductor capacitor,” Nature,
vol. 427, no. 6975, p. 615, Feb. 2004. [Online]. Available:
https://www.nature.com/articles/nature02310
[5] R. Amin, R. Maiti, C. Carfano, Z. Ma, M. H. Tahersima,
Y. Lilach, D. Ratnayake, H. Dalir, and V. J. Sorger, “0.52 V
mm ITO-based Mach-Zehnder modulator in silicon photonics,” APL
Photonics, vol. 3, no. 12, p. 126104, Dec. 2018. [Online]. Available:
https://aip.scitation.org/doi/10.1063/1.5052635
[6] H. Bagherian, S. Skirlo, Y. Shen, H. Meng, V. Ceperic, and M. Sol-
jacic, “On-chip optical convolutional neural networks,” arXiv preprint
arXiv:1808.03303, 2018.
[7] A. Mehrabian, Y. Al-Kabani, V. J. Sorger, and T. El-Ghazawi, “Pcnna: A
photonic convolutional neural network accelerator,” in 2018 31st IEEE
International System-on-Chip Conference (SOCC). IEEE, 2018, pp.
169–173.
[8] “Optalysys.” [Online]. Available: https://www.optalysys.com/
[9] Y. Shen, N. C. Harris, S. Skirlo, M. Prabhu, T. Baehr-Jones,
M. Hochberg, X. Sun, S. Zhao, H. Larochelle, D. Englund, and
M. Soljai, “Deep learning with coherent nanophotonic circuits,” Nature
Photonics, vol. 11, no. 7, pp. 441–446, Jul. 2017. [Online]. Available:
http://www.nature.com/articles/nphoton.2017.93
JOURNAL OF SELECTED TOPICS IN QUANTUM ELECTRONICS 10
Fig. 12: The evaluation of the effect of physical photodiode
and MRR noise on inference accuracy. This effect can be par-
tially compensated through introduction of an artificial noise
source during the training stage. At the absence of training
noise source (top) inference accuracy is quickly deteriorated
as we sweep the photodiode and MRR noise. By introducing
an equivalent of 0.1% guassian noise, the network becomes
more robust to inference noise. Further increase in training
noise level (bottom) hinders the network from proper training.
[10] T. W. Hughes, M. Minkov, Y. Shi, and S. Fan, “Training of
photonic neural networks through in situ backpropagation and gradient
measurement,” Optica, vol. 5, no. 7, pp. 864–871, Jul. 2018.
[Online]. Available: https://www.osapublishing.org/optica/abstract.cfm?
uri=optica-5-7-864
[11] M. Miscuglio, A. Mehrabian, Z. Hu, S. I. Azzam, J. George, A. V.
Kildishev, M. Pelton, and V. J. Sorger, “All-optical nonlinear activation
function for photonic neural networks [Invited],” Optical Materials
Express, vol. 8, no. 12, p. 3851, Dec. 2018. [Online]. Available:
https://www.osapublishing.org/abstract.cfm?URI=ome-8-12-3851
[12] A. Lavin and S. Gray, “Fast algorithms for convolutional neural net-
works,” in Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition, 2016, pp. 4013–4021.
[13] L. Lu, Y. Liang, Q. Xiao, and S. Yan, “Evaluating fast algorithms for
convolutional neural networks on fpgas,” in 2017 IEEE 25th Annual
International Symposium on Field-Programmable Custom Computing
Machines (FCCM). IEEE, 2017, pp. 101–108.
[14] Y. Shen, N. C. Harris, S. Skirlo, M. Prabhu, T. Baehr-Jones,
M. Hochberg, X. Sun, S. Zhao, H. Larochelle, D. Englund et al., “Deep
learning with coherent nanophotonic circuits,” Nature Photonics, vol. 11,
no. 7, p. 441, 2017.
[15] A. N. Tait, M. A. Nahmias, B. J. Shastri, and P. R. Prucnal, “Broadcast
and weight: an integrated network for scalable photonic spike process-
ing,” Journal of Lightwave Technology, vol. 32, no. 21, pp. 3427–3439,
2014.
[16] A. N. Tait, A. X. Wu, T. F. de Lima, E. Zhou, B. J. Shastri, M. A.
Nahmias, and P. R. Prucnal, “Microring weight banks,” IEEE Journal
of Selected Topics in Quantum Electronics, vol. 22, no. 6, pp. 312–325,
2016.
[17] B. Xu, Y. Zhou, and Y. Chiu, “A 23-mw 24-gs/s 6-bit voltage-time
hybrid time-interleaved adc in 28-nm cmos,” IEEE Journal of Solid-
State Circuits, vol. 52, no. 4, pp. 1091–1100, 2017.
[18] M. Ziebell, D. Marris-Morini, G. Rasigade, P. Crozat, J.-M. Fdli,
P. Grosse, E. Cassan, and L. Vivien, “Ten Gbit/s ring resonator silicon
modulator based on interdigitated PN junctions,” Optics Express,
vol. 19, no. 15, pp. 14 690–14 695, Jul. 2011. [Online]. Available:
https://www.osapublishing.org/oe/abstract.cfm?uri=oe-19-15-14690
[19] F. Y. Gardes, A. Brimont, P. Sanchis, G. Rasigade, D. Marris-Morini,
L. O’Faolain, F. Dong, J. M. Fedeli, P. Dumon, L. Vivien, T. F. Krauss,
G. T. Reed, and J. Mart, “High-speed modulation of a compact silicon
ring resonator based on a reverse-biased pn diode,” Optics Express,
vol. 17, no. 24, pp. 21 986–21 991, Nov. 2009. [Online]. Available:
https://www.osapublishing.org/oe/abstract.cfm?uri=oe-17-24-21986
[20] “OSA | Ten Gbit/s ring resonator silicon modulator
based on interdigitated PN junctions.” [Online]. Available:
https://www.osapublishing.org/oe/abstract.cfm?uri=oe-19-15-14690
[21] T. Baba, S. Akiyama, M. Imai, N. Hirayama, H. Takahashi, Y. Noguchi,
T. Horikawa, and T. Usuki, “50-Gb/s ring-resonator-based silicon
modulator,” Optics Express, vol. 21, no. 10, pp. 11 869–11 876, May
2013. [Online]. Available: https://www.osapublishing.org/oe/abstract.
cfm?uri=oe-21-10-11869
[22] P. Dong, S. Liao, D. Feng, H. Liang, D. Zheng, R. Shafiiha, C.-C. Kung,
W. Qian, G. Li, X. Zheng, A. V. Krishnamoorthy, and M. Asghari,
“Low V pp, ultralow-energy, compact, high-speed silicon electro-optic
modulator,” Optics Express, vol. 17, no. 25, p. 22484, Dec. 2009.
[Online]. Available: https://www.osapublishing.org/oe/abstract.cfm?uri=
oe-17-25-22484
[23] H. Jayatilleka, K. Murray, M. Caverley, N. A. F. Jaeger, L. Chrostowski,
and S. Shekhar, “Crosstalk in SOI Microring Resonator-Based Filters,”
Journal of Lightwave Technology, vol. 34, no. 12, pp. 2886–2896, Jun.
2016. [Online]. Available: http://ieeexplore.ieee.org/document/7272050/
[24] M. Bahadori, S. Rumley, H. Jayatilleka, K. Murray, N. A. F. Jaeger,
L. Chrostowski, S. Shekhar, and K. Bergman, “Crosstalk Penalty in
Microring-Based Silicon Photonic Interconnect Systems,” Journal of
Lightwave Technology, vol. 34, no. 17, pp. 4043–4052, Sep. 2016.
[Online]. Available: http://ieeexplore.ieee.org/document/7506337/
[25] “imec-ePIXfab SiPhotonics Passives.” [Online]. Available: http://www.
europractice-ic.com/SiPhotonics technology IHP passives.php
[26] L. Vivien, A. Polzer, D. Marris-Morini, J. Osmond, J. M. Hartmann,
P. Crozat, E. Cassan, C. Kopp, H. Zimmermann, and J. M. Fdli, “Zero-
bias 40gbit/s germanium waveguide photodetector on silicon,” Optics
Express, vol. 20, no. 2, pp. 1096–1101, Jan. 2012. [Online]. Available:
https://www.osapublishing.org/oe/abstract.cfm?uri=oe-20-2-1096
[27] Y. Salamin, P. Ma, B. Baeuerle, A. Emboras, Y. Fedoryshyn,
W. Heni, B. Cheng, A. Josten, and J. Leuthold, “100 GHz Plasmonic
Photodetector,” ACS Photonics, vol. 5, no. 8, pp. 3291–3297, Aug.
JOURNAL OF SELECTED TOPICS IN QUANTUM ELECTRONICS 11
2018. [Online]. Available: http://pubs.acs.org/doi/10.1021/acsphotonics.
8b00525
[28] P. Ma, Y. Salamin, B. Baeuerle, A. Emboras, Y. Fedoryshyn,
W. Heni, B. Cheng, A. Josten, and J. Leuthold, “100 GHz
Photoconductive Plasmonic Germanium Detector,” in Conference on
Lasers and Electro-Optics (2018), paper SM2I.3. Optical Society
of America, May 2018, p. SM2I.3. [Online]. Available: https:
//www.osapublishing.org/abstract.cfm?uri=CLEO SI-2018-SM2I.3
[29] A. N. Tait, T. F. de Lima, E. Zhou, A. X. Wu, M. A.
Nahmias, B. J. Shastri, and P. R. Prucnal, “Neuromorphic photonic
networks using silicon photonic weight banks,” Scientific Reports,
vol. 7, no. 1, p. 7430, Aug. 2017. [Online]. Available: https:
//doi.org/10.1038/s41598-017-07754-z
[30] C. Yoshida, K. Tsunoda, H. Noshiro, and Y. Sugiyama, “High speed
resistive switching in pt/ ti o 2/ ti n film for nonvolatile memory
application,” Applied Physics Letters, vol. 91, no. 22, p. 223510, 2007.
[31] J. Borghetti, G. S. Snider, P. J. Kuekes, J. J. Yang, D. R. Stewart, and
R. S. Williams, “memristiveswitches enable statefullogic operations via
material implication,” Nature, vol. 464, no. 7290, p. 873, 2010.
[32] S. Stathopoulos, A. Khiat, M. Trapatseli, S. Cortese, A. Serb, I. Valov,
and T. Prodromakis, “Multibit memory operation of metal-oxide bi-layer
memristors,” Scientific reports, vol. 7, no. 1, p. 17532, 2017.
[33] I. Baek, M. Lee, S. Seo, M. Lee, D. Seo, D.-S. Suh, J. Park, S. Park,
H. Kim, I. Yoo et al., “Highly scalable nonvolatile resistive memory
using simple binary oxide driven by asymmetric unipolar voltage
pulses,” in IEDM Technical Digest. IEEE International Electron Devices
Meeting, 2004. IEEE, 2004, pp. 587–590.
[34] E. J. Merced-Grafals, N. Da´vila, N. Ge, R. S. Williams, and J. P.
Strachan, “Repeatable, accurate, and high speed multi-level program-
ming of memristor 1t1r arrays for power efficient analog computing
applications,” Nanotechnology, vol. 27, no. 36, p. 365202, 2016.
[35] A. Prakash, D. Deleruyelle, J. Song, M. Bocquet, and H. Hwang,
“Resistance controllability and variability improvement in a taox-based
resistive memory for multilevel storage application,” Applied Physics
Letters, vol. 106, no. 23, p. 233104, 2015.
[36] S. R. Lee, Y.-B. Kim, M. Chang, K. M. Kim, C. B. Lee, J. H. Hur,
G.-S. Park, D. Lee, M.-J. Lee, C. J. Kim et al., “Multi-level switching
of triple-layered taox rram with excellent reliability for storage class
memory,” in 2012 Symposium on VLSI Technology (VLSIT). IEEE,
2012, pp. 71–72.
[37] K. Simonyan and A. Zisserman, “Very deep convolutional networks for
large-scale image recognition,” arXiv preprint arXiv:1409.1556, 2014.
[38] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proceedings of the IEEE conference on computer vision
and pattern recognition, 2016, pp. 770–778.
[39] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking
the inception architecture for computer vision,” in Proceedings of the
IEEE conference on computer vision and pattern recognition, 2016, pp.
2818–2826.
[40] M. Mathieu, M. Henaff, and Y. LeCun, “Fast training of convolutional
networks through ffts,” arXiv preprint arXiv:1312.5851, 2013.
[41] N. Vasilache, J. Johnson, M. Mathieu, S. Chintala, S. Piantino, and
Y. LeCun, “Fast convolutional nets with fbfft: A gpu performance
evaluation,” arXiv preprint arXiv:1412.7580, 2014.
[42] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro,
and E. Shelhamer, “cudnn: Efficient primitives for deep learning,” arXiv
preprint arXiv:1410.0759, 2014.
[43] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing
fpga-based accelerator design for deep convolutional neural networks,”
in Proceedings of the 2015 ACM/SIGDA International Symposium on
Field-Programmable Gate Arrays. ACM, 2015, pp. 161–170.
[44] J. Qiu, J. Wang, S. Yao, K. Guo, B. Li, E. Zhou, J. Yu, T. Tang,
N. Xu, S. Song et al., “Going deeper with embedded fpga platform for
convolutional neural network,” in Proceedings of the 2016 ACM/SIGDA
International Symposium on Field-Programmable Gate Arrays. ACM,
2016, pp. 26–35.
[45] N. Suda, V. Chandra, G. Dasika, A. Mohanty, Y. Ma, S. Vrudhula, J.-s.
Seo, and Y. Cao, “Throughput-optimized opencl-based fpga accelerator
for large-scale convolutional neural networks,” in Proceedings of the
2016 ACM/SIGDA International Symposium on Field-Programmable
Gate Arrays. ACM, 2016, pp. 16–25.
[46] C. Zhang, G. Sun, Z. Fang, P. Zhou, P. Pan, and J. Cong, “Caffeine:
Towards uniformed representation and acceleration for deep convolu-
tional neural networks,” IEEE Transactions on Computer-Aided Design
of Integrated Circuits and Systems, 2018.
[47] A. Nazemi, K. Hu, B. Catli, D. Cui, U. Singh, T. He, Z. Huang,
B. Zhang, A. Momtaz, and J. Cao, “3.4 a 36gb/s pam4 transmitter using
an 8b 18gs/s dac in 28nm cmos,” in 2015 IEEE International Solid-State
Circuits Conference-(ISSCC) Digest of Technical Papers. IEEE, 2015,
pp. 1–3.
[48] D. U. Lee, K. W. Kim, K. W. Kim, K. S. Lee, S. J. Byeon, J. H. Kim,
J. H. Cho, J. Lee, and J. H. Chun, “A 1.2 v 8 gb 8-channel 128 gb/s high-
bandwidth memory (hbm) stacked dram with effective i/o test circuits,”
IEEE Journal of Solid-State Circuits, vol. 50, no. 1, pp. 191–203, 2015.
Armin Mehrabian is a PhD candidate in Electrical
Engineering at the George Washington University.
His research interests include High Performance
Computing (HPC), Neuromorphic Computing, Arti-
ficial Intelligence (AI) from both software and hard-
ware point of view. He received his BS. degree in
Electrical Engineering at Shahid Beheshti University
of Tehran, Iran focusing on Analog Electronics, and
his MS. degree at the George Washington University
(GWU), DC, USA in computer engineering focusing
on VLSI and digital electronics design. His current
research involves leveraging nanophotonics for HPC architecture designs.
Mario Miscuglio Mario Miscuglio is a post-doctoral
researcher in the Electrical Engineering department
at the George Washington University. He received
his Masters in Electric and Computer engineering
from Polytechnic of Turin, working as researcher at
Harvard/MIT. He completed his PhD in Optoelec-
tronics from University of Genova (IIT), working as
research fellow at the Molecular Foundry in LBNL.
His interests extend across science and engineering,
including photonic neuromorphic computing, nano-
optics and plasmonics.
Yousra Alkabani received the BSc and MSc de-
grees in computer and systems engineering from
Ain Shams University, Cairo, Egypt, in 2003 and
2006, respectively. She received the PhD degree in
computer science from Rice University, Houston,
TX, USA, in December 2010. She has been an
assistant professor of computer and systems engi-
neering at Ain Shams University since May 2011
and a visiting assistant professor of computer science
and engineering at the American University in Cairo
since 2013. Her research interests include hardware
security, low power design, and embedded systems. She is a member of the
IEEE.
Volker J. Sorger is an associate professor in the
Department of Electrical and Computer Engineering,
and the director of the Integrated Nanophotonics lab
at the George Washington University. He received
his PhD from the University of California Berkeley
and MS from UT Austin. His research focuses
on integrated photonics and plasmonics, and ana-
log information processing such as programmable
photonic circuits and neuromorphic computing. His
work was recognized by the Emil Wolf prize from
the Optical Society of America, the AFOSR Young
Investigator (YIP) award, the Hegarty Innovation Prize, the National Academy
of Sciences paper-of-the-year award, the MRS Gold medal, and both the Early
Career and Outstanding Research awards at GWU. He is the editor-in-chief
of the Nanophotonics and the OSA division chair for Optoelectronics-and-
Photonics. He serves at the boards of OSA and SPIE, and is a senior member
of IEEE, OSA & SPIE.
JOURNAL OF SELECTED TOPICS IN QUANTUM ELECTRONICS 12
Tarek El-Ghazawi is a Professor in the Depart-
ment of Electrical and Computer Engineering at
The George Washington University, where he leads
the university-wide Strategic Program in High- Per-
formance Computing. He is the founding director
of The GW Institute for Massively Parallel Ap-
plications and Computing Technologies (IMPACT).
His research interests include high-performance
computing, parallel computer architectures, high-
performance I/O, reconfigurable computing, exper-
imental performance evaluations, computer vision,
and remote sensing. He has published over 200 refereed research papers
and book chapters in these areas and his research has been supported by
DoD/DARPA, NASA, NSF, and also industry, including IBM and SGI. He is
the first author of the book UPC: Distributed Shared Memory Programming,
which has the first formal specification of the UPC language used in high-
performance computing. Dr. El-Ghazawi is a member of the ACM and the Phi
Kappa Phi National Honor Society; he was also a U.S. Fulbright Scholar, a
recipient of the Alexander Schwarzkopf Prize for Technological Innovations
and a recipient of the Alexander von Humboldt research award from the
Humboldt Foundation in Germany. He is a fellow of the IEEE.
