Digital Electronics and Analog Photonics for Convolutional Neural
  Networks (DEAP-CNNs) by Bangari, Viraj et al.
Digital Electronics and Analog Photonics for Convolutional Neural Networks
(DEAP-CNNs)
Viraj Bangari,∗ Bicky A. Marquez, Heidi B. Miller, and Bhavin J. Shastri†
Department of Physics, Engineering Physics & Astronomy,
Queen’s University, Kingston, ON KL7 3N6, Canada
Alexander N. Tait
National Institute of Standards and Technology (NIST), Boulder, Colorado 80305, USA
Mitchell A. Nahmias, Thomas Ferreira de Lima, Hsuan-Tung Peng, and Paul R. Prucnal
Department of Electrical Engineering, Princeton University, Princeton, NJ 08544, USA
(Dated: July 3, 2019)
Convolutional Neural Networks (CNNs) are powerful and highly ubiquitous tools for extracting
features from large datasets for applications such as computer vision and natural language pro-
cessing. However, a convolution is a computationally expensive operation in digital electronics. In
contrast, neuromorphic photonic systems, which have experienced a recent surge of interest over
the last few years, propose higher bandwidth and energy efficiencies for neural network training
and inference. Neuromorphic photonics exploits the advantages of optical electronics, including the
ease of analog processing, and busing multiple signals on a single waveguide at the speed of light.
Here, we propose a Digital Electronic and Analog Photonic (DEAP) CNN hardware architecture
that has potential to be 2.8 to 14 times faster while maintaining the same power usage of current
state-of-the-art GPUs.
I. INTRODUCTION
The success of CNNs for large-scale image recognition
has stimulated research in developing faster and more
accurate algorithms for their use. However, CNNs are
computationally intensive and therefore results in long
processing latency. One of the primary bottlenecks is
computing the matrix multiplication required for forward
propagation. In fact, over 80% of the total processing
time is spent on the convolution [1]. Therefore, tech-
niques that improve the efficiency of even forward-only
propagation are in high demand and researched exten-
sively [2, 3].
In this work, we present a complete digital electronic
and analog photonic (DEAP) architecture capable of per-
forming highly efficient CNNs for image recognition. The
competitive MNIST handwriting dataset[4] is used as a
benchmark test for our DEAP CNN. At first, we train a
standard two-layer CNN offline, after which network pa-
rameters are uploaded to the DEAP CNN. Our scope is
limited to the forward propagation, but includes power
and speed analyses of our proposed architecture.
Due to their speed and energy efficiency, photonic neu-
ral networks have been widely investigated from different
approaches that can be grouped into three categories:
(1) reservoir computing [5–8]; reconfigurable architec-
tures based on (2) ring-resonators [9–12], and (3) Mach-
Zehnder interferometers [13, 14]. Reservoir computing in
the discrete photonic domain successfully implement neu-
ral networks for fast information processing, however the
∗ viraj.bangari@queensu.ca
† shastri@ieee.org
predefined random weights of their hidden layers cannot
be modified [8].
An alternative approach uses silicon photonics to de-
sign fully programmable neural networks [15], using a so-
called broadcast-and-weight protocol [10–12]. This pro-
tocol is capable of implementing reconfigurable, recurrent
and feedforward neural network models, using a bank of
tunable silicon microring resonators (MRRs) that recre-
ate on-chip synaptic weights. Therefore, such a protocol
allows it to emulate physical neurons. Mach-Zehnder in-
terferometers have been also used to model synaptic-like
connections of physical neurons [14]. The advantage of
the former approach over the latter is that it has already
demonstrated fan-in, inhibition, time-resolved process-
ing, and autaptic cascadability [12]. The DEAP CNN de-
sign is therefore compatible with mainstream silicon pho-
tonic device platforms. This approach leverages the ad-
vances in silicon photonics that have recently progressed
to the level of sophistication required for large-scale inte-
gration. Furthermore, this proposed architecture allows
the implementation of multi-layer networks to implement
the deep learning framework.
Inspired by the work of Mehrabian et al. [16], which
lays out a potential architecture for photonic CNNs with
DRAM, buffers, and microring resonators, our design
goes a step further by considering specific input repre-
sentation, as well as an example of how an algorithm for
tasks such as MNIST handwritten digit recognition can
be mapped to photonics. Moreover, we consider summa-
tion of multi-channel inputs, multi-dimensional kernels,
the limitations on weights being between 0 and 1, and
the architecture for the depth of kernel or inputs.
This work is divided in five sections: Following this
introduction, in section (II), we describe convolutions as
used in the field of signal processing. Then, we intro-
ar
X
iv
:1
90
7.
01
52
5v
1 
 [e
es
s.S
P]
  2
3 A
pr
 20
19
2duce silicon photonic devices to perform convolutions in
photonics. Section (III) introduces a hardware inspired
algorithm to perform such full photonic convolutions. In
Section (IV), we utilize our previously described architec-
ture to build a two-layers DEAP CNN for MNIST hand-
written digit recognition. Finally, in section (V), we show
an energy-speed benchmark test, where we compare the
performance of DEAP with the empirical dataset Deep-
Bench [17]. Note, we have made the high level simulator
and mapping tool for the DEAP architecture publicly
available [18].
II. CONVOLUTIONS AND PHOTONICS
II.1. Convolutions Background
A convolution of two discrete domain functions f and
g is defined by:
(f ∗ g)[t] =
∞∑
t=−∞
f [τ ]g[t− τ ], (1)
where (f∗g) represents a weighted average of the function
f [τ ] when it is weighting by g[−τ ] shifted by t. The
weighting function g[−τ ] emphasizes different parts of
the input function f [τ ] as t changes.
In digital image processing, a similar process is fol-
lowed. The convolution of an image A with a kernel F
produces a convolved image O. An image is represented
as a matrix of numbers with dimensionalityH×W , where
H and W are the height and width of the image, respec-
tively. Each element of a matrix represents the intensity
of a pixel at that particular spatial location. A kernel
is a matrix of real numbers with dimensionality R × R.
The value of a particular convolved pixel is defined by:
Oi,j =
R∑
k=1
R∑
l=1
Fk,lAi+k,j+l. (2)
Using matrix slicing notation, Eq. (2) can be represented
as a dot product of two vectorized matrices:
Oi,j = vec(F )
T · vec((Am,n)m∈[i,i+R]n∈[j,j+R])T . (3)
A convolution reduces the dimensionality of the input
image to (H − R + 1) × (W − R + 1), so a padding of
zero values is normally applied around the edges of the
input image to counteract this. A schematic illustration
of a convolution in digital image processing is shown at
the top of Fig. 1.
When convolutions are used to perform parallel ma-
trix multiplications in neural networks such as CNNs, a
convolution operation is defined as:
Oi,j = vec(F )
T · vec((Am,n,k)
m∈[iS,iS+R]
n∈[jS,jS+R]
k∈[1,D]
)T , (4)
where the input A has dimensionality H×W ×D, kernel
F has dimensionality R×R×D and D refers to the num-
ber of channels within the input image. The additional
1,1 1,2 1,3
1,4 1,5 1,6
1,7 1,8 1,9{
1,1
1,2
1,5 1,4
1,51,6
1,7
1,81,9
1,2
1,3
1,4
1,5
1,5
1,6
1,8
1,1 1,2
1,3 1,4
1,1 1,2
1,3 1,4 {
2,1 2,2
2,3 2,4
3,1 3,2
3,3 3,4
image kernel
DR2
R
R
H
W
D
D
{ {
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
F
1 2
3 4
o
o
o
o
1,1
1,2
1,3
1,4F
F
F
F
2,1
2,2
2,3
2,4F
F
F
F
3,1
3,2
3,3
3,4F
F
F
F
D
R
2
output
1
2
3
4o
o
o
o
F o
A
=*
2,1
2,2
2,5 2,4
2,52,6
2,7
2,82,9
2,2
2,3
2,4
2,5
2,5
2,6
2,8
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
3,1
3,2
3,5 3,4
3,53,6
3,7
3,83,9
3,2
3,3
3,4
3,5
3,5
3,6
3,8
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
A
2,1 2,2 2,3
2,4 2,5 2,6
2,7 2,8 2,9
A
A
A
A
A
A
A
A
A
3,1 3,2 3,3
3,4 3,5 3,6
3,7 3,8 3,9
A
A
A
A
A
A
A
A
A
1,1 1,2 1,3
1,4 1,5 1,6
1,7 1,8 1,9
A
A
A
A
A
A
A
A
A
Figure 1. Schematic illustration of a convolution. At the top
of the figure, an input image is represented as a matrix of
numbers with dimensionality H ×W × D where H, W and
D are the height, width and depth of the image, respectively.
Each element Ai,j of A represents the intensity of a pixel at
that particular spatial location. The kernel F is a matrix with
dimensionality R×R×D, where each element Fi,j is defined
as a real number. The kernel is slid over the image by using a
stride S equal to one. As the image has multiple channels (or
depth) D, the same kernel is applied to each channel. Assum-
ing H = W , the overall output dimensionality is (H−R+1)2.
The bottom of the figure shows how a convolution operation
generalized into a single matrix-matrix multiplication. where
the kernel F is transformed into a vector F with DR2 ele-
ments, and the image A is transformed into a matrix A of
dimensionality DR2× (H −R+ 1)2. Therefore, the output is
represented by a vector with (H −R+ 1) elements.
Table I. Summary of Convolutional Parameters
Parameter Meaning
N Number of input images
H Height of input image including padding
W Width of input image including padding
D Number of input channels
R Edge length of kernel
K Number of kernels
S Stride
parameter S is referred to as the “stride” of the convolu-
tion. This convolution is similar to Eq. (3), except that
the outputs from each channel are summed together in
the end, and that the stride parameter is always equal to
1 in image processing. The dimensionality of the output
feature is:⌈
H −R
S
+ 1
⌉
×
⌈
W −R
S
+ 1
⌉
×K, (5)
where K is the number of different kernels applied to an
image, and d·e is the ceiling function. Table (I) contains
3a summary of all the convolutional parameters described
so far.
One of the challenges with convolutions is that they
are computationally intensive operations, taking up 86%
to 94% of execution time for CNNs [1]. For heavy work-
loads, convolutions are typically run on graphical pro-
cessing units (GPUs), as they are able to perform many
mathematical operations in parallel. A GPU is a spe-
cialized hardware unit that is capable of performing a
single mathematical operation on large amounts of data
at once. This parallelization allow GPUs to compute
matrix-matrix multiplication at speeds much higher than
a CPU [19]. The convolution operation can be gen-
eralized into a single matrix-matrix multiplication [20].
This is shown at the bottom of Fig. 1, where the ker-
nel F is transformed into a vector F with dimensionality
KDR2 × 1, and the image is transformed into a matrix
A of dimensionality KDR2 × ⌈H−RS + 1⌉ ⌈W−RS + 1⌉K.
Therefore, the output is represented by a vector with⌈
H−R
S + 1
⌉ ⌈
W−R
S + 1
⌉
K elements; where in this partic-
ular case K = 1, S = 1 and H = W .
II.2. Silicon Photonics Background
An emerging alternative to GPU computing is optical
computing using silicon photonics for ultrafast informa-
tion processing. Silicon photonics is a technology that al-
lows for the implementation of photonic circuits by using
the existing complementary-metal-oxide-semiconductor
(CMOS) platform for electronics [21]. In recent years,
the silicon photonic based broadcast-and-weight archi-
tecture has been shown to perform multiply-accumulate
operations at frequencies up to five times faster than con-
ventional electronics [22]. Therefore, there is motivation
to explore how photonics can be used to perform convo-
lutions, and how it compares to GPU-based implemen-
tations.
MRRs are the essential devices of our approach. A
MRR is a circular waveguide that is coupled with either
one or two waveguides. Such silicon waveguides can be
manufactured to have a width of 500 nm while having
a thickness of 220 nm. These waveguides have a bend
radius of 5 µm and can support TE and TM polarized
wavelengths between 1.5 µm and 1.6 µm [21]. The single
waveguide configuration is called an all-pass MRR, see
Fig. 2(a).
The light from the waveguide is transferred into the
ring via a directional coupler and then recombined. The
effective index of refraction between the waveguide and
the MRR and the circumference of the MRR cause the re-
combined wave to have a phase shift, thereby interfering
with the intensity of original light. The transfer func-
tion of the intensity of the light coming out through port
with the light going into the input port of the all-pass
resonator is described by:
Tn(φ) =
a2 − 2ra cos(φ) + r2
1− 2ra cos(φ) + (ar)2 . (6)
A A 21,R1,2Ai
F1,1 F1,2
modulation weight b
A A 22,R2,2A2,1input through Ø
Tr
a
n
sm
is
si
o
n
(a) (b)
Figure 2. (a) All-pass MRR and (b) transfer function: the
orange curve represents the Lorentzian line shape described
by Eq. (6), centered in the initial phase where MRR is in
resonance with the incoming light. The blue triangle curve
shows how such phase can be modified by heating the MRR
via the application of a current through Ai.
The parameter r is the self-coupling coefficient, and a
defines the propagation loss from the ring and the direc-
tional coupler. The phase φ depends on the wavelength
λ of the light and radius d of the MRR [23]:
φ =
4pi2dneff
λ
, (7)
where neff is the effective index of refraction between
the ring and waveguide. The value of neff can be modi-
fied to indirectly change the resonance peak. Such tuning
is usually made by applying current to the ring propor-
tional to the variable Ai. This process heats the ring,
yielding a shift of the resonance peak. Figure 2(b) shows
an example of such tuning: the orange curve represents
the Lorentzian line shape described by Eq. (6), centered
in the initial phase of the ring resonator, indicating that
the MRR is in resonance with the incoming light. The
blue triangle curve shows how such phase can be modified
by heating the MRR.
The phase for an all-pass resonator corresponding to
a particular intensity modulation value can be computed
by using Eq. (6):
φi = arccos
[
Ai(1 + (ar)
2)− a2 − r2
2ra(1−Ai,j)
]
, (8)
resulting in a modulated intensity equal to Ai:
Imod = Tn(φi)|E0|2 = Ai, (9)
where E0 is amplitude of the electric field.
An alternative double waveguide configuration is called
the add-drop MRR. The transfer function of the through
port light intensity with respect to the input light is:
Tp(φ) =
(ar)2 − 2r2 cos(φ) + r2
1− 2r2 cos(φ) + (r2a)2 ; (10)
and the transfer function of the drop port light intensity
with respect to the input light is:
Td(φ) =
(1− r)2a
1− 2r2 cos(φ) + (r2a)2 . (11)
4A A 21,R1,2
FD,1 FD,2 F 2D,R
F2,1 F2,2 F 22,R
Ai
TIA
TIA
TIA
A 22,R2,2
A A 2D,RD,2
A A 21,R1,2
modulation weig
input through
Ø
(a)drop
PD
Ø
Tr
a
n
sm
is
si
o
n
Tr
a
n
sm
is
si
o
n
Ø
(b)
(c) (d)
Figure 3. (a) Add-drop configuration and O/E conversion and
amplification. (b) Output of the balanced photodiode, the
transfer function of Tp − Td. Orange circle and green curves
are drop and through ports, described by Eqs. (11) and (10),
respectively. In panels (c) and (d), the phase shifted (φ+0.2)
blue curves show how such positive and negative kernel values
from the drop and the through outputs, respectively. The or-
ange triangle curves show how those values can be amplified
by a factor of two using a TIA at the output of the balance
photodiode. Those phase shifts are achieved by the applica-
tion of a current through Ai.
In the case where the coupling losses are negligible,
a ≈ 1, the relationship between the add-drop through
and drop transfer functions is Tp = Td − 1. In addition,
if we connect the through and drop ports into a balanced
photodiode and TIA as in Fig. 3(a), we get an effective
transfer function of g(Tp − Td) where g is the gain of the
TIA. Therefore, we get a modulation of:
Imod = g(Tp(φi)− Tn(φi))|E0|2 = Ai. (12)
At the output of the balanced photodiode, the transfer
function of Tp−Td is shown by the blue triangle curve in
Fig. 3(b). Orange circle and green curves are Lorentzian
line shapes, centered in the initial phase where MRR is in
resonance with the incoming light, described by Eqs. (11)
and (10), respectively. Differently, Fig. 3(c) and (d), are
centered in a modified phase (φ+0.2), according to a spe-
cific value of the current Ai. Here we aim to demonstrate
how to represent positive and negative kernel values in
analog photonics. This can be achieved by incorporat-
ing a balanced-PD at the output of the add-drop MRR.
In panels (c) and (d), the blue curves show such positive
and negative kernel values from the drop and the through
outputs, respectively. The orange triangle curves show
the TIA transfer function g(Tp − Td), where g amplifies
Tp − Td by a factor of two.
II.3. Dot Products with Photonics
The fundamental operation of a convolution is the dot
product of two vectorized matrices. Therefore, one needs
to understand how to compute a vector dot product us-
ing photonics before proposing an architecture capable
of performing convolutions.
A wavelength multiplexed signal consists of k electro-
magnetic waves, each with angular frequency ωi, i =
1, . . . , k. If it is assumed that each wave has an ampli-
tude of E0, a power enveloping function µi whose modu-
lation frequency is significantly smaller than ωi, then the
slowly varying envelope approximation and a short-time
Fourier transform can be used to derive an expression for
the multiplexed signal in the frequency domain:
Emux(ω) =
k∑
i=1
E0
√
µiδ(ω − ωi), (13)
where δ(ω − ωi) is the Dirac delta function and µi ≥ 0,
since power envelopes are not negative. If the enveloping
function is prevented from amplifying the electric field,
µi can further be restricted to the domain 0 ≤ µi ≤
1. Next, we introduce tunable linear filters H+(ω) and
H−(ω) such that when they interact with multiple fields,
the following weighted signals are created:
E−w (ω) = H
−(ω)Emux(ω),
E+w (ω) = H
+(ω)Emux(ω).
(14)
Assuming that the two signals are fed into a balanced
photodiode (balanced PD) with spectral response R(ω),
the induced photocurrent is described by:
iPD =
∞∫
−∞
dωR(ω)
(∣∣E+w (ω)∣∣2 − ∣∣E−w (ω)∣∣2) ,
=
∞∫
−∞
dωR(ω)
(∣∣H+(ω)∣∣2 − ∣∣H−(ω)∣∣2) |Emux(ω)|2 ,
=
k−1∑
i=0
R(ωi)
(∣∣H+(ωi)∣∣2 − ∣∣H−(ωi)∣∣2)E0ri.
(15)
Assuming that R(ω) is roughly constant in the area
of spectral interest, one can set Ai = E0R0µi and F
∗
i =
|H+(ωi)|2 − |H−(ωi)|2 resulting in a photocurrent equal
to
iPD =
k∑
i=1
AiF
∗
i = ~A · ~F ∗. (16)
The through and drop ports of a MRR can be used
to implement the linear filters H+ and H− such that
|H+|2 = Td and |H−|2 = Td. Knowing that Tp = Td − 1
with minimal losses, we can set a particular weight using:
F ∗i = 2Td(φi)− 1, (17)
5M
U
X
W
D
M
FD,1 FD,2
F2,1 F2,2
F1,1 F1,2
A A 22,R2,2A2,1
A A 2D,RD,2AD,1
A
Ak
2
A 1
F1 F2 Fk
TIA
weight banks
balanced-PD
input 
intensities
Figure 4. An electro-optic architecture that performs dot
products. Ai (i = 1, . . . , k) are input elements encoded in
intensities, multiplexed by a WDM and linked to the weight
banks via a silicon waveguide. Fi are filter values that modu-
late the MRRs in the PWB. Drop and through output ports
are connected to a balanceD-PD, where the matrix multipli-
cation is performed, followed by an amplifier TIA.
where the phase, φi can obtained by using Eq. 10 and
Eq. 11 to get:
φi = arccos
[
− 1
2r2a
(
2(1− r)2a
F ∗i + 1
− 1− (r2a)2
)]
, (18)
we can see that F ∗i can be between -1 and 1. Since Td is
a filter that only represents values between 0 and 1. In
order to perform a dot product with a weight vector ~w
whose components are not limited to the range -1 to 1, a
gain gTIA can be applied to the photocurrent such that:
~A · ~F = gTIA ~A · ~F ∗
= gTIA
k∑
i=1
AiF
∗
i ,
(19)
if:
gTIA = max
1≤i≤k
|Fi| , (20)
then,
~F = gTIA ~F
∗; (21)
assuming that each φi corresponds to a weighting of
w∗i . This electronic gain can be performed using a tran-
simpedance amplifier (TIA), which can be manufactured
in a standard CMOS process [24] and packaged or in-
tegrated with the photonic chip [21]. A diagram of the
electro-optic architecture described in this section is pre-
sented in Fig. 4. From now on, this amalgamation of
electronic and optical components is referred as a pho-
tonic weight bank (PWB). PWBs similar to the one in
Fig. 4 have been successfully implemented in the past
[11, 25, 26].
We can represent negative inputs between -1 and 1 by
modifying the power enveloping function to µi =
1
2 (xi +
1). If the same set of derivations is followed, we can
modify Eq. (21) to be:
~x · ~w = g
(
k∑
i=1
AiF
∗
i +
k∑
i=1
E0R0F
∗
i
)
. (22)
The second term in this sum is a predictable bias cur-
rent term that conceptually be subtracted before feeding
into the TIA. This is a disadvantage of supporting neg-
ative inputs, as additional optical or electronic control
circuitry would need to be designed. Another trade-off is
a loss in precision due to a larger range of inputs needing
to be represented, analogous to the loss in precision with
signed integers for classical computing.
III. PERFORMING CONVOLUTIONS USING
PHOTONICS
The goal of this section is to present a photonic ar-
chitecture capable of performing convolutions for CNNs.
This new architecture is called DEAP.
For a maximum number of input channels Dm and a
maximum kernel edge length Rm as bounding parame-
ters for DEAP, we represent the range of convolutional
parameters that a particular implementation of DEAP
can support. If a convolutional parameter described in
Table (I) does not have a complementary bounding pa-
rameter, it means that the DEAP architecture can sup-
port for arbitrary values of said convolutional parameter.
III.1. Producing a Single Convolved Pixel
First, we consider an architecture that can produce
one convolved pixel at a time. To handle convolutions
for kernels with dimensionality up to Rm×Rm×Dm, we
will require R2m lasers with unique wavelengths since a
particular convolved pixel can be represented as the dot
product of two 1×R2mvectors. To represent the values of
each pixel, we require DmR
2
m modulators (one per ker-
nel value) where each modulator keeps the intensity of
the corresponding carrier wave proportional to the nor-
malized input pixel value. The R2m lasers are multiplexed
together using wavelength division multiplexing (WDM),
which is then split into Dm separate lines. On every line,
there are R2m all-pass MRRs, resulting in DmR
2
m MRRs
in total. Each WDM line will modulate the signals cor-
responding to a subset of R2m pixels on channel k, mean-
ing that the modulated wavelengths on a particular line
correspond to the pixel inputs (Am,n,k)
m∈[i,i+Rm]
n∈[j,j+Rm] where
k ∈ [1, Dm].
The Dm WDM lines will then be fed into an ar-
ray of Dm PWBs. Each PWB will contain Rm MRRs
with the weights corresponding to the kernel values at
a particular channel. For example, the PWB on line
k should contain the vectorized weights for the kernel
(Fm,n,k)
m∈[1,R2m]
n∈[1,R2m]
. Each MRR within a PWB should be
6*
*
*
M
U
X
W
D
M
A A 21,R1,2A1,1
1λ
2λ
λ 2R
FD,1 FD,2 F 2D,R
F2,1 F2,2 F 22,R
F1,1 F1,2 F 21,R
R
R
R
output
TIA
TIA
TIA
Input Kernel
laser diodes
modulation weight banks voltage adder
A A 22,R2,2A2,1
A A 2D,RD,2AD,1
Figure 5. Photonic architecture for producing a single convolved pixel. Input images are encoded in intensities Al,k, where the
pixel inputs Am,n,k with m ∈ [i, i + Rm], n ∈ [j, j + Rm], k ∈ [1, Dm] are represented as Al,h, l = 1, . . . , D and h = 1, . . . , R2.
Considering the boundary parameters, we set D = Dm and R = Rm. Likewise, the filter values Fm,n,k are represented as are
represented as Fl,h under the same conditions. We use an array of R
2 lasers with different wavelengths λh to feed the MRRs.
The input and kernel values, Al,h and Fl,h modulate the MRRs via electrical currents proportional to those values. Once the
matrix parallel multiplications are performed, the voltage adder has the function to add all signals from weight banks. Here,
R are resistance values. Then the output is the convolved feature.
tuned unique the resonant wavelength within the multi-
plexed signal. The outputs of the weight bank array are
electrical signals, each proportional to the dot product
(Fm,n,k)
m[1,R2m]
n∈[1,R2m]
· (Ap,q,k)p∈[i,i+R
2
m]
q∈[j,j+R2m]
. Finally, the signals
from the weight banks need to be added together. This
can be achieved using a passive voltage adder. The out-
put from this adder will therefore be the value of a single
convolved pixel. Fig. 5 shows a complete picture of what
such an architecture would look like.
To perform a convolution with a kernel edge length
less than Rm, one can set (Fm,n,k)
m∈[R+1,Rm]
n∈[R+1,Rm] to zero.
Similarly, if the dimensionality of the kernel is less than
Dm, then the modulators (Am,n,k)
m∈[1,H]
n∈[1,W ] should also be
set to zero, with k ∈ [D + 1, Dm] in this case.
III.2. Performing a Full Convolution
In the previous section, we have discussed how DEAP
can produce a single convolved pixel. In order to perform
a convolution of arbitrary size, one would need to stride
along the input image and readjust the modulation array.
Since the same kernel is applied across the set of inputs,
the weight banks do not need to be modified until a new
kernel is applied. Fig. 6(a) demonstrates this process on
an input with S = 1. To handle S ≥ 1, the inputs being
passed in to DEAP should also be strode accordingly. In
this approach, the inputs should have been zero padded
before being passed into DEAP. In pseudocode, perform-
ing a convolution with K filters can be implemented as
shown in Algorithm 1.
Algorithm 1 Convolutions for CNNs using DEAP
1: A is the input image
2: F is the kernel
3: R is the edge length of the kernel
4: O is a memory block to store the convolution
5: S is the stride
6: H and W are the height and width of the input image
7: function convolve(A,F,R,O, S,H,W )
8: for (k = 1; k ≤ K; k = k + 1) do
9: load kernel weights from F[:,:,:,k]
10: for (h = 1;h ≤ H −R+ 1;h = h+ S) do
11: for (w = 1;w ≤W −R+ 1;w = w + S) do
12: load inputs from A[h:min(h+T,H),
w:min(w+R,W),:]
13: perform convolution
14: store results in O[h/S,w/S,k]
15: end for
16: end for
17: end for
18: end function
The DEAP architecture also allows for parallelization
by treating the photonic architecture proposed in the
previous section as a single output “convolutional unit”.
However, by creating nconv instances of these convolu-
tional units, you could produce nconv pixels per cycle
by passing in the next set of inputs per unit. This is
demonstrated in Fig. 6(b) for nconv = 2. The compu-
tation of output pixels can be distributed across each
convolutional unit, resulting in a runtime complexity of
O
(
KHW
S2nconv
)
.
7(a)
(b)
Figure 6. (a) Cycling through a convolution using DEAP. (b) Performing a convolution with two convolutional units.
IV. PHOTONIC CONVOLUTIONAL NEURAL
NETWORKS
In this section, we show how DEAP can be used to run
a CNN. CNNs are a type of neural network that were
developed for image recognition tasks. A CNN consists
of some combination of convolutional, nonlinear, pool-
ing and fully connected layers [27], see Fig. 7(a). As
introduced previously, convolutions perform a highly ef-
ficient and parallel matrix multiplication using kernels
[3]. Furthermore, since kernels are typically smaller than
the input images, the feature extraction operation allows
efficient edge detection, therefore reducing the amount of
memory required to store those features.
CNNs are networks suitable to be implemented in pho-
tonic hardware since they demand fewer resources to do
matrix multiplication and memory usage. The linear op-
eration performed by convolutions allows single feature
extraction per kernel. Hence, many kernels are required
to extract as many features as possible. For this reason,
kernels are usually applied in blocks, allowing the net-
work to extract many different features all at once and
in parallel.
In feed-forward networks, it is typical to use a rectified
linear unit (ReLU) activation function. Since ReLUs are
linear piecewise functions that model an overall nonlin-
earity, they allow CNNs to be easily optimized during
training. The pooling layer introduces an stage where a
set of neighbor pixels are encompassed in a single opera-
tion. Typically, such operation consists in the application
of a function that determines the maximum value among
neighboring values. An average operation can be im-
plemented likewise. Both approaches describe max and
average pools, respectively. This statistical operation al-
lows for a direct down-sampling of the image, since the
dimensions of the object are reduced by a factor of two.
From this step, we aim to make our network invariant
and robust to small translations of the detected features.
The triplet, convolution-activation-pooling, is usually
repeated several times for different kernels, keeping in-
variant the pooling and activation functions. Once all
possible features are detected, the addition of a fully con-
nected layer is required for the classification stage. This
layer prepares and shows the solutions of the task.
CNNs are trained by changing the values of the ker-
nels, analogous to how feed-forward neural networks are
trained by changing the weighted connections [28]. The
estimated kernel and weight values are required in the
testing stage. In this work, this stage is performed by
our on-chip DEAP CNN. Figure 7(b) shows a high-level
overview of the proposed testing on-chip architecture.
Here, the testing input values stored in the PC modulate
the intensities of a group of lasers with identical powers
but unique wavelengths. These modulated inputs would
80  1  2  3  4  5  6  7  8  9 
A
D
C
S
D
R
A
M
DAC DAC
SDRAM
PC
Laser 
sources
Modulator
array
Weight bank 
array
Voltage
adder
Activation
(a) (b)
Convolution
Activation
Pooling
Fully
connected
Image
input
Kernel 
weights
Convolved
feature
Figure 7. Block diagrams that describe: (a) a typical CNN, which contains convolutions, activation functions, pooling and
fully connected layers. In this case we exemplify such diagram using MNIST-based recognition task that predicts the number
5; and (b) the DEAP architecture. In the computer (PC) the input image, kernel weights, and convolved features are stored.
Also, the commands to implement the activation function off-chip are stored in the PC. The input image, kernel weights and
convolved features are transferred to the chip via DACs from the SDRAM. Then, the convolution is performed on-chip. Finally
the output is digitalized via an ADC and stored in a SDRAM connected to the computer.
be sent into an array of photonic weight banks, which
would then perform the convolution for each channel.
The kernels obtained in the training step are used to
modulate these weight banks. Finally, the outputs of the
weight banks would be summed using a voltage adder,
which produces the convolved feature. This simulator
works using the transfer function of the MRRs, through
port and drop port summing equations at the balanced
PDs, and the TIA gain term to simulate a convolution.
The simulator assumes that the MRRs can only be con-
trolled with 7−bits of precision as that has been empir-
ically observed in a lab setting. The MRR self-coupling
coefficient is equal to the loss, r = a = 0.99[29] in Eqs. (6)
(10) and (11).
The interfacing of optical components with electronics
would be facilitated by the use of digital-to-analog con-
verters (DACs) and analog-to-digital converters (ADCs),
while the storage of output and retrieving of inputs would
be achieved by using memories GDDR SDRAM. The
SDRAM is connected to a computer, where the infor-
mation is already in a digital representation. Then, the
implementation of the ReLU nonlinearity and the reuse
of the convolved feature to perform the next convolution
can be performed. The idea is to use the same archi-
tecture to implement the triplet convolution-activation-
pooling on hardware.
In this work, we trained the CNN to perform image
recognition on the MNIST dataset. The training stage
uses the ADAM optimizer and back-propagation algo-
rithm to compute the gradient function. The optimized
parameters to solve MNIST can be categorized in two
groups: (i) two 5×5×8 different kernels and (ii) two fully
connected layers of dimensions 128 × 800 and 128 × 10;
and their respective bias terms. These kernels are then
defined by eight 5 × 5 different filters. In the following
we use our DEAP CNN simulator to recognize new input
images, obtained from a set of 500 images, which are in-
tended to be used for the test step. Our simulator only
works at the transfer level and does not simulate noise
or distortion from analog components. The process of
feature extraction performed by the DEAP CNN is illus-
trated in Fig. 8(a). As it can be seen in the illustration,
a 28× 28 input image from the test dataset is filtered by
a first 5 × 5 × 8 kernel, using stride one. The output of
this process is a 24 × 24 × 8 convolved feature, with a
ReLU activation function already applied. Following the
same process, the second group of filters is applied to the
convolved feature to generate the second output, i.e. a
20× 20× 8 convolved feature.
After the second ReLU is applied to the output, aver-
age pooling is utilized for invariance and down-sampling
of the convolved features. The average pooling is imple-
mented by a 2×2 kernel whose elements are all 1/4. How-
ever, the stride one was kept; therefore the pooled feature
has dimensionality 19×19×8. The down-sampling is im-
plemented offline: from the 19× 19× 8 output, a simple
algorithm extracts the elements that have even indexes.
The result of this process is a 10× 10× 8 pooled output.
Finally, the first fully connected layer is fed through by
the flattened version of the pooled object. The resultant
vector feeds the last fully connected layer, where the re-
sult of the MNIST classification appears.
The results of the MNIST task solved by our simu-
lated DEAP CNN is shown by Fig. 8(b). For a test set
of 500 images, we obtained an overall accuracy of 98%.
This performance was compared to the results obtained
using a standard two-layers CNN including a max pool-
ing layer. We found that This standard network achieves
99
8
7
65
4
3
2
1
0
5
MNIST input
28x28 
Conv + ReLU
24x24x8
Conv + ReLU
20x20x8
Conv-Avg
10x10x8 FC
(b)
(a)
0 2 4 86
100
80
60
20
40p
re
d
ic
ti
o
n
 (
%
)
Figure 8. (a) An illustrative block diagram of the two-layers DEAP CNN solving MNIST. (b) Results of the MNIST task using
a simulated DEAP CNN.
an overall accuracy of 98.6%. Therefore, we can con-
clude that our simulator is sufficiently robust despite the
7−bits of precision considered in the DEAP CNN simu-
lation.
V. ENERGY AND SPEED ANALYSES
V.1. Energy Estimation
The energy used by a single DEAP convolutional
unit depends on the R and D parameters. The 100-
wavelength limitation for MRRs constrains the maximum
R to be 10, as each multiplexed waveguide will store R2
signals. The number of MRRs used in the modulator ar-
ray is equal to R2D, meaning that only certain D and R2
values are allowed for a finite number of MRRs. Assum-
ing that a maximum of 1024 MRRs can be manufactured
in the modulator array, a convolutional unit can support
a large kernel size with a limited number of channels, R
= 10, D = 12, or a small kernel size with a large number
of channels, R = 3, D = 113. We will consider both edge
cases to get a range of energy consumption values. For
the smaller convolution size, we will have R2 lasers, R2
MRRs and DACs in the modulator array, R2D MRRs
and D TIAs in the weight bank array and one ADC to
convert back into digital signal. With 100 mW per laser,
19.5 mW per MRR, 26 mW per DAC, 17 mW per TIA
[30] and 76 mW per ADC, we get an energy usage of 112
W for the large kernel size and 95W for the smaller ker-
nel size. Therefore, we estimate a single convolution unit
to use around 100 W when 1024 modulators are used to
represent inputs.
V.2. DEAP Performance
The time it takes for light to propagate from the WDM
to before the balanced PDs is estimated by the following
equation:
tprop =
k2pirMRR
c
(23)
where c is the speed of light 2pirMRR is the circumfer-
ence of the MRR and k is the number of MRRs. Assum-
ing 100 MRRs with a radius of 10 m [11, 31], the PWB
gets a propagation time of around 21 ps and a through-
put of 1/tprop = 50 GS/s. The bottlenecks come from
the fact that the balanced PDs has a throughput of 25
GS/s[30] and the TIA has a throughput of 10 GS/s[32].
An individual MRR can be modulated at speeds of 128
GS/s[31], meaning that the modulation frequency of the
MRRs does not bottleneck the throughput of the PWB.
The throughput of a PWB is around 5 GS/s. The
DACs[33] and ADCs[34] both operate at 5 GS/s and
support to 7-bits. The GDDR6 SDRAM operates at
16 G with a 256-bit bus size[35]. Consequently, the
speed of the system is limited by the throughput of the
DACs/ADCs, resulting in DEAP producing a single con-
volved pixel at 5 GS/s or t = 200 ps.
DeepBench [17] is an empirical dataset that contains
how long various types of GPUs took to perform a convo-
lution for a given set of convolutional parameters. Table
(II) contains the parameters used for each of these bench-
marks, and Table (III) contains the power consumption.
The speeds of various GPUs were directly taken from
Ref. [17], while the speed of the convolution was esti-
10
Table II. Benchmarking parameters for DEAP
W H D N K Rw Rh S
700 161 1 4 32 5 20 2
112 112 64 8 128 3 3 1
7 7 832 16 256 1 1 1
Table III. Benchmarked GPUs with power consumption
GPU Power Usage (W)
AMD Vega FE[36] 375
AMD MI25[37] 300
NVIDIA Tesla P100[38] 250
NVIDIA GTX 1080 Ti[39] 250
mated using the following equation:
truntime = 200 ps× NK
nconv
(
H −R
S
+ 1
)(
W −R
S
+ 1
)
.
(24)
In some of the benchmarks, the kernels edge lengths were
not equal, hence the parameters Rw and Rh which corre-
spond to the width and height of the kernels. For each of
the selected benchmarks, the parameters R2D ≤ 1024,
meaning that the convolutional network is compatible
with DEAP implementations.
Figure 9. Estimated DEAP convolutional runtime compared
to actual GPU runtimes from DeepBench benchmarks
The estimated DEAP runtimes using one and two con-
volutional units were plotted against actual DeepBench
runtimes in Fig. 9. From this, we can see that using
two convolutional units performs slightly better than all
the GPU benchmarks. While mean GPUs power con-
sumption is 295 W, DEAP with a single convolutional
unit sues about 110 W. Therefore, DEAP can perform
convolutions between 1.4 and 7.0 faster than the mean
GPU runtime while using 0.37 times the energy consump-
tion. Using two convolutional units doubles the speed of
DEAP, meaning that DEAP can be between 2.8 and 14
times faster than a conventional GPU while using almost
0.75 times the energy consumption. DEAP with a sin-
gle unit performing at a speed somewhat similar to the
GPUs is expected.
VI. CONCLUSION
We have proposed a photonic network, DEAP, suited
for convolutional neural networks. DEAP was estimated
to perform convolutions between 2.8 and 14 times faster
than a GPU while roughly using 0.75 times the energy
consumption. A linear increase in processing speeds cor-
responds to a linear increase in energy consumption, al-
lowing for DEAP to be as scalable as electronics.
High level software simulations have shown that DEAP
is theoretically capable of performing a convolution. We
demonstrate that our DEAP CNN is capable of solving
MNIST handwritten recognition task with an overall ac-
curacy of 98%. The largest bottlenecks is the I/O in-
terfacing with digital systems via DACs and ADCs. If
photonic DACs[40] and ADCs[41] are to be built with
higher bit-precisions, the speedup over GPUs could be
even higher. If higher bit precision photonic DACs and
ADCs are able to be built, replacing the electronic com-
ponents with optical ones can significantly decrease the
runtime.
In order to realize a physical implementation, there are
a number of issues that still need to be solved. Packag-
ing a silicon photonic with an electronic chip with high
I/O count is a challenging RF engineering task, but it
is a central thrust in the roadmap for silicon photonic
foundries [21]. There also needs to be control circuitry
that routes the outputs of the SDRAM into the rele-
vant DACs and from the ADCs into the SDRAM. Since
we assume that the control circuitry can operate signif-
icantly faster than a memory access, we believe it will
have a negligible impact on the overall throughput. An-
other issue is that DEAP processes data in the analog do-
main, whereas GPUs perform floating point arithmetic.
Though floating-point arithmetic does have some degree
of error due to rounding in the mantissa, their errors are
deterministic and predictable. On the other hand, the
errors from photonics are due to stochastic shot, spec-
tral, Johnson-Nyquist and flicker noises, as well as quan-
tization noise in the ADC, and distortion from the RF
signals applied to the modulators. However, artificially
adding random noise to CNNs have been shown to reduce
over-fitting [42], meaning that some degree of stochastic
behaviour is tolerable in the domain of machine learning
problems.
Finally, MRRs have only been shown to have up to
7-bits of precision, which is significantly smaller than the
range precision supported by even half-precision (16-bit)
floating point representations. In conclusion, photonics
has the potential to perform convolutions at speeds faster
11
than top-of- the-line GPUs while having a lower energy
consumption. Moving forward, the greatest challenges
to overcome have to do with increasing the precision of
photonic components so that they are comparable to clas-
sical floating-point representations. Overall, silicon pho-
tonics has the potential to outperform conventional elec-
tronic hardware for convolutions while having the ability
to scale up in the future.
ACKNOWLEDGMENT
Funding for B.J.S., B.A.M., H.B.M., and V.B. was pro-
vided by the Natural Sciences and Engineering Research
Council of Canada (NSERC) and the Queen’s Research
Initiation Grant (RIG).
[1] X. Li, G. Zhang, H. H. Huang, Z. Wang, and W. Zheng,
in 2016 45th International Conference on Parallel Pro-
cessing (ICPP) (2016) pp. 67–76.
[2] M. Jaderberg, A. Vedaldi, and A. Zisserman, CoRR
abs/1405.3866 (2014), arXiv:1405.3866.
[3] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learn-
ing (The MIT Press, 2016).
[4] Y. LeCun and C. Cortes, (2010).
[5] F. Duport, A. Smerieri, A. Akrout, M. Haelterman, and
S. Massar, Scientific Reports 6, 22381 EP (2016).
[6] D. Brunner, M. C. Soriano, C. R. Mirasso, and I. Fischer,
Nature Communications 4, 1364 (2013).
[7] K. Vandoorne, P. Mechet, T. Van Vaerenbergh, M. Fiers,
G. Morthier, D. Verstraeten, B. Schrauwen, J. Dambre,
and P. Bienstman, Nature Communications 5, 3541 EP
(2014).
[8] L. Larger, M. C. Soriano, D. Brunner, L. Appeltant, J. M.
Gutierrez, L. Pesquera, C. R. Mirasso, and I. Fischer,
Optics Express 20, 3241 (2012).
[9] P. R. Prucnal and B. J. Shastri, Neuromorphic Photonics
(CRC Press, Taylor & Francis Group, Boca Raton, FL,
USA, 2017).
[10] A. N. Tait, M. A. Nahmias, B. J. Shastri, and P. R. Pruc-
nal, Journal of Lightwave Technology 32, 4029 (2014).
[11] A. N. Tait, A. X. Wu, T. F. de Lima, E. Zhou, B. J. Shas-
tri, M. A. Nahmias, and P. R. Prucnal, IEEE Journal of
Selected Topics in Quantum Electronics 22, 312 (2016).
[12] A. N. Tait, T. Ferreira de Lima, M. A. Nahmias,
H. B. Miller, H.-T. Peng, B. J. Shastri, and P. R.
Prucnal, arXiv e-prints , arXiv:1812.11898 (2018),
arXiv:1812.11898 [physics.app-ph].
[13] T. W. Hughes, M. Minkov, Y. Shi, and S. Fan, Optica 5,
864 (2018).
[14] Y. Shen, N. C. Harris, S. Skirlo, M. Prabhu, T. Baehr-
Jones, M. Hochberg, X. Sun, S. Zhao, H. Larochelle,
D. Englund, and M. Soljacˇic´, Nat. Photonics 11, 441
(2017).
[15] T. F. de Lima, H. Peng, A. N. Tait, M. A. Nahmias,
H. B. Miller, B. J. Shastri, and P. R. Prucnal, Journal of
Lightwave Technology 37, 1515 (2019).
[16] A. Mehrabian, Y. Al-Kabani, V. J. Sorger, and
T. A. El-Ghazawi, CoRR abs/1807.08792 (2018),
arXiv:1807.08792.
[17] B. Research, Deepbench.
[18] V. Bangari, B. Marquez, H. Miller, and B. J. Shastri,
DEAP, https://github.com/Shastri-Lab/DEAP (2019).
[19] G. Tan, L. Li, S. Triechle, E. Phillips, Y. Bao, and N. Sun,
in Proceedings of 2011 International Conference for High
Performance Computing, Networking, Storage and Anal-
ysis, SC ’11 (ACM, New York, NY, USA, 2011) pp. 35:1–
35:11.
[20] S. Chetlur, C. Woolley, P. Vandermersch, J. Co-
hen, J. Tran, B. Catanzaro, and E. Shelhamer, CoRR
abs/1410.0759 (2014), arXiv:1410.0759.
[21] A. Rahim, T. Spuesens, R. Baets, and W. Bogaerts, Pro-
ceedings of the IEEE 106, 2313 (2018).
[22] M. A. Nahmias, B. J. Shastri, A. N. Tait, T. F. de Lima,
and P. R. Prucnal, Opt. Photon. News 29, 34 (2018).
[23] W. Bogaerts, P. De Heyn, T. Van Vaerenbergh, K. De
Vos, S. Kumar Selvaraja, T. Claes, P. Dumon, P. Bien-
stman, D. Van Thourhout, and R. Baets, Laser & Pho-
tonics Reviews 6, 47 (2012), arXiv:arXiv:1208.0765v1.
[24] H. Zheng, R. Ma, and Z. Zhu, Analog Integrated Circuits
and Signal Processing 90, 217 (2017).
[25] M. Lipson, Journal of Lightwave Technology 23, 4222
(2005).
[26] A. N. Tait, H. Jayatilleka, T. F. D. Lima, P. Y. Ma, M. A.
Nahmias, B. J. Shastri, S. Shekhar, L. Chrostowski, and
P. R. Prucnal, Opt. Express 26, 26422 (2018).
[27] K. O’Shea and R. Nash, CoRR abs/1511.08458 (2015),
arXiv:1511.08458.
[28] K. Mehrotra, C. K. Mohan, and S. Ranka, Elements of
Artificial Neural Networks (MIT Press, Cambridge, MA,
USA, 1997).
[29] Y. Tan and D. Dai, Journal of Optics 20, 054004 (2018).
[30] Z. Huang, C. Li, D. Liang, K. Yu, C. Santori,
M. Fiorentino, W. Sorin, S. Palermo, and R. G. Beau-
soleil, Optica 3, 793 (2016).
[31] J. Sun, R. Kumar, M. Sakib, J. B. Driscoll, H. Jayatilleka,
and H. Rong, Journal of Lightwave Technology 37, 110
(2019).
[32] M. Atef and H. Zimmermann, Analog Integr. Circuits
Signal Process. 76, 367 (2013).
[33] B. Sedighi, M. Khafaji, and J. C. Scheytt, International
Journal of Microwave and Wireless Technologies 4, 275
(2012).
[34] J. Fang, S. Thirunakkarasu, X. Yu, F. Silva-Rivas,
C. Zhang, F. Singor, and J. Abraham, IEEE Transac-
tions on Circuits and Systems I: Regular Papers 64, 1673
(2017).
[35] I. Micron Technology, Gddr6 sgram mt61k256m32 8gb:
2 channels x16/x8 gddr6 sgram.
[36] I. Advanced Micro Devices, Radeon vega frontier edition
(liquid-cooled) ().
[37] I. Advanced Micro Devices, Radeon instinct mi25 accel-
erator ().
[38] N. Corporation, Vidia tesla p100 gpu accelerator ().
[39] N. Corporation, Geforce gtx 1080 ti ().
[40] F. Zhang, B. Gao, X. Ge, and S. Pan, Optical Engineer-
ing 55, 031115 (2015).
[41] M. A. Piqueras, P. Villalba, J. Puche, and J. Mart´ı,
in 2011 IEEE International Conference on Microwaves,
12
Communications, Antennas and Electronic Systems
(COMCAS 2011) (2011) pp. 1–6.
[42] Z. You, J. Ye, K. Li, and P. Wang, CoRR
abs/1805.08000 (2018), arXiv:1805.08000.
