PCNNA: A Photonic Convolutional Neural Network Accelerator by Mehrabian, Armin et al.
PCNNA: A Photonic Convolutional Neural Network Accelerator
Armin Mehrabian, Yousra Al-Kabani, Volker J Sorger, and Tarek El-Ghazawi
Department of Electrical and Computer Engineering
The George Washington University, Washington, DC, USA
{armin,yousra,sorger,tarek}@gwu.edu
Abstract— Convolutional Neural Networks (CNN) have been
the centerpiece of many applications including but not limited
to computer vision, speech processing, and Natural Language
Processing (NLP). However, the computationally expensive
convolution operations impose many challenges to the perfor-
mance and scalability of CNNs. In parallel, photonic systems,
which are traditionally employed for data communication, have
enjoyed recent popularity for data processing due to their
high bandwidth, low power consumption, and reconfigurability.
Here we propose a Photonic Convolutional Neural Network
Accelerator (PCNNA) as a proof of concept design to speedup
the convolution operation for CNNs. Our design is based on the
recently introduced silicon photonic microring weight banks,
which use broadcast-and-weight protocol to perform Multiply
And Accumulate (MAC) operation and move data through
layers of a neural network. Here, we aim to exploit the synergy
between the inherent parallelism of photonics in the form
of Wavelength Division Multiplexing (WDM) and sparsity of
connections between input feature maps and kernels in CNNs.
While our full system design offers up to more than 3 orders of
magnitude speedup in execution time, its optical core potentially
offer more than 5 order of magnitude speedup compared to
state-of-the-art electronic counterparts.
I. INTRODUCTION
CNNs have been able to reach record-breaking perfor-
mance in many tasks including but not limited to computer
vision [1], speech recognition [2], and NLP [3]. However, the
success comes at the cost of high computational demands.
Convolution operations account for roughly 90% of the
total operations in a CNN [4]. While CNNs enjoy highly-
parallel operations within a layer, data dependencies across
layers challenge any attempt of inter-layer parallelization.
The latter also poses scalability burdens in terms of power
and throughput. Such challenges are even more magnified
when electronics seem to be hitting fundamental power and
speed limitations.
Photonics is considered as a promising alternative to
electronics both for communication and more recently for
computation [5]. Photonic systems offer inherent parallelism
potentials through their Wavelength Division Multiplexing
(WDM) capability. Low Light Matter Interaction (LMI)
makes photonics a great choice for signal transmission to
far distances with minute loss and energy. In addition, the
linear nature of light can be exploited to perform linear math-
ematical operations such as multiplication and addition. This
makes photonics an appealing choice for the implementation
of CNNs due to their heavy reliance on MAC operation.
c©2018 IEEE
Previous work on CNN inference hardware has primarily
focused on electronic implementations, either in the form of
FPGA [6], [7] or ASIC designs [8], [9]. Despite some recent
efforts to implement neural networks using optics [10], [11],
to authors knowledge optical realization of CNNs has not
been explored yet.
In this paper we propose a photonic convolution accelera-
tor for CNNs inference-mode based on the recently proposed
Micro-Ring-Resonator(MRR) banks and the broadcast-and-
weight protocol [10]. We summarize the main contributions
of this work as:
• A first proposed design for an optical CNN based on the
recently proposed broadcast-and-weight protocol with
MRR weight banks.
• An optimization technique is proposed to reduce the
number of microrings for the optical CNN realization.
• An analytical framework is introduced to identify the
number of microrings for any CNN layer and to estimate
the execution time.
II. CONVOLUTIONAL NEURAL NETWORKS (CNNS)
CNNs’ architecture enable them to receive inputs of higher
dimensional shape and construct a hierarchy of feature repre-
sentations. High-dimension inputs are inspected for presence
of features (kernels) learned during a training process. This
inspection process is carried out through a series of 4D
convolution operations. Thus, the output of convolution layer
is a condensed set of values indicating the presence or
absence of features in the input. For that, the outputs of layers
in CNN is referred to as feature map. Unlike dense fully-
connected layers a kernel in CNN only observes and operates
on a narrow window of inputs referred to as receptive field.
The size of this window equals the size of the kernel.
Moreover, the connections between the kernel values and
the input feature map receptive field is one-to-one rather
than all-to-all. In other words, in CNNs convolutional layers
enjoy sparsity of connections between input feature maps
and kernels. This sparsity results in interesting characteristics
including input feature map reuse. As the name suggests, a
single feature map is reused as the convolution input for
many different kernels. In this paper we will exploit this
property of CNNs as the conceptual foundation of our design.
III. PHOTONIC MICRORING RESONATOR (MRR) BANKS
In [10] authors proposed a photonic ANN design based
on the broadcast-and-weight protocol. Figure 1 shows the
ar
X
iv
:1
80
7.
08
79
2v
1 
 [c
s.E
T]
  2
3 J
ul 
20
18
TABLE I
SUMMARY OF CONVOLUTION LAYER PARAMETERS USED IN THIS WORK
Parameter Description
n Input feature map height and width
m Kernel height and width
p Padding size
s Stride step size
nc Input feature map number of channels
Ninput Input feature map size
Nouput Output feature map size
Nkernel Size of kernel
conceptual design of the broadcast-and-weight protocol. In
this model MRR banks and photodiodes perform MAC
operations while broadcast-and-weight protocol carry MAC
results across layers. In broadcast-and-weight protocol each
neuron output is multiplexed onto a distinct light wavelength
using Laser Diodes (LD). Multiplexed wavelengths are bun-
dled together and placed on a waveguide to broadcast to
the destination layer. At the destination layer, each neuron
receives all the incoming wavelengths. Each wavelength is
then multiplied in amplitude with its corresponding micror-
ing. Multiplication is carried out by tuning rings in and
out of resonance to a respective laser wavelength. Later, a
photodiode sums up all the incoming wavelengths into an
aggregate photo-current.
Fig. 1. Broadcast-and-weight protocol using MRR banks. Incoming
bundled wavelengths propagate through the MRR banks. Each bank weights
each wavelength by changing the tune of corresponding rings. A photodi-
ode sums up all the wavelengths into a photo-current. The photo-current
modulates a laser beam of wavelength λm. All outgoing laser beams are
multiplexed together to be broadcast to the next layer.
IV. DESIGN METHODOLOGY
PCNNA takes advantage of MRR banks proposed in [10]
to perform MAC operations. However, one major drawback
of this scheme is that the number of microrings required
to perform the multiplication part of MAC in each layer
scales with Ni ×Ni+1, where N is the number of nodes in
a layer and i is the layer number. This exponential increase
in the number of microrings makes implementation of such
networks challenging. For a kernel of size Nkernel, at
each location of the kernel over the input feature map, only
Nkernel values corresponding to the receptive field of input
feature map take part in the convolution. Therefore, using
MRR banks, we only need to allocate Nkernel microrings
for weighting and the rest of Ninput−Nkernel values can be
ignored. We will refer to these values as non-receptive field
values for the rest of this paper. This ignoring of the non-
receptive field values result in large savings in terms of both
number of wavelengths that represent the input feature map
and the number of microrings required in the following layer
to demultiplex incoming wavelengths. Figure 2 conceptually
shows the effect of filtering non-receptive field values in
MRRs for convolution operation (a) with no filtering of the
non-receptive field values and (b) with non-receptive field
values filtered.
Current CNNs comprise of tens, if not hundreds, of layers
with almost the same range of kernels per layer [1], [12],
[13]. While filtering non-receptive field weight result in large
savings in the number of microrings and respectively power
consumption, implementing a full CNN using MRRs still
requires large footprints and power consumptions. Therefore,
in PCNNA we construct our design based on implementation
of a single layer of CNN and virtually reusing it sequentially
for different layers.
In PCNNA convolution layers are processed sequentially.
Convolution result values of each layer are stored back to the
off-chip DRAM. In addition, for each location of the kernels
corresponding to a receptive field, partial convolutions are
processed sequentially at the optical core of PCNNA. But,
as multiple kernels share the same receptive field values,
convolution computations for different kernels are performed
in parallel. Figure 3 illustrates the parallel execution of K
kernels as they progress across the input feature map.
At the high level PCNNA runs on two clock domains,
a fast clock domain (5GHz), which runs the optical sub-
systems and their immediate electronic circuitry, and a main
slower clock domain to interface with the external envi-
ronment. Figure 4 depicts the overall hardware architecture
of our design. PCNNA consist of a weighting MRR bank
repository, which its microrings tune to kernel weights.
Kernel weights are initially stored in an off-chip DRAM
memory. Upon arrival of a new layer request, kernel weights
are loaded from the off-chip DRAM into the Kernel Weights
Buffer. Digital buffered weights are then converted into
analog voltages, which control the tuning of the microrings
in the MRR banks.
Similarly, the input feature maps are initially stored in the
off-chip DRAM. It should be noted that over the execution
of one layer of a CNN the kernel weights do not change.
However, due to sequential progression of kernels over the
input feature maps, the input values are regularly updated.
For instance, in Figure 3 the values corresponding to kernels
do not change until a new layer is loaded, but the input
receptive field goes through 49 cycles.
For each receptive field corresponding to a particular
location of kernel a subset of input feature map values
(Nkernel) are loaded into the Input Buffer. These buffered
values are then moved closer to the PCNNA core and stored
in small but fast cache memory. Each value in the cache
memory is then converted to an analog signal using the input
Digital to Analog Converters DACs. Laser beams of different
wavelengths generated by Laser Diodes (LD) modulate the
Fig. 2. MRR bank for an input feature map of size 16×16 and 5 kernels of size 3×3, a) without filtering the input feature map and b) with input feature
map filtered to only pass through receptive field. It can be seen that taking advantage of narrow receptive field results in less number of total microrings.
Fig. 3. Parallel execution of kernel as kernels sequentially progress over
various locations of the input feature map.
analog signal from the DAC. Laser beams fly through the
MRR banks and their output photodiodes, which perform the
multiplication and summation operations respectively. This
process can be done quite fast (flight time of light) once
all input values are loaded and converted to analog signals.
Even the integrated photodiodes’ operating frequency at 0
bias voltage can be as high as tens of GHz if not hundreds
[14]. Hence, the whole optical weighting and summation fits
within a single clock cycle of our fast clock domain. Lastly,
computed convolution values are digitized back through the
ADC and stored into the off-chip DRAM.
The latter procedure repeats for every location of kernels
across the input feature map. For any consecutive location of
kernels within a layer, only a fraction of input feature map
values proportional to the size of the stride is required to be
Fig. 4. Conceptual high-level hardware architecture PCNNA. The shaded
areas, which include the core optical components run of a fast clock (5GHz).
Buffers isolate the fast optical core from the outside slow clock environment.
loaded into the cache. As a result, the bandwidth required
for continuous loading of various receptive field values of
input feature maps is minimized.
V. EVALUATIONS
In this section we develop analytical models for the
evaluation of our design in term of the area consumed by
microrings and the execution time.
A. Microring Area
For simplicity we assume that input feature maps are
square-face volumes. As a result, the size of input feature
maps and kernels are as follows,
Ninput = n× n× nc (1)
Nkernel = m×m× nc (2)
where n and m are the size of input feature map and kernel
in x and y direction, and nc is the number of channels. For
a given input feature map and a set of K kernels, the output
feature map will have the size,
Noutput = (
[
(n+ 2p−m)
s
]
+ 1)2 ×K (3)
where p is the size of the padding, s is the size of the stride,
and K is the number of kernels. Given the above input layer,
without any filtering of the non-receptive field values, the
number of required microrings would be,
Nmicrorings = Ninput ×K ×Nkernel (4)
By filtering the non-receptive field values, the total number
of rings will drop to:
Nmicrorings = K ×Nkernel (5)
One important takeaway from equation 5 is that unlike
equation 4 where the total number of rings scales with
product of input size, number of kernels, and the kernel size,
here the total number of rings scales linearly with the number
of kernels K and its size Nkernel.
For instance, the first convolutional layer of AlexNet with
an input feature map of shape 224 × 224 × 3 and 96
kernels of shape 11 × 11 × 3 will require approximately
5.2 Billion microrings without filtering non-receptive field
values. However, the same number once non-receptive field
values are filtered would be 35 thousand. The latter translates
into a saving of more than 150k× in the number microrings.
Similarly, the 4th layer of AlexNet, which accounts for the
most number of kernel weights will require 3456 microrings.
Considering a microring size of 25µm × 25µm [10], it
takes an area of 2.2mm2 to fit all the microrings. Figure
2 compares the number of microring for different layers of
AlexNet.
Fig. 5. Total number of microrings required in MRR banks for different
convolutional layers of AlexNet for two cases, namely redundant microrings
Filtered and Not-Filtered.
B. Execution Time
Here we derive an analytical estimation of the execution
time for the PCNNA. As discussed in section IV The
optical convolution core of PCNNA is capable of computing
convolutions of multiple kernels in parallel for a single
receptive field within a single clock cycle. We name this
time window Tmac i, which equals the time to complete all
multiply-and-accumulate operations for a series of kernels
over the receptive field location ith. The number of locations
(Nlocs) kernels can have on the input feature maps is found
by equation 3 divided by its number of channels,
Nlocs =
Noutput
K
= (
[
(n+ 2p−m)
s
]
+ 1)2 (6)
Therefore, without considering the electronic IO limita-
tions, the computation time to perform a full convolution for
for an input feature map and K kernels is,
Tconv = Nlocs × 1
fclock
(7)
It should be noted that Tconv in equation 7 is independent
of the number of kernels. This allows for increasing the
number of kernels without sacrificing execution time. The
only overhead associated with increased number of kernels
in PCNNA is the allocation of more dedicated microrings
per kernel. However, the number of microrings increase only
linearly with the number of kernels. The total execution
times based on a 5GHz clock for each layer of Alexnet
using PCNNA is listed in Figure 6. This result shows that
the PCNNA core has the potential of speedups of up to
5 orders of magnitude in comparison to its cutting-edge
electronic counterparts. But, a full system implementation
performance of PCNNA is bound by the electronics, both
at the front-end and the back-end. On the front-end, for
Fig. 6. Comparison of execution time for convolution layers of AlexNet.
PCNNA(O) indicates the Optical core of PCNNA without electronic IO
constraints. PCNNA(O+E) indicates full system comprising Optical and
Electronic sub-systems.
each iteration of kernels across the input feature map, the
corresponding receptive field values are loaded from the off-
chip DRAM into the buffer. For each convolution layer, the
first location of kernels on the input feature map require to
load Nkernel = m×m×nc value into the buffer. However, as
the kernel moves across the input feature map, for subsequent
locations of kernels, only a subset of Nkernel values equal
to nc × s must be updated. The stride value s is usually 1
and in general smaller stride values are preferred over larger
ones as they tend to retain more information from at the
boundaries of kernels locations on inputs.
Buffered inputs are cached in the SRAM memory [15],
which has a 128kb capacity that can store 8 thousand 16bit
values. The access time for the memory is 7ns and it has a
footprint of 0.443mm2. Cached values need to be converted
to analog signals using Digital-to-Analog Converters (DAC).
In PCNNA DACs operate at a rate of 6GSa/s [16] while
each takes up an area of 0.52mm2. Our design comprises 1
kernel weight DAC and 10 input DACs. It is worth noting
that for every single set of kernel weights for a CNN layer,
multiple input values need to be loaded corresponding to
different locations of kernels over the input feature map.
Here, analog input values from DAC modulate the laser
beams with Mach Zehnder Modulators (MZM), which are
usually faster than the 5GHz clock.
At the output, calculated convolutions are digitized with
a 2.8GSa/s Analog-to-Digital Converter (ADC) [17] and
stored into the off-chip DRAM through the output buffer.
Considering all hardware specifications, the speed bottleneck
of PCNNA is the DAC. For every location of kernels over
the input feature map a DAC needs to sequentially convert
digital inputs to analog values at each stride. This number
for largest layer of AlexNet with a stride of 1 and 10 (NDAC)
DACs equals:
Nupdated−inputs =
nc ×m× s
NDAC
=
384× 3× 1
10
≈ 116 (8)
We calculated the execution time using this speed constraint.
Figure 6 reports on the estimated execution time of con-
volution layers of AlexNet on PCNNA in comparison with
Eyeriss [18] and YodaNN [19]. In Figure 6, PCNNA(O)
indicates only the purely optical core execution times and
PCNNA(O+E) represent the full system estimated execution
times considering the electronic constraints. Even with elec-
tronic IO speed restrictions the full PCNNA system still has
the potential of saving up to more than 3 orders of magnitude
in execution time.
VI. CONCLUSIONS
In this paper we presented a proof of concept photonic ac-
celerator for convolutional neural networks. Our proposed ac-
celerator is based the broadcast-and-weight protocol, which
takes advantage of microring weight banks to perform mul-
tiply and accumulate operations. In PCNNA, we exploited
the sparsity of connections between input feature maps
and kernels to reduce the number of microrings required
to implement modern convolutional neural networks. We
showed that the PCNNA optical core has the potential of
achieving speedups of up to 5 orders of magnitude. However,
electronic IO impose bandwidth limitations to efficiently
transfer input data to the optical core. Yet, even when taking
these electronic I/O limitations into account, we this optical
accelerator shows a 3 orders of magnitude execution time
improvement over electronic engines.
REFERENCES
[1] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image
recognition,” in Proceedings of the IEEE conference on computer
vision and pattern recognition, 2016, pp. 770–778.
[2] A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen,
R. Prenger, S. Satheesh, S. Sengupta, A. Coates et al., “Deep
speech: Scaling up end-to-end speech recognition,” arXiv preprint
arXiv:1412.5567, 2014.
[3] D. Chen, A. Fisch, J. Weston, and A. Bordes, “Reading wikipedia
to answer open-domain questions,” arXiv preprint arXiv:1704.00051,
2017.
[4] J. Cong and B. Xiao, “Minimizing computation in convolutional neural
networks,” in International conference on artificial neural networks.
Springer, 2014, pp. 281–290.
[5] D. A. Miller, “Attojoule optoelectronics for low-energy information
processing and communications,” Journal of Lightwave Technology,
vol. 35, no. 3, pp. 346–396, 2017.
[6] C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, and J. Cong, “Optimizing
fpga-based accelerator design for deep convolutional neural networks,”
in Proceedings of the 2015 ACM/SIGDA International Symposium on
Field-Programmable Gate Arrays. ACM, 2015, pp. 161–170.
[7] N. Suda, V. Chandra, G. Dasika, A. Mohanty, Y. Ma, S. Vrud-
hula, J.-s. Seo, and Y. Cao, “Throughput-optimized opencl-based
fpga accelerator for large-scale convolutional neural networks,” in
Proceedings of the 2016 ACM/SIGDA International Symposium on
Field-Programmable Gate Arrays. ACM, 2016, pp. 16–25.
[8] S. Han, X. Liu, H. Mao, J. Pu, A. Pedram, M. A. Horowitz, and
W. J. Dally, “Eie: efficient inference engine on compressed deep neural
network,” in Computer Architecture (ISCA), 2016 ACM/IEEE 43rd
Annual International Symposium on. IEEE, 2016, pp. 243–254.
[9] A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P.
Strachan, M. Hu, R. S. Williams, and V. Srikumar, “Isaac: A con-
volutional neural network accelerator with in-situ analog arithmetic
in crossbars,” ACM SIGARCH Computer Architecture News, vol. 44,
no. 3, pp. 14–26, 2016.
[10] A. N. Tait, T. F. Lima, E. Zhou, A. X. Wu, M. A. Nahmias, B. J.
Shastri, and P. R. Prucnal, “Neuromorphic photonic networks using
silicon photonic weight banks,” Scientific Reports, vol. 7, no. 1, p.
7430, 2017.
[11] Y. Shen, N. C. Harris, S. Skirlo, M. Prabhu, T. Baehr-Jones,
M. Hochberg, X. Sun, S. Zhao, H. Larochelle, D. Englund et al.,
“Deep learning with coherent nanophotonic circuits,” Nature Photon-
ics, vol. 11, no. 7, p. 441, 2017.
[12] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks,” in Advances in neural
information processing systems, 2012, pp. 1097–1105.
[13] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov,
D. Erhan, V. Vanhoucke, A. Rabinovich et al., “Going deeper with
convolutions.” Cvpr, 2015.
[14] E. R. Fossum, D. B. Hondongwa et al., “A review of the pinned
photodiode for ccd and cmos image sensors,” IEEE J. Electron Devices
Soc, vol. 2, no. 3, pp. 33–43, 2014.
[15] T. Fukuda, K. Kohara, T. Dozaka, Y. Takeyama, T. Midorikawa,
K. Hashimoto, I. Wakiyama, S. Miyano, and T. Hojo, “13.4 a 7ns-
access-time 25µw/mhz 128kb sram for low-power fast wake-up mcu
in 65nm cmos with 27fa/b retention current,” in Solid-State Circuits
Conference Digest of Technical Papers (ISSCC), 2014 IEEE Interna-
tional. IEEE, 2014, pp. 236–237.
[16] C.-H. Lin, K. L. J. Wong, T.-Y. Kim, G. R. Xie, D. Major, G. Unruh,
S. R. Dommaraju, H. Eberhart, and A. Venes, “A 16b 6gs/s nyquist
dac with imd¡-90dbc up to 1.9 ghz in 16nm cmos,” in Solid-State
Circuits Conference-(ISSCC), 2018 IEEE International. IEEE, 2018,
pp. 360–362.
[17] D. Stepanovic and B. Nikolic, “A 2.8 gs/s 44.6 mw time-interleaved
adc achieving 50.9 db sndr and 3 db effective resolution bandwidth of
1.5 ghz in 65 nm cmos,” IEEE Journal of Solid-State Circuits, vol. 48,
no. 4, pp. 971–982, 2013.
[18] Y.-H. Chen, T. Krishna, J. Emer, and V. Sze, “Eyeriss: An energy-
efficient reconfigurable accelerator for deep convolutional neural net-
works,” 2016.
[19] R. Andri, L. Cavigelli, D. Rossi, and L. Benini, “Yodann: An ultra-
low power convolutional neural network accelerator based on binary
weights,” in VLSI (ISVLSI), 2016 IEEE Computer Society Annual
Symposium on. IEEE, 2016, pp. 236–241.
