Performance evaluation of an integrated photonic convolutional neural
  network based on delay buffering and wavelength division multiplexing by Xu, Shaofu et al.
High-energy-efficiency integrated photonic convolutional neural 
networks 
Shaofu Xu,1 Jing Wang,1 and Weiwen Zou*,1 
1 State Key Laboratory of Advanced Optical Communication Systems and Networks, Intelligent Microwave Lightwave 
Integration Innovation Center (iMLic), Department of Electronic Engineering, Shanghai Jiao Tong University, 800 
Dongchuan Road, Shanghai 200240, China. 
*Correspondence to: wzou@sjtu.edu.cn. 
Photonic technologies have shown a promising way to build high-speed and high-energy-efficiency 
neural network accelerators. In previously presented photonic neural networks, architectures are 
mainly designed for fully-connected layers. When they execute convolutional layers in neural 
networks, their energy efficiencies are strictly limited by extra electronic data manipulations and 
electrooptic conversion arrays. Here we show an integrated photonic architecture specifically for 
energy-efficient convolutional layers calculations. Optical delay lines execute passive data 
manipulations to mitigate data latency and power consumption, as well as reducing the number of 
power-consuming electro-optic conversions. Consequently, energy efficiency of the proposed 
architecture is evaluated to be at least two to eight folds higher than previous photonic architectures 
on AlexNet benchmark. Powered by wavelength-divided multiplexing, the required length of delay 
lines is significantly reduced, thus being practical to fabricate. Furthermore, this architecture has 
the potential to cancel analog-digital interconversions for higher energy efficiencies. The prove-of-
concept experiment validated the proposed architecture on typical classification tasks with 
computer-equivalent accuracy. We anticipate the proposed architecture is inspiring for future 
design of high-energy-efficiency convolutional neural network accelerators. Especially for marginal 
computing terminals such as smartphones, drones, unmanned vehicles and smart-cameras where 
energy accessibility is limited, high-efficiency accelerators will boost the computing capability with 
affordable power consumption. 
Introduction 
In recent years, powered by dramatic development of computer science and the explosive data amount, deep 
learning gets outstanding among artificial intelligence technologies across a broad range of applications, 
including computer vision, natural language processing, games, and scientific researches [1-3]. As one of the 
most widely deployed deep learning models, convolutional neural networks (CNNs) are especially effective to 
process regularized data inputs. With multiple convolutional layers, input data from audios, images or videos 
can be extracted to high-level features which benefits the performances of various tasks [4-6]. Therefore, in 
modern deep learning technologies, a majority of tasks involving regularized data inputs adopts 
convolutional layers to extract features. 
Together with the performance improvement by large-scale neural networks, the computation burden and 
power consumption are increasing conspicuously. To solve the problem of energy consumption while 
maintaining high computing speed, scientists of integrated circuits are making efforts to develop higher-
efficiency deep learning accelerators. As an epitome, Application-specific integrated circuits (ASICs) like 
ISSAC [7], DianNao [8], and tensor processing units (TPUs) [9] have achieved unprecedented speed and 
energy efficiency. However, the integrated electronic circuits always face an energy bottleneck of on-chip data 
registering and manipulations, resulting the best performance limited by 1 pico-Joule per multiply and 
accumulation (pJ/MAC). Recently, optical neural networks accelerators are proposed, heralding a novel way 
to break through the electronic energy bottleneck. Taking advantage of integrated or free-space optical 
components, such as Mach-Zehnder interference array [10], micro-ring resonator array [11], space-light 
modulators [12-14], 3-D printed diffraction plates [15], optical hardwares can calculate the linear part (i.e. 
vector-matrix multiplication) of neural networks with ultra-low or no power consumption. Note that linear 
part of neural networks takes the majority of power consumption in traditional computers. Therefore, optical 
hardware is promising to realize the ultra-energy-efficiency deep learning accelerators. Nevertheless, current 
optical neural network architectures are focusing on fully-connected layers in neural networks. When they 
are adopted to conduct convolutional layers, their energy efficiency can be challenged. A convolutional layer 
can be regarded as a generalized matrix multiplication (GeMM) [16]. Before the matrix multiplications, the 
input data should be firstly patched and allocated in order to match the GeMM format. In current optical 
architectures, the data manipulations are assisted with digital electronic circuits, which introduce extra 
latency and power-consumption. Besides, the large number of high-speed and low-loss input electro-optic 
conversion ports causes substantial power consumption. Recently, photonic architectures especially for CNNs 
has been proposed [17, 18], there occurs the same energy efficiency limitations as described above. 
Here, we propose an energy-efficient integrated photonic architecture especially for the calculation of 
convolutional layers in neural networks. We adopt passive optical delay lines as data buffer, replacing the 
electronic circuit to execute data patching and allocation. These data manipulations are processed with the 
speed of light and free of power. Due to the fact that data manipulations are executed in optical hardwares, 
the proposed architecture requires less input electro-optic conversion ports, thus enhancing the energy-
efficiency further. Given that the optical delay lines have large footprints even in integrated chips, wavelength-
division multiplexing (WDM) is applied to lessen the total length of delay lines. Consequently, the difficulty 
to fabricate the delay lines is dramatically reduced compared with [19]. This architecture shows a way for large-
scale integrated photonic neural networks, where the photonic parallelism is exerted to increase the 
computing speed. Besides, digital-analog inter-conversions are potentially to be removed from the integrated 
chip, showing a way for higher on-chip energy-efficiency. In this paper, we present the principles of the 
proposed architecture and discuss its energy-efficiency by comparison with state-of-the-art electronic 
performance and optical integrated neural networks. We also exhibit a prove-of-concept experiment to 
validate the feasibility of this architecture on practical convolutional neural networks. 
The proposed architecture of integrated photonic convolutional neural network (IPCNN) is shown in Fig. 
1(a). The illustrated structure is equivalent to a convolutional layer with Ci input channels and Co output 
channels, and the size of convolutional kernels is Q. For every input channel, a laser with a specific wavelength 
and a high-speed electro-optic (E/O) modulator is deployed. Totally Ci wavelengths are deployed for the 
simultaneous data modulation of all input channels. Then, these wavelengths are combined into one 
waveguide with a wavelength-divided multiplexer (WDM) to reduce the usage of optical delay line. In the 
delay line bank, (Q-1) delay lines are cascaded with a drop port between each. The physical delay of a dropped 
light is the accumulated traveling length before it is dropped. By designing the lengths of delay lines and the 
 
Fig. 1 Schematic of IPCNN. a, The holistic schematic of the architecture. Input part contains Ci laser sources with different 
wavelengths. High-speed electro-optic modulators (E/O) accept the serialized input data from Ci input channels. With a 
WDM, wavelengths are combined into one waveguide and enter the delay line bank. After delay line bank, the outputs are 
duplicated equally into Co VMM cores. Each VMM core finishes the computation of one output channel. b, The structure 
of a VMM core. Each micro-ring resonator is tuned to control the coupling ratio between through port and drop port for a 
specific wavelength. And the intensities of different wavelengths are added up in the BPD and then amplified with 
transimpedance amplifiers (TIAs). Voltages are finally added up using a voltage adder (ADD). 
coupling ratio of the drop ports, we can obtain Q intensity-equal copies of the input with different physical 
delays. Through this step, the required the data manipulations before matrix multiplications are finished. 
Then, the matrix multiplication is implemented within the vector-matrix multiplication (VMM) cores. The 
outputs of the delay line bank are equally divided into Co copies, and each copy enters a VMM core to get the 
convolution result of an output channel. The structure of VMM core is depicted in Fig. 1(b) and the principle 
of this structure is detailed in [11, 16]. Each VMM core has Q input waveguides, inside each travel Ci 
wavelengths. In a VMM core, there are totally CiQ parameters represented by the micro-ring resonators. A 
single micro-ring resonator acts as a tunable coupler for a specific wavelength, tuned to control the coupling 
ratio from through-port to drop-port: therefore, the intensity of the wavelength is weighted. Then, the 
intensities of Ci weighted wavelengths are summed up and converted to electrical voltage in the balanced 
photo-detector (BPD) and transimpedance amplifier (TIA). All Q voltages are summed up in the voltage adder 
to give the output of a VMM core. With a single VMM core, we can get the result of one output channel; with 
Co VMM cores, the entire convolutional layer is completed. 
In Fig. 2, we visualize the GeMM of a convolutional layer with image inputs to show its equivalence to delay 
buffering. Suppose the input data is a group of square-shaped images with width L and channel number of Ci, 
and the convolutional layer is to convolute the Ci-channel input data to Co-channel output data. Before the 
matrix multiplications, the image data are preprocessed to form an input matrix. The first step is patching: 
dividing the input images into kernel-sized small patches with a fixed stride. In Fig. 2(a), we take the size of 
the convolutional kernels to be Q=σ2=32, and the patching stride is set to 1. Therefore, a 6×6 image is divided 
to 16 patches. The patching is repeated for every input channel. Then, each patch is flattened to form a piece 
of the input matrix X. Patches from the same input channel shall rank horizontally and different input channel 
are arranged vertically. Consequently, the size of matrix X is (L-σ+1)2×Ciσ2. In a single convolutional layer, the 
number of convolutional kernels is CiCo so the aggregate number of parameters is CiCoσ2. The convolutional 
kernels should also be transformed to form a weight matrix W with size Ciσ2×Co (see Suppl. for details). With 
above-mentioned matrices X and W, the convolution result is represented with Y=WX in matrix 
multiplication way. Each row of the result matrix Y can be transformed back to an output image channel. 
Therefore, a standard GeMM is finished. In fact, we have another way to form the input matrix by delay 
buffering. As is shown in Fig. 2(b), we firstly serialize an input image into a row, and then duplicate it for Q 
copies, which are imposed with different fixed delay amounts. After every input channel is manipulated in 
the same way, we can get the delayed matrix X’. In fact, the reddish zone of X’ is the same with X. The delay 
amounts can be calculated with the following equation. 
 
Fig. 2 The processes of GeMM for a convolutional layer. The convolutional layer shown here transforms Ci input 
channels to Co output channels with 3×3-sized convolutional kernels. The image size is L×L. a, The conventional way of 
GeMM in digital computers and previous photonic neural network architectures. Processes include patching, tiling and 
matrix multiplications. Patching and tiling are executed with electronic circuits in previous photonic neural networks. b, 
The IPCNN way to build the input matrix by serialization and delay. By imposing fixed delay amounts on input data, optical 
passive delay line bank can provide the input matrix X’. The valid part of input matrix X’ is equivalent to the conventional 
input matrix X (shown in reddish part). After matrix multiplications, the valid part of the output matrix Y’ is equivalent to 
conventional output matrix Y (shown in purple part). 
( )
1
(i 1)mod , 1,2,3,...,i
i
D L i Q

− 
= + − = 
 
 
Note that the physical delay on the hardware is related to modulation rate of the input E/O, i.e. faster 
modulation rate requires shorter physical length of the delay line to get the required delay amount. If fm 
denotes the modulation rate, the physical delay is  
/phyi i mD D f= . 
Since the height of delayed matrix is also Ciσ2, it can be multiplied with the weight matrix W. As a 
consequence, the rendered parts of the output matrix Y’ is the same with the standard GeMM output Y. 
Because only the valid part (reddish part) of delayed matrix X’ gives valid output (purple part in Y’), we can 
define the valid part of Y’ as well as X’ by the following equation. 
 { '( , ) | ( 1)( 1) 1, ( 1) ,
1 ,0 , , , }.
valid
o
Y Y i j i L nL L nL L
j C n L i j n
 

=  + − + + − + +
    − 
 
Note that every row of the output matrix Y’ is a serialized image channel. After the pointwise nonlinear 
activation function, these serialized image channels can be regarded as the input channels of the next 
convolutional layer. If the output channel number equals to the input channel number of the next layer, they 
can be directly cascaded. Suppose K convolutional layers are cascaded and the output matrix of the K-th layer 
is Y(K). The valid part of Y(K) is defined as: 
 ( ) ( )
( )
{ ( , ) | ( 1)( 1) 1, ( 1) ,
1 ,0 ( 1) 1; , , }.
K K
valid
K
o
Y Y i j i K L nL KL nL L
j C n L K i j n
 

=  + − + + − + +
    − − − 
 
 
Fig. 3 Energy budget and efficiency evaluation of IPCNN. a, Energy budgets of different integrated photonic neural 
network architectures executing the convolutional layers of AlexNet (CONV3~CONV5). b, Energy efficiency of IPCNN 
versus modulation rate. For reference, we depict the energy efficiency levels of high-performance commercial GPU (Nvidia 
V100) and state-of-the-art (SOTA) ASIC. In the figure, we also mark the region of conventional modulation rate in typical 
optical communication systems. When the proposed architecture is implemented with cascaded layers, higher energy 
efficiency is promised. “2-cascaded” and “3-cascaded” represents the cascading of 2 convolutional layers and 3 convolutional 
layers, respectively. 
 
Table 1 Hyperparameters of AlexNet. CONV denotes convolutional layers and FC denotes fully-connected layers. The 
boldface lines are convolutional layers adopted in the energy budget evaluations. 
Results 
Compared with the previous photonic neural network architectures, IPCNN focuses on the convolution layers. 
The major difference is the introduction of optical delay lines. It benefits the convolution efficiency in several 
aspects. First, it replaces the electronic circuits to execute data manipulations before matrix multiplication, 
mitigating the power consumption and data latency. Second, the number of power-consuming E/O interfaces 
decreases. AlexNet is a classical neural network structure for image classification and its hyperparameters are 
listed in Table 1 [20]. To compare the energy efficiency of the proposed architecture among different integrated 
photonic neural networks, we evaluate the energy consumption of these architectures on convolutional layers 
(CONV3-CONV5) of the AlexNet respectively. We limit the modulation rate fm to be 5 GHz which is practically 
compatible with on-chip analog-to-digital converters (ADCs) and digital-to-analog converters (DACs) [17]. 
The power consumptions of applied components are supposed as: 100 mW per laser, 26 mW per DAC, 23 mW 
per E/O modulator [21, 22], 2 mW per TIA [23, 24], and 76 mW per ADC [25]. In this way, we can calculate the 
power consumption of these architectures (see Methods for details). In Fig. 3(a), we compare the energy 
consumption of our architecture to conduct these GeMMs with two typical integrated photonic neural 
networks [10, 11]. It is obvious that the proposed architecture is advantageous in convolutional layers. Note 
that IPCNN majorly mitigates the power of input ports. Therefore, the energy budget enhancement is more 
obvious on CONV4 and CONV5 whose input channel numbers are larger than CONV3. The architectures of 
[11] and IPCNN both adopt WDM, thus the energy budget is evaluated under the limitation of 100 wavelengths. 
While in [10], coherent light is used in the optical hardware. Only a single wavelength is adopted hence no 
wavelength limitation is imposed. In general, when Q=9, IPCNN is supposed to reduce the power 
consumption of input ports by ~9 folds. Given the power consumption of output ports of IPCNN is slightly 
higher than previous integrated photonic neural networks because of multiple TIAs, the overall on-chip 
energy consumption is 2~8-fold less than previous architectures. It is worth noting that only the power of 
photonic and driving electronic components are included in this evaluation. The extra power of electronic 
data manipulation circuits will increase the aggregate consumption of the previous architectures. 
In Fig. 3(b), we compare the energy efficiencies of IPCNN and electronic neural network accelerators. The 
performances of commercial graphic processing unit (GPU), Nvidia V100 [26], and state-of-the-art ASIC [8, 
9] are exhibited for reference. In this evaluation, the input channel number and output channel number are 
fixed to 96, and Q is fixed to 9. Once the modulation rate is higher than 300 MHz, the energy efficiency of the 
proposed architecture surpasses the state-of-the-art deep learning ASICs. Note that conventional optical 
systems work at the speed of several GHz, the energy efficiency of photonic neural networks can be multiple 
times higher than electronics. Recently, a more aggressive energy budget of photonic neural networks is 
indicated in [14], implying that we can significantly lower the power consumption of lasers while maintaining 
 
Fig. 4 Setup of the proof-of-concept experiment. a, System configuration of the proof-of-concept experiment. MZM, 
Mach-Zehnder modulator; AWG, arbitrary waveform generator; TDL, tunable delay line; DOMZM, dual output Mach-
Zehnder modulator; DC, Direct-current source; OSC, oscilloscope. b, An example of the output waveform after the voltage 
adder, recorded by the OSC at 20 GS/s. c. Zoom-in plot of the output waveform. By down-sampling the waveform to 2 GS/s, 
we get the experimental samples. Reference samples from the computer results are plotted in blue dots. 
the prediction accuracy. Furthermore, another feature of the proposed architecture is the cascading ability. If 
the output channel number equals the input channel number of the next convolutional layer, the outputs can 
be directly injected into the next layer after an analog nonlinear activation function. For example, a simple 
diode can be a rectified linear unit (ReLU) or E/O modulators can also be tunable nonlinear functions [27]. 
By cascading, analog-digital interconversions are cancelled between two convolutional layers, and the energy 
efficiency is increased further as shown in Fig. 3(b). 
We conduct proof-of-concept experiments to validate the principle of IPCNN with discrete components. 
The experimental setup is shown in Fig. 4(a). Two continuous-wave light with different wavelengths are 
provided by a 4-channel laser source (Alnair Labs TLG-200). An arbitrary waveform generator (AWG, Keysight 
M8195A) is adopted to generate the input data. After the E/O conversion within the Mach-Zehnder 
modulators (MZMs, Optilab IMC-1550-20), two input channels are multiplexed by a WDM (RoHS DWDM-
4CH). A tunable delay line (TDL, General Photonics MDL-002) is deployed to fulfill the required delay 
amount. In the proof-of-concept experiments, we replace micro-ring resonators with dual-output Mach-
Zehnder modulators (DOMZMs, EOspace AX-1x2-0MSS-20). Firstly, the combined wavelengths are 
demultiplexed. Then, for each wavelength, we use a DOMZM and a direct-current (DC, Keithley 2230-30-1) 
voltage to tune the optical power of two output ports. Two BPDs convert the optical power to electronic 
voltages and two voltages are added up in the voltage adder. The BPDs and the voltage adder are integrated in 
a homemade module with 100MHz-3GHz bandwidth. An oscilloscope (OSC, Keysight DSO-S 804A) is 
adopted to record the waveforms to digital domain. As can be seen, we can only load 2 convolutional 
parameters within a single run. Therefore, we have to repeatedly change the DC voltages and the delay amount 
until all convolutional parameters are traversed and the GeMM is equivalently finished. A Labview program 
is composed to finish the repeating experiments automatically. Here, the modulation rate is set to 2 GHz, so 
the minimal delay amount is 0.5 ns. Since the adopted TDL have a maximal delay amount of 1.12ns, we only 
 
Fig. 5 Image classification results with the experimental setup. a, The adopted CNN model for image classification. 
“Conv. (16)” represents a convolutional layer with 16 output channels. All convolutional kernels are 3×3. “FC (512)” represents 
a fully-connected layer with 512 output neurons. ReLU, Sigmoid, and Softmax are different kind of activation functions. b, 
Convolution results of the first convolutional layer (Conv. (16)) via the experimental setup. Alongside each result, the 
residual error and its PSNR in dB are also provided. Computer results are used as reference. c, Convolution results of the 
second convolutional layer (Conv. (32)). d and e, The experimental prediction distributions of MNIST dataset and Fashion-
MNIST dataset, respectively. Correct predictions concentrate on the diagonal line of the prediction distributions. 
tune the TDL to finish short delays less than 1.12ns. The long delays are finished in data processing (see Suppl. 
Table 1 for details). Note that the adopted BPDs have a low-frequency cut-off, i.e. DC components will be 
filtered out by the system. To avoid waveform distortion, the input data and network parameters are firstly 
coded to remove the DC components (see Methods and Suppl. for details). Figure 4(b) and 4(c) show an 
example of the output waveforms recorded by the OSC. When the modulation rate is 2 GHz, a 2 GS/s sampling 
rate is enough. Here, to observe more details, we set the OSC working at a sampling rate of 20GS/s. The output 
waveforms are the weighted sums of two input channels. In the zoomed waveform of Fig. 4(c), the 
experimental samples are close to the reference samples, implying the experimental convolution result can 
approach the ideal results. Based on the output waveforms, we reconstruct the convolution results (see 
Methods and Suppl.). The proposed architecture is validated with classification tasks in the experiment. We 
adopt a simple CNN to fulfill two classification tasks, MNIST handwritten numbers [28] and Fashion-MNIST 
[29]. The CNN model is shown in Fig. 5(a). We use the experimental setup to execute convolutional layers of 
the CNN and the rest of it is executed in a computer. The network parameters are trained within a computer 
and the experimental setup is used in the inference phase. In Fig. 5(b) and 5(c), we randomly depict some 
convolution results of convolutional layer 1 and convolutional layer 2, respectively. And the absolute error of 
these results to computer references are also depicted alongside these results. The peak signal to noise ratio 
(PSNR) is marked to evaluate the accuracy performance of the experimental setup. It is shown that the error 
of the experimental results is fairly small. Totally, we executed 200 inferences of the first 200 images in each 
test-set to check the overall CNN classification accuracy. The experimental prediction distributions are shown 
in Fig. 5(d) and 5(e). For MNIST handwritten numbers, the experimental correction rate is 99.0% (the 
computer accuracy is 99.0%). For Fashion-MNIST, the experimental correction rate is 89.0% (the computer 
accuracy is 88.5%). In this sense, the correctness of IPCNN by the experimental results is verified. Although 
the accuracy of experimental setup could surpass the reference computer, it is inappropriate to say the 
experimental setup is more accurate than computers. This is because only 200 images in the Fashion-MNIST 
dataset are tested and the randomness of the neural network influences the accuracy. 
Conclusions & Discussions 
In this paper, we propose an integrated photonic convolutional neural network architecture, IPCNN. An 
optical delay line bank can replace the electronic counterparts to finish data manipulations with no power, 
consequently reducing the energy budget of input ports and electronic circuits. Applied WDM dramatically 
reduces the total length of required optical delay lines. AlexNet is used as a benchmark on which the energy 
efficiency of this architecture is evaluated at least 2~8-fold higher than current photonic neural networks. The 
evaluated energy efficiency of this architecture also significantly surpasses the state-of-the-art electronic deep 
learning accelerators. Furthermore, results of the proof-of-concept experiments exhibit that the classification 
accuracy are at the same level with 64-bit computer, validating the principle of the proposed architecture.  
In IPCNN, the physical delay lengths of delay line bank are related with the size of input image, size of 
convolutional windows, and modulation rate. Therefore, we should determine these parameters prior to 
designing the delay line bank. In prevailing CNN models, small kernels (3×3) are widely adopted in the deep 
hidden layers and the image size is around tens [20, 30, 31]. Assuming the modulation rate is 5 GHz, the 
required maximal physical length is at meter level. Since the lengths of delay lines are fixed, current 
technologies support well on ultra-low-loss long delay lines [32-34]. If we consider larger image and kernel 
size, embedded fibers can also provide extra delay amount. Same with other photonic neural networks, the 
hardware imparity to the software-trained parameters is another important issue. Due to the fabrication 
randomness and unbalanced losses, the software-trained parameters may not be directly applicable on the 
hardware. In-situ training methods [35] and weight bank controlling method [36] are helpful to solve the 
problem.  
For now, a plethora of silicon-based photonic integration technologies [37-39] supports the full fabrication 
of IPCNN on a single chip (except for lasers). In the future, large-scale photonic electronic hybrid integration 
technologies [40, 41] and the heterogeneous integration technologies [42, 43] could provide opportunities for 
a full-functioning convolutional neural network. With the advantages of integrated optical circuits of high-
parallelism and low-power, we can practically exert the energy efficiency of IPCNN. Given that marginal 
computing terminals are a prevailing trend, this energy-efficient architecture would benefit the high-
performance deep learning inference on power-limited remote devices, including smartphones, drones, 
unmanned vehicles and smart-cameras. 
Materials & Methods 
Evaluation of energy consumptions of photonic neural networks. 
In the general assumption of photonic neural networks, the matrix multiplication, i.e. linear part, consumes 
no power. Therefore, the major power load is on the active components of input ports and output ports. 
Particularly, lasers, input E/O modulators, photodetectors with TIAs, and required electronic circuits (DACs 
and ADCs) consume the major part of energy. For each input port, one laser, one DAC, and one E/O modulator 
are required. And for each output port, TIAs and one ADC are required. In previous architectures based on 
GeMM, total number of input ports is Ciσ2. Therefore, the on-chip energy consumption of [10] is calculated 
by  
2
2
i o(1e 3) [C (100 23 26) C (2 76)] ,
m
L
E P t
f
=  = −   + + +  +   
where σ = 3, L = 13 for CONV3-CONV5 in AlexNet, and fm = 5e9. However, for WDM-based architectures ([11] 
and ours), we set a maximal wavelength number limitation of 100. Therefore, the Ciσ2 input ports are divided 
into multiple-step calculation. For example, the on-chip energy consumption of [11] on CONV3 (Ciσ2 = 2304) 
is calculated by 
2
(1e 3) [96 (100 23 26) (2 76)] 24,o
m
L
E P t C
f
=  = −   + + +  +    
where the number of input ports is limited to 96 and the time consumption is multiplied by 24. For CONV4 
and CONV5, the number of input ports is 96 and the time is extended by 36. For IPCNN, despite the limitation 
of wavelength number, only Ci input ports are needed. Therefore, for example, the energy consumption of the 
proposed architecture on CONV3 is calculated by 
2
2(1e 3) [86 (100 23 26) (2 76)] 3.o
m
L
E P t C
f
=  = −   + + +   +    
And for CONV4 and CONV5, the number of input ports is set as 96 and time is extended by 4. Then, we 
evaluated the energy efficiency of the proposed architecture with different modulation rate. Energy 
consumption is also evaluated with above method, and the aggregate number of MACs is calculated by CiCoσ2. 
So, the energy efficiency is 
2/ i oE CC = . 
For reference, the energy efficiency of Nvidia V100 is calculated by the official declared FP-16 performance 
from Nvidia corporation [26]. 
 
Neural network implementations on software and experimental setup 
We implement a simple CNN on computer software and experimental setup to validate the principle 
correctness of the proposed architecture. In the computer, the neural network is implemented by TensorFlow 
with 32-bit float-point, and network trainings for two image classification tasks (MNIST and Fashion-MNIST) 
are done within the software. The trained network parameters are saved for the inference both on computer 
and experimental setup. The experimental setup is only used in the convolutional layers in inference phase. 
The input image is firstly serialized row by row to form the 1-D input data. Given the low-frequency cut-off of 
the experimental setup, the 1-D input data is coded with alternating 1 and -1 to remove the DC components 
(detailed in Suppl.). Because the fixed modulation rate is 2 GHz and the sampling rate of adopted AWG is 60 
GS/s, the input data is then spline interpolated from 2 GS/s to 60 GS/s. The AWG generates the interpolated 
waveforms to modulate the light in the E/O modulators. The delay line bank is implemented equivalently by 
a single TDL, and the TDL is tuned to fulfill the required short delay amount in the experiment (see Suppl. 
Table I). To match the input data coding, the convolutional parameters should also be coded by a mask 
(detailed in Suppl.). The masked parameters are then mapped from values to voltages referring to the 
modulation curves of the DOMZMs. Value 1 means the highest intensity on the through-port and lowest 
intensity on the drop-port. Value -1 means the highest intensity on the drop-port while the lowest intensity 
on the through-port. These voltages are then generated by the programmable DC source and loaded on the 
DC port of the DOMZMs. The output waveform of the voltage adder is the weight sum of two input waveforms. 
Since only 2 parameters are loaded and accumulated within a single output waveform, we repeatedly change 
the loaded parameters and delay amount to traverse the weight matrix. All output waveforms are recorded. 
Additional accumulations among different output waveforms are necessary to complete the VMM. These 
accumulations are finished in a computer based on recorded waveforms. From the accumulated output 
waveforms, we can reconstruct the convolution result by decoding, i.e. multiplying the output waveform by 
alternating 1 and -1 (see Suppl. for details). To evaluate the data accuracy of the experimental setup, the PSNR 
is calculated by  
10
1
20 log ,
(error)
PSNR
std
=   
where the error is the difference of experimental results and the computer results. 
Acknowledgement  
This work is supported by National Natural Science Foundation of China (grant no. 61822508, 61571292, 
61535006). 
Conflict of interests 
The authors declare no conflict of interest.  
 
Reference 
1. Y. LeCun, Y. Bengio, and G. Hinton, Deep learning, Nature 521, 436-444 (2015). 
2. M. Jaderberg, W. M. Czarnecki, I. Dunning, L. Marris, G. Lever, A. G. Castañeda, C. Beattie, N. C. Rabinowitz, 
A. S. Morcos, A. Ruderman, N. Sonnerat, T. Green, L. Deason, J. Z. Leibo, D. Silver, D. Hassabis, K. Kavukcuoglu, 
and T. Graepel, Human-level performance in 3D multiplayer games with population-based reinforcement 
learning, Science 364, 859-865 (2019). 
3. J. Peurifoy, Y. Shen, L. Jing, Y. Yang, F. Cano-Renteria, B. G. DeLacy, J. D. Joannopoulos, M. Tegmark, and M. 
Soljačić, Nanophotonic particle simulation and inverse design using artificial neural networks, Science 
Advances 4, eaar4206 (2018). 
4. T. Zahavy, A. Dikopoltsev, D. Moss, G. I. Haham, O. Cohen, S. Mannor, and M. Segev, Deep learning 
reconstruction of ultrashort pulses, Optica 5, 666-673 (2018). 
5. S. Sundaram, P. Kellnhofer, Y. Li, J. Zhu, A. Torralba, and W. Matusik, Learning the signatures of the human 
grasp using a scalable tactile glove, Nature 569, 698-702 (2019). 
6. Y. Rivenson, Y. Zhang, H. Günaydın, D. Teng, and A. Ozcan, Phase recovery and holographic image 
reconstruction using deep learning in neural networks, Light: Science & Applications 7, 17141 (2018). 
7. A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J. P. Strachan, M. Hu, R. S. Williams, and V. 
Srikumar, ISAAC: a convolutional neural network accelerator with in-situ analog arithmetic in crossbars, ACM 
SIGARCH Computer Architecture News 44, 14-26 (2016). 
8. T. Chen, Z. Du, N. Sun, J. Wang, C. Wu, Y. Chen, and O. Temam, DianNao: a small-footprint high-throughput 
accelerator for ubiquitous machine-learning, ACM Sigplan Notices 49, 269-284 (2014). 
9. N. P. Jouppi, C. Young, N. Patil, D. Patterson, G. Agrawal, R. Bajwa, S. Bates, S. Bhatia, N. Boden, A. Borchers, 
R. Boyle, P. Cantin, C. Chao, C. Clark, J. Coriell, M. Daley, M. Dau, J. Dean, B. Gelb, T. V. Ghaemmaghami, R. 
Gottipati, W. Gulland, R. Hagmann, C. R. Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, A. Jaffey, A. 
Jaworski, A. Kaplan, H. Khaitan, D. Killebrew, A. Koch, N. Kumar, S. Lacy, J. Laudon, J. Law, D. Le, C. Leary, Z. 
Liu, K. Lucke, A. Lundin, G. MacKean, A. Maggiore, M. Mahony, K. Miller, R. Nagarajan, R. Narayanaswami, 
R. Ni, K. Nix, T. Norrie, M. Omernick, N. Penukonda, A. Phelps, J. Ross, M. Ross, A. Salek, E. Samadiani, C. 
Severn, G. Sizikov, M. Snelham, J. Souter, D. Steinberg, A. Swing, M. Tan, G. Thorson, B. Tian, H. Toma, E. 
Tuttle, V. Vasudevan, R. Walter, W. Wang, E. Wilcox, and D. H. Yoon, In-datacenter performance analysis of a 
tensor processing unit, ACM/IEEE 44th Annual International Symposium on Computer Architecture, 1-12 
(2017). 
10. Y. Shen, N. C. Harris, S. Skirlo, M. Prabhu, T. Baehr-Jones, M. Hochberg, X. Sun, S. Zhao, H. Larochelle, D. 
Englund, and M. Soljačić, Deep learning with coherent nanophotonic circuits, Nature Photonics 11, 441-446 
(2017). 
11. A. N. Tait, T. F. Lima, E. Zhou, A. X. Wu, M. A. Nahmias, B. J. Shastri, and P. R. Prucnal, Neuromorphic photonic 
networks using silicon photonic weight banks, Scientific Reports 7, 7430 (2017). 
12. J. Bueno, S. Maktoobi, L. Froehly, I. Fischer, M. Jacquot, L. Larger, and D. Brunner, Reinforcement learning in 
a large-scale photonic recurrent neural network, Optica 5, 756-760 (2018). 
13. Y. Zuo, B. Li, Y. Zhao, Y. Jiang, Y. Chen, P. Chen, G. Jo, J. Liu, and S. Du, All-optical neural network with 
nonlinear activation functions, Optica 6, 1132-1137 (2019). 
14. R. Hamerly, L. Bernstein, A. Sludds, M. Soljačić, and D. Englund, Large-scale optical neural networks based 
on photoelectric multiplication, Physical Review X 9, 021032 (2019). 
15. X Lin, Y. Rivenson, N. T. Yardimci, M. Veli, Y. Luo, M. Jarrahi, and A. Ozcan, All-optical machine learning 
using diffractive deep neural networks, Science 361, 1004-1008 (2018). 
16. S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, and J. Tran, cuDNN: efficient primitives for deep learning, 
arXiv preprint arXiv:1410.0759 (2014). 
17. V. Bangari, B. A. Marquez, H. Miller, A. N. Tait, M. A. Nahmias, T. F. Lima, H. Peng, and P. R. Prucnal, Digital 
electronics and analog photonics for convolutional neural networks, arXiv preprint arXiv:1907.01525 (2019). 
18. S. Xu, J. Wang, R. Wang, J. Chen, and W. Zou, High-accuracy optical convolution unit architecture for 
convolutional neural networks by cascaded acousto-optical modulator arrays, Optics Express 27, 19778-19787 
(2019). 
19. H. Bagherian, S. Skirlo, Y. Shen, H. Meng, V. Ceperic, and M. Soljacic, On-chip optical convolutional neural 
network, arXiv preprint arXiv:1808.03303 (2018). 
20. A. Krizhevsky, I. Sutskever, G. E. Hinton, ImageNet classification with deep convolutional neural networks, 
Advances in Neural Information Processing Systems, 1097-1105 (2012). 
21. M. He, M. Xu, Y. Ren, J. Jian, Z. Ruan, Y. Xu, S. Gao, S. Sun, X. Wen, L. Zhou, L. Liu, C. Guo, H. Chen, S. Yu, L. 
Liu, and X. Cai, High-performance hybrid silicon and lithium niobate Mach–Zehnder modulators for 100 Gbit 
s−1 and beyond, Nature Photonics 13, 359-364 (2019). 
22. C. Wang, M. Zhang, X. Chen, M. Bertrand, A. Shams-Ansari, S. Chandrasekhar, P. Winzer, and M. Lončar, 
Integrated lithium niobate electro-optic modulators operating at CMOS-compatible voltages, Nature 562, 101-
104 (2018). 
23. C. Kromer, G. Sialm, T. Morf, M. L. Schmatz, F. Ellinger, D. Erni, and H. Jäckel, A low-power 20-GHz 52-dBohm 
transimpedance amplifier in 80-nm CMOS, IEEE Journal of Solid-State Circuits 39, 885-894 (2004). 
24. S. Zohoori, M. Dolatshahi, M. Pourahmadi, M. Hajisafari, A CMOS, low-power current-mirror-based 
transimpedance amplifier for 10 Gbps optical communications, Microelectronics Journal 80, 18-27 (2018). 
25. B. Murmann, ADC performance survey 1997-2019, http://web.stanford.edu/~murmann/adcsurvey.html, 
(2019). 
26. Nvidia Inc., Tesla V100 Tensor Core GPU, https://www.nvidia.cn/data-center/tesla-v100/, (2019). 
27. J. K. George, A. Mehrabian, R. Amin, J. Meng, T. F. Lima, A. N. Tait, B. J. Shastri, T. El-Ghazawi, P. R. Prucnal, 
and V. J. Sorger, Neuromorphic photonics with electro-absorption modulators, Optics Express 27, 5181-5191 
(2019). 
28. Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition, 
Proceedings of the IEEE 86, 2278-2324 (1998). 
29. X. Han, K. Rasul, and R. Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning 
algorithms. arXiv preprint arXiv:1708.07747 (2017). 
30. C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich, Going 
deeper with convolutions, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 
1-9 (2015). 
31. K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, Proceedings of the IEEE 
Conference on Computer Vision and Pattern Recognition, 770-778 (2016). 
32. H. Lee, T. Chen, J. Li, O. Painter, and K. J. Vahala, Ultra-low-loss optical delay line on a silicon chip, Nature 
Communications 3, 867 (2012). 
33. J. F. Bauters, M. L. Davenport, M. J. R. Heck, J. K. Doylend, A. Chen, A. W. Fang, and J. E. Bowers, Silicon on 
ultra-low-loss waveguide photonic integration platform, Optics Express 21, 544-555 (2013). 
34. M. J. R. Heck, J. F. Bauters, M. L. Davenport, D. T. Spencer, and J. E. Bowers, Ultra-low loss waveguide platform 
and its integration with silicon photonics, Laser Photonics Review 8, 667-686 (2014). 
35. T. W. Hughes, M. Minkov, Y. Shi, and S. Fan, Training of photonic neural networks through in situ 
backpropagation and gradient measurement, Optica 5, 864-871 (2018). 
36. A. N. Tait, H. Jayatilleka, T. F. Lima, P. Y. Ma, M. A. Nahmias, B. J. Shastri, S. Shekhar, L. Chrostowski, and P. 
R. Prucnal, Feedback control for microring weight banks, Optics Express 26, 26422-26443 (2018). 
37. H. Xu, X. Li, X. Xiao, Z. Li, Y. Yu, and J. Yu, Demonstration and characterization of high-speed silicon 
depletion-mode Mach–Zehnder modulators, IEEE Journal of Selected Topics in Quantum Electronics 20, 
3400110 (2014). 
38. J. Chiles, S. M. Buckley, S. Nam, R. P. Mirin, and J. M. Shainline, Design, fabrication, and metrology of 10 by 
100 multi-planar integrated photonic routing manifolds for neural networks, APL Photonics 3, 106101 (2018). 
39. S. Tao, Q. Huang, L. Zhu, J. Liu, Y. Zhang, Y. Huang, Y. Wang, and J. Xia, Athermal 4-channel (de-)multiplexer 
in silicon nitride fabricated at low temperature, Photonics Reseach 6, 686-691 (2018). 
40. C. Sun, M. T. Wade, Y. Lee, J. S. Orcutt, L. Alloatti, M. S. Georgas, A. S. Waterman, J. M. Shainline, R. R. 
Avizienis, S. Lin, B. R. Moss, R. Kumar, F. Pavanello, A. H. Atabaki, H. M. Cook, A. J. Ou, J. C. Leu, Y. Chen, K. 
Asanović, R. J. Ram, M. A. Popović, and V. M. Stojanović, Single-chip microprocessor that communicates 
directly using light, Nature 528, 534-538 (2015). 
41. A. H. Atabaki, S. Moazeni, F. Pavanello, H. Gevorgyan, J. Notaros, L. Alloatti, M. T. Wade, C. Sun, S. A. Kruger, 
H. Meng, K. A. Qubaisi, I. Wang, B. Zhang, A. Khilo, C. V. Baiocco, M. A. Popović, V. M. Stojanović, and R. J. 
Ram, Integrating photonics with silicon nanoelectronics for the next generation of systems on a chip, Nature 
556, 349-354 (2018). 
42. P. Rabiei, J. Ma, S. Khan, J. Chiles, and S. Fathpour, Heterogeneous lithium niobate photonics on silicon 
substrates, Optics Express 21, 25573-25581 (2013). 
43. T. Komljenovic, M. Davenport, J. Hulme, A. Y. Liu, C. T. Santis, A. Spott, S. Srinivasan, E. J. Stanton, C. Zhang, 
and J. E. Bowers, Heterogeneous silicon photonic integrated circuits, Journal of Lightwave Technology 34, 20-
35 (2016). 
  
Supplementary Materials for High-energy-efficiency integrated 
photonic convolutional neural networks 
Authors: Shaofu Xu,1 Jing Wang,1 and Weiwen Zou*,1 
1 State Key Laboratory of Advanced Optical Communication Systems and Networks, Intelligent Microwave Lightwave 
Integration Innovation Center (iMLic), Department of Electronic Engineering, Shanghai Jiao Tong University, 800 
Dongchuan Road, Shanghai 200240, China. 
*Correspondence to: wzou@sjtu.edu.cn. 
 
1. Deciding the valid part of the output matrix 
As is depicted in Fig.1 in the main manuscript, GeMM can be also implemented by serialization and delay 
buffering. The input matrix and output matrix from this scheme, however, are larger than the standard GeMM. 
To reconstruct the valid part of the output matrix, we show the derivation of the valid part equations 
(equations in the introduction of the main manuscript) in this section. 
 
Suppl. Fig. 1 Convolutional kernel W’ shifting on the serialized/delayed matrix X’. For each step, the dot-product is 
calculated by the overlapped positions of W’ and X’. The equivalent plot of the original kernel W moving on the original image 
X is illustrated for each step. The dot-product is valid only if the whole kernel W is inside the image X and its shape stays 3×3. 
We denote the valid steps with “Y” and invalid steps with “N”. 
  Suppl. Fig. 1 illustrates an example of a single convolutional kernel moving in a single input image. The 
kernel W is 3×3, and the size of image is 6×6. The aim of convolution is to calculate the dot-product of the 
kernel and the corresponding patch in the image. Through moving the kernel from left to right row by row in 
the image, we can get the convolution result. In the proposed delay/buffer scheme, the image is firstly 
serialized and delayed to form the matrix X’. And the kernel is flattened to a vector W’. It is shown in the figure, 
shifting the flattened kernel W’ is equivalent to moving the kernel row by row in the image. When the kernel 
is inside the image and its shape does not alter, the dot-product is a valid convolution result and the time shift 
at this position is denoted as a “valid step”. It is obvious that there are 16 valid steps after W’ traverses X’. In 
Suppl. Fig. 2(a), we mark these valid steps in a 6×6 matrix row by row. It is seen that the valid part is a 4×4 area 
in the matrix, and it is equivalent to the 4×4 result of standard GeMM. The proposed architecture also features 
cascading ability. Suppl. Fig. 2(b) depicts the result after two cascading convolutional layers, it is equivalent 
to the standard GeMM results. 
 
Suppl. Fig. 2 Valid/invalid steps grouped as a matrix. The matrix is constructed row by row after the flatten kernel W’ 
traverses delay/buffered matrix X’. (a) After one convolutional layer, the valid area is 4×4. (b) After two cascading convolutional 
layers, the valid area is 2×2. 
  Generalizing this example to arbitrary kernel size σ and image size L. the valid steps can be denoted as 
[( 1)( 1) 1, ( 1) ],0 , .L nL L nL L n L n  + − + + − + +   −   
And after K cascading convolutional layers, the valid steps can be denoted as 
 ( 1)( 1) 1, ( 1) ,0 ( 1) 1, .K L nL KL nL L n L K n  + − + + − + +   − − −   
 
2. Parameters deployment of VMM cores.  
In the proposed architecture, VMM cores are adopted to execute the matrix multiplication in the GeMM. Each 
VMM core finishes the multiplication for one single output channel (i.e. one row of the weight matrix). Suppl. 
Fig. 3 visualizes how to process the convolutional kernels to the weight matrix and how to deploy these 
parameters on the micro-ring resonators in VMMs. 
 
Suppl. Fig. 3 Deploying the convolutional kernels on micro-rings. Wij denotes the convolutional kernel from i-th input 
channel to j-th output channel.  
For a convolutional layer with Ci input channel and Co output channel and kernel size Q = σ2, we denote 
the convolutional kernel with Wij, i.e. the kernel from i-th input channel to j-th output channel. To form the 
weight matrix W, these kernels are firstly flattened. Then, kernels with the same output channel number are 
arranged to form one row of the weight matrix. Kernels with different output channel numbers are stacked 
vertically in the weight matrix. Therefore, a row of the weight matrix contains all parameters of a VMM core. 
A VMM core has Q input waveguides. Each waveguide contains Ci wavelengths. So Wij is distributed vertically, 
and Ci kernels are distributed horizontally in the VMM core. 
 
3. Coding the input data and the convolutional kernels. 
In the experiment, the adopted BPDs with embedding TIAs show low-frequency cut-off. If image data is 
directly input into the system, the output would be distorted, because the image data has large DC 
components without coding. Shown in Suppl. Fig. 4(a), when the DC components are cut-off by the system, 
the baseline of the output waveform will drift away from original, resulting unacceptable distortions. 
Therefore, to make use of the DC-block photodetectors, we remove the DC components of the input waveform 
through coding before entering the system. 
 
Suppl. Fig. 4 Output waveform before and after input waveform coding. (a) Before input waveform coding, the DC 
components are cut-off. Blue line is the baseline of input waveform, and the orange line is the output waveform baseline. The 
baseline of the waveform obviously drifts. (b) After input waveform coding, the DC components are removed before entering 
the system. Therefore, the baseline of output waveform stays. 
 
Suppl. Fig. 5 The kernel coding. Assume the original input data is all positive. After coding, it is positive/negative-alternating 
as x(t). We try to sum up the coded input data x(t) and its delayed copy x(t+τ) with weights 1 and 0.5. (a) Because of the input 
data coding, the x(t+τ) is negative (coded from a positive value) at the t=0 point. Therefore, the directly weighted sum waveform 
is incorrect. (b) If the weight 0.5 is changed to -0.5, the weighted sum is correct, thus implying the kernel coding is necessary. 
(c) and (d) The kernel coding masks with size 3×3. 
The simplest way to code the input waveform is multiplying the input data by alternating 1 and -1. Suppl. 
Fig. 4(b) shows the effect of this ±1 coding. Obviously, the baseline drifting effect is eliminated. However, input 
data coding will introduce errors if the convolutional kernels are not coded. Suppl. Fig. 5(a) shows the 
consequence with coded input data and non-coded kernels. Because of the input data coding, some originally 
positive values are converted to negative values. As a consequence, the negative values of delayed waveform 
x(t+τ) is at the same position of positive values of original waveform x(t). As summing up these waveforms 
directly would introduce error, the kernels should also be coded, shown in Suppl. Fig. 5(b). All kernels should 
be pointwise multiplied by a mask. Each row of the mask contains alternating 1 and -1. If the width of input 
image L is even, the starting sign of every row is always positive (Suppl. Fig. 5(c)). If the L is odd, the starting 
sign of each row alternates relevantly (Suppl. Fig. 5(d)). Input data coding and kernel coding enable us to use 
the DC-blocking components in the system. After the coding, the output waveform is the weighted sum of 
coded input waveforms. We can decode the result by multiplying the output waveform by alternating 1 and -
1. 
  
 
Suppl. Table 1. Experimental implementations of the required delays in the first and the second convolutional layer. 
Because the maximal delay amount of the adopted TDL is 1.12 ns, it cannot complete the long delays required by the architecture. 
For the first layer, the width of input images is L=28; the size of convolutional kernel is 3×3; and the modulation rate is 2 GHz. 
The required delays are listed. Delay amounts larger than 1.12 ns are implemented by digital processing. The width of input 
images of the second layer is 7. The required delays and experimental implementations are also listed. 
 
