With the proliferation of ultra-high-speed mobile networks and internet-connected devices, along with the rise of artificial intelligence, the world is generating exponentially increasing amounts of data-data that needs to be processed in a fast, efficient and 'smart' way. These developments are pushing the limits of existing computing paradigms, and highly parallelized, fast and scalable hardware concepts are becoming progressively more important. Here, we demonstrate a computational specific integrated photonic tensor core-the optical analog of an ASIC-capable of operating at Tera-Multiply-Accumulate per second (TMAC/s) speeds. The photonic core achieves parallelized photonic inmemory computing using phase-change memory arrays and photonic chip-based optical frequency combs (soliton microcombs). The computation is reduced to measuring the optical transmission of reconfigurable and non-resonant, i.e. broadband, passive components operating at a bandwidth exceeding 14 GHz, limited only by the speed of the modulators and photodetectors. Given recent advances in hybrid integration of soliton microcombs at microwave line rates, ultra-low loss silicon nitride waveguides, and high speed on-chip detectors and modulators, our approach provides a path towards full CMOS wafer-scale integration of the photonic tensor core. While we focus on convolution processing, more generally our results indicate the major potential of integrated photonics for parallel, fast, efficient and wafer-scale manufacturable computational hardware in demanding AI applications such as autonomous driving, live video processing, and next generation cloud computing services.
limitations of electronic addressing 14 , with additional challenges in the manufacturing and implementation due to issues with device variability 15, 16 , cyclability 17 , and drift 18, 19 .
Integrated photonics benefits from the same modularity and scalable fabrication methods of integrated circuits, but has two key advantages over its electronic counterparts: (1) massively parallel data transfer through wavelength division multiplexing (WDM) in conjunction with multichannel sources (i.e. optical frequency combs); and (2) extremely high data modulation speeds limited only by the bandwidth of on-chip optical modulators and photodetectors. These uniquely photonic advantages have led to the ubiquity of optical networks for information transfer, and are presently revolutionizing data centre interconnects (i.e. server-to-switch communication). However, these developments have yet to seriously challenge digital electronics in the arena of information processing. Despite the current dominance of integrated electronics for computing, an application-specific optical processor not limited by the energybandwidth trade-off of electrical interconnects 8 could bring the advantages of optical networking to the field of computing. This would result in very high computational throughput via low-latency (i.e. information processing and propagation at light speed) and parallel operations in a single physical optical processing core using WDM.
However, for this to be practically realised, photonic integration and the use of CMOS compatible manufacturing is of paramount importance: on chip, both energy-efficient optical memory units and a compact, broadband multi-channel laser source must be combined within a scalable photonic architecture. Recent work on integrated photonic processors for MVMs and neuromorphic computing [20] [21] [22] has shown the potential advantages of the photonic approach, but key issues such as large footprints (11,000 µm 2 per interferometer unit 20 ) and the use of thermo-optic heaters to tune the phase or resonance wavelength of their components (ranging on average from 1 mW to 10 mW per heater for ring resonators and Mach-Zehnder interferometers respectively) were bottlenecks 23 , as well as the fact that resonant devices such as add-drop resonators limit the modulation bandwidth. Additionally, while using WDM for processing multiple inputs simultaneously in the same physical hardware has been proposed 24 , it has not yet been demonstrated on-chip.
Here, we design and experimentally demonstrate a novel scalable, CMOS compatible, photonic hardware accelerator (which we term "photonic tensor core" in the following) capable of many parallel MVM operations at optical data rates to process images using convolution filters (here, edge detection and emboss filters). In a departure from electronic accelerators (see Fig. 1a ), our photonic processor implements an on-chip matrix multiplication engine capable of performing parallel multiply-accumulate operations using multiple wavelengths, derived from a photonic chip-based optical frequency comb, that are incoherently added within a network of waveguides that exploit phase-change materials. We leverage recent advances in chip-scale microcombs 25, 26 operating in the regime of dissipative Kerr soliton (DKS) states, which enable broadband, low-noise, and fully integrated optical frequency combs with line spacing ranging from GHz to THz domains and that are compatible with wafer scale manufacturing and integration with on-chip lasers 27-29 . These devices have already been employed in system level demonstrations such as massively parallel coherent communications 30 , chip-scale frequency synthesizers 31 , and massively parallel LiDAR 32 .
Key to our approach is the encoding of image data onto the individual comb teeth of an on-chip frequency comb, and subsequently encoding fixed convolutional kernels in the nonvolatile configuration (i.e. the amorphous or crystalline phase) of integrated phase-change material cells that couple evanescently to a matrix of interconnected photonic waveguides (shown in Fig. 1c ). Our approach minimizes both latency and the movement of data, by using non-volatile in-memory photonic MAC operations and greatly reduces the footprint cost of photonics by multiplexing computations in the same photonic core. Importantly, both the soliton microcombs and the matrix of photonic waveguides can be implemented in silicon nitride 33 , an ultra-low loss, CMOS compatible nonlinear integrated photonic platform, that is compatible with wafer scale manufacturing and foundry. Combined with recent advances in both on chip modulators and hybrid integration of soliton microcombs 27,28 , fully integrated custom photonic tensor cores are now a viable proposition.
Realization of parallel 2D convolutions via matrix-vector multiply operations
One prominent class of machine learning that stands much to gain from high throughput accelerators are convolutional neural networks (CNNs) which are highly effective for applications such as in image classification, autonomous navigation, and audio analysis in the frequency domain. In state-of-the-art CNNs, many convolution "hidden layers" are applied to an input signal before feeding the processed data to fully connected layers for classification 34, 35 .
Each of the convolution layers takes in an input image, performs convolutional operations (with a 'filter') to extract features, and generates an output image. To perform each convolution operation, a filter is passed over the input image inspecting a small window of pixels at a time.
A pixel-wise MAC operation between the filter and the current window is carried out to calculate a single pixel of the output image. For the case of a convolution between an input image of dimension n×n with din channels and a filter of dimension k×k×din, the resulting output image is of dimension (n-k+1)×(n-k+1). In CNNs, dout convolution kernels will be applied to the same image, which corresponds to (n-k+1)²×k 2 ×din×dout MAC operations per convolution layer. When performing these operations in the digital domain, a minimum of two clock cycles are typically required for each sequential MAC operation, leading to a significant computational bottleneck and requiring distribution across multiple computing cores, as illustrated in Fig. 1a .
In order to build efficient hardware to perform the convolution operations, one approach (originally conceived for electronic in-memory computing using memristive crossbar arrays 36, 37 ) is to combine all the convolutional filters into a large filter matrix stored in memory.
As depicted in Fig. 1b , such a filter matrix will be of dimension (k²×din) × dout. It is constructed by stacking the kernel matrices into the columns of the final filter matrix. In the same way the pixels of the input image are rearranged by stacking the pixels of the filter volume, (k×k×din), into the rows of the input matrix. Hence a single convolution operation involves (n-k+1)² MVM operations between the filter matrix and the input vectors of k²×din dimension. In the electronic domain, these MVM operations are typically multiplexed in time with parallelization afforded only by physically replicating the filter matrix. In this work, we exploit photonic integrated soliton microcomb and optical WDM to overcome this fundamental limitation by encoding multiple input vectors of dimension k²×din onto multiple lines of a coherent chipscale frequency comb, see Fig. 1c . These optical input vectors can then be applied to a single (k²×din) × dout filter matrix simultaneously, thus eliminating duplicated physical hardware and sequential operations. This approach will be employed when designing the photonic tensor core.
The photonic tensor core
First, we demonstrate how to perform an MVM operation in the optical domain using photonic integrated circuits employing non-volatile phase-change cells that store analog values of the matrix in situ 38 . Details of using phase-change materials (PCMs) on single devices are described elsewhere 38, 39 . In this work, the PCM (Ge2Sb2Te5 or GST) cells are employed as attenuating matrix elements which absorb a desired amount of light depending on their particular phase configuration. In the crystalline PCM state, most of the incoming light is absorbed, representing for example a "0". In the amorphous state, most of the light is transmitted, thus representing a "1". Intermediate transmission states can be chosen by controllably switching fractions of amorphous and crystalline parts in the PCM cell 38, 40 . To achieve both positive and negative matrix elements, we here define "0" as an intermediate state between the crystalline and amorphous states as described in the Supplementary Information.
In order to calculate the i×j MVM operation shown at the top of Fig. 2a , the input vector is encoded in the amplitude of the optical signals sent to the different matrix inputs. In addition to amplitude at a given wavelength, the input vector is also encoded at different wavelengths providing the ability for multiple calculations to be carried out simultaneously. The amplitude of each wavelength represents one of the vector entries (X1, ... Xi). Therefore, the input vectors can be fed to the matrix by modulating the input wavelengths with currently available fast electro-optical modulators, providing access to very high data rates. The matrix itself is designed as a waveguide crossbar array with additional directional couplers that equally distribute the input power to all PCM-cells (more details of the splitting ratios of the directional couplers are given in the supplementary information). By using a soliton microcomb with a mode spacing that exceeds the detector bandwidth, interference inside the waveguides can be avoided and the summation of the individual products (of the matrix-vector multiplications) can be performed by adding the comb teeth to the output waveguides, also by using directional Each individual matrix cell has three additional grating couplers used to optically address the PCM. By sending pulses (via the middle coupler) to the waveguide directly leading to the PCM cell on the crossing, it can be optically switched for programming each matrix element (in this case the light is coupled to the chip using Bragg-grating couplers because operation at a single wavelength (1550 nm) is sufficient).
In addition to substantial benefits in modulation speed (for changing the vector inputs), an optical implementation of a matrix-vector multiplier allows the harnessing of wavelength division multiplexing (MUX) to execute parallel MVM operations. In particular, as Fig. 2d explains, the same matrix can be used to process several input vectors at the same time when all the individual vectors are encoded on different wavelengths. For the 4×4 matrix example shown in Fig. 2 , and the processing of four input vectors per time step, sixteen different wavelengths are needed. In this work, these wavelengths are generated using a single DKS state of a microcomb 26, 44, 45 which is fed into a demultiplexer to split up the individual wavelengths operation in software is also performed in a post-processing step to subtract a certain reference power from the measured output power in the matrix columns (more details are provided in the supplementary information). In order to avoid this post-processing, the reference convolution operation can also be performed optically in the same on-chip matrix. In this case, one matrix column is left in a reference state (for example all PCM-cells in the crystalline phase state). The output value from this column is then subtracted from all the matrix columns holding the actual image kernels. Figure 3f shows an experimental example of a convolution operation which was performed without electrical post processing using reference subtraction. Here, a 3×3 kernel (emboss filter) was applied using a 9×2 matrix, with one column for the image kernel and one column for the reference. The original image is shown on the left, while the experimental output image after the convolution operation is shown in the middle panel. From comparison with the calculated expected output on the right, it can be seen that the on-chip matrix also performs well without the need for the post-processing step. It should be noted, that even though the image has three color channels red, green and blue (din 
Projections to the future
The above data was obtained with matrices up to a size of 9×4, with maximal four input vectors per time step and a relatively slow modulation speed resulting from using the variable optical attenuators (approximately 1 kHz). To estimate the ultimate performance capabilities of the system, we now operate the tensor processor using high speed electro-optical modulators and multiple comb lines. Because the photonic system is designed with broadband input couplers and broadband directional couplers in silicon nitride with a wide optical transparency window, the tensor processor supports many comb lines from the frequency comb source. Figure 5a shows the optical spectrum of the frequency comb after transmission through the matrix, with lines in a wide range of over 100 nm with a spacing of 100 GHz, thus providing access to more than 200 individual wavelengths. The inset depicts a zoom into the 16 frequency lines that were used throughout the experiments discussed above. Besides the spectral width of the frequency comb, the influence of wavelength dependent parts in the matrix design have to also be considered when estimating the wavelength range exploitable for the calculations. In this case, it is especially the wavelength dependence of the directional couplers that hinders the equal distribution of the input power for all wavelengths. The simple design of the couplers applied here still offers a range of approximately 100 nm (see supplementary information) but can be considerably improved by an adapted design 47 .
Each of the comb lines can be used for encoding vector values by setting the power with electro-optical modulators. Figure 5b shows the frequency response of the matrix for frequencies up to 14 GHz. The data illustrates the influence of the matrix on each input for modulation frequencies up to 14 GHz and was obtained by first characterizing the complete setup and then subtracting the frequency response of the setup with the matrix excluded. The flat response shows that the matrix only acts as a passive element during the convolution operations and does not limit the operational speed. In this experiment, the maximum frequency was determined by the photodiode which is specified (-3 dB bandwidth) up to 12 GHz. The inset shows an open eye-diagram at a rate of 13.5 GHz. Thus, considering a 9×4 matrix, four multiplexed input vectors and a modulation speed of 14 GHz, a processing speed of 2 TMAC/s (9×4 MACs × 4 input vectors × 14 GHz) can be obtained. This, however, is not the ultimate speed, since we are limited here by the modulation and detection bandwidth of our particular experimental setup.
To analyze the accuracy of the optical convolution processor, randomly chosen input vectors with nine entries are processed using a fixed matrix and are compared to the expected analytically calculated multiplication result. The results for 100,000 calculations are scaled to the range [0,1] and plotted in Fig. 5c and the inset shows the corresponding histogram revealing a standard deviation of 0.008. Figure 5d This is a factor of 4 greater than Google's recently developed custom tensor processing ASIC 6 (operating at 8-bit precision) with a compute density of 150 GMACs/mm 2 . We note that by moving to a silicon-on-insulator platform with a nominal bend radius of 5 µm and using integrated electrical control of the GST 49, 50 , it would be straight-forward to reduce the area of the MAC unit cell to less than 30 × 30 µm 2 , increasing the compute density to 15.6 TMACs/mm 2 per input channel. This is ~100× improvement over digital implementations and scales linearly with the number of input vectors via WDM-a notably different computing paradigm compared to electronic approaches.
To estimate the full capabilities of the optical accelerator for convolution operations, the performance of common optical components in foundry services 51, 52 must be considered in combination with the wavelength range of the frequency comb that can be used. The frequency comb clearly shows lines from 1500 nm to 1650 nm (see Fig. 5a ), leading to a range of 150 nm that could be exploited for computation that can even be extended by optimizing the setup.
Considering the spacing of the comb lines of 100 GHz (0.8 nm), this leads to approximately 150 nm / 0.8 nm = 187 different wavelengths. Decreasing the spacing to 50 GHz (0.4 nm) and increasing the matrix size to 50×50, the operational speed can reach an unprecedented 1 PMAC/s with a single matrix device, assuming a modulation and detection speed of 50 GHz.
Conclusion
We describe the first instance of a photonic tensor core which combines in-memory computing with state-of-the-art photonic integrated frequency combs enabling parallelizing convolution operations in the same physical device. We demonstrate the simultaneous data transfer and computing at speeds comparable to fiber networks. Prior optical approaches to computing have largely been limited by a lack of integrated non-volatile photonic memory and the lack of multiplexing capability for such calculations 20, 22, 48 . Our approach overcomes both these limitations by using nonvolatile, phase-change materials integrated on waveguides to locally store convolution kernels on-chip, and photonic chip-based frequency combs which enables true in-memory photonic computing using WDM capability. The photonic tensor core demonstrated in this work is capable of operating at the speed of 2 TMAC/s, promising even faster operation by an increase of several orders of magnitude by moderate scaling with stateof-the-art foundry processes. A key feature of this is that, because the convolution operation is a passive transmission measurement, the calculations can in theory be performed at the speed of light at very low power, experimentally limited only by the modulation and detection bandwidths. Making use of the wavelength division multiplexing capabilities inherent to alloptical systems, our fast and parallellized implementation promises higher computational bandwidths when compared to electronic devices, as several pixels or even complete images can potentially be processed in a single time step. Our approach for convolution processing provides an effective method to remove the computing bottleneck in machine learning hardware for applications ranging from live video processing to autonomous driving and AI-aided lifesaving applications. More importantly, such an approach more broadly suggests that integrated photonics are coming of age and in some cases can begin to match and even challenge electronic computation.
Methods

Device fabrication
The photonic circuits used for the convolution experiments are fabricated using a three-step electron-beam lithography (EBL; Raith EBPG 5150) process on a silicon nitride (325 nm) on silicon oxide (3300 nm) on silicon wafer (Rogue Valley Microdevices). The complete circuit was designed using GDShelpers, a design framework for integrated circuitry 53 .
In the first lithography step, windows in the positive tone resist Polymethylmethacrylat (PMMA) are exposed for the deposition of alignment markers made from gold. The resist is developed in 1:3 MIBK:Isopropanol for 120 seconds and a layer stack of 5 nm chromium, 120 nm gold and 5 nm chromium are evaporated via electron-beam physical vapour deposition (EBPVD). By sonicating the chip in acetone, the PMMA is removed and only the gold markers in the exposed positions remain.The markers are used in the second step to align the photonic structures. After spin coating a layer of 300 nm of the resist and prebaking it for 60 seconds at 85°C, an etch mask is exposed in the negative-tone ebeam resist arN 7520.12. The photonic structures are developed in MF-319 for 75 seconds and a post-development bake is performed at 85°C for 60 seconds. Using reactive ion etching with a CHF3/O2 plasma the mask of the photonic circuits is transferred into the sample. The silicon nitride layer is fully etched leaving single mode waveguides at telecom-wavelengths with a width of 1.2 µm and a height of 325 nm. Subsequently the remaining resist is removed in an oxygen plasma for 10 minutes. In the third EBL step, windows for the deposition of the phase-change material are written using the same markers as for the photonic structures for the alignment. The same process as in the first EBL step is used. Finally, 10 nm of the phase-change material GST and 10 nm of indium tin oxide (ITO) are sputter deposited on the sample. Both layers are sputtered using RF sputtering with an argon plasma (5 mtorr pressure, 15 sccm Ar, 30 W RF power and a base pressure of 2×10 -6 Torr). The ITO is used as a protective film to prevent oxidation of the phase-change material. As in the marker-deposition, the PMMA is lifted off by sonicating the sample in acetone leaving the phase-change material only in the desired positions on the photonic circuitry. Prior to the experiments the GST is crystallized on a hot plate at 220°C for approximately 10 minutes.
Measurement setup
The experimental setups used to perform the convolution experiments are shown in supplementary figures S1-S3. The individual wavelengths are generated using a frequency comb that is operated in the single soliton state and separated using a fibre-based multiplexer.
For the image processing experiments ( Fig. 3 and 4 ) the wavelengths (input vectors) are modulated using variable optical attenuators based on micro-electro-mechanical systems (MEMS), whereas the fast modulation ( Fig. 5 ) was performed with a 20 GHz electro-optic modulator (EOM). The input signal is coupled to the chip using 3D printed broadband total internal reflexion couplers capable of operating from the visible to the telecom wavelengths regime.
In the multiplexed version of the experiment processing four vectors at the same time, the corresponding wavelengths are multiplexed and demultiplexed accordingly before and after the matrix again using fiber-multiplexers. The convolution results are read using photodetectors (New Focus Model 2011). In the frequency response experiment (Fig. 5 ) a fast photodiode (12 GHz) was utilised.
Realization of high-Q Si3N4 microresonators
The soliton microcombs used in our work are based on Si3N4 microring resonators with free spectral range (FSR) of 100 GHz shown in Fig. 2 b. The microresonators are fabricated using Photonic Damascene process 45 , which provides access to high Q factors reaching 10 7 and enables the four-wave-mixing based nonlinear frequency conversion processes as well as the formation of DKS states at low pump powers 54 .
The microresonators were designed to have cross-section dimensions of 0.82 x 1.50 µm, which ensure anomalous group velocity dispersion (GVD) of about 1-2 MHz at around 1550 nm needed for the Kerr comb generation and the formation of DKS states. The light is coupled evanescently to a microresonator via the on-chip bus waveguide with similar dimensions located close to the microring, and which are additionally equipped with inverse tapers at the ends for edge chip coupling. Employed Si3N4 chips are furthermore fiber-packaged with average loss 4 dB/interface to facilitate the light coupling in and out of the system. The fabricated devices have Q-factors exceeding 5 x 10 6 , which allows for the DKS generation and switching 55,56 even for relatively low input pump powers below 1W.
Soliton comb generation
For the DKS generation a Si3N4 microring resonator is driven using continuous wave tunable fiber laser which is amplified with an Erbium-doped fiber amplifier (EDFA) to the power level of about 1 W. A high-power bandpass filter is used to suppress the amplified spontaneous emission (ASE) from the EDFA. The light polarization is adjusted using fiber-based polarization controller to match the TE-polarized fundamental mode of the microresonator, and then is launched to the fiber-coupled Si3N4 chip.
In order to launch the DKS state, a standard pump tuning technique is applied 25 where the amplified seed laser is swept over the choosen frequency resonance from the blue-detuned side to the red detuned side at a speed of approximately 200 GHz/s. This approach allows to generate multiple-soliton states with several pulses inside the cavity, which however usually has highly structures optical spectrum. In order to achieve the single DKS state with spectrally smooth sech 2 -shape envelope the soliton switching procedure is employed 55 
