#### **MSc in Photonics**

Universitat Politècnica de Catalunya (UPC) Universitat Autònoma de Barcelona (UAB) Universitat de Barcelona (UB) Institut de Ciències Fotòniques (ICFO)

http://www.photonicsbcn.eu

Master in Photonics

### MASTER THESIS WORK

# Building a photonic tensor core unit with an electronic interface for convolution processing

## **Ravi Pradip**

Supervised by Dr. Simone Ferrari, (WWU Münster, Germany) Co-supervised by Dr. Crina Maria Cojoacaru, (UPC, Barcelona)

Presented on date 25<sup>th</sup> October 2022

Registered at

Escola Tècnica Superior d'Enginyeria de Telecomunicació de Barcelona

UNIVERSITAT POLITÈCNICA DE CATALUNYA







# Building a photonic tensor core unit with an electronic interface for convolution processing

#### **Ravi Pradip**

University of Münster, Physikalisches Institut, AG Pernice, Münster, Germany Universitat Politècnica de Catalunya · Barcelona Tech – UPC, Barcelona, Spain

**Abstract.** With huge amounts of data being generated every second, the demand for parallelized, high speed, and efficient computing power is rising rapidly, pushing the limits of existing computing paradigms. In this circumstance, photonic computing hardware is a promising alternative to conventional electronics with prospects of speed and remarkably power efficient at accelerating multiply-accumulate (MAC) operations. Moreover, optical computing enables massive parallelism over their electronic counter parts through wavelength division multiplexing. This work involves the design and fabrication of an integrated photonic tensor core (PTC) capable of performing 60 millon MAC operations per second. Optical computing hardware makes use of multiple electro-optic and digital-analog converters. This work also involves the design and characterisation of a dedicated electronic interface to feed data to the PTC. In order to demonstrate the application potential, we perform convolution processing on 2D images in the optical domain with the newly developed hardware.

Keywords: Photonic computing, Neuromorphic computing, FPGA\* programming

#### 1. Motivation

Most powerful computers in the recent years are based on a centralized processing architecture. First conceptualized by John von Neumann in 1945, this architecture consists of a central processing unit and discrete memory suited to perform sequential digital logic. Consequently, von Neumann computers are inefficient when it comes to processing data in a distributed and parallel fashion, a key requirement to perform multiply accumulate (MAC) operations [1]. It is indeed a significant drawback since MAC operations are at the heart of artificial intelligence and deep learning models required to analyze the vast amount of data being generated [2]. Even though, recent years have seen a rapid development in custom silicon computing hardware such as FPGAs\*, ASICs\* and GPUs\* [3][4] capable of parallel processing MAC computations and improved data throughput, they still experience fundamental limitations of electrical signaling leading to great amounts of energy consumption [5].

Information processing with light is gaining popularity in recent years. As photons- quanta of light- are roughly 300 times faster than electrons and the fact that the former is bosonic in nature, photonic computing hardware presents the advantage of increased speed along with massive parallelism over their electronic counter parts. Even though, the linear nature of light makes it inconvenient for digital processing, the same reason makes it expertly suited for implementing MAC hardware [6] [7] [8]. Optical signals are also free from electromagnetic interference, cross talk and Joule's heating resulting in higher power efficiency [9]. In addition, advancements in integrated photonics have enabled scalable micro fabrication of these photonic circuits.

A major limitation arises when it comes to operating photonic computing circuits with sufficient data throughput. All optical networks require certain permutations of digital to analog and

electronic to optical conversions at its edges [10][11]. This immediately calls for a custom electronic interface that can operate and test the functionalities of photonic designs. However, building such a flexible interface with multiple synchronized data channels running at throughputs of at least several megabits per second is challenging. The ideal approach here would be to fabricate ASICs that meet the specifications, but their utility is only viable when mass produced, making them prohibitive in the research and development stage. Another approach would be to use general purpose microcontrollers in conjunction with Digital to analog converter (DAC) modules. These devices even though can be programmed with relative ease, cannot help but compromise on the number of synchronous channels and data rate [12]. In this context, programming a commercially available FPGA is the next best approach. However, most FPGA development boards are limited in terms of analog data channels and have fixed signal specifications and bandwidth.

With these challenges in mind, this work of mine includes the fabrication of an application specific integrated photonic tensor core (PTC) in conjunction with designing an electronic interface to perform necessary data logistics. The PTC is capable of optical MAC computations by which I demonstrate convolution processing on 2D images. The electronic interface has four synchronous channels with modulation frequencies up to 15 MHz per channel. The digital and analog circuit design of the interface is extensible allowing it to have several more synchronous channels, making it an ideal design to test and operate sophisticated photonic computing networks.

#### 2. Materials and methods

This chapter describes the development of individual stages required for performing image convolutions in the optical domain. These processing stages are summarized in a flow chart (figure 1). The sub sections go into the details of how each of the stages were implemented and how the measurements were performed.



Figure 1 A flow chart summarizing the necessary signal processing stages to realize convolution processing in the optical domain

#### 2.1 Convolution processing in the optical domain

This section describes the design of a quad channel CMOS compatible integrated photonic tensor core capable accelerating 2D convolutions. A discrete 2D convolution, especially the ones that are performed for image processing can be broken down to a sequence of matrix multiplications between the input matrix and a kernel (filter) matrix. Further, matrix multiplication (MM) in turn consists of several multiply-accumulate (MAC) calculations. A MAC calculation unit fundamentally requires scaling elements that perform scalar multiplications and later an accumulator which adds the scaled values together.

In this work, scaling of optical signals are achieved with directional couplers (figure 2.A). Generally, directional couplers (DCs) are a set of parallel waveguides that split certain amount of input signal power from the incoming waveguide to a second waveguide that is in proximity to it. A sample structure is shown in figure 1.A. It consists of two curved sections where the input port A and outputs C (cross) and D (through) ports are present; and a straight section separated by a narrow gap. If the phase velocity of light is same in both channels and there is sufficient mode overlap between the two certain amounts of optical power can be transferred in between them. The amount of power transfer can be tuned from 0% to 100% by varying the waveguide separation and their interaction length (L) [13]. Therefore, DCs with a set L act as

hard coded optical memory units that provide a positive scaling factor between [0,1] to the incoming light.

Once scaled, optical powers from multiple DCs are added together by incoherent superposition at photodetectors present at the output to complete an all-optical MAC operation.

To further extend this functionality, an arrangement of DCs (figure 2.B) can be used to perform convolution processing [14]. A 2D image convolution is algorithmically a series of matrix multiplications. A single step of convolving an image matrix with a 2x2 convolution kernel is shown in the figure 2.C. The kernel multiplies with the 2x2 2D window on the image matrix. The resulting matrix is summed up to form a single pixel of the final output image. The 2D window then slides over the whole image one step at a time repeating the process to acquire the remaining pixels of the convolved image. Convolutional filters, depending on their shape and contents sharpen, blur or perform edge detection on an image. Moreover, convolutional layers form an integral part of convolution neural networks (CNNs) that are used in the state-of-the-art artificial intelligence and deep learning models.





The figure 2 B) demonstrates a photonic tensor core design capable of accelerating 2D convolution processing with 2x2 kernels. The design was created using GDShelpers, a framework for developing integrated photonic circuitry. To conveniently perform external modulation and detection, grating couplers are placed at the input and output ports of the design. The photonic circuit hardcodes the elements of the kernel matrix into several DCs to perform efficient convolutions. The elements of the 2x2 input matrix amplitude modulated onto an optical pulse train enter ports a,b,c and d simultaneously where they are weighted by the DCs. To incorporate negative scaling factors with DCs, in this design we use balance detection from two outputs and define 0 as an intermediate element between the two limits extending the range of scaling to [-1, 1]. The weighted light amplitudes are then coupled out of the chip through output ports O1 and O2. In this specific design, a DC which transfers most of the light onto the output O2 is defined a 1 while that brings in most light to O1 a -1. The kernel highlights edges

of the image in a certain direction. Similarly, the remaining light from the DC's through port can be passed onto several kernel matrices to perform multiple separate convolution filters on the same input. In this work a total of four 2x2 kernels are implemented highlighting edges along positive/negative x and y directions.

#### 2.2 Fabricating the integrated photonic chip

The photonic design was fabricated on a *silicon nitride*  $Si_3N_4$  (330 nm) on *silicon oxide*  $SiO_2$  (3300 nm) on silicon Si wafer (Rogue Valley microdevices) using a 100keV Raith EBPG5150 electron beam lithography (EBL) system. Firstly, the 20 x 20 mm wafer was annealed at 1100°C for four hours to ensure sufficient film quality. Next, a negative resist (*Ar-N* 7520.12) was spin coated on the sample. Thirdly, the mask for the photonic design is written on the sample (EBL). The mask is then developed and later the  $Si_3N_4$  layer is fully etched with reactive ion etching using  $CHF_3/O_2$  plasma. Afterwards, the resist is removed with oxygen plasma. The waveguides are also cladded with 800 nm HSQ. For preparing the cladding, a positive HSQ resist was spin coated on the sample. The area of the sample where the newly written waveguides reside are once again exposed under the EBL to cure the resist, completing the fabrication procedure. The fabricated design was then inspected (figure 3) under a microscope before further testing and measurements.



Figure 3. The photonic circuits are fabricated on silicon nitride on silicon dioxide on silicon wafer (20 x 20 mm). a) The whole circuit with 4 kernel matrices. b) An enlarged image of the design marked with dotted lines in a). The DCs that encode the kernel elements are shown here.

#### 2.3. Building the electronic interface

The photonic tensor core needs to be supplied with synchronous amplitude modulated optical signals to all four of its inputs to facilitate the convolution as described in Section 2.1. Since the image data resides on a computer in electronic digital format it had to be brought out externally and rearranged onto appropriate channels. In addition to the digital circuitry, a digital to analog converter (DAC) with appropriate output specifications was also implemented.

#### 2.3.1 Digital Design

A sketch of the digital layout is shown in the figure 4. The input for the PTC resides as a digital image inside a PC. To transfer this data externally as electrical signals, the design uses a *FTDI* FT232H module. The module receives 4-bit integers as a first in first out (FIFO) array through a C++ program (provided by the manufacturer) running on the PC via USB which it outputs sequentially onto physical pins at the rising edge of an internal 60 MHz clock (figure 4). In other words, every time the clock ticks there is a new 4-bit number appearing on four physical pins of this board. This data bus along with the clock signal is then passed onto a field programmable gate array (FPGA) programmed to take care of further data processing. FPGAs

contain a matrix of logic blocks connected via programmable interconnects. Compared to general purpose microcontrollers, the structure of FPGAs allows them to execute digital logic continuously and at much higher speeds. The FPGA used in this work was a *Lattice ICE40-hx8k*, programmed with the *ICEstorm* toolkit which incorporates Verilog hardware description language along with some graphical layout capabilities.

The instructions programmed inside the FPGA are governed by the same clock as of the USB-FIFO module. The major component of the design consists of two sets of four digital flip flops (DFFs), primary and secondary as noted in the figure 3. In addition, a ring counter counts the positive edges of the clock ticks. The counter's function is to sequentially activate the primary DFFs to store incoming values on the bus. The input to output signal propagation time at the PTC is several orders faster than electronic signals. Hence, any delay between the output pulses of individual channels of the electronic interface will result in distorted output pulses from the



Figure 4. A schematic of the digital design programmed inside the FPGA. The above design is implemented in ICEstudio software, a programming environment for LATTICE FPGAs.

PTC. Therefor the output channels need to be synchronized for which a secondary set of DFFs are used. Together they function as shown in the timing diagram (figure 5).



Figure 5. Timing diagram of the digital circuitry. Each colored block represents a 4-bit value transferred to the FPGA from PC via the USB-FIFO module.

The first four primary DFFs are filled up by the fourth rising edge of the clock. On the falling edge of the fourth clock tick, all the four values are simultaneously passed onto the secondary set of DFFs whose outputs are mapped to 16 physical general purpose input output GPIO pins of the device. On the 5<sup>th</sup> clock cycle the primary set of DFFs start to get overwritten with new values while the secondary set of DFFs hold the previous set of values. On the 7<sup>th</sup> falling edge of the clock, the secondary DFFs are reset to 0 too. This procedure keeps on executing as long as there is data on the USB-FIFO bus. In this manner the FPGA divides the 4 bit 60 MHz single channel bus into a 4-bit quad channel bus running at 15 MHz. The data channels carry the

appropriate information to be fed into four photonic chip inputs if the domain crossings are done accordingly.

#### 2.3.2 Digital to analog converter design

To amplitude modulate the prepared data onto light, the digital data signals had to be converted into analog voltages that drive electro-optic modulators. The digital to analog conversion is done with the circuit is depicted in the figure 6. The primary stage of the circuit consists of a 4bit R-2R DAC circuit. The R-2R DAC employs an arrangement of passive resistors to convert the digital signals produced by the FPGA into a square pulse train of 16 different amplitude levels [15]. The most significant bit (MSB) through the least significant bit (LSB) are driven with voltages from the GPIO pins of the FPGA. The FPGA switches between 0 V corresponding to a logic 0 to a 3.3V corresponding to a logic 1. The ladder network of resistors then causes this 3.3V signal to be weighted in their contribution to the output depending on the significance of the bit. The circuit only uses two resistor values namely R and 2R and operates as a string of current dividers, whose output accuracy is a function of how precisely the resistor values are matched. After a formal analysis of this circuit, one finds the output impedance and input impedance of to be R. To ensure maximal signal transfer, this impedance R needs to be at least ten times the output impedance [16] of the FPGA pins (100  $\Omega$ ). Hence, an R value of 1 k $\Omega$ was chosen for this DAC.



Figure 6 Circuit diagram of the Digital to Analog converter. The DAC stage and amplifier stage are marked separately.

Further, the converted analog signal is passed through a high pass filter to remove the DC bias from it before entering an amplifier stage. The amplifier stage consists of an LM7171 high speed operational amplifier with a bandwidth of 200 MHz in non-inverting configuration [16]. This stage acts as a buffer between the R-2R DAC and the output load to be connected to it. A buffer stage helps to prevent any kind of unacceptable loading of the DAC circuit that could interfere with its desired operation. The output of the amplifier is then further connected conveniently to an SMA connector with a 50  $\Omega$  series termination resistor in order to match the input impedance of the EOMs, minimizing reflections. Series termination resistors do limit the current output of the circuit. However, since EOMs are voltage driven devices, this was not an issue.

The design was initially simulated and revised to ensure intended functionality using LT Spice circuit simulation software before physically being assembled on a printed circuit board (PCB). The FPGA connects to the PCB via proprietary board to board connectors from the manufacturer. Once on-board the digital signals are redirected onto four separate DACs as described earlier. Since, the board generates frequencies in the order of several megahertz, it was also necessary to have a double layer PCB design with an uninterrupted ground layer directly under the signal traces. Once the functionality of the board was tested electrically using an oscilloscope, it was ready to be used in combination with the photonic chip.

#### 2.4 Experimental setup

To perform image convolutions using the photonic tensor core, the setup in figure 7 was used. A 1550 nm amplified spontaneous emission (ASE) laser source is divided into 4 coherent channels using a 4-port splitter. Further, each one of these channels are amplitude modulated using external EOMs to encode pixel values of the image to be convolved. The modulation signals for the EOMs are provided by the electronic interface.

The modulated optical signal is then connected to polarization controllers to optimize the coupling efficiency to the TE mode of the on-chip waveguides. The modulated optical signals enter the integrated waveguides through a fiber array positioned above on-chip grating couplers. The photonic chip also resides conveniently on a computerized stage to precisely align the fiber array above on-chip grating couplers. The same fiber array collects the output light signal from the integrated wave guides which connects to photodetector. The RF output of the photodetector is monitored using a 4-channel oscilloscope. The data from the oscilloscope is collected and post processed on a PC to reveal the convolved images.



Figure 7 A block diagram of the experimental setup.

#### 3. Results and discussion

#### 3.1 Directional coupler characteristics



Figure 8 (LEFT) The dependence of light transmission through the cross port of Directional couplers as a function of their coupling length (L). (RIGHT) The same measurement done for adjacent wavelengths to demonstrate potential for wavelength-division multiplexed inputs.

In order to ensure correct power distributions in the photonic matrix, the transmission in the cross port of the DCs are to be characterized. The transmissions were measured experimentally from the several calibration devices included in the fabricated design. Figure 8 depicts the measured data. The transmission follows a  $\sin^2$  dependence as expected from theory [13]. The range of accessible scaling factors were slightly lower than expected most probably due to

insertion losses at the input and outport ports of the DC. The second plot shows the same measurements around adjacent wavelengths. Scaling factors from 0 to about 0.75 were achieved up to 20 nm away from the central wavelength. It is clear from this plot that a broad range of wavelengths can be used for multiplexing input matrices for parallel computing on the same device, an additional technique to improve the data throughput without increasing the modulation frequency. The relative loss in transmission between the curves is most likely due to wavelength dependent performance of the grating couplers. Hence, adopting wider bandwidth couplers or performing modulation and detection on-chip are certainly sought for in further design revisions.

#### 3.2 Electronic interface performance

A 7.5 MHz square wave generated with the electronic interface is shown in figure 9A. The maximum peak to peak voltage was 2.7V compared 3.2V from the initial simulation. This could be due to insertion losses at multiple connectors in the transmission path. The rise time of the pulses also seemed to be slightly longer. The stepped rise of the pulses is clearly an indication of impedance variations in the transmission line. The pulse periods are also not perfectly regular most likely due to finite jitter of the FPGA clock. Even though the first two effects are less significant for our experiments, the last observation implies that we cannot simply sample the final output signal from the PTC at regular intervals to determine the pulse amplitudes. Instead, the situation called for a sophisticated algorithm that detected the rising and falling edges of individual pulses before sampling between them.



Figure 9. A) A square wave generated with the electronic interface. The curve in blue is from simulating the same setup in LT Spice circuit simulation software. B) Render of the electronic interface PCB

After ensuring sufficient signal integrity, the pixel values of a gray scale image were encoded onto the square pulse amplitudes. The electrical signal was converted to an optical one using an EOM which was directly monitored with a photo diode. The pulse amplitudes were determined from the photodiode output to reconstruct the original image. Figure 10 shows the original image and the one after reconstruction. Both are near identical except for a slight decrease in contrast.



Figure 10 (LEFT) 128 px X 128 px grayscale image with 4-bit color depth. The pixel values of this image are sent as an amplitude modulated pulse train through the electronic interface. (RIGHT) The same image reconstructed after electro-optic, digital-analog domain conversions and back.

#### 3.3 Convolution Processing

Convolution kernels are small matrices used to apply effects, such as blurring, sharpening, outlining or embossing, on images. They're also used in machine learning for 'feature extraction', a technique for determining the most important portions of an image. The  $2x^2$  kernels hardcoded in our PTC detect the edges of an image in certain directions. Edge detecting kernels help draw boundaries around objects in an image. Since our kernel matrix dimensions are limited to  $2x^2$ , edge detection is only possible in a certain direction from a single kernel. These kernels are as shown below:

$$1)\begin{pmatrix} -1 & 1 \\ -1 & 1 \end{pmatrix} = 2)\begin{pmatrix} -1 & -1 \\ 1 & 1 \end{pmatrix} = 3)\begin{pmatrix} 1 & 1 \\ -1 & -1 \end{pmatrix} = 4)\begin{pmatrix} 1 & -1 \\ 1 & -1 \end{pmatrix}.$$

A *64px* x *64px* gray scale image shown in figure 11 was chosen as the input. Since all kernels are inscribed in the same matrix, the pixel values of all four convolved images can be obtained simultaneously, including 16 dot product operations in total limited only by the modulation frequency of the electronic interface. The interface modulates the input vectors at a rate of 15 MHz equaling to 4 image convolutions in 1 millisecond.

As for the quality of the images after convolution, the first two kernels produced the best results. The second two indeed highlights the expected edges but at the same time do not completely suppress the original image, revealing slight characteristics of it. This could be from the unbalanced optical power input in the individual channels resulting from the insertion losses of the fiber cables at all components from external EOMs to the polarization controllers. This imbalance results in unequal weighting from certain kernel elements. Fabrication imperfections could also result in variations of the splitting ratios of individual DCs, adding to this imbalance. Additionally, in this design successive kernels only receive a part of the light from the previous ones worsening the signal to noise ratio in former's output. In any case, irrespective of these power fluctuations the convolved images, when added together, successfully highlighted most edges of the original image confirming the functionality of our PTC and its electronic interface.



Figure 11. The upper row shows the reconstructed images after convolving with edge highlighting kernels. Kernels 1 and 4 highlight opposite vertical edges while the other two detect the horizontal ones. The lower row shows the original 64px x 64px image and the result after combining all the convolved images from the upper row.

The power consumed by the PTC to perform the image convolutions is dominated by the control electronics and photodetectors. However, an estimate of the optical energy consumption by the PTC alone can be determined from the cumulative losses introduced by the integrated photonic devices. The max output pulse power from the PTC was about 8  $\mu$ W. This is after 6 dB loss from the grating couplers and 9 dB loss from the arrangement of DCs. These losses

along with the input pulse duration (133 ns) was used to estimate the energy consumed by the PTC per MAC operation which came out to be approximately 8pJ/ MAC. This energy consumption can be significantly improved by implementing DCs with lower insertion losses [14]. The coupling losses can also be brought down to 1.5 dB per coupler with optimized low loss couplers [17]. Integrating the laser source and modulators on the same chip will further minimize the cumulative loss.

#### 4. Conclusion and outlook

This work demonstrates the fabrication of a CMOS compatible, application specific integrated photonic tensor core capable of performing 2D convolutions in the optical domain. The device performs efficient in-memory MAC computations by hardcoding the matrix elements of convolution kernels in the splitting ratios of photonic direction couplers. Since, optical computing architectures analog signals to perform calculations, a custom interface was built to convert digital electronic data stored on a PC into analog optical signals. The interface has four synchronous channels, each with a modulation frequency up to 15 MHz. The digital and analog design is also extensible allowing several more channels to be implemented limited only by GPIO pins on FPGAs. To establish the functionality of the newly developed hardware, a set of image convolutions were successfully performed with edge highlighting kernels.

Even though, the electronic interface here has the possibility to modulate several channels at 15 MHz, it bottlenecks the data throughput of the optical circuit by manifolds. In order to resolve this bottleneck, we need an optoelectronic co-design that considers opto-electronic converters and digital/analog converters [17]. However, integrating CMOS electronics with optical circuits and fabricating a high density opto-electronic interface that includes several on-chip optical sources, modulators and detectors require advancements in photonic computing architecture as well as in hybrid fabrication [18]. Another promising approach to bypass the parasitic effects of digital to analog domain conversion is to implement optical DACs and ADCs [19][20]. This would mean that photonic processors can take in digital optical signals, convert them to analog before processing and convert them back into digital data at their outputs [21]. These optical processing networks will then be more conducive to be implemented along with existing high speed CMOS technology to allow for maximum data throughput.

#### Acknowledgment

I would like to express my sincere gratitude to Prof. Dr. Wolfram Pernice for giving me theopportunity to perform my final master project in his research group. I would also like to thank Dr. Simone Ferrari for his supervision without which this work would not have been possible. Further, I am also grateful to all the members of AG-Pernice especially Mr. Akhil Varri and Mr. Frank Brückerhoff-Plückelmann both of whom actively provided guidance during my experiments.

#### 5. References

[1] Ben-Nun, Tal and Hoefler, Torsten. Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis. Demystifying Parallel and Distributed Deep Learning: An In-Depth Concurrency Analysis. New York, NY, USA : Association for Computing Machinery, August 2019. Vol. 52. ISSN: 0360-0300.

[2] Pacheco, Peter S. and Malensek, Matthew. Chapter 1 - Why parallel computing. Second Edition Chapter 1 - Why parallel computing. Philadelphia : Morgan Kaufmann, 2022. pp. 1-15. ISBN: 978-0-12-804605-0.

[3] Optimizing FPGA-Based Accelerator Design for Deep Convolutional Neural Networks. Zhang, Chen, et al. New York, NY, USA : Association for Computing Machinery, 2015. Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. pp. 161–170. ISBN: 9781450333153.

[4] In-Datacenter Performance Analysis of a Tensor Processing Unit. Jouppi, Norman P., et al. New York, NY, USA : Association for Computing Machinery, 2017. Proceedings of the 44th Annual International Symposium on Computer Architecture. pp. 1–12. ISBN: 9781450348928.

[5] Miller, D. A. B. Attojoule optoelectronics for low-energy information processing and communications. *J. Lightwave Technol.* **35**, 346–396 (2017).

[6] Shen, Y. et al. Deep learning with coherent nanophotonic circuits. *Nat. Photon.* **11**, 441–446 (2017).

[7] Feldmann, Johannes & Youngblood, Nathan & Karpov, Maxim & Gehring, Helge & Li, Xuan & Gallo, Manuel & Fu, Xin & Lukashchuk, Anton & Raja, Arslan & Liu, Junqiu & Wright, David & Sebastian, Abu & Kippenberg, Tobias & Pernice, Wolfram & Bhaskaran, Harish. (2020). Parallel convolution processing using an integrated photonic tensor core.

[8] Pérez, D. et al. Multipurpose silicon photonics signal processor core. *Nat. Commun.* **8**, 636 (2017).

[9] Brunner, Daniel & Marandi, Alireza & Bogaerts, Wim & Ozcan, Aydogan. (2020). Photonics for computing and computing for photonics. Nanophotonics. 9. 10.1515/nanoph-2020-0470.

[10] Drenski, Tom & Rasmussen, Jens. (2018). ADC & DAC - Technology Trends and Steps to Overcome Current Limitations. M2C.1. 10.1364/OFC.2018.M2C.1.

[11] Laperle, Charles & O'Sullivan, Maurice. (2014). High-Speed DACs and ADCs for Next Generation Flexible Transceivers. Optics InfoBase Conference Papers. SM3E.1. 10.1364/SPPCOM.2014.SM3E.1.

[12] Al-Dhaher, A.H.G. (2004). Development of Microcontroller/FPGA-based systems. International Journal of Engineering Education. 20. 52-60.

[13] Lifante, G. (2003). Introduction to Integrated Photonics. In Integrated Photonics: Fundamentals, G. Lifante (Ed.). https://doi.org/10.1002/0470861401.ch1

[14] Brückerhoff-Plückelmann, Frank & Feldmann, Johannes & Gehring, Helge & Zhou, Wen & Wright, David & Bhaskaran, Harish & Pernice, Wolfram. (2022). Broadband photonic tensor core with integrated ultra-low crosstalk wavelength multiplexers. Nanophotonics. 11. 10.1515/nanoph-2021-0752.

[15] I. Sinclair, J. Dunton (1988). Practical electronics handbook — Second edition. Microelectronics Reliability. 28. 993. 10.1016/0026-2714(88)90301-0.

[16] Fink, Donald & McKenzie, Alexander. (1975). Electronics engineers' handbook. New York: McGraw-Hill, 1975, edited by Fink, Donald G.; McKenzie, Alexander A

[17] Chmielak, Bartos & Suckow, Stephan & Parra, Jorge & Duarte, Vanessa & Mengual, Teresa & Piqueras, Miguel & Giesecke, Anna & Lemme, Max & Kilders, Pablo. (2022). Highefficiency grating coupler for ultra-low loss Si3N4 based platform. Optics Letters. 47. 10.1364/OL.455078.

[18] Xu, Yi & Pasricha, Sudeep. (2014). Silicon Nanophotonics for Future Multicore Architectures: Opportunities and Challenges. Design & Test, IEEE. 31. 9-17. 10.1109/MDAT.2014.2332153.

[19] Meng, Jiawei & Miscuglio, Mario & George, Jonathan & Babakhani, Aydin & Sorger, Volker. (2020). Electronic Bottleneck Suppression in Next-Generation Networks with Integrated Photonic Digital-to-Analog Converters. Advanced Photonics Research. 2. 2000033. 10.1002/adpr.202000033.

[20] Kim, Nak & Dagli, Nadir. (2012). A subranging photonic ADC based on cyclic code. 222-223. 10.1109/IPCon.2012.6358572.

[21] Nezami, Mohammadreza & Ferreira de Lima, Thomas & Mitchell, Matthew & Yu, Shangxuan & Wang, Jing & Bilodeau, Simon & Zhang, Weipeng & Al-Qadasi, Mohammed & Taghavi, Iman & Tofini, Alexander & Lin, Stephen & Shastri, Bhavin & Prucnal, Paul & Chrostowski, Lukas & Shekhar, Sudip. (2022). Packaging and Interconnect Considerations in Neuromorphic Photonic Accelerators. IEEE Journal of Selected Topics in Quantum Electronics. PP. 1-12. 10.1109/JSTQE.2022.3200604.