Convolutional neural networks (CNN) provide state-of-the-art results in a wide variety of machine learning (ML) applications, ranging from image classification to speech recognition. However, they are very computationally intensive and require huge amounts of storage. Recent work strived towards reducing the size of the CNNs: [1] proposes a binary-weight-network (BWN), where the filter weights (w i 's) are ±1 (with a common scaling factor per filter: α ). This leads to a significant reduction in the amount of storage required for the w i 's, making it possible to store them entirely on-chip. However, in a conventional all-digital implementation [2, 3] , reading the w i 's and the partial sums from the embedded SRAMs require a lot of data movement per computation, which is energy-hungry. To reduce data-movement, and associated energy, we present an SRAMembedded convolution architecture (Fig. 31.1.1 ), which does not require reading the w i 's explicitly from the memory. Prior work on embedded ML classifiers have focused on 1b outputs [4] or a small number of output classes [5] , both of which are not sufficient for CNNs. This work uses 7b inputs/outputs, which is sufficient to maintain good accuracy for most of the popular CNNs [1] . The convolution operation is implemented as voltage averaging (Fig. 31 .1.1), since the w i 's are binary, while the averaging factor (1/N) implements the weight-coefficient α (with a new scaling factor, M, implemented off-chip). The GRBL is charged with this current for a duration t ON , which is directly proportional to the X IN code. For better t ON vs X IN linearity there should only be one ON pulse for every code to avoid multiple charging phases. This is impossible to generate using signals with binary-weighted pulse-widths. Hence, we propose an implementation where the 3 MSBs of X IN are used to select (using TD 56 ) the ON pulse-width for the first-half of charging (TD 56 is high) and the 3 LSBs for the second-half (TD 56 is low). An 8:1 mux with 8 timing signals is shared during both phases to reduce the area overhead and the signal routing. As such, it is possible to generate a single ON pulse for each X IN code, as shown for codes 63 and 24 in Fig. 31.1.3 . This DAC architecture has better mismatch and linearity than the binary-weighted PMOS charging DACs [4], since the same PMOS stack is used to charge GRBL for all input codes. Furthermore, the pulse-widths of the timing signals typically have less variation compared to those arising from PMOS V t mismatch.
Figure 31.1.2 shows the overall architecture of the 256×64 conv-SRAM (CSRAM) array. It is divided into 16 local arrays, each with 16 rows to reduce the area overhead of the ADCs and the local analog multiply-and-average (MAV a ) circuits. Each local array stores the binary weights (w i 's) in the 10T bit-cells (logic-0 for +1 and logic-1 for -1) for each individual 3D filter in a conv-layer. Hence, each local array has a dedicated ADC to compute its partial convolution output (Y OUT ). The input-feature-map values (X IN ) are fed into column-wise DACs (GBL_DAC), which pre-charge the global read bit-lines (GRBL) and the local bit-lines (LBL) to an analog voltage (V a ) that is proportional to the digital X IN code. The GRBLs are shared by all of the local arrays, since in CNNs each input is shared/processed in parallel by multiple filters. Figure 31.1.3 shows the schematic of the proposed GBL_DAC circuit. It consists of a cascoded PMOS constant current source. The GRBL is charged with this current for a duration t ON , which is directly proportional to the X IN code. For better t ON vs X IN linearity there should only be one ON pulse for every code to avoid multiple charging phases. This is impossible to generate using signals with binary-weighted pulse-widths. Hence, we propose an implementation where the 3 MSBs of X IN are used to select (using TD 56 ) the ON pulse-width for the first-half of charging (TD 56 is high) and the 3 LSBs for the second-half (TD 56 is low). An 8:1 mux with 8 timing signals is shared during both phases to reduce the area overhead and the signal routing. As such, it is possible to generate a single ON pulse for each X IN code, as shown for codes 63 and 24 in Fig. 31.1.3 . This DAC architecture has better mismatch and linearity than the binary-weighted PMOS charging DACs [4] , since the same PMOS stack is used to charge GRBL for all input codes. Furthermore, the pulse-widths of the timing signals typically have less variation compared to those arising from PMOS V t mismatch.
After the DAC pre-charge phase, the w i 's in a local array are evaluated locally by turning on a RWL, as shown in Fig. 31 .1.4. One of the local bit-lines (LBLF or LBLT) will be discharged to ground depending on the stored w i (0 or 1). This is done in parallel for all 16 local arrays. Next, the RWL's are turned off and the appropriate local bit-lines are shorted together horizontally to evaluate the average via the local MAV a circuit. MAV a passes the voltages of the LBLT and LBLF to the positive (V p-AVG ) and negative (V n-AVG ) voltage rails, depending on the sign of the input X IN (EN P is ON for X IN >0, EN N is ON for X IN <0). The difference between V p-AVG and V n-AVG is fed to a charge-sharing based ADC (CSH_ADC) to get the digital value of the computation (Y OUT ). Algorithm simulations (Fig. 31.1.1) show that Y OUT has a peak distribution around 0 and is typically limited to ±7, for a full-scale input of ±31. Hence, a serial integrating ADC architecture is more applicable than other area-intensive (e.g. SAR) or more power-hungry (e.g. flash) ADCs. A PMOS-input sense-amplifier (SA) is used to compare V p-AVG and V n-AVG , and its output is fed to the ADC logic. The first comparison determines the sign of Y OUT , then capacitive To demonstrate the functionality for a real CNN architecture, the MNIST handwritten digit recognition dataset is used with the LeNet-5 CNN. 100 test images are run through the 2 convolutional and 2 fully-connected layers (implemented by the CSRAM array). We achieve a classification error rate of 1% after the first 2 convolutional layers and 4% after all the 4 layers, which demonstrates the ability of the CSRAM architecture to compute convolutions. The distribution of Y OUT in Fig. 31.1 .6 for the first 2 computation-intensive convolutional layers (C1, C3) show that both layers have a mean of ~1LSB, justifying the use of a serial ADC topology. Figure 31 .1.6 also shows the overall computational energy annotated with the different components. Layers C1 and C3 consume 4.23pJ and 3.56pJ per convolution, computing 25 and 50 MAV operations in each cycle respectively. Layer C3 achieves the best energy efficiency of 28.1TOPS/W compared to 11.8 for layer C1, since C1 uses only 6 of the 16 local arrays. Compared to prior digital accelerator implementations for MNIST, we achieve a >16× improvement in energy-efficiency, and a >60× higher FOM (energy-efficiency × throughput/SRAM size) due to the massively parallel in-memory analog computations. This demonstrates that the proposed SRAM-embedded architecture is capable of highly energy-efficient convolution computations that could enable low-power ubiquitous ML applications for a smart Internet-of-Everything. DIGEST 
