Abstract-This paper presents an energy-efficient static random access memory (SRAM) with embedded dot-product computation capability, for binary-weight convolutional neural networks. A 10T bit-cell-based SRAM array is used to store the 1-b filter weights. The array implements dot-product as a weighted average of the bitline voltages, which are proportional to the digital input values. Local integrating analogto-digital converters compute the digital convolution outputs, corresponding to each filter. We have successfully demonstrated functionality (>98% accuracy) with the 10 000 test images in the MNIST hand-written digit recognition data set, using 6-b inputs/outputs. Compared to conventional full-digital implementations using small bitwidths, we achieve similar or better energy efficiency, by reducing data transfer, due to the highly parallel in-memory analog computations.
. Basics of a typical CNN for a classification problem, showing the structure for the CONV and FC layers [4] , [5] .
traffic to the "cloud," by only sending the critical/relevant information and filtering out the rest of the massive amount of data the edge devices may collect. Furthermore, "edge computing" helps in improving the security of the data by keeping it local (within the devices), rather than having to send sensitive information to the "cloud." While "edge computing" promises significant benefits for IoT devices, it also has certain requirements. The circuits to run the compute algorithms must be very energy efficient to extend the battery life of these IoT devices, most of which have a very limited energy budget and would be "always-ON." In addition, in many applications, the local decision-making has to be done in real time (e.g., self-driving cars), to make them practical.
Convolutional neural networks (CNNs) provide state-of-theart results in a wide variety of AI/ML applications, ranging from image classification [3] to speech recognition [1] . However, they are highly computation-intensive and require huge amounts of storage. Hence, they consume a lot of energy when implemented in hardware and are not suitable for energy-constrained applications, e.g., "edge computing."
CNNs typically consist of a cascade of convolutional (CONV) and fully connected (FC) layers ( Fig. 1 ) [4] , [5] , with some non-linear layers in between (not shown in the figure). The CONV layers extract different features of the input and the FC layers combine these features to finally assign the input to one of the many pre-determined output classes. For each of the CONV/FC layers, there is a set of 3-D filters (W k ), which are applied on the 3-D input feature map (IFMP) to 0018-9200 © 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
that layer and generate its 3-D output feature map (OFMP). Each 3-D filter/input consists of multiple 2-D arrays, each of which corresponds to a different "channel" (1 to C). When a 3-D filter (W k ) is applied on the input (X), an elementwise multiplication is performed, followed by an addition of the partial products to compute the convolution output (Y k ). For CONV layers, the 3-D filter is applied to the shifted input to compute the next element in the 2-D OFMP. Each individual filter corresponds to a different channel in the 3-D OFMP. Therefore, the fundamental operation for both the CONV and FC layers can be simplified to a dot product or a multiplyand-accumulate (MAC) operation, as shown in the following equation:
where H is the width/height of the IFMP (with padding), E is the OFMP width/height (for a stride S), R is the filter width/height, C is the number of IFMP/filter channels, and M is the number of filters/OFMP channels for a given CONV/FC layer. The width and height of the feature maps/filters are assumed to be the same for simplicity, and also, because it is very common in most of the popular CNNs. In general, CNNs use real-valued inputs and weights. However, in order to reduce their storage and compute complexity recent works have strived toward using small bit widths to represent the input/filter weight values. Rastegari et al. [6] proposed a binary-weight-network (BWN), where the filter weights (w i s) can be trained to be +1/−1 (with a common scaling factor per 3-D filter: α). This leads to a significant reduction in the amount of storage required for w i s, making it possible to store them entirely on-chip. BWNs also simplify the MAC operation to an add/subtract operation, since α is common for a given 3-D filter and it can be incorporated after finishing the entire convolution computation for that filter. As shown in [6] , this algorithm does not compromise much on the original classification accuracy of the CNN, obtained using full precision weights. BWN performs better than binary connect [7] , which does not incorporate the scaling factor of α per filter, and also binarized neural networks [8] , where both weights and activations are constrained to ±1.
In the conventional all-digital implementation of CNNs [4] , [5] , [9] , with the memory and the processing elements being physically separate, reading w i s and the partial sums from the on-chip SRAMs lead to a lot of data movement per computation [10] and, hence, make them energy-hungry. This is because, in modern CMOS processes, the energy required to access data from memory can be much higher than the energy needed for a compute operation with that data [11] . To address this problem, we present an SRAM-embedded convolution computation architecture [12] , conceptually shown in Fig. 2 . Embedding computation inside memory has two significant benefits. First, data transfer to/from the memory is greatly reduced, since the filter weights are not explicitly read and only the computed output is sent outside the memory. Second, we can take advantage of the massively parallel nature of CNNs to access multiple memory addresses simultaneously. This is because we are only interested in the result of the computation using the memory data and not the individual stored bits. Therefore, a much higher memory bandwidth can be achieved with this approach, overcoming some of the major limitations posed by the conventional "von-Neumann bottleneck." This paper is organized as follows. Section II explains the concept of memory-embedded convolution computation as voltage averaging in SRAMs. Section III presents the overall architecture. Section IV highlights the key contributions of this paper, compared to prior in-memory computing approaches. Section V discusses the details of the different circuitry involved in the embedded convolution computation. Section VI presents the measurement results. Finally, concluding remarks are discussed in Section VII.
II. CONCEPT OF SRAM-EMBEDDED COMPUTATION
The basic operation involved in evaluating convolutions (Y ) for CNNs is the dot product of the 3-D IFMP (X) and the filter weights (W ), as shown in (1) . It can be rewritten by flattening the 3-D tensor into a 1-D vector to obtain the following equation, where the 2-D subscripts (x, y) have been omitted for simplicity:
Equation (2) can be further simplified for the case of binary filter weights (w i s) to get (3a), where α k is the common coefficient for the kth filter. If α k is expressed as a ratio of two integers (M k , N: number of elements added per dot product in 1 clock cycle), then we get (3b) Now, if we separate out the scaling factor of M k (which can be incorporated after computing the entire dot product), we get the expression for the effective convolution output (Y OUT ) as
where X IN is the effective convolution input, i.e., scaled version of the original input X, limited to 7-b (includes 1-b sign).
For energy-efficient computation with multi-bit values inside the memory, (4) has to be implemented in the analog domain, as shown in the following equation:
The equivalence of (4) and (5), conceptually shown in Fig. 3 , becomes apparent in three key steps. First, the digital inputs (X IN s) are converted into analog voltages (V a s) using digital-to-analog converters (DACs). Then, the analog voltages are multiplied by the corresponding 1-bit filter weights (w i s), which are stored in a memory array. This is followed by averaging over N terms to get the analog-averaged convolution output voltage (V Y _AVG ). These constitute the second step: multiply-and-average (MAV). Finally, in the last step, the analog-averaged voltage is converted back into the digital domain (Y OUT ) using an analog-to-digital converter (ADC), for further processing. It may be noted that if the 3-D filter size (R × R ×C) is greater than N, the above-mentioned threestep process is repeated multiple (N r ) times using R × R × C (≤N) elements in each cycle, where N r = C/C . The partial outputs (from the ADC) can then be further added digitally (outside the memory) to get the final convolution output. Fig. 4 shows the overall architecture of the 16-Kb CONV-SRAM (CSRAM) array, consisting of 256 rows × 64 columns of SRAM bit-cells. It is divided into 16 local arrays, each with 16 rows. Each local array is meant to store the binary filter weights (w i s) for a different 3-D filter in a CONV/FC layer. w i is stored in a 10T SRAM bit-cell as either a digital "0" or a digital "1," depending on whether its value is +1 or −1, respectively. The 10T bit-cell consists of a regular 6T bit-cell and two decoupled read-ports. Each local array has its analog averaging circuits (MAV a s) and a dedicated ADC to compute the partial convolution outputs (Y OUT s). Sharing these circuits for 16 rows in a local array reduces the area overhead. The IFMP values (X IN s) are fed into columnwise DACs, which convert the digital X IN codes to analog input voltages on the global read bitlines (GRBLs). The GRBLs are shared by all the local arrays, implementing the fact that in CNNs each input is shared/processed in parallel by multiple filters. With this architecture, the 16-Kb CSRAM array can process a maximum of 64 convolution inputs and compute 16 convolution outputs in parallel. Fig. 5(a) shows the simulated test error-rates for the MNIST data set with the LeNet-5 CNN, consisting of two CONV layers (C1, C3) and 2 FC layers (F5, F6). The number of bits to represent the IFMP/OFMP values is varied from 8 to 4. Lower bitwidth helps in reducing the area/power costs of the DAC and ADC circuits involved for the convolution computations. However, as shown in Fig. 5(a) , the error rate starts to increase steeply for <7-b. Hence, 7-b is chosen as the target bitwidth for the DAC/ADC circuits. With 7-b (including the sign bit), the voltage resolution needed on a 1-V scale is 1 LSB = 1/2 6 ≈ 15.6 mV. Next, the effect of the averaging factor ("N") on the test error-rate is observed. A high value of "N" would decrease the area/power overhead of the ADC by amortizing it over more MAV operations per clock cycle. However, higher "N" can also degrade the computation accuracy due to increased quantization by averaging. This is more critical for CNN layers with smaller filter sizes. As shown in Fig. 5(b) , for layer F6, with a 3-D filter size of 120, the error-rate steeply increases as "N" is varied from 15 to 120. For the other three layers of LeNet-5, the 2-D filter size is 5 × 5. Hence, a minimum N = 25 is required to fit at least one full filter channel per CSRAM row. We chose N = 64 to fit two channels for 5 × 5 filters, without sacrificing much on the error rate.
III. OVERALL ARCHITECTURE
The number of rows (N rows ) per local array in the CSRAM determines the unit capacitance (C LBL ), which is used for all the analog operations required for the in-memory convolution computation. For every column in a local array, there is a corresponding MAV a circuit. Hence, a higher value of N rows would decrease the area overhead of MAV a , by amortizing it over multiple rows. It also reduces the variation of the C LBL value, which helps in improving the accuracy of the computations. However, a high value of N rows means a high C LBL , which translates to increased energy costs. It would also lead to less throughput for a given SRAM size, since fewer outputs would be computed per cycle. Therefore, N rows = 16 is chosen as a tradeoff. It may be noted that with N rows = 16, the thermal noise (kT/C) is <1 mV, which is well below 1 LSB = 15.6 mV.
IV. KEY CONTRIBUTIONS OF THIS PAPER
While there are a few different approaches [13] [14] [15] [16] [17] for in/near-memory computing, the proposed architecture has some key contributions, which provide significant benefits over prior work. The first key feature of our approach is the robustness to SRAM bit-cell V t variations. SRAM bit-cells use nearminimum transistor sizes available in a given CMOS process and, hence, suffer from transistor mismatch and variation. For example, if we consider the discharge current (I cell ) through an SRAM bit-cell (shown in Fig. 6 ), we can observe that it has a significant spread from its mean value (σ ≈ 30%μ). Now, when I cell is used to modulate the analog voltage (V a ) on the bitline [13] [14] [15] , [17] , there is a wide variation in the V a value and it cannot be controlled very well. This compromises the computation accuracy and extra algorithmic techniques might be required to compensate for that. Zhang et al. [13] use the "AdaBoost" technique, in which the results of many weak classifiers are combined to get a more accurate final result. However, this would lead to an increase in the number of computations and the energy required. Gonugondla et al. [15] proposed an on-chip training to compensate for chip-to-chip variations. However, this would incur the energy and timing penalty required to re-train the network corresponding to every single chip. In our approach (Fig. 6 ), the analog voltage (V a ) is directly sent to the bitlines using global DACs at the periphery. Since the global DACs can be upsized, with their area being amortized over multiple rows (256 in this case), the variation due to it is significantly less compared to that of the bit-cell. Furthermore, the SRAM bit-cell is only used to multiply V a by the 1-b filter weight (w i ) stored in it, using full signal swing locally. That means, the purpose of the SRAM bit-cell is to discharge one of its local bitlines to 0, it is not used to control V a . Hence, given enough time for the worst case bitcell discharge, the computation accuracy does not suffer from local bit-cell V t variations.
The second key feature of our approach is the improvement of the dynamic voltage range for the analog computations without disturbing any bit-cell. In the conventional approach (with 6T SRAM bit-cells) [13] , [14] , [17] , where multiple word-lines (WLs) are activated for the same bitline, there might be a situation where one of the accessed bit-cells in that column is in pseudo-write mode (Fig. 7) . This is because multiple activated bit-cells in that column can discharge a bitline to a very low voltage, which could overwrite the data stored (Q k = "1") in the disturbed bit-cell. Hence, the bitline voltage range has to be limited to prevent any write disturb. In our approach, 10T bit-cells are used which de-couple the read and the write ports, to prevent any write disturb. Furthermore, each bit-cell is read independently in parallel without sharing any bitlines. Hence, the discharge on one bitline cannot affect another accessed bit-cell. Thus, we can utilize a wide voltage range (close to full rail) for the analog computations, without disturbing any bit-cell. It may be noted that, although a 10T bit-cell has more transistors than a 6T, it can be designed using smaller sized devices, compared to a 6T bit-cell. This is because, unlike a 6T, a 10T bit-cell does not have conflicting sizing requirements to achieve high margins for both read and write operations. In addition, due to the limitations of 6T bit-cells for in-memory analog computations, network augmentation, i.e., larger sized neural networks, might be required to compensate for lower computation accuracy. Larger neural networks translate to increased storage requirement for filter-weights on-chip and, hence, increased SRAM size. Therefore, overall, our 10T bitcell based in-memory architecture is not necessarily higher area than a 6T-based design.
The third key feature of our work, which distinguishes it from other "in-memory" computing approaches [14] , [16] is the use of the inherent bitline capacitances in the SRAM array to implement the computations. This precludes the need for extra area-intensive capacitors, which would be otherwise required at the SRAM periphery [14] to implement some of the analog computations.
Finally, this paper supports multi-bit resolution for the inputs and outputs of the dot products, compared to [13] (output: 1-b) and [16] and [17] (both input-output: 1-b). This helps in achieving higher classification accuracy for a neural network of a given size.
All the key features, described earlier, make our proposed architecture scalable, i.e., multiple CSRAM arrays can operate in parallel to run larger neural networks.
V. CIRCUITS FOR THE THREE-PHASE CONV-SRAM OPERATION

A. Phase-1: DAC
During the first phase of the CSRAM operation the digital convolution input (X IN ) is converted into an analog voltage (V a ) using a columnwise DAC (GBL_DAC). The analog voltage is used to pre-charge the global read bitline (GRBL) and the local bitlines (not shown in Fig. 4 ) in the SRAM array. Each GRBL is shared by all the 16 local arrays, and hence, they get the same value of the analog pre-charge voltage. This implements the fact that in a given CNN layer (CONV/FC) each input is processed simultaneously by multiple filters. Furthermore, all the 64 columnwise GBL_DACs operate in parallel and can send a maximum of 64 analog inputs to the CSRAM array in one clock cycle. Fig. 8 shows the schematic of the proposed GBL_DAC circuit. It consists of a cascode pMOS stack biased in the saturation region to act as a constant current source. The GRBL is charged with this fixed current for a time t ON , which is determined by the ON pulsewidth. t ON is modulated based on the digital input code (X IN [5 : 0]), using a digital-to-time converter. To achieve a very good linearity of V a versus X IN or t ON versus X IN , there should be a single continuous ON pulse for every input code, to avoid non-linearities due to multiple charging phases. This is not possible to generate by simply using six timing signals with binaryweighted pulsewidths. However, it may be generated using 2 6 or 64 timing signals and a 64:1 mux. However, that would consume a lot of areas, which is not ideal for a circuit that needs to be replicated for each column of the SRAM array. To address this issue, we present a two-phase architecture in which the three MSBs of X IN are used to select the ON pulsewidth for the first half of charging and the three LSBs for the second half. A control signal (TD 56 ) is used to choose between the two phases. In this way, an 8-to-1 mux, with eight timing signals, can be shared during both the phases, to reduce the area overhead and the number of timing signals to route. A tree-based architecture, using 2:1 unit muxs, is used for the 8:1 mux to equalize the mux delay for different control bits.
To design the pulsewidths of the eight timing signals, we need to express X IN in terms of its two components
where k A and k B are the decimal values for the three MSBs and the three LSBs of X IN , respectively. Since k A and k B can have any integer values from 0 to 7, the pulsewidths of the timing signals (TDs) are chosen as
where t 0 is the minimum time resolution. A delay-line architecture, with a controllable unit delay of t 0 , is used to generate 64 time-delayed signals from the input clock. Then, the appropriate signals are combined using NOR gates to generate the TDs. This is done at the global level and the generated TDs are buffered and routed to all the GBL_DACs. A higher value of t 0 reduces the non-linearities from the timing generation circuitry, at the cost of increased clock cycle time.
To understand how the two-phase charging technique works, let us consider two X IN values of 24 and 63, as shown in Fig. 8 . For X IN = 24 = 8 × 3 + 0, k A is 3 and k B is 0. Hence, TD 9×3 or TD 27 is used in phase A and TD 0 is used in phase B, to select the pulsewidth of the ON timing signal. Similarly, for the code X IN = 63 = 8 × 7 + 7, both k A and k B are 7, and hence, TD 63 is used in both the charging phases.
In addition to the linearity aspect of the DAC transfer function, this architecture also performs better in terms of device mismatch, compared to binary-weighted pMOS charging DACs [13] . This is because, here, the same pMOS stack is used to charge the global bitline for all input codes, rather than having to use smaller pMOS devices for small input values. Furthermore, the pulsewidths of the globally generated timing signals have less variations typically, compared to those arising from local V t mismatch in the pMOS devices [13] .
It may be noted that a one-time calibration is required to set the maximum value of the analog pre-charge voltage for the maximum input code (X IN,max ). The maximum precharge voltage should be kept lower than the supply voltage of the GBL_DAC, to ensure that the pMOS cascode stack is operating in the saturation region as a constant current source. For a given t 0 , the calibration can be achieved by tuning the externally provided bias voltage (V biasp ) of the pMOS stack. During calibration, all DACs are fed the same input value of X IN,max . In a given clock cycle, first, the GBL_DAC precharges the GRBL to an analog voltage (V a ). Then, V a is compared to an externally provided reference voltage V ref (typically kept at 1 V in this paper). The comparison is done by the columnwise sense amplifiers (SAs), which are already present for normal readout of the SRAM. All the 64 SAs operate in parallel and use the same V ref to provide 64 comparison outputs simultaneously. V biasp is monotonically increased from 0 V until majority of the SAs (>50%) flip their outputs ("1" to "0"), at which point the calibration is achieved. In this paper, a 5-mV step size is used to tune V biasp .
B. Phase-2: Multiply-and-Average
The second phase of the CSRAM operation involves the multiplication of the analog input voltages (V a s) with the 1-b filter weights (w i s) and averaging over N values. This MAV operation is done in parallel for all the 16 local arrays, each storing the w i s for a different 3-D filter when running a CONV/FC layer Fig. 9 shows the details for the MAV operation for one local array. It starts by turning on the read WL (RWL) for the selected row in the local array. This leads to discharging of one of the local bitlines (LBLT, LBLF) in each column, depending on w i stored in the corresponding 10T bit-cell. A positive w i (+1) is stored as a digital "0" and a negative w i (−1) as a digital "1." It may be noted that the local bitlines have been pre-charged to the same analog voltage (V a,i ) as its corresponding global bitline (GRBL) during phase-1. Therefore, at the end of weight evaluation, the difference between the local bitline voltages represents the product of the analog voltage (V a,i ) and the 1-b weight (w i ). For example, the bit-cell in the "0th" column stores a −1, and hence,
The weight multiplication/evaluation step is completed by turning off the RWL. After that, the appropriate local bitlines are shorted together horizontally to evaluate the average. The positive and negative parts of the average as obtained on two separate voltage rails, V p AVG and V n AVG , respectively. This is implemented by the local MAV a circuits, which pass the voltages of the LBLTs and LBLFs to either V p AVG or V n AVG voltage rails, depending on the sign of the input X IN . If the input for the particular column is positive (X IN > 0) E N P is turned on, otherwise, E N N is on. E N P and E N N are digital control signals which are globally routed and shared columnwise by all the 16 local arrays. The switches controlled by E N P and E N N are implemented using nMOS pass transistors, since the final V p AVG , V n AVG voltages would be closer to 0 V than V ref . On the other hand, the switches controlled by PCH G use CMOS transmission gates, since they need to pass a wide range of voltages from 0 V to V ref (∼1 V), during phase-1 (DAC pre-charge).
The fully differential nature of the averaging architecture helps in mitigating many common-mode noise issues, e.g., clock coupling noise from the control switches, capacitance variation of the local bitlines and the voltage rails due to different process corners and so on. This helps in improving the accuracy of the dot-product computations with our approach.
It may be also noted that during this phase, when the SRAM bit-cell is actually used for weight evaluation, the time required does not have a huge variation. Fig. 10 shows the simulated local bitline discharge time (t dis,LBL ) in the slowest process corner (SS). As seen from the figure, even the 6σ value of t dis,LBL is merely 500 ps, which is much smaller than the total clock period (≈100 ns). This shows that bit-cell V t variations do not dominate the overall computation time. The longer clock period is justified due to the highly parallel processing in the compute mode.
C. Phase-3: ADC
The third and last phase of the CSRAM operation is the analog-to-digital conversion of the dot-product outputs, with multi-bit resolution. The difference of the analog average voltages (V p AVG and V n AVG ) is fed to an ADC to get the digital value of the computation (Y OUT ). This is done in parallel for all the 16 local arrays, producing outputs corresponding to 16 different filters simultaneously.
Choosing the ADC architecture is crucial since it would be replicated multiple times in the CSRAM array. Hence, area and power consumption are key metrics to consider. In addition, the typical distribution of the ADC outputs (Y OUT s) should also be considered to find the more appropriate architecture.
As seen from simulation results in Fig. 11 , for a typical CONV layer with a full-scale input range of ±31, Y OUT has an absolute mean value of ±1.3 and is typically limited to ±7. Hence, a serial integrating ADC architecture is more suitable in this scenario, compared to other area intensive (e.g., SAR) and more power-hungry ones (e.g., flash). In spite of its serial nature, in most cases, we can expect the ADC to finish its operation within a few cycles, due to the particular Y OUT distribution. Fig. 12 shows the architecture of the proposed integrating ADC (charge-sharing-based ADC [CSH_ADC]). It consists of three main parts: a CSH-based integrator, an SA, and a logic block. Capacitive CSH with replica bitlines is used to implement the integration. The use of replica bitlines helps to track the local bitline capacitance better in the presence of process and temperature variations. The SA has a standard StrongARM latch-type architecture. pMOS devices are chosen for the input differential pair of the SA, since the common mode voltages of V p AVG and V n AVG signals are expected to be closer to the GND rail. The logic block provides the timing signals for the CSH (PCH R , EQ P , EQ N ) and the SA comparison (SA_EN), using the globally provided timing signals (φ 1 , φ 2 ). It also has a counter to count the number of cycles it takes to finish the ADC operation and that provides the digital output of the dot-product computation. Fig. 12 also shows the waveforms for a typical CSH_ADC operation. It starts by sending a SA_EN pulse from the ADC logic block to the SA. The SA compares V p AVG and V n AVG and sends its outputs (SAO P , SAO N ) to the ADC logic block. The first comparison determines the sign of the output, e.g., for the case shown in Fig. 12 , Y OUT is positive since V p AVG is higher than V n AVG . After the first comparison, the lower of the two voltage rails (V n AVG ) is integrated by CSH it with a reference local bitline (BLN ref ), using the equalize signal (EQ N in this case). The reference bitline, which replicates the local bitline capacitance, was pre-charged during the SA comparison using the PCH R signal to V ref (=1 V in this paper). Therefore, the step size of the integration is ≈ (V ref /N) , where N is the number of SRAM local columns that were averaged. The pre-charge and equalize/integrate operations, along with the SA comparison, continue until the lower voltage rail (V n AVG ) exceeds the higher one (V p AVG ). When this happens, the SA outputs flip indicating the endof-conversion (EOC). After this, no more timing pulses are generated. A counter in the ADC logic block counts the number of equalize pulses (EQ N ) it takes to reach EOC and that generates the digital value of the convolution/dot-product output (Y OUT ), which is +4 for the example shown in Fig. 12 .
It may be noted that Y OUT is directly affected by the SA offset voltage (V OS ), which can degrade the overall computation accuracy due to incorrect ADC outputs. To address this issue, we propose a simple two-cycle offsetcancellation (OC) technique, using a flipping mux at the input of the SA (Fig. 13) . During the first/even cycle of this two-cycle period, FLIP = "0." Hence, V p AVG and V n AVG are passed to the positive and negative input terminals of the SA, respectively. Therefore, Y OUT,0 = ADC(V y AVG,0 − V OS ). The output in this cycle is exactly same as in the conventional 
On the other hand, for the conventional case, the effect of V OS adds up since
and this makes the accumulation result further inaccurate. It may be noted that the benefits of this OC technique comes without any extra timing and power penalty, as long as an even number of cycles are required to finish a full convolution computation. This can be easily expected for most CNNs.
VI. MEASURED RESULTS
The 16-Kb CSRAM array was implemented in a 65-nm linear programming CMOS process. The die photograph in Fig. 14 shows the relative area occupied by the different key blocks. The bit-cell array (ARY) along with its peripheral circuitry occupies 73.1% of the total CSRAM area, 8.2% is occupied by the GBL_DACs, 8.6% by the local MAV a circuits, 7.3% by the CSH_ADCs and the rest by global timing circuits. The test-chip summary is shown in Table I . lower than transistor V t variation. It can be also seen in Fig. 16 that the energy/ADC scales linearly with the input-output value, which is expected for the integrating ADC topology.
A. Circuit Characterizations
The effect of the OC technique for the SA (in the CSH_ADC) is also characterized, as shown in Fig. 17 for two different input codes. It can be clearly seen that the OC helps in reducing the variation of the Y OUT values, leading to a better computation accuracy for the dot products/convolutions.
B. Test Case: MNIST Data Set
To demonstrate the functionality for a real CNN architecture, the MNIST hand-written digit recognition data set is used with the LeNet-5 CNN [18] . As shown in Fig. 18 , LeNet-5 consists of two CONV layers (C1, C3) and two FC layers (F5, F6). In addition, there are two sub-sampling or max-pooling layers (S2 and S4, following layers C1 and C3, respectively) and a non-linear ReLU layer (R5 after layer F5). Only the CONV/FC layers, which involve majority of the computations, are implemented on-chip by the CSRAM array. The non-linear layers are implemented in software. Fig. 19 shows the test setup used to automatically run the four CONV/FC layers, one after the other, on the testchip. Data are transferred back and forth between MATLAB (running on a host PC) and the testchip, via a fieldprogrammable gate array board. Table II shows For layer F5, the entire filter cannot fit at once in the CSRAM array (due to its limited 16-Kb size in the testchip). Hence, the entire process, explained earlier, is repeated multiple times to finish all the computations. However, having multiple CSRAM arrays operating in parallel can easily alleviate this problem, by fitting all the filter weights together on-chip. Fig. 20 shows the measured error rate for the 10 000 test images in the MNIST data set, with the four CONV/FC layers being successively implemented on-chip. Three different chips are measured, each experiment is repeated multiple times, and the average value of the error rate is reported. We tested two different versions of LeNet-5: with and without batchnormalization (BN) layers preceding the CONV/FC layers. Without BN layers ("v1"), we achieve a classification error rate of 2.5% after all the four layers. The error rate is improved to 1.7% by using the BN layers ("v2"). This is mostly because BN normalizes the convolution inputs for every layer, with a mean around 0 and also limits the maximum value of the inputs. Hence, after input quantization to 6-b, its features are better preserved compared to an un-normalized input distribution. The measured error rate, which is close to the expected value from an ideal digital implementation, shows the robustness of the CSRAM architecture to compute convolutions. The error rate for the MNIST data set is improved by 8.3% compared to prior work on in/near-memory compute [13] , [16] , where a 10% error rate was achieved. Next, we tested functionality at a lower voltage setting of V dd,DAC = 1 V and the rest of the circuits operating at V dd (rest) = 0.8 V, with a clock period of 400 ns. The maximum DAC precharge voltage (V a,max ), corresponding to the maximum input code, is calibrated to 0.8 V. Hence, the magnitude of 1 LSB is ∼26 mV (instead of 32 mV for the previous case with V a,max = 1 V). Fig. 21 shows the measured error rate for this set of voltages. Due to reduced analog voltage precision, the error rates are slightly higher, with "v1" achieving 3.4% and "v2" achieving 1.9% for the MNIST test data set.
The distributions of the partial convolution outputs from the ADC (Y OUT s) are shown in Fig. 22 , for all the four CONV/FC layers. For each of these layers, Y OUT has a mean around ≈1 LSB, which justifies the use of the serial ADC topology to compute it. Fig. 23 shows the distributions of the convolution inputs (X IN s) for the four layers. X IN s have been properly scaled and quantized to 6-b (including sign bit) before being sent to the CSRAM array to compute the convolutions. As seen from the figure, all the layers have a high proportion of 0s for the X IN s. This helps in reducing the GBL_DAC energy to convert and send them to the columns of the CSRAM array. Fig. 24 shows the overall energy consumption of the CSRAM array for running the different layers of LeNet-5, and f clk = 5 MHz. Of the four CONV/FC layers in LeNet-5, the energy consumptions while running layers C1 and F6 are lower than that of layers C3 and F5. This is because layers C1 and F6 do not fully utilize the entire CSRAM array, due to their small filter sizes. However, that also translates to a lower energy efficiency for these layers (Table III) , since the energy Fig. 24 also shows the energy breakdown for the three major components: GBL_DAC, ARY + MAV a , and CSH_ADC. The energy for the GBL_DACs is limited by the bit-precision requirement for representing the IFMP values. Whereas, the energy for the ARY, MAV a and CSH_ADC circuits can be scaled down by scaling their supply voltages while sacrificing speed. Fig. 25 shows the measured energy consumption of the CSRAM array, with V dd,DAC = 1 V, V dd (rest) = 0.8 V and f clk = 2.5 MHz. The reduced supply voltages help in decreasing the energy consumption, leading to better energy-efficiency numbers (Table IV) .
Recent hardware implementations [5] , [9] , [16] , [19] [20] [21] for NNs have focused on reduced bit-precisions to achieve higher energy efficiency. Table V presents comparison with prior work, both conventional digital [5] , [9] , [19] , [20] and in-memory approaches [14] , [16] . It should be noted that, while [5] , [9] , [16] , [19] , [20] are full systems, the main focus of this paper was to demonstrate in-memory computation capability for CNNs. Hence, ours does not include the energy for IFMP/OFMP memories. However, for the MNIST data set with LeNet-5 CNN, we estimate [22] those to have only small contributions to the overall energy efficiency per MAV operation, due to the high parallelism supported by our in-memory approach. Furthermore, as shown from Figs. 22 and 23, both the inputs (X IN s) and the partial outputs (Y OUT s) have a high proportion of "0"s. Hence, in the future work, data-dependent memory architectures, e.g., 8T SRAMs, [23] , [24] can be used to store and access the inputs/outputs. Reference [23] , [24] take advantage of data properties to significantly reduce memory-access energy, which would be highly useful here. Compared to [9] and [5] , we achieve >27× improvement in energy-efficiency, due to the massively parallel in-memory analog computations. Our work achieves similar energy-efficiency numbers as [19] (considering a simplified technology scaling model), while using 6-b for IFMP/OFMP, compared to 1-b in [19] . Whereas, we achieve similar classification accuracy as [19] on MNIST, using ∼37× less MAC/MAV operations per classification. Our numbers are also comparable to the energy efficiency of [20] (not quoted for MNIST), which uses 1-b for weights. Next, we compare our results to an in-memory mixedsignal computing approach [13] , which implements a support vector machine (SVM) algorithm with 45 binary-classifiers, for a 10-way MNIST digit classification. We achieve similar energy efficiency as [13] , while improving the classification accuracy from 90% to >98%. This is because, we mitigate the problem of degraded computation accuracy, caused by SRAM bit-cell variation. Although [13] can run at a higher speed, it only supports 1-b output resolution, compared to 6-b in our case. When compared to a near-memory approach [16] , which uses only 1-b for IFMP/OFMP, we still achieve 8.5× improvement in energy efficiency. This is because, our approach exploits high parallelism of accessing multiple memory addresses simultaneously, without the need to sequentially and explicitly read out the data (filter weights) from the memory. We also achieve a higher classification accuracy compared to [16] , because of using 6-b for inputs/outputs. Finally, our work also achieves better classification accuracy than prior in-memory approach [14] , although it supports 8-b weights. This is because we reduce the effect of bit-cell variation when evaluating the weights. In addition, our approach benefits from more parallelism, by supporting 16 different dot-product computations per array per cycle, compared to 1 for [14] .
VII. CONCLUSION
This paper presents an SRAM-embedded convolution (dot-product) computation architecture for running binaryweight neural networks. We demonstrated functionality with the LeNet-5 CNN on the MNIST hand-written digit recognition data set, achieving classification accuracy close to digital implementations and much better than prior in-memory approaches. This is made possible by our variation-tolerant architecture and also the support of multi-bit resolutions of input-output values. Compared to conventional digital accelerator approaches using small bitwidths, we achieve similar or better energy efficiency, by overcoming some of the major limitations of memories in traditional computing paradigms. This is because our architecture can significantly reduce data transfer by running massively parallel analog computations inside the memory. The results indicate that the proposed energy efficient, SRAM-embedded dot-product computation architecture could enable low power ML applications (e.g., "always-ON" sensing) for "smart" devices in the Internet of Everything.
