Large-scale digital computing almost exclusively relies on the von Neumann architecture, which comprises separate units for storage and computations. The energy-expensive transfer of data from the memory units to the computing cores results in the well-known von Neumann bottleneck. Various approaches aimed toward bypassing the von Neumann bottleneck are being extensively explored in the literature. These include in-memory computing based on CMOS and beyond CMOS technologies, wherein by making modifications to the memory array, vector computations can be carried out as close to the memory units as possible. Interestingly, in-memory techniques based on CMOS technology are of special importance due to the ubiquitous presence of field-effect transistors and the resultant ease of large-scale manufacturing and commercialization. On the other hand, perhaps the most important computation required for applications such as machine learning, etc., comprises the dot-product operation. Emerging nonvolatile memristive technologies have been shown to be very efficient in computing analog dot products in an in situ fashion. The memristive analog computation of the dot product results in much faster operation as opposed to digital vector in-memory bitwise Boolean computations. However, challenges with respect to large-scale manufacturing coupled with the limited endurance of memristors have hindered rapid commercialization of memristive-based computing solutions. In this paper, we show that the standard 8 transistor (8T) digital SRAM array can be configured as an analoglike in-memory multibit dot-product engine (DPE).
I. INTRODUCTION
S TATE-OF-THE-ART computing platforms are widely based on the von Neumann architecture [1] . The von-Nuemann architecture is characterized by distinct spatial units for computing and storage. Such physically separated memory and compute units result in huge energy consumption due to the frequent data transfer between the two entities. Moreover, the transfer of data through a dedicated limited-bandwidth bus limits the overall compute throughput. The resulting memory bottleneck is the major throughput concern for hardware implementations of data-intensive applications such as machine learning, artificial intelligence, and so on.
A possible approach geared toward high throughput beyond von Neumann machines is to enable distributed computing characterized by tightly intertwined storage and compute capabilities. If computing can be performed inside the memory array, rather than in a spatially separated computing core, the compute throughput can be considerably increased. As such, one could think of ubiquitous computing on the silicon chip, wherein both the logic cores and the memory unit partake in computing operations. Various proposals for "in-memory" computing with respect to emerging nonvolatile technologies have been presented for both dot-product computations [2] , [3] as well as vector Boolean operations [4] . Prototypes based on emerging technologies can be found in [3] and [5] .
With respect to the CMOS technology, Boolean in-memory operations have been presented in [6] and [7] . Dong et al. [6] have presented vector Boolean operations using 6T SRAM cells. In addition, Agrawal et al. [7] have demonstrated that the 8 transistor (8T) SRAM cells lend themselves easily as vector compute primitives due to their decoupled read and write ports. Both the works [6] and [7] are based on vector Boolean operations. Interestingly, by adding additional peripheral circuits, more complex functions such as add and multiplication can be implemented using bulk-bitwise computations as shown in [8] . However, perhaps the most frequent and computeintensive function required for numerous applications like machine learning is the dot-product operation. Memristors based on resistive-RAMs (Re-RAMs) have been reported in many works as an analog dot-product compute engine [4] , [9] . Few works based on analog computations in SRAM cells can Fig. 1 . (a) Schematic of a standard 8T-SRAM bit-cell. It consists of two decoupled ports for reading and writing, respectively. (b) First proposed configuration (Config-A) for implementing the DPE using the 8T-SRAM bit-cell. The SL is connected to the input analog voltage v i , and the RWL is turned on. The current I RB L through the RBL is sensed and is proportional to the dot product v i · g i , where g i is the ON-/sc off-conductance of the transistors M1 and M2. (c) Second proposed configuration (Config-B). The input analog voltages are applied to the RWL, while the SL is supplied with a constant voltage V bias . The current through the RBL is sensed in the same way as in Config-A. be found in [10] - [13] . These works use 6T SRAM cells and rely on the resultant accumulated voltage on the bitlines (BLs). Not only 6T SRAMs are prone to read-disturb failures but also the failures are a function of the voltage on the BLs. This leads to a tightly constrained design space for the proposed 6T SRAM-based analog computing. More recently, there have been works on modified SRAM cells such as 8T or 10T cells for more robust in-memory computing [14] - [17] . These works use a modified SRAM cell enabling parallel computations of binary dot products. In general, the existing analog computing works in SRAM cells have implemented either multibit multiplication or have been limited to binary dot products. Interestingly, using current-based computations allows highly parallel multibit dot-product operation. 8T SRAM cells as current-based dot-product accelerators were first proposed in [18] . Subsequently, the work in [19] used a twin 8T cell to enable multibit dot products using current-based computations. Based on [18] , in this paper, we employ 8T cells that are much more robust as compared to the 6T cells due to isolated read port. We show that without modifying the basic bit-cell for the 8T SRAM cell, it is possible to configure the 8T cell for in-memory multibit dot-product computations. Note that in sharp contrast to the previous works on in-memory computing with the CMOS technology, we enable current-based, analoglike dot-product computations using robust digital 8T bit-cells.
The key highlights of this paper are as follows. 1) We show that the conventional 8T SRAM cell can be used as a primitive for analoglike dot-product computations, without modifying the bit-cell circuitry. In addition, we present two different configurations for enabling dot-product computation using the 8T cell. 2) Apart for the sizing of the individual transistors consisting of the read port of the 8T cell, the basic bit-cell structure remains unaltered. Thus, the 8T SRAM array can also be used for usual digital memory read and write operations. As such, the presented 8T cell array can act as a dedicated dot-product engine (DPE) or as an ondemand dot-product accelerator. 3) A detailed simulation analysis using 45-nm predictive technology models (PTMs) including layout analysis and effect of nonidealities like the existence of line resistances and variation in transistor threshold voltages has been reported highlighting various tradeoffs presented by each of the two proposed configurations.
II. 8T-SRAM AS A DOT-PRODUCT ENGINE
A conventional 8T bit-cell is schematically shown in Fig. 1(a) . It consists of the well-known 6T-SRAM bit-cell with two additional transistors that constitute a decoupled read port. To write into the cell, the write word-line (WWL) is enabled and write BLs (WBLs/WBLBs) are driven to V D D or ground depending on the bit to be stored. To read a value from the cell, the read BL (RBL) is precharged to V D D and the read WL (RWL) is enabled. Note that the source-line (SL) is connected to the ground. Depending on whether the bit-cell stores a logic "1" or "0," the RBL discharges to 0 V or stays at V D D , respectively. The resulting voltage at the RBL is read out by the sense amplifiers. Although 8T-cells incur a ∼30% increase in bit-cell area compared to the 6T design, they are read-disturb free and more robust due to separate read and write path optimizations [20] .
We now show how such 8T-SRAMs, with no modification to the basic bit-cell circuit (except for the sizing of the read transistors), can behave as a DPE, without affecting the stability of the bits stored in the SRAM cells. We propose two configurations-Config-A and Config-B, for enabling dotproduct operations in the 8T-SRAMs. Config-A is shown in Fig. 1(b) . The inputs v i (encoded as analog voltages) are applied to the SLs of the SRAM array, and the RWL is also enabled. The RBL is connected to a sensing circuitry, Fig. 2 . 8T-SRAM memory array for computing dot-products with 4-bit weight precision. Only the read port is shown, the 6T storage cell and the write port are not shown. The array columns are grouped in four, and the transistors M1 and M2 are sized in the ratio 8 : 4 : 2 : 1 for the four columns. The output current I j OUT represents the weighted sum of the I RBL of the four columns, which is approximately equal to the desired dot-product. which we will describe later. Thus, there is a static current flow from the SL to the RBL, which is proportional to the input v i and the conductance of the two transistors M1 and M2. For simplicity, assume that the weights (stored in the SRAM) have a single-bit precision. If the bit-cell stores "0," the transistor M1 is off, and the output current through the RBL is close to 0. On the other hand, if the bit-cell stores a "1," the current is proportional to v i · g ON , where g ON is the series "ON"-conductance of the transistors. Assume that similar inputs v i are applied on the SLs for each row of the memory array. Since the RBL is common throughout the column, the currents from all the inputs v i are summed into the RBL. Moreover, since the SL is common throughout each row, the same inputs v i are supplied to multiple columns. Thus, the final output current through RBL of each column is proportional to I j RBL = (v i · g j i ), where g j i is the "ON-" or "OFF"-conductances of the transistors, depending on whether the bit-cell in the i th row and j th column stores a "1" or "0," respectively. The output current vector thus resembles the vector-matrix dot product, where the vector is v i in the form of input analog voltages, and the matrix is g j i stored as digital data in the SRAM.
Let us now consider a 4-bit precision for the weights. If the weight W j i = w 3 w 2 w 1 w 0 , where w i are the bits corresponding to the 4-bit weight, the vector matrix dot product becomes
Now, if we size the read transistors M1 and M2 of the SRAM bit-cells in column 1 through 4 in the ratio 2 3 : 2 2 : 2 1 : 1, as shown in Fig. 2 , the transistor conductances in the "on"-state would also be in the ratio 2 3 : 2 2 : 2 1 : 1. Thus, summing the currents through the RBLs of the four columns yields the required dot product in accordance to the equation shown above. This sizing pattern can be repeated throughout the array. In addition, one could also use transistors having different threshold voltages to mimic the required ratio of conductances as 2 3 : 2 2 : 2 1 : 1. Note that the currents through the RBLs of the four consecutive columns are summed together, thus we obtain one analog output current value for every group of four columns. In other words, the digital 4-bit word stored in the SRAM array is multiplied by the input voltage v i and summed up by analog addition of the currents on the RBLs. This one-go computation of vector multiplication and summation in a digital memory array would result in highthroughput computations of the dot products.
It is worth mentioning that the way input v i is multiplied by the stored weights, and summed up is reminiscent of memristive dot-product computations [9] . However, a concern with the presented SRAM-based computation is the fact that the ON-resistance of the transistors (few kilohms) is much lower as compared to a typical memristor ON-resistance, which is in the range of few tens of kilohms [21] . As such the static current flowing through the ON-transistors M1 and M2 would typically be much higher in the presented proposal. In order to reduce the static current flow, we propose scaling down the supply voltage of the SRAM cell. Note that, interestingly, 8T cells are known to retain their robust operation even at highly scaled supply voltages [22] . In the next section, we have used V D D lower than the nominal V D D of 1 V. We would now describe another way of reducing the current, although with tradeoffs, as detailed in the following.
Config-B is shown in Fig. 1(c) . Here, the SLs are connected to a constant voltage V bias . The input vector v i is connected to RWLs, i.e., the gate of M2. Similar to Config-A, the output current I RBL is proportional to v i . We will later show from our simulations that for a certain range of input voltage values, we get a linear relationship between I RBL and v i , which can be exploited to calculate the approximate dot product. To implement multibit precision, the transistor sizing is done in the same way as Config-A as represented in Fig. 2 , so that I RBL is directly proportional to the transistor conductances. Key features of the proposed Config-B are as follows. The input voltages v i have a capacitive load, as opposed to a resistive load in Config-A. This relaxes the constraints on the input voltage generator circuitry and is useful while cascading two or more stages of the DPE. However, as presented in the next section, Config-B has a small nonzero current corresponding to zero input as opposed to Config. A that has zero current for zero input.
In order to sense the output current at the RBLs, we use a current-to-voltage converter. This can most simply be a resistor, as shown in Fig. 1 . However, there are a few constraints. As the output current increases, the voltage drop across the output resistor increases, which, in turn, changes the desired current output. A change in the voltage on the RBL would also change the voltage across the transistors M1 and M2, thereby making their conductance a function of the voltage on the RBL. Thus, at higher currents corresponding to multiple rows of the memory array, I RBL does not approximate the vectormatrix dot product but deviates from the ideal output. This dependence of the RBL voltage on the current I RBL will be discussed in detail in the next section with possible solutions. 
III. RESULTS
The operation of the proposed configurations (Config-A and Config-B) for implementing a multibit DPE was simulated using HSPICE on the 45-nm PTM technology [23] . For the entire analysis, we have used a scaled down V D D of 0.65 V for the SRAM cells. The main components of the DPE implementation are the input voltages and conductances of the transistors for different states of the cells. A summary of the analysis for the two configurations is presented in Fig. 3 . In Fig. 3 , we have assumed a sensing resistance of 50connected to the RBL. Note that a small sense resistance is required to ensure that the voltage across the sensing resistance is not high enough to drastically alter the conductances of the connected transistors M1 and M2.
In Fig. 3 (a) and (b), we plot the output current in RBL (I RBL ) as a function of the input voltage for three 4-bit weight combinations "1111," "1010," and "0100" for the two different configurations described in the previous section. The results presented are for a single 4-bit cell. To preserve the accuracy of a dot-product operation, it is necessary to operate the cell in the voltage ranges such that the current is a linear function of the applied voltage v i . These voltage ranges are marked as a linear region in Fig. 3 (a) and (b). The slope of the linear section I RBL versus V in plot varies with weight, thus signifying a dot-product operation. Furthermore, at the left voltage extremity of the linear region, I RBL tends to zero irrespective of the weight, thus satisfying the constraint that the output current is zero for zero V in . It is to be noted that the two configurations show significantly different characteristics due to the different point-of-application of input voltages. To expand the dot-product functionality to multiple rows, we performed an analysis for upto 64 rows in the SRAM array, driven by 64 input voltages. In the worst case condition, when the 4-bit weight stores "1111," maximum current flows through the RBLs, thereby increasing the voltage drop across the output resistance. Fig. 3 (e) and (f) indicates that the total current I RBL deviates from its ideal value with an increasing number of rows, in the worst case condition. The deviation in Fig. 3 (e) and (f) is because we sense the output current with an equivalent sensing resistance (R sense ), and hence, the final voltage on the BL (V BL ) is dependent on the current I RBL . At the same time, I RBL is also dependent on V BL , and as a result, the effective conductance of the cell varies as V BL changes as a function of the number of rows. It was also observed that the deviation reduces with decreasing sensing resistance as expected. Another concern with respect to Fig. 3 is the fact that the total summed up current reaches almost 6 mA for 64 rows for the worst case condition (all the weights are "1111").
There are several ways to circumvent the deviation from ideal behavior with an increasing number of simultaneous row accesses and also reduce the maximum current flowing through the RBLs. One possibility is to use an operational amplifier (Opamp) at the end of each 4-bit column, where the negative differential input of the Opamp is fed by the BL corresponding to a particular column. On the other hand, the positive input is supplemented by a combination of the Opamp offset voltage and any desired voltage required for the suitable operation of the dot product as shown in the left-hand side of Fig. 4 . Opamp provides a means of sensing the summed up current at the RBL while maintaining a constant voltage at the RBL. Opams in the configuration as shown in Fig. 4 have been traditionally used for sensing in memristive crossbars as in [3] .
We performed the same analysis as previously described in Fig. 3 for the two proposed configurations with the BL terminated by an Opamp. For our analysis, we have set V pos = 0.1 V for the positive input of the Opamp, and thus, analysis is limited to input voltages above V pos to maintain the unidirectional current. Note that we have used an ideal Opamp for our simulations, where the voltage V pos can be accounted for both the nonideal offset voltage of the Opamp and a combination of an externally supplied voltage. Fig. 4(a) and (b) shows the plot of I RBL versus input voltage V in for the two configurations. Similar behavior as in the case of Fig. 3(a) and (b) is observed even in the presence of the Opamp. However, note that the current ranges have decreased since RBL is now clamped at V pos . Furthermore, the dot-product operation is only valid for V in > V pos , and thus, the acceptable input range is shifted in the presence of an Opamp. Fig. 4(c) and (d) shows the behavior of I RBL versus weight levels for the two configurations and desirably, linearity is preserved. Fig. 4(e ) and (f) shows the current through the RBL as a function of the number of rows. As expected, due to the high input impedance of the Opamp, and the clamping of V BL at a voltage V pos , the deviation of the summed up current from the ideal value has been mitigated to a huge extent. Although, the current levels have reduced significantly as compared to Fig. 3 , the resultant current for 64 rows would still be higher than the electromigration limit for the metal lines constituting the RBL [24] . One possible solution is to sequentially access a smaller section of the crossbar (say 16 or 8 rows at a Fig. 5 . Fully connected network topology consisting of three layers, the input layer, the hidden layer, and the output layer [21] . We have used M = 784, N = 500, and P = 10. time), convert the analog current into its digital counterpart each time, and finally, add all accumulated digital results. In addition to the use of high-threshold transistors for the read port of the SRAM would also help to reduce the maximum current values. Furthermore, the maximum current is obtained only when all the weights are "1111," which is usually not true due to the sparsity of matrices involved in various applications as in [25] and [26] .
We also performed functional simulations using the proposed DPE based on Config. A in a fully connected artificial neural network consisting of three layers as shown in Fig. 5 . The main motivation behind this analysis is to evaluate the impact of the nonlinearity in the I -V characteristics on the inference accuracy of the neural network. We chose an input voltage range of 0.1-0.22 V. As can be observed in Fig. 4(a) , the I -V characteristics are not exactly linear within this range, as such a network-level functional simulation is required to ascertain the impact of the nonlinearity on classification accuracy. The network details are as follows. The hidden layer consisted of 500 neurons. The network was trained using the backpropagation algorithm [27] on the MNIST digit recognition data set under ideal conditions using MATLAB Deep Learning Toolbox [28] .
During inferencing, we incorporated the proposed 8T-SRAM-based DPE in the evaluation framework by discretizing and mapping the trained weights proportionally to the conductances of the 4-bit synaptic cell. The linear range of the voltage was chosen to be [0.1-0.22 V] and normalized to a range of [0, 1]. The dot-product operation was ensured by normalizing the I -V characteristics for all the weight levels such that current corresponding to the highest input voltage and highest weight level is
The activation function of the neuron was considered to be a behavioral satlin function scaled according to the scaling factor of the weights to preserve the mathematical integrity of the network. To be noted, the normalization of current and input voltage simplifies the scaling of the neuron activation function. The accuracy of digit recognition task was calculated to be merely 0.11% lower than the ideal case (98.27%), thus indicating that the proposed DPE can be seamlessly integrated into the neural network framework without significant loss in performance.
Furthermore, it is to be noted that, in many cases, the inherent resilience of the applications that require dot-product computations can be leveraged to circumvent some of the circuit level nonidealities. For example, for cases like training and inference of an artificial or a spiking neural network, various algorithmic resilience techniques can be applied where modeling circuit nonidealities and modifying the standard training algorithms [21] , [29] can help to preserve the ideal accuracy of the classification task concerned. In addition, the proposed technique can either be used as a dedicated CMOS-based dot-product compute engine or as an on-demand dot-product accelerator, wherein the 8T array acts as usual digital storage and can also be configured as a compute engine as and when required. It is also worth mentioning that the 8T cell has also been demonstrated in [7] as a primitive for vector Boolean operations. This work significantly augments the possible use cases for the 8T cells by adding analoglike dot-product acceleration.
Due to different sizing of the read transistors and an additional metal line routing for SL, there is an area penalty of using the proposed configurations, compared to the standard 8T-SRAM bitcell used for storage. Fig. 6 shows the thincell layout for a standard 8T-SRAM bitcell [20] . Note that the rightmost diffusion with width (W ) constitutes the read transistors (M1 and M2). To implement the 4-bit precision dot-product, we size the width of read transistors in the ratio 8 : 4 : 2 : 1, as described earlier. Thus, the width of the rightmost diffusion is increased to 8, 4, and 2 W, increasing the bitcell length (horizontal dimension) by ∼39.6%, 17.1%, and 5.7% for bits 3, 2, and 1, respectively, compared to the standard minimum-sized 8T bitcell with diffusion width W . Moreover, to incorporate an extra metal line (SL), which runs parallel to V D D and ground lines, the cell width (vertical dimension) increases by ∼12.5%. The resulting layout of the first four columns for two consecutive rows in the proposed array is shown in Fig. 7 . The overall area overhead for the whole SRAM array with 4-bit weight precision amounts to ∼29.4% compared to the standard 8T SRAM array. Note that this low area overhead results from the fact that both the read transistors M1 and M2 share a common diffusion layer, and hence, an increase in transistor width can be easily accomplished by having a longer diffusion, without worrying about spacing between metal or polylayers. In addition, instead of progressively sizing the read transistors, one could also use multi-V T design, wherein the LSBs consist of high V T read transistors, and the MSBs consist of nominal (or low V T read transistors). The use of multi-V T design can significantly reduce the reported area overhead. As such, the reported area overhead is close to the worst case impact on the bitcell area without resorting to additional circuit tricks like multi-V T design.
IV. VARIATION ANALYSIS
To ascertain the robustness of the presented dot-product computations, in this section, we analyze the effects of nonidealities on the output current. The nonidealities considered are SL and BL line resistances and transistor threshold voltage variations.
A. Corner Analysis
Since the currents through the read transistors depend on the effective conductance of the read port, one would expect that the output current on the BL would be a strong function of global threshold voltage of the transistor (VT) due to process corners or temperature changes. We reproduced [ Fig. 8(a) ] the leftmost figure of Fig. 4(b) assuming global changes in the threshold voltage of ±90 mV% as compared to the nominal threshold voltage. As can be seen from the figure, the output currents indeed are a strong function of global variations in VT. However, interestingly, such global effects can be easily taken care of by compensating/calibrating circuits. An initial calibration based on on-chip tracking of process corner or temperature can be used to either calibrate the input voltage ranges or the offset voltage (Vpos) applied to the OPAMP. As an illustration, in Fig. 8(b) , we show that by adding or subtracting voltages from Vpos and Vbias, one can easily compensate the deviation in current due to global changes in VT. Note that the plots in Fig. 8(b) were generated in a similar manner to Fig. 4(b) except that Vpos and Vbias were added or subtracted with a constant compensating voltage. One can also compensate digitally, by adding or subtracting a constant number (a bias) from the resultant dot product (after converting it to a digital value). It is worth mentioning that since process corners and temperatures have global effects, a single bias estimator circuit can be shared among multiple subbanks of the SRAM array, amortizing its energy and area overhead. For the rest of the analysis, we have assumed the nominal corner case.
B. Effect of Line Resistances
Both the SL and BL line resistances add parasitic voltage drops along the rows and the columns. Moreover, to complicate the analysis, the error in the output current would be a function of both the spatial dependence due to distributed line resistances and data dependence as a function of the stored weights in the memory array. We, therefore, resort to worst case analysis. The worst case arises when all the weights and all the inputs are at the highest value. This scenario results in maximum current flow through the BLs and SLs and, hence, has the maximum impact of parasitic line resistances. To analyze the impact, we consider a line resistance of 1.3 /μm [30] . Based on the layout, the average line resistance between each bitcell was found to be 1.25 in the BL direction and 2.5 in the SL direction. We explore both the configurations (Config. A and Config. B) to analyze the impact of the line resistances and ways to compensate for the voltage degradation along the metal lines. In addition, for Config. B, we explore two variants to minimize the effect of line resistances. Note that in Config. B, the inputs are connected to the WLs, i.e., to the gate of the transistors. As such, the inputs drive capacitive load and there is no voltage degradation due to line resistances. On the other hand, the bias voltage is connected to the SL, which would degrade due to line resistances and induce error in the final output current flowing through each column. To minimize this error, the two variants of Config. B presented in this paper are as follows.
1) Config. B with the bias voltage driving the SL from both
the ends (i.e., from the extreme right and extreme left ends, as shown in Fig. 9 ). 2) Config. B with the SL tapped every 16 bit with regenerated values of the bias voltage in the horizontal direction, as depicted in Fig. 9 . Fig. 10 shows the worst case impact (when all inputs are at the highest value and all the weights are "1111") of the line resistances in terms of percentage error in the output current (note that this is error with respect to the current values, it should not be confused with the error corresponding to the classification accuracy) for various configurations for simultaneous activation of 16 and 8 rows, respectively. As observed, Config. A has a higher error than all the variants of Config. B. Note that tapping is infeasible in case of Config. A, because in Config A, the input voltages are connected to the SL. Tapping in Config. A would therefore require regeneration of input voltage along the horizontal direction, making it infeasible. In contrast, in the case of Config. B, SL is supplied by a global bias voltage and, hence, is easy to regenerate. We have assumed an array size of 64×128 (64 rows and 128 columns). Furthermore, for our analysis, we assume that the "farthest" Percentage error in output current for worst case combination (highest input values and all weights = "1111"). The left set of bar graphs represents the error for various combinations assuming 16 rows are activated simultaneous for the dot-product computation, while the right set of bar graphs corresponds to simultaneous activation of eight rows. 16 rows are simultaneously activated. SL and BL distributed resistances were included for all the activated rows, while the unactivated rows were modeled by an equivalent lumped resistance.
For the rest of the analysis, we choose Config. B with tapping every 16 bit. We now analyze the error percentage across all voltages and weight combinations to understand the impact of the degradation in light of applications discussed in this paper. Fig. 11(a) and (b) shows the 2-D error map due to line resistances for various combinations of input voltages and weights for 16 and 8 activated rows, respectively. Note that for each input voltage, weight combination in all rows is supplied with the same input voltage, and all weights in the array are the same. In addition, Fig. 11(c) shows the weight-level distribution of a neural network layer trained on the MNIST data set. As observed from Fig. 11(a) and (b), the error above 6% for 16 rows and above 4% for 8 rows is concentrated to the top 25% of the map corresponding to the highest weights and inputs. However, from Fig. 11(c) , we observe that for relevant applications such as neural network, the trained weights are Fig. 11. (a) and (b) Percentage error map arising due to line resistance for different weight levels ranging from "0000" to "1111" and input voltages ranging from 0.35 to 0.675 V for 16 and 8 activated rows, respectively. For example, the data point corresponding to V = 0.35 V and weight level = "0000" means the test case where all the 4-bit weight elements in the memory array are considered to be at weight "0000," and the input voltages to all rows are 0.35 V. The percentage error decreases with decreasing weight and input value. (c) Probability of occurrence of weight levels in a trained neural network on MNIST data set shows lowest weight levels and have the highest frequency, thus indicating low impact due to line resistance. mostly concentrated to the weight levels 1-6 where the error is close to 0%-5% for 16 rows and 0%-3.4% for 8 rows. From this analysis, we can conclude that using the circuit techniques presented, i.e., driving SL from both the sides and tapping every 16 column and also leveraging the weight distribution for a trained neural network, the effect of line resistances for simultaneous activation of 8 and 16 rows can be substantially mitigated. For example, for Config. B with tapping every 16 column with SL being driven from both the ends, the worst case error is contained within 9%. Furthermore, it was observed that the error improves rapidly when the input voltages or the programed weights are less than their maximum possible values.
C. V T Variations
The variations in transistors can result in error in the dotproduct operations. To analyze this, we perform 1000 Monte Carlo simulations to assess the variation of the output current for various combinations of input voltages and weights. We considered 30-mV σ variation of threshold voltage (V T ) for the minimum-sized transistor and scaled the variation with width as σ L = σ min (W min L min /WL) 1/2 . Note that for random variations, it is customary to include various sources of variations into effective variation in the transistor threshold voltage [31] . We ran 1000 Monte Carlo simulations for each voltage value ranging from 0.35 to 0.675 V in steps of 0.025 V, and each weight level ranging from 0 to 15 and obtained the standard deviation in output current for each case. This captures the impact of Vth variations for a considerable precision of gate voltages. We calculated the standard deviation about the mean current for the entire range of output current from the cases described above for 16 activated rows of the memory array. The minimum current on the x-axis in Fig. 12(a) arises when the input voltages and (or) the stored weights in the memory array are zero. The next higher level of current is obtained when either the weight or the input voltage is incremented. It is worth noting that Fig. 12 (a) corresponds to 16 activated rows in an array of 64 rows and 128 columns. Furthermore, for the analysis in Fig. 12(a) , we have neglected the effect of line resistances for the following reasons.
1) Adding line resistances makes the standard deviation in Fig. 12 (a) a function of not only the random V T variation and weights but also makes the deviation in current spatially dependent. This leads to a nontrivial analysis problem that can quickly become intractable. 2) As shown in Section IV-B, even the worst case error due to line resistances was well within acceptable limits. Fig. 12(a) shows that the absolute standard deviation is higher for a higher value of output current. We fit a representative standard deviation for each current value using a polynomial fit as shown by the fitted line in Fig. 12 . In the functional simulations to evaluate the classification accuracy with such V T variations, we calculated the output current from every 16 row and replaced it with a random value from a Gaussian distribution with the corresponding standard deviation of the particular output current. The final classification accuracy was 0.01% lower than the case without random variations. Table. I shows the inference accuracy on MNIST data set for 8T DPE in comparison to ideal accuracy. This error resiliency is mainly due to the inherent robustness of neural networks. If we look at the standard deviation as a percentage of the total current in Fig. 12(b) , we observe that the errors are the highest for low currents. However, deviceto-device variations can have both additive and subtractive effect on the currents through each cell. This means that for simultaneously activated rows, the resulting current deviation in individual synapses due to variations tend average-out as we are primarily interested in the sum of the currents. This results in a lower error due to variations. As such for generic dot-product operations, there exists a tradeoff, lower voltage implies lower current (lesser than the electromigration limit) allowing to turn on more number of rows simultaneously and, hence, leads to higher parallelism. At the same time, lower voltage also results in higher variation, and hence more approximation in the dot-product calculations. For neural networks/machine learning applications and specifically for the case of on-line learning, such approximations can be taken care of by relying on the inherent error resiliency of the target applications.
V. DISCUSSION
We would like to emphasize the fact that the present proposal aims at providing a means to enable in situ dot-product computations in standard 8T SRAM cells by exploiting the isolated read port. We believe that a wide range of applications can be accelerated using the present proposal. As such, the presented DPE should not be seen only in the context of machine learning and neural network applications. In general, any application that can benefit from approximate vector addition and multiplication can be a possible use case for the presented proposal. This wide spectrum of possible use cases implies that the exact details of the required peripheral circuits and its complexity would depend heavily on the target application. For example, error resilient applications like neural networks can rely on low-cost peripherals, whereas more traditional dot-product computations as in image processing might require sophisticated circuitry. Moreover, one could think of hybrid significance-driven peripheral design such that the less significant computations are associated with low overhead peripherals, while more significant operations are enabled by high accuracy circuits or a full digital computation without resorting to dot-product acceleration. The target application would also dictate the constrains on OPAMP specification and the required precision of the resistance Rf shown in Fig. 8 . In addition, the choice of Config. A versus Config. B would also depend on the target application. For example, Config. A shows better linearity as opposed to Config. B. However, the input voltages in Config. A drive a resistive load requiring complex driving circuits as opposed to Config. B, which has a capacitive load. The authors would also like to point that a detailed analysis of the appropriate peripherals and the associated architecture for each individual use case require a case-by-case analysis and is not the focus of this paper. This paper is more of a generic proposal and a study of the effect of intrinsic nonidealities, for example, the nonlinearity, the line resistances, and the transistor threshold voltage variations with respect to the present DPE.
Advantageously, the presented proposal resembles dot-product computations in memristive crossbar arrays [9] . Memristive dot products have been extensively studied and techniques like segregation of positive and negative weights in different subarrays, bit-slicing, and efficient sensing schemes for computations have been widely proposed [32] - [34] . Such memristive techniques can also be used in conjunction with the presented proposal. We would, however, like to highlight the fact that the present proposal should not be considered as a replacement for memristive crossbar arrays. Rather, we feel that memristors are more suitable for spatial architectures [33] , [34] , whereas the presented proposal can be used to augment on-chip storage (cache or GPU register files) with dot-product acceleration.
We would now present the estimates for energy consumptions by performing 16 × 16 dot-product operation with and without the proposed DPE. It is worth mentioning that the overall DPE consists of digital-to-analog converters (DACs) Fig. 13 . Average energy comparison between conventional digital sequential implementation and the proposed DPE. The energy is reported for 16 × 16 dot-product computations, wherein 16 rows are simultaneously activated and each row consists of 16 4-bit words. to generate analog inputs fed to the 8T-SRAM crossbar array, along with analog-to-digital converters (ADCs) to detect the analog outputs and converting them back to digital bits. A cache memory of size 256 kB with a basic subarray size of 64 × 128 bits was modeled using CACTI [35] simulator. The energy consumption and latency of the peripheral circuitry (ADCs and DACs) were appropriately incorporated in the CACTI model, referring to [36] . We assume a 16 × 16 crossbar operation (i.e., activating 16 rows at a time with each row containing 16 4-bit words) at any given time, thus requiring 16 ADCs in the peripheral circuitry, per subarray. The conversion time for the ADC operation was assumed to be 10 ns, and the energy estimates for the ADCs were adopted from [36] . This framework was used to evaluate the total energy consumption and latency of the proposal for a test vector of 16 × 16 dot product, compared to the pure digital approach, wherein the dot product was computed by sequential memory access and multiply as well as add operations were performed in dedicated adders and multipliers synthesized separately. Fig. 13 shows the energy for performing a 16 × 16 dot product with the proposed DPE and the conventional digital approach. This energy overhead stems from the fact that in digital approach, row-by-row access to the data from memory, followed by multiply-accumulate operations, is performed sequentially to compute the same 16 × 16 matrix-vector dot product, which the proposed DPE can do in a single instruction. Also, it was noted that the total energy consumption of the DPE had a dominant contribution from the peripheral circuitry. Nevertheless, in general, the energy and latency overheads associated with respect to DACs and ADCs in similar DPEs based on memristors have been extensively studied and can be found in works such as [33] and [34] .
VI. CONCLUSION
In the quest for novel in-memory techniques for beyond von Neumann computing, we have presented the 8T-SRAM as a vector-matrix dot-product compute engine. Specifically, we have shown two different configurations with respect to 8T SRAM cell for enabling analoglike multibit dot-product computations. We also highlight the tradeoffs presented by each of the proposed configurations. The usual 8T SRAM bit cell circuit remains unaltered, and as such the 8T cell can still be used for the normal digital memory read and write operations. The proposed scheme can either be used as a dedicated dot-product compute engine or as an on-demand compute accelerator. This paper augments the applicability of 8T cells as a compute accelerator in the view that dot products find wide applicability in multiple data-intensive application and algorithms including efficient hardware implementations for machine learning and artificial intelligence.
