The emerging mobile devices in this era of internetof-things (IoT) require a dedicated processor to enable computationally intensive applications such as neuromorphic computing and signal processing. Vector-by-matrix multiplication (VMM) is the most prominent operation in these applications. Therefore, compact and power-efficient VMM blocks are required to perform resource-intensive computations. To this end, in this work, for the first time, we propose a time-domain mixed-signal VMM exploiting a modified configuration of 1 MOSFET-1 RRAM (1T-1R) array which overcomes the energy inefficiency of the current-mode VMM approaches based on RRAMs. In the proposed approach, the inputs and outputs are encoded in digital domain as duration of the pulses while the weights are realized as programmable current sinks utilising the modified 1T-1R blocks in the analog domain. We perform a rigorous analysis of the different factors such as channel length modulation (CLM), draininduced barrier lowering (DIBL), capacitive coupling etc. which may degrade the compute precision of the proposed VMM approach. We show that there exists a trade-off between the compute precision, dynamic range and the energy efficiency in the modified 1T-1R array based VMM approach. Therefore, we also provide the necessary design guidelines for optimising the performance of this implementation. The preliminary results show that an effective compute precision > 8-bits is achievable owing to the inherent compensation effect with an energy efficiency of ~323.5 TOps/J and a throughput of 1.25 Tops/s considering the input/output (I/O) circuitry for a 200×200 VMM utilising the proposed approach.
I. INTRODUCTION
The widespread use of computationally intensive applications such as deep neural networks(DNNs)/recurrent neural networks (RNNs), real-time signal processing and optimization algorithms in this era of internet-of-things (IoT) necessitates the development of dedicated processing blocks within the mobile devices since the traditional digital processors are extremely energy inefficient while handling high-dimensional data from operations such as object/speech recognition, image processing, probabilistic inference, etc. [1] - [2] . The vector-by-matrix multiplication (VMM) forms the most integral part (and often bottleneck) of these computationally intensive systems.
The authors are with the California Nano Systems Institute (CNSI) and also with the Department of Electrical and Computer Engineering, University of California, Santa Barbara, California, 93106, U.S.A. Therefore, the development of a compact, highly precise and energy efficient VMM engine is highly essential [3] - [16] .
The analog-domain VMM implementations are more compact and energy-efficient as compared to the digital counterparts for computational tasks such as inference, classification, recognition, etc. which are robust to low resolution (reduced precision) VMM operations and can be trained effectively to handle hardware imperfections without compromising with the accuracy [3] , [6] - [10] . Recently, VMMs based on emerging non-volatile memories, RRAMs in particular, have attracted considerable attention since the VMM operation is simplified as current accumulation through programmable resistances in analog domain [5] - [6] , [10] . However, the current-mode VMM implementations based on RRAM require high current levels [6] , [16] and bulky transimpedance amplifiers at each column of the cross-bar [6] degrading its energy and area efficiency. Moreover, the compute precision is also limited and may be improved only at the cost of an increased area for complex peripheral circuitry to implement sophisticated tuning algorithms or complex mapping techniques [6] .
Recently, time-domain VMMs [4] , [9] - [15] exploiting flash memory [15] , post-synaptic pulse (PSP) emulators [11] , and SRAM (binary) output [13] as programmable weights have been proposed. Moreover, the energy efficiency of the RRAM based VMM approaches could be significantly improved if a time-domain switched capacitor based approach [8] is followed as opposed to the power-hungry current-mode approach. To this end, in this work, for the first time, we propose a time-domain mixed-signal VMM exploiting a modified 1MOSFET-1RRAM (1T-1R) array. In the proposed VMM approach, the weights are realized as programmable current sinks via tuning the RRAM conductance state in the modified 1T-1R blocks in the analog domain while the inputs and outputs are encoded as pulse durations in the digital domain. Contrary to the conventional 1T-1R blocks, where RRAM is connected to the drain of the MOSFET, the RRAM is attached to the source in this approach which leads to a self-compensation effect and significantly improves the compute precision. A rigorous analysis of the different non-ideal factors affecting the compute precision of the proposed VMM such as channel length modulation (CLM), drain-induced barrier lowering (DIBL), capacitive coupling, (e-mail: shubhamsahay@ucsb.edu)
Time-Domain Mixed-Signal Vector-by-Matrix
Multiplier Exploiting 1T-1R Array etc. was performed. It was found that there exists a trade-off between the compute precision, the dynamic range and the energy dissipation in this implementation. Therefore, we also provide the necessary design guidelines for optimizing the performance of the proposed architecture. The preliminary results show that an effective precision > 8-bits may be obtained utilizing this approach with an energy efficiency of ~323.5 Tops/J for a 200×200 VMM. The paper is organized as follows: the proposed VMM approach is discussed in section II. The load-line characteristics of the modified 1T-1R block, the different factors which may affect the performance of the proposed approach are discussed in section III. The design guidelines for optimizing the performance of the proposed 1T-1R VMM are discussed in section IV and the area, energy and throughput estimates are discussed in section V. The conclusions are drawn in section VI.
II. PROPOSED VMM APPROACH
A generalized ( × )VMM operation may be represented as: = 1 ∑ =1 (i) where the inputs , outputs and weights are normalized such that ( , , ) ∈ [0,1]. The proposed time-domain VMM approach exploiting the modified 1T-1R array is shown in Fig. 1 . In the time-domain VMM [9]- [15] , the inputs and outputs are encoded as duration of the digital pulses such that:
where is the time window for the VMM operation. In the proposed approach, the modified 1T-1R block acts as a programmable current sink and the digital inputs applied to the gate of the MOSFETs ( , ) enable the ℎ current sink for a duration , . It may be noted that unlike conventional 1T-1R arrays where the RRAMs are connected to the drain of the MOSFETs, in this approach, the RRAMs are connected to the source of the MOSFETs to dissuade the non-idealities such as channel length modulation (CLM) and drain induced barrier lowering (DIBL) owing to the compensation effect as discussed in section III.C. The weights ( ∈ [0, ]) are mapped to the currents ( ∈ [ , ]) through the programmable current sink as:
Each column of the programmable current sinks is connected to a load capacitor . A threshold (neuron) circuit proposed in [14] with a transfer function given as: = ( − ( )) (v) where () is the Heaviside function encodes the voltage on load capacitor into output digital pulse duration.
The entire VMM operation is completed in two cycles (phase-I and phase-II) of duration each. The load capacitor is initially pre-charged to a voltage at the beginning of phase-I ( = 0). The inputs are activated only in phase-I(integration phase) and the current sinks start discharging . At the end of phase-I ( = ), the voltage across can be given as:
In the proposed scheme, the weighted sum is mapped to the voltage across the load capacitor at the end of phase I i.e.
( ) = ∈ [ , ]. To ensure this condition, the load capacitor must be designed such that: = − (vii)
In phase-II(evaluation phase), all the inputs are inactivated and the load capacitor is discharged through a constant current . This discharging current may be generated either via a current mirror or by adding a similar 1T-1R array at the load capacitor with all the inputs activated for the entire duration during phase-II and the current sinks programmed to . In this work, we have followed the latter approach to implement the constant current source during phase-II. The neuron circuit generates an output pulse when the voltage on the load capacitor reaches the threshold voltage i.e. ( ( ) = ). The time instance ( , ) at which ( ) = can be given as:
(viii)
From equation (viii), the output pulse duration ( , ) can be simply obtained as:
In the subsequent section, we shall discuss the non-idealities and provide the design guidelines for optimizing the performance of the proposed 1T-1R time-domain VMM.
III. 1T-1R VMM DESIGN GUIDELINES
The performance of the proposed 1T-1R VMM was evaluated using the 55-nm CMOS technology in HSPICE (version N-2017.12 [17] ). The minimum sized transistor from 55-nm technology node with a length = 60 nm and width = 120 nm was used. Furthermore, a rather simplistic compact model was used for RRAM with the current-voltage relationship expressed as = 0 sinh ( ) where 0 is the conductance in the initial state and is the non-linearity factor [18] . An ON-state conductance of 0.1mS and OFF-state conductance of 0.1μS were considered for RRAM similar to [6] . Under these assumptions, we evaluated the potential of the proposed 1T-1R time-domain VMM under different operating conditions and different parameters for the RRAM. In the subsequent sections, we discuss the operating conditions and provide the necessary design guidelines to extract the optimum performance from the proposed architecture. It may be noted that the optimal conditions also differ with the input constraints such as VMM size, input voltage, time window, dynamic range ( = ), targeted precision, etc.
A. Load-line characteristics
The load-line characteristics of the modified 1T-1R block (with RRAM attached to the source of the MOSFET) is shown in Fig.  2 for different non-linearity factors (β). The reset voltage was chosen as 0.9 V to reduce the error induced due to non-idealities such as CLM and DIBL (discussed in section III.C). From Fig. 2 , we observe that increasing the non-linearity factor of the RRAM results in a reduction of the operating range of drain voltage for low gate voltages (< 0.6 V). A reduced operating range of drain voltage leads to a lower dynamic range of current values which may be obtained from the modified 1T-1R block via tuning the conductance state of the RRAM as shown in Table I . Also, the ON-state to OFF-state conductance ratio of the RRAM should be high to obtain an appreciable .
Moreover, the operating range of drain voltages is also degraded when a RRAM with lower ON-state conductance or a higher OFF-state conductance is used as shown in Fig. 2 . Furthermore, unlike the current-mode VMM approach based on RRAM where the accumulated current depends exclusively on the conductance state of the RRAMs, the current from the modified 1T-1R block depends both on the conductance state of the RRAM and the channel conductance of the MOSFET (which depends on the input voltage). Therefore, even if the ON-state conductance of RRAM increases by tenfold, as shown in Fig. 2 , the drain current increases only slightly (< 2 times) and does not degrade the energy efficiency of the proposed VMM approach considerably as opposed to the current-mode VMM where the accumulated current would increase by a decade and limit the energy efficiency. However, the dynamic range also increases for lower ON-state resistances of the RRAM in 1T-1R configuration.
B. Precision
The effective weight precision (i.e. programmability of the current sinks) depends on the accuracy of tuning the conductance states of RRAM and degrades due to the drift in the analog conductance state with cycling and temperature and the inherent intrinsic noise such as RTN exhibited by the RRAM. Previous works have already shown an effective weight precision greater than 7-bits based on a simple tuning algorithm [19] . The weight precision may be further improved by oxide material engineering or by utilizing more efficient tuning algorithms.
As discussed in [15] , the compute error (or output error, ) may be decoupled from the weight error and defined separately as the maximum difference between the theoretically calculated output time period (
) obtained considering ideal current sinks and the simulated output time period ( ) i.e.
= max
Therefore, the compute precision ( ) can be given as:
Considering the efficacy of the differential scheme in improving the noise immunity and enhancing the output precision while enabling inclusion of bipolar weights [8] , two adjacent columns of the 1T-1R array were tuned for implementing the positive and negative weight components of the bipolar weight matrix. Furthermore, the adjacent neuron circuits were used to calculate the positive and negative component of output in this differential implementation. Moreover, the final output was obtained as the time difference between the rising edge of the neuron circuits used for obtaining the positive ( ) and negative ( ( +1) ) component of the output. This rectified linear (ReLU) operation may be implemented utilizing a digital gate for = • ( +1) .
C. Non-ideal factors
The compute precision is degraded by several factors which tend to prohibit the current sink from yielding constant current. While CLM leads to a linear dependence of the MOSFET's drain current on the drain voltage and restricts their action as constant current sink, the DIBL effect induces threshold voltage shift which further increases the variation in the drain current with the drain voltage. Therefore, the current through the programmable current sink depends on the output voltage at the load capacitor. To minimize the dependency of the current sink on the output voltage, we modified the conventional 1T-1R array architecture. While the RRAM is connected to the drain terminal of the MOSFET in the conventional 1T-1R array, one terminal of RRAM is connected to the source of the MOSFET and the other terminal is grounded in this implementation as shown in Fig. 1 . When the drain voltage increases in the modified 1T-1R configuration with RRAM connected to the source, an enhanced current flows through the RRAM leading to a larger voltage drop across it. The increased voltage drop across the RRAM effectively boosts the source potential leading to a reduction in the effective gate to source voltage (VGS). This leads to a significant reduction in the drain current. Therefore, the increment in the drain current due to application of a larger drain voltage is compensated by a reduction in the effective gate overdrive voltage in the modified 1T-1R array. This inherent compensation effect leads to a diminished dependency of the current through the programmable current sink on the output voltage at the load capacitor. The error due to CLM and DIBL can be defined as:
where ∆ is chosen as 1mV to estimate the local error contours with accuracy. We performed a rigorous analysis of the CLM and DIBL error for different gate (input) voltages within the operating regime of the modified 1T-1R configuration. The error contour plots for different input voltages and non-linearity factors of RRAM are shown in Fig. 3 . For all the input voltages, we found that the programmable current sink is relatively independent of the drain voltage for high drain voltages. Therefore, we selected a high reset voltage, = 0.9 V and designed the neuron circuit to have a threshold voltage = 0.7 V to ensure a non-disturbing maximum voltage swing of 0.2 V across the RRAM [6] . Furthermore, we also observe that the DIBL/CLM error increases slightly as we reduce the input voltage and operate with a smaller maximum current ( ) to limit the load capacitance (see equation (vii)) as shown in Fig.  3(c) .
Apart from the error induced due to CLM and DIBL, the capacitive coupling between the load capacitor and the gatedrain capacitance of the MOSFET could be another possible source of charge disturbance. However, in the proposed architecture, the load capacitor is large compared to the gatedrain capacitance of the minimum sized MOSFETs owing to the higher maximum current . This diminishes the charge disturbance due to capacitive coupling.
IV. DESIGN SPACE EXPLORATION
We also performed a rigorous analysis to explore the design space for optimizing the performance of the proposed VMM architecture. The input voltage (VGS) and the time window ( ) are the important design parameters for tuning the performance of the proposed VMM. The performance parameters of the VMM architecture for different input voltages (VGS), time window ( ), VMM sizes ( in × VMM) and non-linearity factor ( ) of RRAM are listed in Table I . The output (worst case) error ( ) was found by simulating multiple runs of VMM operation in HSPICE with different combination of random inputs and random weights in each run. The line parasitics such as line resistances and capacitances and the corresponding process variations pertinent to the 55-nm technology node were also considered in the HSPICE simulations. The total energy dissipated in the load capacitor, (which is the dominant energy dissipation mechanism as discussed later in section V) for the VMM operation has also been included in Table I . The compute error is significantly low and further reduces with increasing VMM size till < 100. However, as the VMM size increases above 100, the line parasitics and their process variations lead to a non-negligible increase in the compute error. While the line resistances lead to a drop in the effective input (gate) voltage of the MOSFETs on the far end of the 1T-1R array leading to a reduced drain current, the line capacitances add to the latency. Although the differential configuration is effective in mitigating the impact of fixed line parasitics, the process variations cannot be compensated even exploiting a differential configuration and escalate the compute error. From Table I , it can also be observed that there is a trade-off between the dynamic range, compute precision and the energy dissipated in the load capacitor. For instance, to achieve a high compute precision of ~10-bits for large sized VMMs ( > 100) without incurring large energy dissipation, a low value of input voltage (VGS = 0.3 V) should be used. A low input voltage facilitates the VMM operation with a lower maximum current and hence, a smaller load capacitance. However, the dynamic range is also low for such operating conditions and the weight precision may limit the compute precision in such cases. Still, the preliminary results indicate that an effective compute precision of 12-bits is achievable for a time window of 32 ns for a VMM size, > 100 using the proposed approach. In addition, depending on the targeted compute precision, input time window, VMM size, area, energy efficiency, voltage swing across RRAM etc. we may optimize the design parameters to achieve optimum performance of the proposed VMM architecture.
Since the conductance state of the RRAM is highly sensitive to the voltage drop across them, we have also analyzed the performance of the proposed VMM approach for neuron circuit with different threshold voltages ( > 0.5 V) to limit the maximum voltage swing across RRAM ( − ). As can be observed from Fig. 4 , a reduction in the maximum voltage swing across RRAM leads to a higher compute error. However, the compute precision is still high (> 10-bit) even when the voltage drop is reduced to 0.05 V. Although a reduction in the voltage drop across RRAM increases the load capacitor size according to equation (vii), the energy dissipated in the load capacitor, decreases owing to the lower voltage swing as shown in Fig. 4 .
Furthermore, the impact of variation in the ambient temperature on the compute error of the proposed VMM architecture is shown in Fig. 5 . The temperature dependence of RRAM was taken into account by changing the initial conductance values following [20] - [21] . As can be observed from Fig. 4 , the compute error increases with a variation in the ambient temperature. However, a compute precision of 10-bits is achievable utilizing the proposed VMM approach for VMM size, = 200. Moreover, the variations in the input voltage may be compensated by re-tuning the conductance states of the RRAM to achieve same drain current level.
V. PERFORMANCE ESTIMATION
From Table I , it can be observed that the proposed VMM approach yields a compute precision of 5-bits to 13-bits depending on the design parameters. Targeting a compute precision of 8-bits, which is sufficient for several applications including neuromorphic computing [8] , [10] , we select an input voltage of 0.3 V and a time window of 32 ns for estimating the energy and area efficiency of the proposed approach. Fig. 6 shows the area and energy efficiency breakdown of the proposed VMM approach taking into account the input/output (I/O) peripheral circuitry and the neuron circuit for different VMM sizes. The basic components of the VMM I/O circuitry are digital input to time-domain pulse converters (DTC) which consist of an 8-bit shared counter and an 8-bit digital comparator followed by a S-R latch for each input and timedomain pulse to digital output converters (TDC) which consist of an 8-bit accumulator for each neuron output. The 8-bit accumulator is realized using an 8-bit full adder and an 8-bit register based on D-flip flops. A shared clock enables conversion of the pulse duration of the neuron output to digital outputs. The neuron circuit consists of a S-R latch realized using a pair of NAND gates followed by an AND gate and NOT gate for implementing the differential scheme. The load capacitors are realized using MOSCAPs from the 55-nm technology node. It can be observed from Fig. 6 that the I/O circuitry consumes a significant portion of the energy and area landscape of the proposed VMM architecture when the VMM size is small. However, the load capacitor ( ) tends to dominate the area and energy landscape as the VMM size increases. The preliminary results indicate an effective compute precision of 8-bits with an energy efficiency of ~323.5 Tops/J and a throughput of 1.25 Tops/s for VMM size = 200 utilizing the proposed approach. It may be noted that the area and energy efficiency may be further improved since we are not targeting a compute precision of 10-bits which may be achieved utilizing these design parameters. Since applications such as inference, classification, recognition etc. can be performed with high accuracy utilizing even low precision (~4 bits) VMM operations [8] , we also analyze the efficacy of the proposed approach for different target precisions lower than 8-bits as shown in Fig. 7 . A reduction in the targeted bit precision allows utilization of a smaller time window ( ) to encode the inputs while operating at the same frequency. Therefore, the capacitor and I/O circuit area and energy consumption decreases significantly with a reduced target precision. This leads to a considerable improvement in the area and energy efficiency when targeting lower precision VMM operations as shown in Fig. 7 . Moreover, we may utilize a lower conductance value for the ON-state of the RRAM to reduce and decrease load capacitance for enhancing the energy and area efficiency while operating with a reduced precision (8-bits) as compared to the calculated compute precision (10-bits). Similarly, a lower reset voltage ( ) may further increase the energy and area efficiency while enabling a compute precision of 8-bits.
Furthermore, the intrinsic thermal noise of the MOSFET and the random telegraph noise (RTN) in the RRAM may also affect the compute precision. Therefore, analysis of the proposed VMM approach under noise is an important future work.
VI. CONCLUSION
An energy-efficient time-domain VMM exploiting a modified configuration of 1T-1R array has been proposed in this work. The different mechanisms such as CLM, DIBL, capacitive coupling, etc. which may degrade the performance and precision of the proposed architecture are discussed in detail. Furthermore, we show that there exists a trade-off between the compute precision, dynamic range and the energy efficiency of the proposed VMM approach. Therefore, we also provide necessary design guidelines to further optimize the performance of the 1T-1R VMM. The preliminary results indicate an effective compute precision of 8-bits with an energy efficiency of ~263.3 Tops/J and a throughput of 1.25 Tops/s for VMM size = 200 using the proposed approach which is one of the largest precision achieved till date with comparable energy efficiency as shown in Table II . Our results may provide an incentive for experimental realization of the VMM approach based on 1T-1R array.
