Energy-Efficient Moderate Precision Time-Domain Mixed-signal
  Vector-by-Matrix Multiplier Exploiting 1T-1R Arrays by Sahay, Shubham et al.
 
Abstract— The emerging mobile devices in this era of internet-
of-things (IoT) require a dedicated processor to enable 
computationally intensive applications such as neuromorphic 
computing and signal processing. Vector-by-matrix multiplication 
(VMM) is the most prominent operation in these applications. 
Therefore, compact and power-efficient VMM blocks are required 
to perform resource-intensive computations. To this end, in this 
work, for the first time, we propose a time-domain mixed-signal 
VMM exploiting a modified configuration of 1 MOSFET-1 RRAM 
(1T-1R) array which overcomes the energy inefficiency of the 
current-mode VMM approaches based on RRAMs. In the 
proposed approach, the inputs and outputs are encoded in digital 
domain as duration of the pulses while the weights are realized as 
programmable current sinks utilising the modified 1T-1R blocks 
in the analog domain. We perform a rigorous analysis of the 
different factors which may degrade the compute precision of the 
proposed approach. We show that there exists a trade-off between 
the compute precision, dynamic range and the area and energy 
efficiency of the proposed VMM implementation. Therefore, we 
also provide the necessary design guidelines for optimising the 
performance. The preliminary results show that an effective 
compute precision of 6–bits is achievable owing to the inherent 
compensation effect. A 4-bit 200×200 VMM utilising the proposed 
approach exhibits a significantly high energy efficiency of ~1.5 
POps/J and a throughput of 2.5 TOps/s while considering the 
contribution due to the input/output (I/O) circuitry. 
 
Index Terms—— Vector-by-matrix multiplication, 1T-1R 
array, Time-domain encoding, mixed-signal VMM.  
I. INTRODUCTION 
The widespread use of computationally intensive applications 
such as deep neural networks(DNNs)/recurrent neural networks 
(RNNs), real-time signal processing and optimization 
algorithms in this era of internet-of-things (IoT) necessitates the 
development of dedicated processing blocks within the mobile 
devices since the traditional digital processors are extremely 
energy inefficient while handling high-dimensional data from 
operations such as object/speech recognition, image processing, 
probabilistic inference, etc. [1]-[2]. The vector-by-matrix 
multiplication (VMM) forms the most integral part (and often 
bottleneck) of these computationally intensive systems. 
Therefore, the development of a compact, highly precise and 
energy efficient VMM engine is highly essential [3]-[16]. 
                                                          
The authors are with the California Nano Systems Institute (CNSI) and also 
with the Department of Electrical and Computer Engineering, University of 
California, Santa Barbara, California, 93106, U.S.A. 
The analog-domain VMM implementations are more 
compact and energy-efficient as compared to the digital 
counterparts for computational tasks such as inference, 
classification, recognition, etc. which are robust to low 
resolution (reduced precision) VMM operations and can be 
trained effectively to handle hardware imperfections without 
compromising with the accuracy [3], [6]-[10]. Recently, VMMs 
based on emerging non-volatile memories, RRAMs in 
particular, have attracted considerable attention since the VMM 
operation is simplified as current accumulation through 
programmable resistances in analog domain [5]-[6], [10]. 
However, the current-mode VMM implementations based on 
RRAM require high current levels [6], [16] and bulky trans-
impedance amplifiers at each column of the cross-bar [6] 
degrading its energy and area efficiency. Moreover, the 
compute precision is also limited and may be improved only at 
the cost of an increased area for complex peripheral circuitry to 
implement sophisticated tuning algorithms or complex 
mapping techniques [6]. 
Recently, time-domain VMMs [4], [9]-[15] exploiting 
flash memory [15], post-synaptic pulse (PSP) emulators [11], 
and SRAM (binary) output [13] as programmable weights have 
been proposed. Moreover, the energy efficiency of the RRAM 
based VMM approaches could be significantly improved if a 
time-domain switched capacitor based approach [8] is followed 
as opposed to the power-hungry current-mode approach. To this 
end, in this work, for the first time, we propose a time-domain 
mixed-signal VMM exploiting a modified 1MOSFET-1RRAM 
(1T-1R) array. In the proposed VMM approach, the weights are 
realized as programmable current sinks via tuning the RRAM 
conductance state in the modified 1T-1R blocks in the analog 
domain while the inputs and outputs are encoded as pulse 
durations in the digital domain. Contrary to the conventional 
1T-1R blocks, where RRAM is connected to the drain of the 
MOSFET, the RRAM is attached to the source in this approach 
which leads to a self-compensation effect and significantly 
improves the compute precision. A rigorous analysis of the 
different non-ideal factors affecting the compute precision of 
the proposed VMM indicate that channel length modulation 
(CLM) and drain-induced barrier lowering (DIBL) are the 
dominant mechanisms degrading the compute precision.  
Furthermore, we also show that there exists a trade-off between
(e-mail: shubhamsahay@ucsb.edu)  
Time-Domain Mixed-Signal Vector-by-Matrix 
Multiplier Exploiting 1T-1R Array 
Shubham Sahay, Member, IEEE, Mohammad Bavandpour, Mohammad Reza Mahmoodi, and Dmitri 
Strukov, Senior Member, IEEE 
 Fig.1 Schematic view of (a) the VMM circuit utilizing 1T-1R array and the timing diagram of the inputs, outputs and the voltage across the load capacitor, (b) the 
modified 1T-1R block which acts as a programmable current sink and (c) the peripheral circuit within the neuron block implementing the Heaviside function. 
 
the compute precision, dynamic range and the area and energy 
dissipation in this implementation. Therefore, we also provide 
the necessary design guidelines for optimizing the performance 
of the proposed architecture. The preliminary results show that 
an effective precision of 6-bits may be obtained utilizing this 
approach with an energy efficiency of ~498.5 Tops/J and a 
throughput of 1.9 Tops/s for a 200×200 VMM. 
The paper is organized as follows: the proposed VMM 
approach is discussed in section II. The load-line characteristics 
of the modified 1T-1R block, the different factors which may 
affect the performance of the proposed approach are discussed 
in section III. The design guidelines for optimizing the 
performance of the proposed 1T-1R VMM are discussed in 
section IV and the area, energy and throughput estimates are 
discussed in section V. The conclusions are drawn in section 
VI. 
II.  PROPOSED VMM APPROACH 
A generalized 𝑀 × 𝑁 VMM operation may be represented as: 
   𝑦𝑗 =
1
𝑀
∑ 𝑤𝑖𝑗𝑥𝑖
𝑀
𝑖=1 , 𝑗 = 1,2, … , 𝑁           (i) 
where the inputs 𝑥𝑖, outputs 𝑦𝑗 and weights 𝑤𝑖𝑗  are normalized 
such that (𝑥𝑖 , 𝑦𝑗 , 𝑤𝑖𝑗) ∈ [0,1]. The proposed time-domain 
VMM approach exploiting the modified 1T-1R array is shown 
in Fig. 1. In the time-domain VMM [9]-[15], the inputs are 
encoded as duration of the digital pulses such that: 
    𝑡𝑖𝑛,𝑖 = 𝑥𝑖𝑇           (ii)  
where 𝑇 is the time window for the VMM operation. In the 
proposed approach, the modified 1T-1R block acts as a 
programmable current sink as shown in Fig. 1(b) and the digital 
inputs applied to the gate of the MOSFETs (𝑉𝑖𝑛,𝑖) enable the 𝑖
𝑡ℎ 
current sink for a duration 𝑡𝑖𝑛,𝑖. It may be noted that unlike 
conventional 1T-1R arrays where the RRAMs are connected to 
the drain of the MOSFETs, in this approach, the RRAMs are 
connected to the source of the MOSFETs to dissuade the non-
idealities such as channel length modulation (CLM) and drain 
induced barrier lowering (DIBL) owing to the self-
compensation effect as discussed in section III.C. The 
weights (𝑤𝑖𝑗 ∈ [0,1]) are mapped to the currents (𝐼𝑖𝑗  ∈
[𝐼𝑚𝑖𝑛 , 𝐼𝑚𝑎𝑥]) through the programmable current sink as: 
    𝐼𝑖𝑗 = 𝐼𝑚𝑖𝑛 + 𝑤𝑖𝑗(𝐼𝑚𝑎𝑥 − 𝐼𝑚𝑖𝑛)               (iii) 
Each column of the programmable current sinks is connected to 
a load capacitor 𝐶𝑗. A threshold (neuron) circuit proposed in 
[14] with a transfer function given as: 
   𝑉𝑜𝑢𝑡𝑗 = 𝑉𝐷𝐷𝐻 (𝑉𝑇𝐻 − 𝑉(𝐶𝑗))         (iv) 
where 𝐻() is the Heaviside function encodes the voltage on 
load capacitor 𝐶𝑗 into output digital pulse duration.  
 The entire VMM operation is completed in two cycles 
(phase-I and phase-II) of duration 𝑇 each. The load capacitor 𝐶𝑗 
is initially pre-charged to a voltage 𝑉𝑅𝐸𝑆𝐸𝑇  at the beginning of 
phase-I (𝑡 = 0). The inputs are activated only in phase-
I(integration phase) and the current sinks start discharging 𝐶𝑗. 
At the end of phase-I (𝑡 = 𝑇), the voltage across the load 
capacitor 𝑉(𝐶𝑗) reduces by ∆𝑉(𝐶𝑗) where: 
   ∆𝑉(𝐶𝑗)𝑡=𝑇 =
1
𝐶𝑗
∑ 𝐼𝑖𝑗𝑡𝑖𝑛,𝑖
𝑀
𝑖=1           (v) 
Using the expression for 𝐼𝑖𝑗  from equation (iii) in equation (v), 
we get: 
 
∆𝑉(𝐶𝑗)𝑡=𝑇 =  
𝑇(𝐼𝑚𝑎𝑥−𝐼𝑚𝑖𝑛)
𝐶𝑗
∑ 𝑤𝑖𝑗𝑥𝑖𝑛,𝑖
𝑀
𝑖=1 +
𝑇𝐼𝑚𝑖𝑛
𝐶𝑗
∑ 𝑥𝑖𝑛,𝑖
𝑀
𝑖=1    (vi) 
 
As evident from equation (vi), the change in the voltage across 
the load capacitor at the end of phase I is mapped to a linear 
expression of the weighted sum in this scheme. To ensure that 
this voltage variation across the load capacitor is limited to a 
targeted operation regime, i.e. 𝑉(𝐶𝑗)𝑡=𝑇 ∈ [𝑉𝑅𝐸𝑆𝐸𝑇 , 𝑉𝑇𝐻], the 
load capacitor 𝐶𝑗 must be designed such that: 
   𝐶𝑗 =
𝑀𝐼𝑚𝑎𝑥𝑇
𝑉𝑅𝐸𝑆𝐸𝑇−𝑉𝑇𝐻
          (vii) 
 
In phase-II(evaluation phase), all the inputs are inactivated and 
the load capacitor is discharged through a constant current 
𝑀𝐼𝑚𝑎𝑥. This discharging current may be generated either via a 
current mirror or by adding a similar 1T-1R array at the load 
capacitor with all the inputs activated for the entire duration 𝑇 
during phase-II and the current sinks programmed to 𝐼𝑚𝑎𝑥. In 
this work, we have followed the latter approach to implement 
the constant current source during phase-II. The neuron circuit 
generates an output pulse when the voltage on the load capacitor 
reaches the threshold voltage i.e. (𝑉(𝐶𝑗) = 𝑉𝑇𝐻). The time 
instance (𝑡𝑟,𝑗) at which 𝑉(𝐶𝑗) = 𝑉𝑇𝐻 can be given as: 
  𝑡𝑟,𝑗 = 𝑇 − 𝑡𝑜𝑢𝑡,𝑗 = 𝑇 [1 −
∑ 𝐼𝑖𝑗𝑡𝑖𝑛,𝑖
𝑀
𝑖=1
𝑀𝐼𝑚𝑎𝑥𝑇
]       (viii) 
The output pulse duration (𝑡𝑜𝑢𝑡,𝑗) can be simply obtained by 
using equations (i) and (iii) in equation (viii) as: 
         𝑡𝑜𝑢𝑡,𝑗 = 𝑎𝑦𝑗𝑇 + 𝑏              (ix) 
 
where, 
   𝑎 =
(𝐼𝑚𝑎𝑥−𝐼𝑚𝑖𝑛)
𝐼𝑚𝑎𝑥
,    𝑏 =
𝐼𝑚𝑖𝑛
𝑀𝐼𝑚𝑎𝑥
∑ 𝑥𝑖𝑛,𝑖
𝑀
𝑖=1 .                 (x)  
 
Equation (ix) clearly indicates that the output result obtained 
using the proposed scheme is different from the targeted ideal 
output result (𝑡𝑜𝑢𝑡,𝑗 = 𝑦𝑗𝑇) due to the non-zero minimum 
current (𝐼𝑚𝑖𝑛) of the 1T-1R cells which lead to the undesirable 
multiplicative coefficient (𝑎) and the input-dependent additive 
coefficient (𝑏).  
 However, it may be noted that the input-dependent additive 
coefficient (𝑏) can be cancelled out by utilizing the differential 
scheme. In the differential implementation, each weight is 
realized utilizing two sub-weights 𝑤𝑖𝑗
+ and 𝑤𝑖𝑗
− such that  
    𝑤𝑖𝑗 = 𝑤𝑖𝑗
+ − 𝑤𝑖𝑗
−         (xi) 
and two sub-neurons are dedicated to calculate the dot product 
of inputs and each sub-weight vector as 𝑡𝑜𝑢𝑡,𝑗
+  and 𝑡𝑜𝑢𝑡,𝑗
−  as:  
   𝑡𝑜𝑢𝑡,𝑗
+ =
𝑎
𝑀
∑ 𝑤𝑖𝑗
+𝑥𝑖
𝑀
𝑖=1 + 𝑏         (xii) 
 
   𝑡𝑜𝑢𝑡,𝑗
− =
𝑎
𝑀
∑ 𝑤𝑖𝑗
−𝑥𝑖
𝑀
𝑖=1 + 𝑏        (xiii) 
 
A simple logic circuitry is then employed to generate the final 
differential output pulse as: 
    𝑡𝑜𝑢𝑡,𝑗 = 𝑡𝑜𝑢𝑡,𝑗
+ − 𝑡𝑜𝑢𝑡,𝑗
−         (xiv) 
  
On the other hand, the multiplicative coefficient (𝑎) leads to a 
reduction in the output time window. This shrinkage can be 
compensated by either lowering the constant current during the 
evaluation phase (which extends the time window for phase II), 
or increasing the output time-to-digital convertor (TDC) 
counter frequency.   
III. 1T-1R VMM DESIGN GUIDELINES 
The performance of the proposed 1T-1R VMM was evaluated 
at the 55-nm technology node using PDK from Global 
Foundries in HSPICE (version N-2017.12[17]). MOSFETs 
with minimum width (120 nm) were used for all the analysis. 
Furthermore, a rather simplistic compact model was used for 
RRAM with the current-voltage relationship expressed as: 
   𝐼𝑚𝑒𝑚 = 𝑔0sinh (𝛽𝑉𝑚𝑒𝑚)         (xv) 
 
 
 
 
Fig. 2 The load line characteristics of the modified 1T-1R block for RRAMs 
with different non-linearity factor (a) β = 4 and (b) β = 8. 
where 𝑔0 is the conductance in the initial state and 𝛽 is the non-
linearity factor [18]. A maximum ON-state conductance (𝑔0 =
𝑔𝑚𝑎𝑥) of 0.1 mS (RON = 10 KΩ) and a minimum OFF-state 
conductance (𝑔0 = 𝑔𝑚𝑖𝑛) of 0.1 μS (ROFF = 10 MΩ) were 
considered for RRAM similar to [6]. Furthermore, a maximum 
permissible read voltage of 0.5 V without disturbing the 
programmed state of RRAM was assumed. Under these 
assumptions, we evaluated the potential of the proposed 1T-1R 
time-domain VMM under different operating conditions and 
different parameters for the RRAM. In the subsequent sections, 
we discuss the operating conditions and provide the necessary 
design guidelines to extract the optimum performance from the 
proposed VMM architecture. It may be noted that the optimal 
conditions also differ with the input constraints such as VMM 
size, input voltage, time window, dynamic range (𝐷𝑅 =
𝐼𝑚𝑎𝑥 − 𝐼𝑚𝑖𝑛), targeted precision, etc. 
A. Load-line characteristics 
The load-line characteristics of the modified 1T-1R block 
shown in Fig. 1(b) (MOSFET with minimum gate length, Lg = 
60 nm) is shown in Fig. 2 for different non-linearity factors (β). 
The reset voltage 𝑉𝑅𝐸𝑆𝐸𝑇  was chosen as 0.9 V to reduce the error 
induced due to non-idealities such as CLM and DIBL (as 
discussed in section III.C). From Fig. 2, we observe that 
increasing the non-linearity factor of the RRAM results in a 
reduction of the operating range of drain voltage for low gate 
voltages (< 0.6 V). A reduced operating range of drain voltage 
leads to a lower dynamic range of current values which may be 
obtained from the modified 1T-1R block via tuning the 
conductance state of the RRAM (as can also be observed from 
Table I). Also, the ON-state to OFF-state conductance ratio of 
the RRAM should be high to obtain an appreciable 𝐷𝑅. 
 Moreover, the operating range of drain voltages is also 
degraded when a RRAM with lower ON-state conductance or a 
higher OFF-state conductance is used as shown in Fig. 2. 
Furthermore, unlike the current-mode VMM approach based on 
RRAM where the accumulated current depends exclusively on 
the conductance state of the RRAMs, the current from the 
modified 1T-1R block depends both on the conductance state 
of the RRAM and the channel conductance of the MOSFET 
(which depends on the input voltage). Therefore, even if the 
ON-state conductance of RRAM increases by tenfold, as shown 
in Fig. 2, the drain current increases only slightly (< 2 times) 
and does not degrade the energy efficiency of the proposed 
VMM approach considerably as opposed to the current-mode 
VMM where the accumulated current would increase by a 
decade and limit the energy efficiency. However, the 𝐷𝑅 also 
increases for lower ON-state resistances of the RRAM in 1T-
1R configuration. 
B. Precision 
The effective weight precision (i.e. programmability of the 
current sinks) depends on the accuracy of tuning the 
conductance states of RRAM and degrades due to the drift in 
the analog conductance state with cycling and temperature and 
the inherent intrinsic noise such as RTN exhibited by the 
RRAM. Previous works have already shown an effective 
weight precision greater than 7-bits based on a simple tuning 
algorithm [19]. The weight precision may be further improved 
by oxide material engineering or by utilizing more efficient 
tuning algorithms.  
 As discussed in [15], the compute error (or output error, 
𝑒𝑜𝑢𝑡𝑗) may be decoupled from the weight error and defined 
separately as the maximum difference between the theoretically 
calculated output time period considering ideal current sinks 
(𝑡𝑜𝑢𝑡𝑗
𝑐𝑎𝑙 ) and the output time period obtained via transient 
simulation of the entire VMM circuit (𝑡𝑜𝑢𝑡𝑗
𝑠𝑖𝑚 ), spanning over the 
entire sample space of the weights and inputs i.e.  
    𝑒𝑜𝑢𝑡𝑗 = max
𝑡𝑜𝑢𝑡𝑗
|𝑡𝑜𝑢𝑡𝑗
𝑐𝑎𝑙 −𝑡𝑜𝑢𝑡𝑗
𝑠𝑖𝑚 |
𝑇
          (xvi) 
The compute precision (𝑃𝑜𝑢𝑡𝑗) can then be defined as: 
   𝑃𝑜𝑢𝑡𝑗 = − log2 𝑒𝑜𝑢𝑡𝑗 − 1         (xvii) 
Considering the efficacy of the differential scheme in 
cancelling the impact of the input-dependent additive 
coefficient (𝑏) as discussed in section II and improving the 
noise immunity and enhancing the output precision while 
enabling inclusion of bipolar weights [8], two adjacent columns 
of the 1T-1R array were tuned for implementing the positive 
and negative weight components of the bipolar weight matrix. 
Furthermore, the adjacent neuron circuits were used to calculate 
the positive (𝑉𝑜𝑢𝑡𝑗 = 𝑡𝑜𝑢𝑡,𝑗
+ ) and negative (𝑉𝑜𝑢𝑡(𝑗+1) = 𝑡𝑜𝑢𝑡,𝑗
− ) 
component of output in this differential implementation. 
Moreover, the final output was obtained as the time difference 
between the rising edge of the neuron circuits used for obtaining 
the positive (𝑉𝑜𝑢𝑡𝑗) and negative (𝑉𝑜𝑢𝑡(𝑗+1)) component of the 
output. This rectified linear (ReLU) operation may be 
implemented utilizing a digital gate for 𝑉𝑓𝑖𝑛𝑎𝑙𝑗 = 𝑉𝑜𝑢𝑡𝑗 ∙
𝑉𝑜𝑢𝑡(𝑗+1). 
C. Non-ideal factors 
The compute precision is degraded by several factors which 
tend to deviate the current sink from providing a constant 
current. While CLM leads to a linear dependence of the 
MOSFET’s drain current on the drain voltage and restricts their 
action as constant current sink, the DIBL effect induces 
threshold voltage shift which further increases the variation in 
the drain current with the drain voltage. Therefore, in addition 
to the input voltage (VGS), the current through the 
programmable 1T-1R current sink also depends on the drain 
voltage i.e. the output voltage at the load capacitor.  
 To minimize this dependency of the current sink on the 
output voltage, we modified the conventional 1T-1R array 
architecture. While the RRAM is connected to the drain 
terminal of the MOSFET in the conventional 1T-1R array, one 
terminal of RRAM is connected to the source of the MOSFET 
and the other terminal is grounded in this implementation as 
shown in Fig. 1(b). An increase in the drain voltage in the 
modified 1T-1R configuration with RRAM connected to the 
source leads to an enhanced current flowing through the 
RRAM. This results in a larger voltage drop across the RRAM. 
The increased voltage drop across the RRAM effectively boosts 
the source potential leading to a reduction in the effective gate 
to source voltage (VGS) which in turn suppresses the increment 
in the drain current. Therefore, the increment in the drain 
current due to application of a larger drain voltage is 
compensated by a reduction in the effective gate overdrive 
voltage in the modified 1T-1R array. This inherent 
compensation effect leads to a diminished dependency of the 
current through the programmable current sink on the output 
voltage at the load capacitor. 
 The error due to CLM and DIBL can be defined as: 
    𝑒𝐶𝐿𝑀/𝐷𝐼𝐵𝐿 = 1 −
𝐼(𝑉−∆𝑉)
𝐼(𝑉)
      (xviii) 
where ∆𝑉 is chosen as 1mV to estimate the local error contours 
with accuracy. We performed a rigorous analysis of the CLM 
and DIBL error for different gate (input) voltages and non-
ideality factors within the operating regime of the modified 1T-
1R configuration. The error contour plots for different input 
voltages and non-linearity factors of RRAM are shown in Fig. 
3. For all the input voltages, we found that the programmable 
current sink is relatively independent of the drain voltage (i.e. 
the CLM/DIBL error is low) for high drain voltages. Therefore, 
we selected a high reset voltage, 𝑉𝑅𝐸𝑆𝐸𝑇  = 0.9 V and designed 
the neuron circuit to have a threshold voltage 𝑉𝑇𝐻 = 0.7 V to 
ensure a non-disturbing maximum voltage swing of 0.2 V 
across the RRAM. Furthermore, we also observe from Fig. 3 
that the DIBL/CLM error increases as we reduce the input 
voltage and operate with a smaller maximum current (𝐼𝑚𝑎𝑥) to 
limit the load capacitance (see equation (vii)). Moreover, the 
increased non-linearity factor of the RRAM leads to a 
significant reduction in the dynamic range as shown in Fig.  
3(b). 
 Fig. 3 The error contour plot due to the DIBL and CLM effect for different input voltages and non-ideality factors: (a) VGS = 0.3, β = 4 and (b) VGS = 0.3, β = 8 and 
(c) VGS = 0.5 and β = 4 for MOSFETs with Lg = 120 nm. 
 
 
 
Fig. 4 The total error due to the DIBL and CLM effect for different input 
voltages and gate lengths (Lg) of the MOSFET with non-ideality factor, β = 4.  
 
Since the DIBL and CLM effects are extremely sensitive to the 
gate length (Lg) of the MOSFETs, we also performed a 
thorough investigation of the CLM/DIBL error for 1T-1R 
blocks with MOSFETs of different gate lengths biased at 
different input voltages as shown in Fig. 4. We found that 
despite the self-compensation effect, the CLM/DIBL error is 
significantly high in the MOSFETs with minimum gate length 
(Lg = 60 nm) and reduces drastically by ~5 times when the gate 
length is quadrupled to Lg = 240 nm for the same input voltage. 
Moreover, the CLM/DIBL error can be further reduced while 
obtaining a higher 𝐷𝑅 by increasing the input voltage at the cost 
of an increased capacitor area and energy owing to larger 𝐼𝑚𝑎𝑥 . 
Therefore, there exists a trade-off between area and energy 
efficiency, the dynamic range and the compute error in the 
proposed approach.  
 Apart from the error induced due to CLM and DIBL, the 
capacitive coupling between the load capacitor and the gate- 
drain capacitance of the MOSFET could be another possible 
source of charge disturbance. However, in the proposed 
architecture, the load capacitor is large compared to the gate-
drain capacitance of the minimum sized MOSFETs owing to 
the higher maximum current 𝐼𝑚𝑎𝑥. This diminishes the charge 
disturbance due to capacitive coupling. 
IV.  DESIGN SPACE EXPLORATION  
We also performed a rigorous analysis to explore the design 
space for optimizing the performance of the proposed VMM 
architecture. The input voltage (VGS) and the time window (𝑇) 
are the important design parameters for tuning the performance 
of the proposed VMM for a particular gate length of the 
MOSFET utilized in the 1T-1R block. The performance 
parameters of the VMM architecture for different input voltages 
(VGS), time window (𝑇), gate lengths (Lg), VMM sizes (𝑀 in 
𝑀 × 𝑀 VMM) and non-linearity factor (𝛽) of RRAM are listed 
in Table I. The output (worst case) error (𝑒𝑜𝑢𝑡) was found by 
simulating multiple runs of VMM operation in HSPICE with 
different combination of random inputs and random weights in 
each run in an attempt to span the entire sample space of 
possible input and weight combinations. The line parasitics 
such as line resistances and capacitances and the corresponding 
process variations pertinent to the 55-nm technology node were 
also considered in the HSPICE simulations. The total energy 
dissipated in the load capacitor, 𝐸𝐶𝑙  (which is the dominant 
energy dissipation mechanism as discussed later in section V) 
for the VMM operation has also been included in Table I. As 
can be observed from table I, the compute error is low and 
further reduces with increasing VMM size till 𝑀 < 100. 
However, as the VMM size increases above 100, the line 
parasitics and their process variations lead to a non-negligible 
increase in the compute error. While the line resistances lead to 
a drop in the effective input (gate) voltage of the MOSFETs on 
the far end of the 1T-1R array leading to a reduced drain 
current, the line capacitances add to the latency. Although the 
differential configuration is effective in mitigating the impact 
of fixed line parasitics, the process variations cannot be 
compensated even exploiting a differential configuration and 
escalate the compute error.  
 From Table I, it can also be observed that there is a trade-
off between the compute precision, dynamic range and the 
energy dissipated. For instance, to achieve a high compute 
precision of ~6-bits for large sized VMMs (𝑀 > 100), a higher 
value of input voltage (VGS = 0.5 V) should be used. A higher 
input voltage results in a higher maximum current (𝐼𝑚𝑎𝑥) 
leading to a larger load capacitance. Although the dynamic 
range is also high for such operating conditions, the area and 
energy efficiency is limited by the load capacitor which  
dominates the area and energy landscape  (as discussed in
 
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 
 
6 
TABLE I 
DESIGN SPACE EXPLORATION 
VGS (V) 
0.3 (Lg = 120 nm, RON = 10KΩ, 
ROFF=10MΩ) 
0.3 (Lg = 240 nm, RON = 10KΩ, 
ROFF=10MΩ) 
0.5 (RON = 1MΩ) 
Β 4 8 4 8 
4 (Lg = 120 nm, ROFF 
= 9 MΩ) 
4 (Lg = 240 nm, ROFF 
= 9.5 MΩ) 
𝐼𝑚𝑎𝑥; 𝐼𝑚𝑖𝑛 136.9 nA ; 25.8 nA 137.5 nA ; 39.8 nA 125.9 nA ; 25.2 nA 126.3 nA ; 38.7 nA 497 nA ; 94.6 nA 496.5 nA ; 94.1 nA 
𝑇 (ns) 16 32 64 16 32 64 16 32 64 16 32 64 16 32 64 16 32 64 
VMM size, 𝑴 = 𝟏𝟎 
𝐸𝐶𝑙(pJ) 0.09 0.19 0.39 0.09 0.19 0.39 0.09 0.18 0.36 0.09 0.18 0.36 0.3 0.7 1.4 0.3 0.7 1.4 
𝑒𝑜𝑢𝑡,% 4.5 2.8 2.8 4.4 2.8 2.8 2.6 2.5 2.4 2.6 2.5 2.4 1.44 1.0 0.74 0.88 0.74 0.55 
𝑃𝑜𝑢𝑡 3 4 4 3 4 4 4 4 4 4 4 4 5 5 6 5 6 6 
VMM size, 𝑴 = 𝟓𝟎 
𝐸𝐶𝑙(pJ) 2.45 4.92 9.85 2.47 4.95 9.9 2.25 4.53 9.06 2.27 4.5 9.09 8.93 17.8 36 8.95 17.8 36 
𝑒𝑜𝑢𝑡,% 4.3 2.6 2.7 4.2 2.7 2.6 2.4 2.4 2.3 2.5 2.4 2.2 1.3 0.94 0.72 0.77 0.68 0.48 
𝑃𝑜𝑢𝑡 3 4 4 3 4 4 4 4 4 4 4 4 5 5 6 5 6 6 
VMM size, 𝑴 = 𝟏𝟎𝟎 
𝐸𝐶𝑙(pJ) 9.81 19.7 39.4 9.9 19.8 39.6 9.0 18.1 36.2 9.09 18.2 36.3 35.7 71.5 144 35.6 71.4 144 
𝑒𝑜𝑢𝑡,% 4.2 2.6 2.6 4.2 2.6 2.5 2.3 2.3 2.3 2.3 2.4 2.2 1.2 0.92 0.66 0.75 0.64 0.46 
𝑃𝑜𝑢𝑡 3 4 4 3 4 4 4 4 4 4 4 4 5 5 6 6 6 6 
VMM size, 𝑴 = 𝟐𝟎𝟎 
𝐸𝐶𝑙(pJ) 39.2 78.4 157 39.6 79.2 158 36 72.5 145 36.3 72.7 145 142 286 576 142 285 576 
𝑒𝑜𝑢𝑡,% 4.3 2.7 2.7 4.4 2.7 2.7 2.4 2.4 2.3 2.4 2.3 2.3 1.3 0.93 0.67 0.77 0.64 0.46 
𝑃𝑜𝑢𝑡 3 4 4 3 4 4 4 4 4 4 4 4 5 5 6 6 6 6 
 
 
Fig. 5 Impact of variation in the threshold voltage of the neuron circuit (𝑉𝑇𝐻) to 
limit the maximum voltage swing across RRAM (𝑉𝑅𝐸𝑆𝐸𝑇 − 𝑉𝑇𝐻) on the 
compute error and the capacitor energy of the proposed VMM approach. 
 
Section V. 
 Moreover, to achieve a higher area and energy efficiency 
by limiting the size of the load capacitor, a lower input voltage 
may be used to reduce the maximum current (𝐼𝑚𝑎𝑥). However, 
the compute precision and the dynamic range reduces 
significantly at such operating conditions. The weight precision 
may limit the compute precision in such cases. 
 Still, the preliminary results indicate that an effective 
compute precision of 6-bits is achievable for a VMM size, 𝑀 > 
100 using the proposed approach. In addition, depending on the 
targeted compute precision, input time window, VMM size, 
area, energy efficiency, voltage swing across RRAM etc. we 
may optimize the design parameters to achieve optimum 
performance of the proposed VMM architecture. 
 Since the conductance state of the RRAM is sensitive to 
the voltage drop across them, we have also analyzed the 
performance of the proposed VMM approach for neuron circuit 
with different threshold voltages (𝑉𝑇𝐻 > 0.5 V) to limit the 
maximum voltage swing across RRAM (𝑉𝑅𝐸𝑆𝐸𝑇 − 𝑉𝑇𝐻). As can 
be observed from Fig. 5, a reduction in the maximum voltage 
swing across RRAM leads to a higher compute precision owing 
to the lower CLM/DIBL error. Although a reduction in the 
voltage drop across RRAM increases the load capacitor size 
according to equation (vii), the energy dissipated in the load 
capacitor, 𝐸𝐶𝑙 decreases owing to the reduced voltage swing as 
shown in Fig. 5.  
V.  PERFORMANCE ESTIMATION 
From Table I, it can be observed that the proposed VMM 
approach yields a compute precision of 3-bits to 6-bits 
depending on the design parameters. Targeting a compute 
precision of 4-bits, which is sufficient for several applications 
including neuromorphic computing [8], [10], we select an input 
voltage of 0.3 V, a time window of 16 ns and a gate length Lg = 
240 nm for estimating the energy and area efficiency of the 
proposed approach. Fig. 6 shows the area and energy efficiency 
breakdown of the proposed VMM approach taking into account 
the input/output (I/O) peripheral circuitry and the neuron circuit 
for different VMM sizes. 
 The basic components of the VMM I/O circuitry are digital 
input to time-domain pulse converters (DTC) which consist of 
a 4-bit shared counter and a 4-bit digital comparator followed 
by a S-R latch for each input and time-domain pulse to digital 
output converters (TDC) which consist of a 4-bit accumulator 
for each neuron output. The 4-bit accumulator is realized using 
a 4-bit full adder and a 4-bit register based on D-flip flops. A 
shared clock enables conversion of the pulse duration of the 
neuron output to digital outputs. The neuron circuit consists of 
a S-R latch realized using a pair of NAND gates followed by an 
AND gate and NOT gate for implementing the differential 
scheme. The load capacitors are realized using MOSCAPs from 
the 55-nm technology node.  
 It can be observed from Fig. 6 that the I/O circuitry 
consumes a significant portion of the energy and area landscape 
of the proposed VMM architecture when the VMM size is 
small. However, the load capacitor (𝐶𝑗) tends to dominate the 
area and energy landscape as the VMM size increases. The 
preliminary results indicate an effective compute precision of 
4–bits with an energy efficiency of ~1.5 Pops/J and a  
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 
 
7 
 
 
 
Fig. 6 The variation of (a) area efficiency, (b) energy efficiency and (c) 
throughput of the proposed VMM with VMM size (𝑀) for a ReLU neuron.  
 
 
 
Fig. 7 The variation of area efficiency and energy efficiency of the proposed 
VMM with VMM size, M = 200 for different targeted precisions.  
 
throughput of 2.5 Tops/s for VMM size = 200 utilizing the 
proposed approach.  
 Although applications such as inference, classification, 
recognition etc. may be performed with high accuracy utilizing 
even low precision (~4 bits) VMM operations [8], we also 
analyze the efficacy of the proposed approach for different 
target precisions higher than 4-bits with different operating  
TABLE II 
PERFORMANCE BENCHMARKING 
Reference [3] [6] [7] [8] [10] [11] [14] 
This 
work 
Approach CM CM CM SC TD TD TD TD 
Process(nm) 180 22 180 40 14 250 55 55 
Precision 
(bits) 
3 ~4 ~5 3 <8 ~7 ~6 6 
EE(Tops/J) 6.4 60 5.7 8 18 <290 80 498 
I/O included Yes No Yes Yes No No Yes Yes 
Results Sim Sim Exp Exp Sim Sim Sim Sim 
CM: current-mode  SC: switch-capacitor     TD: time-domain 
 
conditions as shown in Fig. 7. An increase in the targeted bit 
precision effectively increases the time window (𝑇) to encode 
the inputs while operating at the same frequency. Therefore, the 
capacitor and I/O circuit area and energy consumption increases 
significantly with an increased target precision. This leads to a 
considerable degradation in the area and energy efficiency 
when targeting higher precision VMM operations as shown in 
Fig. 7. Moreover, we may utilize a lower conductance value for 
the ON-state of the RRAM to reduce 𝐼𝑚𝑎𝑥 further and decrease 
load capacitance 𝐶𝑗for enhancing the energy and area efficiency 
while operating with a reduced precision (4-bits) as compared 
to the calculated compute precision (6-bits). Similarly, a lower 
reset voltage (𝑉𝑅𝐸𝑆𝐸𝑇) may further increase the energy and area 
efficiency while enabling a compute precision of 4-bits. The 
capacitor area may also be reduced by using a different input 
encoding scheme whereby the individual input bits are encoded 
as discrete pulses and employing the switched capacitor scheme 
to reduce the charge integrated on the load capacitor [TBD]. 
 Furthermore, the intrinsic thermal noise of the MOSFET 
and the random telegraph noise (RTN) in the RRAM may also 
affect the compute precision. Therefore, analysis of the 
proposed VMM approach under noise is an important future 
work.    
VI.  CONCLUSION 
An energy-efficient time-domain VMM exploiting a modified 
configuration of 1T-1R array has been proposed in this work. 
The dominant mechanisms such as CLM, DIBL, etc. which 
degrade the performance of the proposed architecture are 
discussed in detail. Furthermore, we show that there exists a 
trade-off between the compute precision, dynamic range and 
the area and energy efficiency of the proposed VMM approach. 
Therefore, we also provide necessary design guidelines to 
further optimize the performance of the 1T-1R VMM. The 
preliminary results indicate an effective compute precision of 
6–bits with a significantly high energy efficiency of ~498.5 
TOps/J as compared to the other proposed VMM approaches 
(Table II) and a throughput of 1.9 Tops/s for VMM size = 200 
using the proposed approach. Our results may provide an 
incentive for experimental realization of the VMM approach 
based on 1T-1R array. 
REFERENCES 
[1] Y. LeCun et al, “Deep learning,” Nature, vol. 521, pp. 436-444, May 2015. 
[2] M. Mohammadi et al, “Deep Learning for IoT Big Data and Streaming 
Analytics: A Survey,” in IEEE Comm. Surv. Tut., vol. 20, no. 4, pp. 2923-
2960, 2018. 
[3] J. Binas, D. Neil, G. Indiveri, S. C. Liu, and M. Pfeiffer, “Precise deep 
neural network computation on imprecise low-power analog 
hardware,” arXiv preprint arXiv:1606.07786, 2016. 
> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) < 
 
8 
[4] T. Tohara, H. Liang, H. Tanaka, M. Igarashi, S. Samukawa, K. Endo, Y. 
Takahashi and T. Morie, “Silicon nanodisk array with a fin field-effect 
transistor for time-domain weighted sum calculation toward massively 
parallel spiking neural networks,” Appl. Phys. Expr., vol. 9, no. 3, 
p.034201, 2016. 
[5] G. Burr et al., “Experimental Demonstration and Tolerancing of a Large-
Scale Neural Network (165,000 Synapses), using Phase-Change Memory 
as the Synaptic Weight Element,” in Proc. IEDM, CA, Dec. 2014. 
[6] M. Hu et al., “Dot-product engine for neuromorphic computing: 
programming 1T1M crossbar to accelerate matrix-vector multiplication,” 
in Proc. DAC, Austin, TX, pp.1-6, 2016. 
[7] X. Guo, F. M. Bayat, M. Bavandpour, M. Klachko, M. R. Mahmoodi, M. 
Prezioso, K. K. Likharev, and D. B. Strukov, “Fast, energy-efficient, 
robust, and reproducible mixed-signal neuromorphic classifier based on 
embedded NOR flash memory technology,” In IEEE IEDM, pp. 6-11, Dec 
2017. 
[8] E. H. Lee, and S. S. Wong, “Analysis and design of a passive switched-
capacitor matrix multiplier for approximate computing,” IEEE J. Solid-
State Circuits, vol. 52, no. 1, pp.261-271, 2017.  
[9] D. Miyashita, S. Kousai, T. Suzuki, and J. Deguchi, “A neuromorphic chip 
optimized for deep learning and cmos technology with time-domain analog 
and digital mixed-signal processing,” IEEE J. Solid-State Circuits, vol. 52, 
no. 10, pp.2679-2689, 2017. 
[10] M. J. Marinella, S. Agarwal, A. Hsia, I. Richter, R. Jacobs-Gedrim, J. 
Niroula, S. J. Plimpton, E. Ipek, and C. D. James, “Multiscale co-design 
analysis of energy, latency, area, and accuracy of a ReRAM analog neural 
training accelerator,” IEEE J. Emerging and Selected Topics in Circuits 
and Systems, vol. 8, no. 1, pp.86-101, 2018. 
[11] Q. Wang, H. Tamukoh, and T. Morie, “A Time-domain Analog Weighted-
sum Calculation Model for Extremely Low Power VLSI Implementation 
of Multi-layer Neural Networks,” arXiv preprint arXiv:1810.06819, 2018. 
[12] T. Morie, H. Liang, T. Tohara, H. Tanaka, M. Igarashi, S. Samukawa, K. 
Endo, and Y. Takahashi, “Spike-based time-domain weighted-sum 
calculation using nanodevices for low power operation,” In 2016 IEEE-
NANO, pp. 390-392, 2016. 
[13] M. Yamaguchi, G. Iwamoto, H. Tamukoh, and T. Morie, “An Energy-
efficient Time-domain Analog VLSI Neural Network Processor Based on 
a Pulse-width Modulation Approach,” arXiv preprint arXiv:1902.07707. 
Feb 2019. 
[14] M. Bavandpour et al., “Mixed-signal neuromorphic inference accelerators: 
Recent results and future prospects,” in Proc. IEDM’18, San Francisco, 
CA, Dec. 2018. 
[15] M. Bavandpour et al., “Energy-Efficient Time-Domain Vector-by-Matrix 
Multiplier for Neurocomputing and Beyond,” IEEE Trans. Circuits and 
Systems II, 2019. doi:10.1109/TCSII.2019.2891688M. 
[16] B. Chakrabarti, M. A. Lastras-Montaño, G. Adam, M. Prezioso, B. 
Hoskins, M. Payvand, A. Madhavan, A. Ghofrani, L. Theogarajan, K. T. 
Cheng, and D. B. Strukov, “A multiply-add engine with monolithically 
integrated 3D memristor crossbar/CMOS hybrid circuit,” Scientific 
rep., vol. 7, p.42429, 2017. 
[17] HSPICE User Guide: Basic Simulation and Analysis, Synopsys, Inc., 592 
Mountain View, CA, USA, 2018. 
[18] B. Li et al. "RRAM-based analog approximate computing." IEEE Trans. 
Computer-Aided Design of Integrated Circuits and Systems, vol. 34, no. 
12, pp. 1905-1917, 2015.   
[19] F. Alibart, et al. "High precision tuning of state for memristive devices by 
adaptable variation-tolerant algorithm." Nanotechnology, vol. 23, no. 7 
2012. 
 
 
