Parallel Programming of Resistive Cross-point Array for Synaptic Plasticity  by Xu, Zihan et al.
Parallel Programming of Resistive Cross-point Array  
for Synaptic Plasticity 
Zihan Xu1*, Abinash Mohanty1, Pai-Yu Chen1, Deepak Kadetotad1, Binbin 
Lin2, Jieping Ye2, Sarma Vrudhula2, Shimeng Yu2, Jae-sun Seo1, Yu Cao1 
1School of Electrical, Computer and Energy Engineering, 
 Arizona State University, Tempe, AZ 85287, USA 
2School of Computing, Informatics, and Decision Systems Engineering, 
 Arizona State University, Tempe, AZ 85281, USA 
{zihanxu,amohant4, pchen72, dkadetot, blin16, 
 jieping.ye, vrudhula, shimeng.yu, jaesun.seo, yu.cao}@asu.edu  
 
 
 
 
Abstract 
This paper proposes a parallel programming scheme for the cross-point array with resistive random 
access memory (RRAM). Synaptic plasticity in unsupervised learning is realized by tuning the 
conductance of each RRAM cell. Inspired by the spike-timing-dependent-plasticity (STDP), the 
programming strength is encoded into the spike firing rate (i.e., pulse frequency) and the overlap time 
(i.e., duty cycle) of the pre-synaptic node and post-synaptic node, and simultaneously applied to all 
RRAM cells in the cross-point array. Such an approach achieves parallel programming of the entire 
RRAM array, only requiring local information from pre-synaptic and post-synaptic nodes to each 
RRAM cell. As demonstrated by digital peripheral circuits implemented in 65nm CMOS, the 
programming time of a 40kb RRAM array is 84 ns, indicating 900X speedup as compared to state-of-
the-art software approach of sparse coding in image feature extraction. 
 
Keywords: Resistive cross-point array, Parallel programming, Synaptic plasticity, Dictionary learning 
1 Introduction 
Inspired by the daunting computational capability of the human brain, cognitive computing and 
learning that are inspired by neuroscience have become an increasingly attractive paradigm for future 
computation beyond the von Neumann architecture. Along this path toward machine intelligence, 
learning compact representations on data adaptive dictionaries is the state-of-the-art method for 
analyzing big data [1]. It aims to minimize the reconstruction error        , where  is an 
input vector,  is called the dictionary and  is the coefficient vector which is usually assumed to be 
sparse in many problems. Such an optimization target is motivated by the sparseness in visual cortex, 
minimizing both the error and energy consumption in learning.  
Procedia Computer Science
Volume 41, 2014, Pages 126–133
BICA 2014. 5th Annual International Conference on Biologically
Inspired Cognitive Architectures
126 Selection and peer-review under responsibility of the Scientiﬁc Programme Committee of BICA 2014
c© The Authors. Published by Elsevier B.V.
doi: 10.1016/j.procs.2014.11.094 
However, when the data set is big, which is often the case, optimizing the dictionary is a 
computational challenging problem. Stochastic Gradient Descent (SGD) [2] is one of the most 
efficient algorithms to solve this problem. Instead of updating the dictionary by full gradient descent, 
SGD updates the dictionary by using randomly selected gradient as follows: 
        
where   is the learning rate,       and the residual error of data presentation (r) is 
      .  
An analogy to this dictionary learning could be found in neural networks in our brain, which 
consists of spiking neurons and synapses that connect the neurons. During the training process, a 
spiking neural network learns through plastic synapses that change their weights based on the spike 
timing of the pre-synaptic neuron and the post-synaptic neuron. This learning rule is known as spike-
timing-dependent-plasticity (STDP) [3, 4], as illustrated in Fig. 1(a). 
When these learning algorithms are implemented in hardware to accelerate the learning beyond 
software limitations, the cross-point array was recently proposed as an effective way to represent 
synapses with large fan-in and fan-out [5, 6], where each cross-point is implemented with a memory 
cell. Since scaling conventional on-chip memories (SRAM or eDRAM) becomes more difficult every 
new technology node, resistive random access memory (RRAM) has emerged as an alternative choice 
for next-generation memory designs due to its non-volatility, integration density, and low power 
consumption [7].  
A RRAM cell structure is shown in Fig. 1(b), it consists of two metal layers and an oxide layer. 
The conductance of the oxide layer is determined by the length of the conductive filament. To change 
the conductance, a voltage pulse needs to be applied across the RRAM cell. Figure 1(b) also shows 
the simulation results on how the RRAM conductance is changed by different voltage pulses. Positive 
pulses will increase the conductance while the negative pulses will decrease it. It can be seen that the 
conductance change is very sensitive to the voltage amplitude and fairly less sensitive to pulse width, 
which is another reason to use timing to control the programming in fine granularity in this paper. We 
use 1.5V (Vdd) as the programming voltage across the two terminals and use 0.75V (Vdd/2) to 
prevent programming. 
Using resistive devices for synapses in neuromorphic applications have been actively explored [5, 
8]. However, updating all the resistive devices in a large cross-point array is still very time-
consuming in previous approaches, since it requires sequential operation (row-by-row, column-by-
column, or even bit-by-bit). In this paper, we focus on a resistive cross-point array which holds the 
dictionary values (D), and connects Z (sparse data representation) on one side and r (residual error of 
data representation on inputs) on the other side. We seek an efficient way to update all the dictionary 
values stored in a resistive cross-point array by an amount proportional to the multiplication of Z and 
r (i.e.,   ). Specific write circuitries are designed for Z and r on the periphery of the cross-point 
array, such that the entire resistive cross-point array could be programmed in parallel and thus, the 
programming speed is not limited by the scale of the dictionary any more. 
The remainder of the paper is organized as follows. Section 2 describes the programming 
principles and the circuits for Z and r. To be robust under noise and compatible with other units in a 
VLSI system, this design utilizes a digital scheme, with Z and r as 4-bit binary numbers (i.e., 16-bit in 
the natural representation) and D as a 16-bit binary number for high precision. Section 3 implements 
this digital system at the 65nm node, and applies it to a realistic learning case to demonstrate its 
efficacy. The paper is concluded in Section 4. 
 
Parallel programming of resistive cross-point array for synaptic plasticity Zihan Xu et al.
127
2 Parallel Programming Scheme and Circuit Design 
In this section, we propose a parallel programming scheme to adapt the synaptic weight (i.e. 
conductance of a RRAM cell) of the entire cross-point array based on values of Z and r. Digital write 
circuits for Z and r are designed to generate the programming voltage for the cross-point array. 
2.1 Programming Method 
Conventionally, programming a resistive memory array is performed sequentially column by 
column as shown in Fig. 2(a), or even bit by bit as usually implemented in the software. In our 
learning application, to change D value by an amount proportional to   , it first needs to calculate 
   for each column. To program one column of the array, programming pulses that represent the 
   values of this column are applied on the left side of the array, while this column is connected to 
ground. The rest of the columns are kept at Vdd/2 to prevent programming. After programming one 
column is finished, the next column can be programmed by applying programming pulses and 
voltages that correspond to the next column. Therefore, the total time to program the resistive 
crosspoint array using this method is in the order of O(N), where N is the number of columns of the 
array, and its value ranges from 100 to several thousand, depending on the application.  
Exploiting the specific property of resistive cross-point arrays that one can simultaneously apply 
-100 -50 0 50 100
-60
-40
-20
0
20
40
60
80
100
120
Δt<0 
LTD
 Exp. Data [4]
 
C
on
du
ct
an
ce
 C
ha
ng
e 
ΔG
 (%
)
Spike Timing Δt (ms)
Δt>0 
LTP
Figure 1: The similarity of programming a biophysical synapse and a RRAM cell. (a) STDP based on the time 
interval between pre- and post-synaptic spikes; (b) Tuning of RRAM conductance with a voltage pulse across 
both ends. 
(a) 
(b) 
0 10 20 30
-10
-5
0
5
10
C
on
du
ct
an
ce
 C
ha
ng
e 
ΔG
 (μ
Ω
-1
)
Voltage Pulse Width t (ns)
1.5V
1.46V
-1.46V
-1.5V
Conductance at
      V = 0.3V
Resistive Cross-point Array
Metal

Oxide
V
t
D
Z
X or r
Spike Timing Dependent Plasticity
(STDP)
Pre-synaptic 
neuron
Post-synaptic 
neuron
Synapse
Presynaptic 
Spike Train
Postsynaptic 
Spike Train
∆t
Parallel programming of resistive cross-point array for synaptic plasticity Zihan Xu et al.
128
different voltage pulses on each row and column, a parallel programming method is proposed in order 
to parallelize and accelerate the entire programming process, as illustrated in Fig. 2(b). In this 
method, we do not calculate    before programming, instead pulses that represent Z and pulses that 
represent r are simultaneously applied on the rows and on the columns of the cross-point array, 
respectively. We overlap the Z pulses and r pulses over the write enable period to effectively realize 
the multiplication function and thereby increase or decrease the conductance of the RRAM. 
Specifically, we encode r value into spikes of 1ns pulses in a fixed time period, and encode Z value 
into the duty cycle of the write period when the r pulses could be applied to each RRAM cell. Thus, 
in such a synchronous design, the accumulated overlap time of these two pulses in each write cycle 
indicates the product of   .   
2.2 Circuit Design of Z Pulses 
The write circuit for Z generates the programming pulse with a duty cycle proportional to the 
value of Z in a fixed clock period. To program an RRAM cell, the voltage across the cell should be 
Vdd, while Vdd/2 is not able to change its conductance. Since the programming voltage of r is in the 
range of 0 to Vdd, the effective programming period of the Z pulse should supply either 0 or Vdd 
voltage, and the rest should be Vdd/2.  
Z is always a positive number while r can be positive or negative, depending on the residual error. 
Therefore whether D will increase or decrease depends on the sign of r, but not Z. When r is positive, 
D decreases, and vice versa. Since we don’t calculate    up front, the programming voltage of Z 
has to prepare for both positive r and negative r. In our synchronous design, we divide the write 
period into two phases, controlled by the clock. The first phase deals with the condition of r > 0 
(positive period), and the second phase deals with the condition of r < 0 (negative period). In the 
positive period, the effective programming voltage is 0 in a certain time proportional to Z. After this 
time, the programming voltage switches to Vdd/2 to prevent further programming. Similarly in the 
negative period, the effective programming voltage is Vdd, and then the voltage switches to Vdd/2. 
To program a RRAM cell, we can keep the voltage of r as Vdd in the positive period (for r > 0) and 
as 0 in the negative period (for r < 0). Consequently the voltage across the RRAM cell during the 
overlap time of Z and r will be –Vdd and Vdd for r > 0 and r < 0, respectively. Such a voltage overlap 
serves as the basis to tune the RRAM conductance for D. 
To generate such a pulse pattern, a digital circuit is designed, as shown in Fig. 3. The inputs 
include Z [15:0], WE, PN and clock. Z [15:0] is a pre-decoded natural number, representing the value 
of Z from 0 to 16 by the number of ‘1’ in these 16 bits. The ‘1’s are all sequentially on the right side 
of Z [15:0]. WE is the global control write enable signal. The writing is performed when WE = 1. PN 
is the signal that differs the positive period and negative period. PN = 0 means positive period and PN 
= 1 means negative period. The clock signal is an internal clock. There are 32 cycles in the whole 
write period. 
		



  
Figure 2: The parallel scheme achieves O(1) in programming speed, independent on the array dimension.  
(a) (b) 
Traditional Sequential Programming
D
Vdd/2 Vdd/2
z[0].r[0]
z[1].r[0]
D
Vdd/2 Vdd/2
z[0].r[1]
z[1].r[1]
Parallel programming of resistive cross-point array for synaptic plasticity Zihan Xu et al.
129
In Fig. 3, the left part of the circuit is a 16-bit shift register. It converts the parallel input Z [15:0] 
into a sequential output. Thus, the time when the output is 1 is proportional to the value of Z. Note 
that the output of the shift register is connected back to the first stage of itself in order to recycle the 
data Z. With 32 clock cycles for one write period, the shift register generates two identical pulses 
with the duty cycle proportional to the value of Z. These two identical pulses are further input to the 
mux to generate different programming voltage levels for the positive period and the negative period, 
which is controlled by both WE and PN. With the whole circuit above, we are able to convert the 
value of Z into the duty cycle of pulses for both cases of r > 0 and r <0. 
2.3 Circuit Design of r Pulses 
The write circuit for r generates a train of pulses, where (1) the number of pulses is proportional to 
the value of r, (2) each pulse has a fixed width (for fixed RRAM programming period) and (3) the 
pulses are evenly distributed across a constant write period. Whenever there is overlap between the Z 
window and an r pulse, the fixed pulse width ensures that the conductance of RRAM is changed by a 
fixed amount. The uniform distribution of pulses is important to minimize the quantization error in 
our method, which effectively multiplies Z and r. r could be a positive or negative value, which 
would increase or decrease the RRAM conductance, respectively. Since the required voltage values 
for increasing and decreasing the RRAM conductance are different, each write cycle was separated 
into two phases, where the first phase generates signals for positive r values and the second phase for 
negative r values. 
In order to increase the resistance of the RRAM, a positive voltage of Vdd is required between the 
Z and R nodes. Thus in the first phase,  if r is positive, active-high pulses (number of pulses 
proportional to r) are generated with a fixed pulse width of 1ns, while Z is driven low (the time at low 
is proportional to Z value). Through this operation in the first phase, a fixed voltage (VR – VZ = 1.5V) 
is applied to each RRAM cell for the accumulated overlap time that represents   . If r is negative, 
the output signal is kept at low during the first phase, ensuring no change in the resistance of the 
RRAM cells. Similarly in the second phase, if r is positive, the output is kept at high to ensure no 
change in resistance of the RRAM cells. On the other hand, if r has a negative value, then in the 
second phase active-low pulses are generated with a fixed pulse width of 1ns while Z is driven high. 
Thus a fixed voltage in the opposite direction in the case of positive r value (VZ – VR = 1.5V) is 
applied to the RRAM cells for the accumulated overlap time that represents   . After each write 
cycle, the RRAM conductance would increase or decrease by an amount proportional to   .  
D Q
Q
D Q
Q
WE
Z[0]Z[15]
clock
16 bits shift register
WE
PN
Vdd12
Programming 
Voltage
ZZ
Time
Figure 3: Circuit schematic to generate programming voltage Z. The inset illustrates the pulse pattern 
for both phases. 
Parallel programming of resistive cross-point array for synaptic plasticity Zihan Xu et al.
130
The circuit implementation consists of various delay elements forming a configurable ring 
oscillator (RO) with a start and polarity control. Write Enable (WE) and sign-bit of r determine the 
phase in which the pulses are to be generated and their polarity. The number of pulses during the 
write period is varied by changing the length of the ring oscillator and thus, its frequency. This was 
achieved by the use of switches, which determines the total gate delay in the ring oscillator.  The 
control of the switches is generated from the r value, ensuring that only one switch is on for a 
particular value of r. When r = 0, no change in the RRAM conductance is allowed. In total, 15 buffer 
stages (d1-d15) in Fig. 4 are implemented with different delay values, such that the number of pulses 
generated for each write cycle is proportional to the r value. The fixed pulse width (1ns) is generated 
after each rising edge of the RO output. Based on the sign-bit of r and the write phase (PN), the final 
mux stages select among Vdd, 0, pulse generator output or the inversion of pulse generator output.   
3 Experimental Results 
In this section, we show the simulation results of the overall system that consists of parallel 
programming circuits and RRAM cells. The write circuitries for Z and r are implemented in 65nm 
Figure 5: Timing diagram of the programming system, from SPICE simulation of a 65nm technology. 
Through the overlap in time between Z and r pulses, it demonstrates that (a) D decreases when r > 0. (b) D 
increases when r < 0. 
(a) (b) 
0.00
0.75
1.50
0.00
0.75
1.50
0.00
0.75
1.50
0 20 40 60 80 100
400n
600n
800n
WE
Z = 6
r = 9
Time (ns)
D decreases
0.00
0.75
1.50
0.00
0.75
1.50
0.00
0.75
1.50
0 20 40 60 80 100
1.0µ
2.0µ
3.0µ
WE
Z = 10
r = -7
Time (ns)
D increases
r[0:3]
Pulse Generator
Configurable 
Ring Oscillator
sign
Programming 
Voltage
sign
WE
d15 d14 d1d2
PN
PN
PN
Figure 4: Circuit schematic to generate the programming pulses of r. 
Positive r  
Negative r  
Parallel programming of resistive cross-point array for synaptic plasticity Zihan Xu et al.
131
CMOS technology, and we used a spice model for the RRAM device that was generated from device 
measurements. 
Figure 5 shows the timing diagram of the parallel programming system. When the write enable 
(WE) signal turns on, both Z and r write circuitries start pulse programming based on the values of Z 
and r, and thus change the value of D during the overlap time of the two pulses. In Figure 5(a), it is 
shown that when r is positive, the programming occurs in the first half of the write period and the 
value of D decreases. Figure 5(b) illustrates that when r is negative, the programming happens in the 
second half of the write period and the value of D increases. Independent of the array size, the 
parallel programming of the entire array requires only 84 ns, while the sequential programming 
requires 1.6 μs for a 400 x 100 array. The simulation also shows that the energy consumption of 
parallel programming of the 400 x 100 array is about 13.9 nJ. The layout areas of Z and r circuitries 
are 850 μm2 and 1154 μm2, respectively.  
The method of using overlap time of Z and r pulses with a certain granularity to calculate 
multiplication introduces quantization error. To analyze this, we performed simulation for all 
combinations of Z and r values (both from 0 to 16). Figure 6 shows the comparison of the simulated 
results of    , namely the accumulated overlap time of Z and r pulses, to an ideal    
multiplication. It is observed that the digital programming mostly follows the ideal multiplication 
closely, while producing the maximum error of 1 bit (out of 16 bits) when both Z and r are small. 
We compared the proposed system against a software implementation on the task of updating the 
dictionary D. For this purpose, we used MNIST [9] data set to extract the image feature with 
Stochastic Gradient Descent [2] algorithm. For the software approach, we used Intel Core i5 2.4 GHz 
dual-core processor and 4 GB memory. Figure 7 shows the dictionary D before and after the feature 
extraction. The computation time consumed to update D for this entire dictionary learning process is 
750 μs per image patch (10 iterations). With our proposed hardware approach, a 400 x 100 resistive 
cross-point array is used to achieve the computation time of 840 ns per image patch, which is a 900X 
improvement over the software implementation for the identical dictionary learning. 
4 Summary 
In this paper, we proposed a parallel programming scheme for dictionary learning applications that 
can update the entire resistive cross-point array during one write cycle, where each dictionary value is 
represented by the conductance of a cross-point RRAM cell. Inspired by plastic synapses in biology, 
Figure 7: Demonstration of the proposed method in updating 
the dictionary. Current software approach: Processor: Intel 
Core i5 2.4GHz 2 cores; Memory: 4 GB; Computing time: 750 
μs per image patch. Proposed parallel programming hardware 
approach: RRAM array dimension: 400 x 100; Computing 
time: 840 ns per image patch. 
Before learning After learning 
Figure 6: The quantization error of the 
parallel programming method, with the 
maximum error at 1 bit (6.25%).  
-15 -10 -5 0 5 10 15
-15
-10
-5
0
5
10
15
D
ig
ita
lly
 P
ro
gr
am
m
ed
 (Z
r)
Theoretical (Zr)
Out of 16-bit data
Maximum error = 1 bit 
RMS Error = 0.326 bit
Parallel programming of resistive cross-point array for synaptic plasticity Zihan Xu et al.
132
the programming strength of resistive devices is encoded into the number of pulses (spikes) and duty 
cycle of the pre-synaptic and post-synaptic terminals, which are simultaneously applied to all RRAM 
cells in the array. Peripheral write circuitries were implemented by digital circuits in 65nm CMOS, 
and the simulation with RRAM device models was performed to demonstrate conductance tuning of 
the RRAM cells. With the proposed approach, programming a 40kb RRAM array is completed in 84 
ns, which demonstrates 900X acceleration for an image feature extraction task when compared to 
state-of-the-art software approach of sparse coding. 
Acknowledgements 
 This research was support in part by the following sources: (1) NSF PFI-BIC award no. 1237856, 
(2) NSF FRP award no. 1230401, (3) Raytheon through CES, (4) Samsung GRO program. 
References 
1. I. Tosic and P. Frossard, “Dictionary learning,” IEEE Signal Process. Mag., vol. 28, no. 2, pp. 
27–38, Mar. 2011. 
2. L. Bottou and O. Bousquet, “The tradeoffs of large-scale learning,” Optimization for Machine 
Learning, page 351, 2011. 
3. S. Song, et al., “Competitive Hebbian learning through spike-timing-dependent synaptic 
plasticity,” Nature Neuroscience, pp. 919-926, 2000. 
4. G. Bi and M. Poo, “Synaptic Modifications in Cultured Hippocampal Neurons: Dependence on 
Spike Timing, Synaptic Strength, and Postsynaptic Cell Type,” Journal of Neuroscience, vol. 18, no. 
24, pp. 10464-10472, Dec. 1998. 
5. S. H. Jo, et al., “Nanoscale memristor device as synapse in neuromorphic systems,” Nano 
letters, vol. 10, no. 4, pp. 1297–1301, 2010. 
6. J. Seo, et al., “A 45nm CMOS neuromorphic chip with a scalable architecture for learning in 
networks of spiking neurons,” IEEE Custom Integrated Circuits Conference, 2011. 
7. H.-S. P. Wong, et al., “Metal Oxide RRAM,” Proceedings of the IEEE, vol. 100, no. 6, pp. 
1951-1970, 2012, 
8. S. Yu, et al., “A low energy oxide-based electronic synaptic device for neuromorphic visual 
system with tolerance to device variation,” Advanced Materials, vol. 25, no. 12, pp. 1774-1779, 2013. 
9. http://yann.lecun.com/exdb/mnist/ 
 
Parallel programming of resistive cross-point array for synaptic plasticity Zihan Xu et al.
133
