Efficient Mini-batch Training on Memristor Neural Network Integrating Gradient Calculation and Weight Update by YAMAMORI, Satoshi et al.
Title Efficient Mini-batch Training on Memristor Neural NetworkIntegrating Gradient Calculation and Weight Update
Author(s)YAMAMORI, Satoshi; HIROMOTO, Masayuki; SATO,Takashi
Citation
IEICE Transactions of Fundamentals on Electronics,
C mmunications and Computer Sciences (2018), E101-A(7):
1092-1100
Issue Date2018-07-01
URL http://hdl.handle.net/2433/242229
Right
© 2018 The Institute of Electronics, Information and
Communication Engineers 許諾条件に基づいて掲載してい
ます。
Type Journal Article
Textversionpublisher
Kyoto University
IEICE TRANS. FUNDAMENTALS, VOL.ElOl-A, NO.7 JULY 2018 
1092 
PAPER 
Efficient Mini-Batch Training on Memristor Neural Network 
Integrating Gradient Calculation and Weight Update 
Satoshi YAMAMOR1ta), Student Member, Masayuki HIROMOTOt, and Takashi SATOt, Members 
SUMMARY We propose an efficient training method for memristor 
neural networks. The proposed method is suitable for the mini-batch-based 
training, which is a common technique for various neural networks. By 
integrating the two processes of gradient calculation in the backpropagation 
algorithm and weight update in the write operation to the memristors, the 
proposed method accelerates the training process and also eliminates the 
external computing resources required in the existing method, such as mul-
tipliers and memories. Through numerical experiments, we demonstrated 
that the proposed method achieves twice faster convergence of the train-
ing process than the existing method, while retaining the same level of the 
accuracy for the classification results. 
key words: memoristor, neural network, mini-batch training, stochastic 
gradient descent 
1. Introduction 
Resent progress in neural networks is remarkable. The per-
formance of the neural networks have become comparable, 
or have outperformed the human' s ability in various fields 
such as image recognition, speech recognition, and auto-
matic translation [ 1 ]- [ 4] . This progress owes greatly to the 
progress of computer technologies along with the Moore's 
law. However, recent processors, such as CPUs and GPU s, 
are facing a serious obstacle - the computation in state-of-
the-art neural network architectures demand large amount of 
memory access for input data and network parameters. 
A memristor neural network (MNN) [5] , [6] is one of 
the emerging computing technologies to resolve the above 
problem. A memristor [7]- [9] , which is also called as a 
ReRAM, is a passive electric element whose resistance is 
changed by the current passing through the device, and can 
be used as a low-power non-volatile memory [6] , [10] . By 
employing a crossbar array structure, the memristors are 
also capable of performing multiply-and-accumulate (MAC) 
operations [5] , [11]- [13]. This enables so-called in-memory 
calculation, which has a possibility to overcome the Von 
Neumann bottleneck of the modern processors. There are 
various kinds of MNNs proposed [13]- [16] , most of which 
are reported to realize much better energy efficiency than the 
conventional digital processing. 
The training of the MNN s is realized by applying high 
voltages to the memristors to update conductance according 
Manuscript received October 31, 2017. 
Manuscript revised March 7, 2018. 
tThe authors are with Department of Communications and 
Computer Engineering, School of Informatics, Kyoto University, 
Kyoto-shi, 606-8501 Japan. 
a) E-mail: paper@easter.kuee.kyoto-u.ac.jp 
DOI: 10.1587 /transfun.El0l.A.1092 
to the parameters of the neural networks. There are two ap-
proaches to train the MNN: ex-situ and in-situ methods [17] . 
In the ex-situ method, the parameters, or weights, of the 
neural networks are calculated separately from the MNN by 
executing a training algorithm on ordinary digital computers. 
This method is so simple that it is used in various existing 
works [ 11] , [ 17]. However, it is difficult to accurately write 
the weights to the memristors due to device variation. Since 
the error in the weights may deteriorate the classification 
performance of the neural networks, additional adjustment 
of the memristor resistance has been necessary. On the 
other hand, the in-situ method performs network training on 
the MNN itself. Although this method requires additional 
peripheral circuits dedicated for training, this method has ad-
vantages in terms of robustness against the device variation 
and capability to realize on-line training. 
Kataeva et al. [18] proposed an in-situ training method 
for the MNN, in which the backpropagation algorithm is ef-
ficiently combined with the weight update process on the 
MNN. This method realized a good training result on the 
MNN having variations of the memristor devices, as re-
ported by Prezioso et al. [13]. However, this method is 
not suitable for applying to the algorithms that use a mini-
batch technique. The mini-batch-based training is known 
as fast and robust, and it is used in combination with var-
ious optimization algorithms such as a stochastic gradient 
decent (SGD) method [19] . The reason is that the Kataeva's 
method requires not only a large number of update iterations 
in proportion to the size of the MNN, and also additional 
computation resources, such as multipliers and memories, 
are required to calculate and store the intermediate data. 
In this paper, we propose weight dividing update 
(WDU) method, an efficient in-situ training method for 
MNNs with mini-batch technique. The proposed method 
divides the weight update into discrete processes for each 
sample in a mini-batch. As a result, we can combine two 
processes into one: gradient calculation in the backpropaga-
tion algorithm, and weight update in the write operation of 
the memristor. This integration can eliminate the external 
multipliers and memories that have been indispensable for 
the existing method, and also can shorten the training pro-
cess by parallel execution of the weight update for all the 
memristors on the crossbar array. 
The contribution of this paper is summarized as follows. 
• A weight dividing update (WDU) method for effi-
cient mini-batch training on memristor neural networks 
Copyright© 2018 The Institute of Electronics, Information and Communication Engineers 
YAMAMORI et al.: EFFICIENT MINI-BATCH TRAINING ON MEMRISTOR NEURAL NETWORK INTEGRATING GRADIENT CALCULATION AND WEIGHT UPDATE 
1093 
Algorithm 1 Stochastic gradient decent (SGD) 
1: Choose an initial weight matrix W 
2: for i = 1, ... , Nepochs do 
3: Randomly sample a mini-batch 13 from the training set 
4: Calculate gradient ~ W on the mini batch 13 
5: Update weight: W <--- W + ~W 
6: endfor 
(MNNs) is newly proposed. 
• WDU eliminates the external multipliers and reduces 
memory usage compared to the existing method [18] . 
• WDU can accelerate the training process by parallel 
weight update of the memristors. 
• The training accuracy is evaluated through circuit sim-
ulation. The result shows the proposed WDU achieves 
M / K times faster convergence of the training compared 
to the existing method [18] , while keeping the same 
classification accuracy. Here, M and K are a network 
size and a mini-batch size, respectively. The mini-batch 
is a subset of the training dataset, from which the gra-
dient is calculated for SGD to update weights of the 
neural network. The mini-batch size K is the number 
of the data elements in the mini-batch. 
The remainder of the paper is organized as follows. 
First, the fundamentals of conventional MNNs are briefly 
reviewed in Sect. 2. Then the proposed WDU method is 
described in Sect. 3, and the performance evaluation through 
circuit simulation is conducted Sect. 4. Finally, the paper is 
concluded in Sect. 5. 
2. Memristor Neural Network 
2.1 Neural Network 
A neural network is one of the brain-inspired mathematical 
models that have an ability to approximate functions through 
learning on a large training data [19]. The neural network 
has a hierarchical network structure that consists of simple 
units called neurons. Figure 1 shows a neuron model and 
an example of the multi-layer neural network. The output of 
the neuron is given by 
y = f(Wz), (1) 
where z, W, and f(-) are an input vector, a weight matrix, 
and an activation function, respectively. 
In training the neural networks, stochastic gradient de-
cent (SGD) [19] is widely used. Algorithm 1 shows a pseu-
docode of the SGD, in which gradient calculation (Line 4) 
and weight update (Line 5) are iteratively executed for each 
epoch until the iteration reaches a predefined number of 
epochs, Nepochs· The weight gradient!:,. W is obtained by the 
backpropagation method as follows: 
eL(k) = t(k) - y(k), (2) 
e1(k) = ( (w1+1f e1+1(k)) 0 f' (w1z1(k)), (3) 
Neuron 
toward propagation back propagation 
Neural network 
Fig.1 A neuron model and an example of a multi-layer neural network. 
A neuron works as a scalar function that calculates an inner product of the 
inputz and the weight wand outputs y through an activation functionf( ·). 
The neural network consists of multiple neurons connected to each other. 
The red and blue arrows represent the directions of the forward and back 
propagation, respectively. 
A.i(k) = e1(k) (z1f (k), (4) 
K 
AW1 =172_ IA1(k). 
K k=l 
(5) 
Here we assume an £-layer neural network, and z1, W1, and 
e1 are the input, weights, and errors for the /-th layer, re-
spectively. y is the output of the final layer, i.e., the result 
of the forward propagation of the neural network. t is target 
values of the output, and the error at the final layer eL is 
calculated by Eq. (2). Note that the augment k for each vari-
able indicates that those values correspond the k-th training 
sample in a mini-batch B whose size is K. The errors at the 
/-th layer can be obtained from the errors at the (l + 1)-th 
layer as shown in Eq. (3). The operator symbols 0, T, and' 
mean Hadamard product, matrix transpose, and derivative, 
respectively. By repeating this process, errors for all the 
layers are obtained sequentially from the last layer to the first 
layer. This is why this method is called backpropagation. 
After having all the errors e1, the gradient of the weights to 
be updated is calculated by Eqs. (4) and (5), where 17 is a 
constant variable called a learning rate. 
2.2 Memristor Neural Network 
2.2.1 Memristor Device 
A memristor is a passive two-terminal component theoreti-
cally predicted by Chua in 1970s [7] and found by Strukov et 
al. in 2008 [9] . Figure 2(a) shows a typical structure of the 
memristor, which consists of a metal oxide layer in between 
two metal layers. The conductance of the memristor changes 
depending on the current passing through the device. The 
I-V characteristic of the memristor is shown in Fig. 2(b), in 
which the hysteresis loop is found. The memristor works in 
two operation modes: read and write. In the read mode, a 
1094 
Oxygen 
j.,___ positive 
negative ; ! current 
current -----+--j 
High Low 
resistance resistance 
(a) Physical model (b) 1-V characteristic 
Fig. 2 The physical abstraction model and the 1-V characteristic of the 
memristor device. 
t t 
out\ out"1 Vi -V1 -Vw/2 -Vw/2 
( a) Foward propagation (b) Back propagation ( c) Weight update 
Fig. 3 Schematics of the memristor crossbar array operating in three 
modes: (a) forward propagation, (b) back propagation, and (c) weight 
update by V /2 scheme. 
low read voltage ½ is applied to the memristor so that its 
conductance does not change and the device behaves as a 
constant conductance. The read current is given by 
read: l = G½, (6) 
where G is the conductance of the memristor. On the other 
hand, in the write mode, the conductance of the memristor 
is altered by applying a high write voltage Vw for a certain 
period, AT. The relative change of the conductance is given 
by 
write: 
AG G = 8a oc ATexp(aVw), (7) 
where a is a device-dependent constant [20]. 
2.2.2 Crossbar Array 
A crossbar array is a typical circuit structure to implement 
neural networks using memristors [12] . Figure 3 shows an 
example of the memristor crossbar array, in which the mem-
ristors are connected at each intersection of the horizontal 
input line and the vertical output line. A pair of the mem-
ristors is used to represent the positive and negative weights 
of the neural network. The crossbar array works in three 
modes: forward propagation, back propagation, and weight 
update. 
First, the forward propagation mode is explained using 
Fig. 3(a). Let the conductances of the positive and negative 
memristors at the intersection of the i-th column and the }-th 
IEICE TRANS. FUNDAMENTALS, VOL.ElOl-A, N0.7 JULY 2018 
row be Gt and Gij (i = 1, · · · , M, j = 1, · · · , N), respec-
tively. When voltages ½ are applied to the input terminals 
inj, the currents flowing through the output terminals out7 
and out1 become 
N 
17 = I Gt½, 
j 
N 
l;- = IGij½• 
j 
(8) 
In matrix representation, this operation can be written as 
frwd = GVfwd, (9) 
where lrwd is a vector of differential output currents l; = 
17 - r;, and Yrwct is a vector of the input voltages ½. Each 
element in the conductance matrix G represents the differ-
ence of the pairwise conductances, Gij = G~-~ Gij. When 
we associate G and Vfwd with the weights W and the in-
puts z of neurons in Eq. (1) respectively, the crossbar array 
realizes the dot product operations to obtain the output y 
as the output currents lrow• As such, in the crossbar array, 
the forward propagation of the neural network is conducted 
by only using the passive components without using costly 
digital multipliers. 
Next, the backpropagation is also realized as shown in 
Fig. 3(b). Compared to the forward propagation mode, the 
inputs and the outputs of the crossbar array are swapped. By 
applying voltages Vbwd to the bottom terminals (which were 
the output terminals in the forward mode) of the corssbar 
array, the currents flown through the left terminals (input 
terminals in the forward mode) become 
(10) 
which can realize the dot product operations of W1~ 1 e1+1 (k) 
in Eq. (3), by associating G and Vbwd with W1+1 and ez+1 (k), 
respectively. 
Finally, the weight update mode is explained. In this pa-
per, "V /2 scheme" [ 6] , [ 13] is used to update the conductance 
of the memristor in the corssbar array. This method enables 
to change the conductance of only the selected memristor 
while retaining those of the other memristors. Figure 3(c) 
shows an example of the voltage application to update the 
conductance Gt0 and G00 only. The terminals connected to 
the target memristors are set to Vw /2 and -Vw /2, while the 
other terminals are set to OV. Here, Vw is a write voltage 
that is determined to satisfy 
(11) 
where Vth is a threshold voltage under which the conductance 
of the memristor is almost unchanged. When the voltages 
are applied, the conductances of Gt0 and G00 are changed 
since the applied voltage Vw is higher than the threshold ½h, 
whereas the conductances of the other memristors remain 
same since the applied voltage does not reach the threshold 
voltage. In this way, by controlling the voltage to apply to the 
crossbar array, only the desired memristors can be updated. 
The determination of the write voltage Vw and the efficient 
YAMAMORI et al.: EFFICIENT MINI-BATCH TRAINING ON MEMRISTOR NEURAL NETWORK INTEGRATING GRADIENT CALCULATION AND WEIGHT UPDATE 
1095 
weight update method on the crossbar array are detailed in 
the next section. 
2.2.3 Variable-Amplitude Scheme 
In this section, we describe the "variable-amplitude" scheme 
proposed by Kataeva et al. [18] for in-situ training of the neu-
ral networks on the memristor crossbar array. From Eq. (7), 
the conductance change AG is determined by the period 
AT and the amplitude Vw of the voltage application. In the 
variable-amplitude scheme, the write voltage Vw is changed 
to control the conductance, while the period AT is kept con-
stant. The advantage of this method is that it can substitute 
multiplication of two variables with subtraction of two volt-
ages applied to the memristor. For example, assume that we 
want to update the conductance by AG oc XY, where X and 
Y are two variables whose product is proportional to AG and 
satisfying IXI, IYI > 1. Let the two voltages Vx and Vy be 
Vx = Sign(XY) ln IXI, 
Vy = - Sign(XY) ln IYI-
(12) 
(13) 
When these two voltages are applied to the two terminals of 
the memristor, its conductance changes by 
AG oc exp(Vw) 
= exp(Vx - Vy) 
= exp(Sign(XY) ln IXYI) 
=XY. 
(14) 
(15) 
(16) 
(17) 
This means that the multiplication of X and Y can be realized 
just by applying corresponding voltages Vx and Vy. 
This property of the variable-amplitude scheme enables 
effective in-situ training on the memristor crossbar array 
when not using a mini-batch, i.e., K = 1. In this case, the 
conductance of the memristor at the intersection of the i-
th row and the j-th column should be updated by AG;j oc 
AW;i = rJe;Zj according to Eq. (5). This update is realized 
in a fully parallel manner on the crossbar array by applying 
voltages½ oc ln lzil to the j-th row and V; oc ln le;I to the 
i-th column respectively as shown in Fig. 4. Note that V; and 
½ must be chosen to satisfy the V/2 scheme condition in 
Eq. (11). Since the signs of the two values Zj and e; have 
four combinations of ( +, + ), ( +, -), (-, -), and (-, + ), the 
update process must be separately executed in four phases 
corresponding to such sign combinations, which is detailed 
in [18] . Thus the number of the voltage application for 
each weight update becomes Nva = 4. The advantage of this 
method is that it does not require external circuits to calculate 
the multiplication of Zj x e;, but requires N + M memories 
to store Zj and e; during the backpropagation process. 
In the case of the mini-batch training with K > 1, 
however, the above method does not work efficiently. From 
Eq. (5), the conductance update of the memristor should 
be AGii oc AW;i = TJtc I,f=1 e;(k)zj(k), which cannot be 
decomposed into a simple multiplication of two variables, 
such as e;(k) and Zj(k). Therefore, we must calculate AW;j 
Memristor Wire 
[EI] 
ln(z1(1)) '° Vi -
/ t t 
Vi CX: V2 CX: VM CX: 
[EI] -ln(e1 (1)) -ln(e2 (1)) •·· -ln(eM(l)) 
Fig. 4 Weight update by existing method with K = l. 
Algorithm 2 Weight update by existing method with K > 1 
K <- mini batch size 
M <- row sizes 
k <- l 
whilek s K do 
Do forward propagation 
Store the input data z k for each layer l 
Do back propagation 
Store the update value AW f for each layer Z 
k<-k+l 
end while 
m <- 1 
whilem s M do 
Update weights in them-th column by AWf 
m<-m+l 
end while 
off the crossbar, and apply voltages ½ oc ln I A W;i I Cl to the 
j-th row and V; oc ln(C) to the i-th column, respectively. 
Note that C > 0 is a constant such that V; and ½ satisfy the 
V/2 scheme condition in Eq. (11). Because the values of½ 
are different for each i-th column, the update process requires 
M iterations as shown in Alg. 2 and Fig. 5. Each iteration is 
executed in two phases since C is a positive constant and only 
the sign of A Wij must be considered. Thus, the total number 
of the voltage application becomes Nva = 2M. This means 
that the time required for the weight update proportionally 
increases as the size of the crossbar becomes larger. In 
addition, unlike the case of K = 1, this method requires 
KM N multiplications and MN memories for the external 
circuit to calculate and store A Wij, which may deteriorate 
overall efficiency of the system that is based on the MNN. 
3. Weight Dividing Update 
In this section, we propose a novel weight update method 
called "weight dividing update (WDU)," which realizes effi-
cient mini-batch training on the memristor crossbar array. As 
described in the previous section, the existing method [18] 
does not work efficiently for the mini-batch training since 
the multiplication and the summation for the K samples in 
Eqs. (4) and (5) must be calculated in the external system. 
In contrast, our proposed method performs this calculation 
1096 
Memristor 
ln(LlWMi/C) · · · ln(LlW2i/C) ln(LlW11 /C) -x; V, 
ln(LlWM2 /C) · · · 1n(LlW22 /C) ln(LlW12/C) X; V, 
t t t 
Memoryusage: O(NM) 
The number of times voltage application: O(M) 
Vi ex Y2 cx j m~l j -ln(C) 0 
j m~2 j O -ln(C) 
-ln(C) 
Fig. 5 Weight update by the conventional method when K > l. One 
column in each iteration is selected to apply column voltage, and the mem-
ristors in the selected column are simultaneously updated. 
Algorithm 3 Weight dividing update method 
K <--- mini batch size 
k <--- l 
whilek s K do 
Do forward propagation 
Store the input data z i for each layer I 
Do back propagation 
Store the errors e i for each layer I 
k <--- k + 1 
end while 
k <--- 1 
whilek s K do 
Update weights using the input data z i and the errors e i 
k <--- k + 1 
end while 
inside the crossbar array in addition to the weight update. 
Algorithm 3 shows the procedure of the weight update 
by the proposed method. First, the input data z(k) and the 
corresponding errors e(k) are calculated through forward 
and backward propagation for each training sample k in the 
mini-batch. Then the weight update is executed for each k, 
without calculating the summation of K samples in Eq. (5). 
In other words, the weight ~ W is divided into K values of 
zj(k) • ei(k) and they are sequentially updated through K 
iterations. This method brings another great advantage that 
the multiplications of z1(k) and e;(k) can be realized on the 
crossbar array in a fully parallel manner, as in the case of 
the existing method [18] with K = 1. Figure 6 illustrates 
the update flow of the proposed method, in which, for each 
iteration, all the memristors are updated in parallel. Note that 
each update is identical to that of the existing method [18] 
with K = 1, and it requires four phases due to the sign 
combination of the values. Thus the total number of the 
voltage application is Nva = 4K for the proposed WDU. 
Table 1 summarizes the comparison of the existing 
methods [18] and the proposed method. The number of 
the voltage application of WDU is Nva = 4K whereas that 
of the existing method with K > l is Nva = 2M. This 
IEICE TRANS. FUNDAMENTALS, VOL.ElOl-A, NO.7 JULY 2018 
Memristor Wire 
ln(z1(2)) 
ln(z2(K)) ln(z2(2)) ln(z2(1 )) X; V, 
ln(zMK)) ln(zM2)) ln(zMI)) -x; VN -
Memory usage: O(K(N + M)) 
The number of times voltage application: O(K) Vi (X 
~ -ln(e1 (1)) 
j k~2 j -ln(e1 (2)) 
-ln(e2 (1)) .. • -ln(eM(l)) 
-ln(e,(2)) ... -ln(eM(2)) 
j k~K j -In(e,(K)) -ln(e2 (K)) ... -ln(eM(K)) 
Fig. 6 Weight update by the proposed method. The memristors within a 
colored rectangle are updated in parallel for each iteration. 
Table 1 Comparison of the existing methods and proposed method 
WDU. 
Existing methods [18] WDU 
(K = 1) (K > 1) (All K) 
# Voltage applications, Nva 4 2M 4K 
Clocks per data, CPE/D 0(1) 0(M/K) 0(1) 
External memory M+N MN K(M +N) 
External multipliers 0 KMN 0 
# Voltage sources for Forward M M M 
# Voltage sources for Back N N N 
# Voltage sources for Update M+N N+l M+N 
means that WDU outperforms the existing methods when 
K < M /2, which is usually the case in recent practical deep 
neural networks. Here, we define a metric called required 
clocks per epoch (CPE), which measures efficiency of setting 
weights: 
CPE = Crwd X D + Cbwd X D + Cupd X D/K, (18) 
where D is a total number of the training data, and K is a 
mini-batch size. Crwd and Cbwd are the numbers of required 
clocks to execute the forward and backward propagation for 
a single training data, respectively. In general, they are con-
stant values. Cupd is a number of required clocks for weight 
update, i.e., the number of voltage application, Nva· From 
this definition, the required clocks per data (CPE/D) can be 
calculated as shown in Table 1. The order of the CPE/ D 
for the proposed method is O ( 1), which is independent of 
the size of the crossbar array. Another advantage of WDU 
is smaller footprint of external memories than the existing 
method, when the crossbar size M is sufficiently larger than 
the batch size K. In addition, WDU requires no multipli-
ers for all Mand K, while existing method requires KMN 
multiplires when K > I. 
Table 1 also shows the number of voltage sources that 
are connected to the row/column terminals. In the weight 
update mode, the proposed method requires M + N voltage 
sources whereas the existing method for K > 1 requires only 
N + 1. This means that the circuit area of the proposed 
YAMAMORI et al.: EFFICIENT MINI-BATCH TRAINING ON MEMRISTOR NEURAL NETWORK INTEGRATING GRADIENT CALCULATION AND WEIGHT UPDATE 
1097 
method becomes larger than that of the existing method 
since each voltage source is typically realized by a digital-to-
analog converter (DAC). However, as the size of the crossbar 
(Mand N) becomes larger, the area of the DACs becomes 
relatively smaller than that of the crossbar array since the 
former grows linearly to M and N, while the latter grows 
in the order of O(MN). Therefore, the increase of the cir-
cuit area is remedied when we assume to use a sufficiently 
large crossbar array as in our research target. In addition, 
the additional DACs required for the proposed method does 
not deteriorate the operation frequency, since all the DACs 
operate in parallel and thus their conversion time is equal to 
the single DAC. 
The training results of the MNNs by WDU and the ex-
isting method are basically identical since both methods are 
based on the same equation to detennine the update weight 
as Eq. (5). However, this claim is true under the assump-
tion that the conductance of the memristor G in Eq. (7) is 
constant. In reality, the conductance G changes gradually 
during the WDU process, which may cause difference be-
tween the final results after training of the two methods. In 
this work, however, we assume that the conductance change 
is negligibly small and does not affect the training process. 
This assumption is validated through the experiments in the 
next section, which shows that the training results by the pro-
posed method achieved almost equal accuracy to the existing 
method. 
4. Experiments 
In this section, through circuit simulations using a memris-
tor device model, we demonstrate that the proposed WDU 
method can achieve equal accuracy with the conventional 
method in training MNNs. 
4.1 Memristor Model 
In the following experiments, we use a memristor model 
by Chen et al. [21] , which is a behavioral model written in 
Verilog-A language and for analog circuit simulation using 
SPICE. The model parameters in this model are used, which 
are based on the measurements on a HfOx-based ReRAM. 
The I-V characteristic of the memristor model follows Eq. (7) 
in the region where 8c is small. The write voltage to change 
the conductance of the memristor by 8c can be determined 
as 
Vw = Aln(K8c) + B, (19) 
where A = 0.03864, B = 2.030, and K = 0.05 are the con-
stants calculated from the model parameters for the voltage 
application time of l::!..T = 3.5 ns. Note that the signs of Vw 
and8c donotmatchif8c < exp(-B/A)/K < 2.1 x 10-22, 
but this case rarely occurs in a practical use. 
On the other hand, if 8c takes a large value, Eq. (7) 
does not hold, and the variable-amplitude method is not 
applicable to train the MNN. Therefore, we must use the 
memristor device under the write voltage Vw such that 8c is 
maintained small. 
In order to determine an adequate Vw, we first conduct 
a preliminary experiment for a single memristor by SPICE 
simulation. Given a memristor having a certain initial con-
ductance Gini, a write voltage Vw, which is determined by 
Eq. (19) to increase the conductance by 8c, is applied, and 
the errors = G- Gini (1 + 8c) is observed, where G is a new 
conductance after the application of the write voltage. Fig-
ure 7(a) shows conductance errors for various combinations 
of Gini (x-axis) and 8c (y-axis). Note that 8c is presented as 
a relative percentage against Gini· From the result, the error 
is suppressed fewer than 10% if the conductance change 8c 
is less than 10%, whereas the intolerable errors are found in 
the region with large 8c. Thus the training parameters are 
determined so that 8c becomes less than 10% of the initial 
conductance. 
4.2 Simulation Setups 
For a benchmark to evaluate the memristor neural network, a 
binary classification problem called "circle" [22] is utilized. 
The classification task is to separate the two-dimension input 
vectors (xi, x2) into two classes. An example input vectors 
are shown in Fig. 7(b), in which the red and blue present 
the members of the two classes. All the input vectors are 
distributed in the rage of [-1.0, 1.0]. Among 200 samples 
in the dataset, we utilize 150 samples for training and 50 
samples for testing, respectively. 
The configuration of the neural network to be imple-
mented on the memristor crossbar array is a two-layer fully 
connected network with a 2-M first layer and M-1 second 
layer. The numbers of the input and output channels are one 
and two, respectively. The total number of the memristors to 
be used is ((2+ l)xM +(M + l)x l)x2 = 8M +2. Note that 
"+ 1" is for the bias term, and "x2" is for the pair of memris-
tors to represent positive and negative weights as described in 
Sect. 2.2.2. The initial conductances of the memristors Gini 
are determined by changing the model parameter g, a gap, 
which is a length of high-impedance region in a memristor 
device. For each memristor, g is sampled from a normal dis-
tribution with the mean of 1.37 nm and the relative variance 
of 0.01. For the activation functions, a hyperbolic tangent 
30.0 
'ij_ 10.0 
~-10.0 
-30.0 - - ----
0.81 0.94 1.08 
G;n;[µS] 
1.21 
30 
20 
10 'ij_ 
0 ~ 
-10 "' 
-20 
-30 
(a) Errors of a single mernristor. 
. . 
0.5 ... ;, ..... ., e 
~
, .. . . 
• ti!• •· I '\4•1, 
R 00 ' l\! ._, ~ ,._ ... ~•a· .. 
,. r.!7' ... 
-0.5 e e .• : ,: • • ,. • 
-0.5 0.0 0.5 
x, 
(b) "circle." 
Fig. 7 (a) Errors of conductance change of a single memristor. Gini is 
an initial conductance and c5c is an increase rate of the target conductance. 
s is a relative error of the simulation result against the ideal value from 
Eq. (7). (b) An example binary classification problem "circle." The red 
and blue points (x1, x2) are the samples belonging to two classes. They are 
distributed within the range of [-1.0, 1.0]. 
1098 
___. Forward Flow 
___,. Back Flow 
___. Update Flow 
U: Update module 
F: Forward module 
B: Back module 
Fig. 8 Simulation circuit for training of the two-layer memristor neural 
network. 
Table 2 Applied voltage for weight update. 
Applied voltage Existing method [18] Proposed method 
Vw = ½ow - Vcol Aln(AW) +B Aln(xi(k)ej(k)) + B 
Rows: Vrow Aln(AW /C) + B Aln(xi(k)) + B/2 
Columns: Veal -Aln(C) -Aln(ej(k)) - B/2 
f(x) = tanh(,Bx) for the first layer, where ,B is a hyper 
parameter to normalize the input, and a sigmoid function 
f(x) = 1/(1 + exp(-,Bx)) for the output layer are used. 
For circuit simulation, we implement the crossbar array 
and the other peripheral circuits with Verilog-A language. 
The memristor model described in Sect. 4.1 is used for the 
crossbar array, and the peripheral circuits such as current 
controlled voltage sources (CCVS), activation functions, and 
a control circuit are implemented as ideal behavioral mod-
els. In a practical use, the peripheral circuits are realized 
by AID and DIA converters or mixed-signal circuits with 
similar functions [13]- [16] . Figure 8 shows a diagram of 
the simulation circuit and its behavior. The three modes 
are operated on the same circuit: forward propagation (red 
arrows), back propagation (blue arrows), and weight update 
(green arrows). "Cl" and "C2" are the crossbar arrays for 
the first and the second layers, respectively. "F," "B," and 
"U" units are the modules to control the operations for the 
forward propagation, back propagation, and weight update 
modes. Each module includes CCVSs to provide the voltage 
to the corresponding crossbar array and memories to store 
the data such as z and e. In the forward and back propaga-
tion modes, the applied voltage to the memristor is limited to 
l½-1 < 0.1 V so that the conductance of the memristor does 
not change. In the weight update mode, the applied voltages 
by the existing method [18] and the proposed method are 
summarized in Table 2. 
The hyper parameters used for the training are experi-
mentally determined by a grid search. We set the learning 
rate to T/ = 0.01 and the normalization parameter for tanh 
activation to ,B = 1 x 108 A-1. 
4.3 Result 
The classification accuracies of the existing method [18] and 
the proposed method are compared with different sets of the 
parameters K (mini-batch size) and M (crossbar array size). 
Table 3 shows the classification errors on the test set by the 
IEICE TRANS. FUNDAMENTALS, VOL.ElOl-A, N0.7 JULY 2018 
Table 3 Classification errors by the existing method [18] / the proposed 
method(%). 
M 4 K 
0.4 
0) 
e o.3 t 0.2 
0.1 
0.5 
0.4 
0) 
e 0.3 
... g 0.2 
0) 
0.1 
0.0 
0 
4 
8 
16 
20 
20 
8.0 / 10.0 
8.0 / 8.0 
22.0/ 8.0 
40 60 80 
epoch 
K, M:[16, 16] 
40 60 80 
epoch 
8 
22.0 I 12.0 
12.0 / 8.0 
20.0 I 16.0 
0.5 
0.4 
., 
e 0.3 
i5 0.2 I:: 
., 
0.1 
100 
0.5 
0.4 
., 
e 0.3 
... 
5 0.2 0.1 
0.0 
100 0 
16 32 
8.0 / 8.0 
6.016.0 
6.016.0 
6.0 I 6.0 
6.0 I 6.0 
8.0 / 6.0 
K,M:[8, 32] 
20 40 60 80 
epoch 
K, M:[16, 32] 
20 40 60 80 
epoch 
100 
100 
Fig.9 Learning curves by the existing method and the proposed method. 
The classification error is shown as function of the number of elapsed 
epochs. 
0.5 
K,M:[8, 16] 
0.5 
K, M:[8, 32] 
0.4 0.4 
0) 0) 
e 0.3 e 0.3 
... ... 
5 0.2 5 0.2 0.1 0.1 
0.0 0.0 
0 50 100 150 200 0 50 100 150 200 
time[µs] time[µs] 
0.5 
K, M:[16, 16] 
0.5 
K, M:[16, 32] 
0.4 0.4 
~ 0.3 ~ 0.3 g 0.2 g 0.2 
0) 
0.1 
0) 
0.1 
0.0 
0 50 100 150 200 50 100 150 200 
time[µs] time[µs] 
Fig.10 Learning curves by the existing method and the proposed method. 
The classification error is shown as function of the elapsed time. 
existing and the proposed methods. The result shows that the 
proposed method achieves comparable or lower error rates 
compared to the existing method. The minimum error is 
6.0%, which is almost equal to the software classification 
results. Thus the proposed method successfully performs 
the sufficient level of the neural network training. 
Figure 9 shows the learning curves for (K, M) = 
(8, 16), (8, 32), (16, 16), (16, 32), which represent time 
change of the classification error for the test set as a function 
of training epochs. These results indicate that the proposed 
method achieves the equivalent training of the neural net-
works to the existing method. Figure 10 shows the same 
YAMAMORI et al.: EFFICIENT MINI-BATCH TRAINING ON MEMRISTOR NEURAL NETWORK INTEGRATING GRADIENT CALCULATION AND WEIGHT UPDATE 
1099 
Table 4 Clocks per data (CPE/D) by the existing method [18] / the 
proposed method. 
8 
16 
4 8 
4/6 6/6 
3/6 4/6 
2.5 I 6 3 / 6 
16 
10/ 6 
6/6 
4/6 
32 
18 / 6 
10/ 6 
6/6 
learning curves in Fig. 9 but the x-axis is changed to the 
elapsed time. The elapsed time is calculated by using CPE/ D 
in Table 4 and the clock period. From Eq. (18), CPEID for 
the existing and the proposed methods are (2 x M / K + 2) 
and 6, respectively. Here, we designed Cfwd = Chwd = 1 
for both methods and Cupd = 2M and 4 for the existing 
and the proposed methods, respectively. In this experiment, 
the same clock period of 3.5 ns are used for both methods 
since their peripheral circuits are almost the same and the 
maximum delay of both methods is equal. The result when 
(K, M) = (8, 32), which satisfies the condition of K < M /2 
described in Sect. 3, shows the twice faster convergence of 
the proposed method than the existing method. For example, 
the error rate of the proposed method becomes 0.1 at 50 µs, 
while that of the existing method reaches the same error rate 
after spending 100 µs. 
5. Conclusion 
This paper proposed an efficient mini-batch-based training 
method for MNNs. In the proposed method, we integrate 
the two essential processes for the training, i.e., the gradient 
calculation and the weight update, into one process. As a 
result, the required number of voltage applications to the 
memristor crossbar array becomes proportional to the size 
of the mini-batch, whereas that of the existing method is 
proportional to the size of the crossbar array. The proposed 
method is suitable when the large crossbar arrays are used. 
In addition, the external multipliers and memories required 
by the existing method are greatly reduced by the proposed 
method. Through the experiments by circuit simulation, 
it is shown that the proposed method achieves twice faster 
training process than the existing method when relatively 
large networks such as (K, M) = (8, 32) are used, while 
retaining the same level of the accuracy for the classification 
results. 
For the future work, we will consider the design of the 
peripheral circuits that are suitable for training of neural net-
works considering the analog characteristics of memristors. 
Acceleration of the circuit simulation is another future work 
so that we can accurately evaluate much larger size of neural 
networks. 
Acknowledgments 
This work was partially supported by JSPS KAKENHI Grant 
No. 26730027 and 17H01713. This work was also supported 
by VLSI Design and Education Center (VDEC), the Univer-
sity of Tokyo in collaboration with Synopsys, Inc. 
References 
[1] A. Krizhevsky, I. Sutskever, and G.E. Hinton, "ImageNet classifica-
tion with deep convolutional neural networks," Proc. Neural Infor-
mation Processing Systems, pp.1097-1105, 2012. 
[2] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for 
image recognition," Proc. Computer Vision and Pattern Recognition, 
pp.770---778, 2016. 
[3] G. Hinton, L. Deng, D. Yu, G.E. Dahl, A.r. Mohamed, N. Jaitly, 
A. Senior, V. Vanhoucke, P. Nguyen, T.N. Sainath, and B. Kingsbury, 
"Deep neural networks for acoustic modeling in speech recognition: 
The shared views of four research groups," IEEE Signal Process. 
Mag., vol.29, no.6, pp.82-97, 2012. 
[4] Y. Wu, Mi. Schuster, Z. Chen, Q.V. Le, M. Norouzi, W. Macherey, 
M. Krikun, Y. Cao, Q. Gao, K. Macherey, J. Klingner, A. Shah, M. 
Johnson, X. Liu, L. Kaiser, S. Gouws, Y. Kato, T. Kudo, H. Kazawa, 
K. Stevens, G. Kurian, N. Patil, W. Wang, C. Young, J. Smith, 
J. Riesa, A. Rudnick, 0. Vinyals, G. Corrado, M. Hughes, and J. 
Dean, "Google's neural machine translation system: Bridging the 
gap between human and machine translation," Computing Research 
Repository, vol.abs/1609.08144, 2016. 
[5] K.K. Likharev, "CrossNets: Neuromorphic hybrid CMOS/nanoelec-
tronic network," Science of Advanced Materials, vol.3, no.3, pp.322-
331, June 2011. 
[6] J.J. Yang, D.B. Strukov, and D.R. Stewart, "Memristive devices for 
computing," Nature Nanotechnology, vol.8, pp.13-24, 2013. 
[7] L.O. Chua, "Memristor- The missing circuit element," IEEE Trans. 
Circuit Theory, vol.18, no.5, pp.507-519, Sept. 1971. 
[8] L.O. Chua and S.M. Kang, "Memristive devices and systems," Proc. 
IEEE, vol.64, no.2, pp.209-223, 1976. 
[9] D.B. Strukov, G.S. Snider, D.R. Stewart, and R.S. Williams, "The 
missing memristor found," Nature, vol.453, no.7191, pp.80---83, 
2008. 
[10] H.S.P. Wong, H.Y. Lee, S. Yu, Y.S. Chen, Y. Wu, P.S. Chen, B. Lee, 
F.T. Chen, andM.J. Tsai, "Metal-oxide RRAM," Proc. IEEE, vol.100, 
no.6, pp.1951-1970, June 2012. 
[11] F. Alibart, L. Gao, B.D. Hoskins, and D.B. Strukov, "High precision 
tuning of state for memristive devices by adaptable variation-tolerant 
algorithm," Nanotechnology, vol.23, no.7, p.075201, 2012. 
[12] M. Hu, H. Li, Y. Chen, Q. Wu, G.S. Rose, and R.W. Linderman, 
"Memristor crossbar-based neuromorphic computing system: A case 
study," IEEE Trans. Neural Netw. Learning Syst., vol.25, no.IO, 
pp.1864--1878, Oct. 2014. 
[13] M. Prezioso, F. Merrikh-Bayat, B. Hoskins, G. Adam, K.K. Likharev, 
and D.B. Strukov, "Training and operation of an integrated neuro-
morphic network based on metal-oxide memristors," Nature, vol.521, 
no.7550, pp.61-64, 2015. 
[14] A. Shafiee, A. Nag, N. Muralimanohar, R. Balasubramonian, J.P. 
Strachan, M. Hu, R.S. Williams, and V. Srikumar, "ISAAC: A con-
volutional neural network accelerator with in-situ analog arithmetic 
in crossbars," ACM/IEEE 43rd ISCA, pp.14--26, June 2016. 
[15] P. Chi, S. Li, C. Xu, T. Zhang, J. Zhao, Y. Liu, Y. Wang, and Y. Xie, 
"PRIMES: A novel processing-in-memory architecture for neural 
network computation in ReRAM-based main memory," ACM/IEEE 
43rd ISCA, pp.27-39, June 2016. 
[ 16] L. Song, X. Qian, H. Li, and Y. Chen, "Pipe layer: A pipelined reram-
based accelerator for deep learning," IEEE International Symposium 
on HPCA, pp.541-552, Feb. 2017. 
[17] F. Alibart, E. Zamanidoost, and D.B. Strukov, "Pattern classification 
by memristive crossbar circuits using ex situ and in situ training," 
Nature Communications, vol.4, p.2072, June 2013. 
[18] I. Kataeva, F. Merrikh-Bayat, E. Zamanidoost, and D. Strukov, "Ef-
ficient training algorithms for neural networks based on memristive 
crossbar circuits," Proc. International Joint Conference on Neural 
Networks, pp.1-8, July 2015. 
[19] C.M. Bishop, Pattern Recognition and Machine Learning, Springer, 
1100 
2006. 
[20] F.M. Bayat, B. Hoskins, and D.B. Strukov, "Phenomenological mod-
eling ofmemristive devices," Appl. Phys. A, vol.118, pp.779-786, 
March 2015. 
[21] P.Y. Chen and S. Yu, "Compact modeling of RRAM devices and its 
applications in lTlR and lSlR array design," IEEE Trans. Electron 
Devices, vol.62, no.12, pp.4022-4028, Dec. 2015. 
[22] F. Pedregosa, G. Varoquaux, A. Grarnfort, V. Michel, B. Thirion, 
0. Grisel, M. Blonde!, P. Prettenhofer, R. Weiss, V. Dubourg, J. Van-
derplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, and 
E. Duchesnay, "Scik:it-learn: Machine learning in Python," Journal 
of Machine Learning Research, vol.12, pp.2825-2830, 2011. 
Satoshi Yamamori recieved B.E. degree 
in Electrical and Electronic Engineering from 
Kyoto University in 2017. He is a master course 
student at Department of Systems Science, Kyoto 
University. He is a student member of the In-
stitue of Electronics, Information and Comuni-
cation Engineers (IEICE). 
Masayuki Hiromoto received B.E. degree 
in Electrical and Electronic Engineering and 
M.Sc. and Ph.D. degrees in Communications 
and Computer Engineering from Kyoto Univer-
sity in 2006, 2007, and 2009 respectively. He 
was a JSPS research fellow from 2009 to 2010, 
and with Panasonic Corp. from 2010 to 2013. 
In 2013, he joined the Graduate School of In-
formatics, Kyoto University, where he is cur-
rently a senior lecturer. His research interests 
include VLSI design methodology, image pro-
cessing, and pattern recognition. He is a member of IEEE, IEICE, and 
IPSJ. 
Takashi Sato received B.E. and M.E. de-
grees from Waseda University, Tokyo, Japan, and 
a Ph.D. degree from Kyoto University, Kyoto, 
Japan. He was with Hitachi, Ltd., Tokyo, Japan, 
from 1991 to 2003, with Renesas Technology 
Corp., Tokyo, Japan, from 2003 to 2006, and 
with the Tokyo Institute of Technology, Yoko-
hama, Japan. In 2009, he joined the Graduate 
School of Informatics, Kyoto University, Kyoto, 
Japan, where he is currently a professor. He was 
a visiting industrial fellow at the University of 
California, Berkeley, from 1998 to 1999. His research interests include 
CAD for nanometer-scale LSI design, fabrication-aware design methodol-
ogy, and performance optimization for variation tolerance. Dr. Sato is a 
member of the IEEE, ACM, and IEICE. He received the Beatrice Winner 
Award at ISSCC 2000 and the Best Paper Award at ISQED 2003. 
IEICE TRANS. FUNDAMENTALS, VOL.ElOl-A, NO.7 JULY 2018 
