Design of CMOS-memristor Circuits for LSTM architecture by Smagulova, Kamilya et al.
 
Design of CMOS-memristor Circuits for LSTM architecture  
 
Kamilya Smagulova, Kazybek Adam, Olga Krestinskaya and Alex Pappachen James  
 
Nazarbayev University 
Astana, Kazakhstan, apj@ieee.org 
 
 
Abstract 
 
Long Short-Term memory (LSTM) architecture is a       
well-known approach for building recurrent neural networks       
(RNN) useful in sequential processing of data in application         
to natural language processing. The near-sensor hardware       
implementation of LSTM is challenged due to large        
parallelism and complexity. We propose a 0.18 m CMOS,       μ   
GST memristor LSTM hardware architecture for near-sensor       
processing. The proposed system is validated in a forecasting         
problem based on Keras model.  
 
Key words:​ LSTM, crossbar, prediction, analog circuit 
  
Introduction 
 
There are numerous applications that applies sequentially       
ordered data for prediction and classification using recurrent        
neural network (RNN). Feedback connections in RNN       
enables its units to retain previous information to compute the          
current stat, making RNN an efficient tool for processing         
sequential data where maintaining order of information is        
important. RNN training is performed using 'backpropagation       
through time' (BPTT) leading to either vanishing or exploding         
gradient problem. Long short-term memory (LSTM) (see       
Fig.1 is a extension of RNN that overcomes vanishing         
gradient problem [1,2]. 
 
LSTM architecture 
In Fig.1, the inputs and the outputs to the LSTM cell are            
vectors. Input at time is of size and the rest of  (t)x    t     N      
inputs and outputs are of size referring to the number of      M       
hidden LSTM units.The refers to a single element in the    m′        
vectors. For instance , is the output of the forget gate for    f  mt′          
’th element. elements form a vector [3]. The parameterm   M         
is an input iterator index ( ), is a current n      1, ]n∈ [ N  m′     
hidden unit index ( ), and is a hidden unit   1, ]m′∈ [ M   m      
iterator index ( ).1, ]m∈ [ M  
 
A. Implementation of LSTM network for prediction problem 
In [4], LSTM network is used to predict the number of           
international airline passengers. The network model such as        
for the regression problem is shown in the Fig.2 that consists           
of an input, a hidden layer comprised of a single LSTM cell            
with 4 units and an output layer with no activation function           
layer. 
A single input value is applied to LSTM cell and its    xi         
input gate forms the current intermediary cell state .        ct  
Initially, due to absence of a previous cell state data and          C t−1   
previous cell output vector , the present cell state    ht−1      C t  
becomes equal to . The resulting cell state vector contribute   ct        
to values of the final output vector of the LSTM unit . The          ht   
obtained vector passes through an output layer which  ht        
produces a prediction of the next value. 
 
B. LSTM CMOS-memristive hardware implementation 
The LSTM cell architecture is designed using the         
memristive crossbar and CMOS circuits. The proposed       
architecture computes the gate output values and the        
intermediary cell state per LSTM hidden unit one step   ct     m′    
at a time. After cycles we obtain the full output vector for   M          
a LSTM cell. At each cycle obtained values are saved in the            
memory units.  
Memristive crossbar implementation for matrix-vector     
multiplication [5] of a single LSTM cell with inputs and        N    
LSTM hidden units per cell is shown in Fig. 3. The inputsM              
, where , belong to the time step , and thexi   1, ]i∈ [ N       t    
inputs , where , are the outputs of hj   1, ]j ∈ [ M      
corresponding LSTM blocks at time . The biases for     t − 1     
input, forget, and output  
gates are shown as , , and , respectively. The bias    bi  bf  bo     bc  
is for the intermediary cell state . The real cell state is .      ct       C t  
In Fig. 3, the weights of the inputs and biases are represented            
by the conductances of the memristors in the crossbar. Four          
structures delimited by the dashed blue lines compute the         
outputs of the gates: , and , and the intermediary cell    it  f t  ot      
state . The transistors in the structures serve as switches ct          
that allow to perform weighted summation from a single         
crossbar column at a particular time step. 
The resulting currents from the read transistors, that         
represent dot product of inputs and crossbar weights, are         
fetched into current mirror circuits. The mirrored currents are         
then fed to corresponding activation function circuits: sigmoid        
and hyperbolic tangent circuits. The activation function       
circuits produce the output voltage values for , , and       it  f t  ct   
required to obtain the cell state . The cell stepot        C t    C t−1  
from the previous time step, is stored in a memory unit.           
Finally, the output of a current LSTM hidden unit is   ht         
obtained as a voltage after few steps of conversions and          
calculations, performed in multiplier circuits, analog adder       
and voltage to current converter. The calculation of the final          
predicted output is obtained from the memory unit combining         
all hidden layer outputs and output layer weights to    V hM       
obtain dot product, which is represented as a current going          
through the resistor  proportional to .R V out  
 
Results and discussion 
 
The input dataset includes 144 observations during 12        
years, was normalized to the range from 0 to 1 and divided            
into training and testing sets. Upon training of the network for           
100 epochs, we extracted values of the network weight         
matrices for a constrained-range between -1 and 1. LSTM         
unit is composed of three matrices of sizes [1,16] for input           
signal , [4,16] for recurrent connections with andxi        h*   
biases [1,16]. Outputs of the columns in the matrix form b*           
input gate, forget gate, intermediary cell state and output gate          
vectors. The output layer has two matrices, e.g. weight matrix          
of size [4,1] and single base value 1. Obtained score for           
training set is smaller than for testing set (24.84%         
root-mean-square error (RMSE) against 55.26% RMSE) and       
requires further optimization.  
To build a circuit, TSMC 0.18 m CMOS technology and      μ     
the memristor model for large scale simulations [6] was         
utilized. We use GST memristors that can be adjusted to 16           
different resistance levels from to [7], thus    00kΩ2   000kΩ2    
extracted weights interpolation was performed. Adjusting      
weights according to memristor conductance values also       
affects the network performance in either way - degrading or          
improving the prediction. Simulation with modified weights       
led to a slight improvement in performance (24.61% RMSE         
for training and 47.33% RMSE for testing sets). The total area           
and power consumption of a single LSTM unit is         7.326μm7 2  
and , respectively.05.9mW1  
 
Fig. 1 Mathematical representation of conventional LSTM cell [2]. 
 
Fig. 2 Implementation of LSTM network for the particular example          
of prediction making problem [3]. 
 
 
 
 
 
Conclusion 
 
In this work, the hardware architecture of a LSTM Keras          
model for forecasting was presented. Memristor crossbar       
array has high efficiency which allows to compute larger         
complex vector-matrix multiplication in single constant time       
step. However further network optimization and development       
of learning circuit is required.  
 
References 
 
[1] ​K. ​Smagulova, O. Krestinskaya, and A.P. James. “A         
memristor-based long short term memory circuit.” ​Analog       
Integrated Circuits and Signal Processing​ (2018): 1-6. 
[2] F. A. Gers, J. Schmidhuber, and F. Cummins, “Learning to          
forget: Continual prediction with lstm,” 1999. 
[3] I. Goodfellow, Y. Bengio, A. Courville, and Y. Bengio, ​Deep          
learning​. MIT press Cambridge, 2016, vol. 1. 
[4] J. Brownlee, “Time series prediction with lstm recurrent neural         
networks in python with keras,” Available at: ​machinelearning-        
mastery. com​, 2016. 
[5] M. Hu, J. P. Strachan, Z. Li, E. M. Grafals, N. Davila, C.             
Graves, S. Lam, N. Ge, J. J. Yang, and R. S. Williams, “Dot-             
product engine for neuromorphic computing: programming      
1t1m crossbar to accelerate matrix-vector multiplication,” in       
Proceedings of the 53rd annual design automation conference.        
ACM​, 2016, p. 19. 
[6] D. Biolek, Z. Kolka, V. Biolkova, and Z. Biolek, “Memristor           
models for spice simulation of extremely large memristive net-         
works,” in ​2016 IEEE International Symposium on Circuits and         
Systems (ISCAS)​, May 2016, pp. 389–392. 
[7] S. Xiao, X. Xie, S. Wen, Z. Zeng, T. Huang, and J. Jiang, “Gst-                 
memristor-based online learning neural networks,” ​Neurocom-      
puting​, 2017.  
 
 
Fig. 3 CMOS-memristive hardware implementation of LSTM cell. 
