Long Short-Term memory (LSTM) architecture is a well-known approach for building recurrent neural networks (RNN) useful in sequential processing of data in application to natural language processing. The near-sensor hardware implementation of LSTM is challenged due to large parallelism and complexity. We propose a 0.18 m CMOS, μ GST memristor LSTM hardware architecture for near-sensor processing. The proposed system is validated in a forecasting problem based on Keras model.
Introduction
There are numerous applications that applies sequentially ordered data for prediction and classification using recurrent neural network (RNN). Feedback connections in RNN enables its units to retain previous information to compute the current stat, making RNN an efficient tool for processing sequential data where maintaining order of information is important. RNN training is performed using 'backpropagation through time' (BPTT) leading to either vanishing or exploding gradient problem. Long short-term memory (LSTM) (see Fig.1 is a extension of RNN that overcomes vanishing gradient problem [1, 2] .
LSTM architecture
In Fig.1 elements form a vector [3] . The parameter m M is an input iterator index ( ), is a current
), and is a hidden unit
A. Implementation of LSTM network for prediction problem
In [4] , LSTM network is used to predict the number of international airline passengers. The network model such as for the regression problem is shown in the Fig.2 
B. LSTM CMOS-memristive hardware implementation
The LSTM cell architecture is designed using the memristive crossbar and CMOS circuits. In Fig. 3 , the weights of the inputs and biases are represented by the conductances of the memristors in the crossbar. Four structures delimited by the dashed blue lines compute the outputs of the gates: , and , and the intermediary cell i t f t o t state . The transistors in the structures serve as switches c t that allow to perform weighted summation from a single crossbar column at a particular time step.
The resulting currents from the read transistors, that represent dot product of inputs and crossbar weights, are fetched into current mirror circuits. The mirrored currents are then fed to corresponding activation function circuits: sigmoid and hyperbolic tangent circuits. The activation function circuits produce the output voltage values for , , and i t f t c t required to obtain the cell state . The cell step o t C t C t−1 from the previous time step, is stored in a memory unit. Finally, the output of a current LSTM hidden unit is h t obtained as a voltage after few steps of conversions and calculations, performed in multiplier circuits, analog adder and voltage to current converter. The calculation of the final predicted output is obtained from the memory unit combining all hidden layer outputs and output layer weights to V h M obtain dot product, which is represented as a current going through the resistor proportional to .
R V out

Results and discussion
The input dataset includes 144 observations during 12 years, was normalized to the range from 0 to 1 and divided into training and testing sets. Upon training of the network for 100 epochs, we extracted values of the network weight matrices for a constrained-range between -1 and 1. LSTM unit is composed of three matrices of sizes [1, 16] for input signal , [4, 16] for recurrent connections with and x i h * biases [1, 16] . Outputs of the columns in the matrix form b * input gate, forget gate, intermediary cell state and output gate vectors. The output layer has two matrices, e.g. weight matrix of size [4, 1] and single base value 1. Obtained score for training set is smaller than for testing set (24.84% root-mean-square error (RMSE) against 55.26% RMSE) and requires further optimization.
To build a circuit, TSMC 0.18 m CMOS technology and μ the memristor model for large scale simulations [6] was utilized. We use GST memristors that can be adjusted to 16 different resistance levels from to [7] , thus 00kΩ 2 000kΩ 2 extracted weights interpolation was performed. Adjusting weights according to memristor conductance values also affects the network performance in either way -degrading or improving the prediction. Simulation with modified weights led to a slight improvement in performance (24.61% RMSE for training and 47.33% RMSE for testing sets). The total area and power consumption of a single LSTM unit is 7.326μm 7 2 and , respectively. 05.9mW 1 Fig. 1 Mathematical representation of conventional LSTM cell [2] . Fig. 2 Implementation of LSTM network for the particular example of prediction making problem [3] .
Conclusion
In this work, the hardware architecture of a LSTM Keras model for forecasting was presented. Memristor crossbar array has high efficiency which allows to compute larger complex vector-matrix multiplication in single constant time step. However further network optimization and development of learning circuit is required.
