Abstract: This paper proposes Full-Parallel Convolutional Neural Networks (FP-CNN) for specific target recognition, which utilize the analog memristive array circuits to carry out the vector-matrix multiplication, and generate multiple output feature maps in one single processing cycle. Compared with ReLU and Tanh function, we adopt the absolute activation function innovatively to reduce the network scale dramatically, which can achieve 99% recognition accuracy rate with only three layers. Furthermore, we propose a performance metrics function to resize the scale of the FP-CNN for solving different classification tasks. With the help of such design guidelines, the FP-CNN can still achieve over 96% recognition accuracy under the condition of 95% yield of memristor crossbar array and 0.5% Single-Pole-Double-Throw switches (SPDT) noise.
Introduction
In recent years, embedded neuromorphic processing systems have acquired significant advantages, such as the ability to solve the image classification problems while consuming very little area and power consumption [1] . Convolutional neural networks (CNNs) are widely popular in industry for their superior accuracy on datasets, and when implemented on GPU clusters, it has become the state of the art in image classification. With the development of deep learning algorithms, the dimension and complexity of its CNNs layers have grown significantly. Due to the area limitation of the hardware scale in mobile devices, the application requirements for specific target recognition often fail to achieve the desired results, which often consume excessive system resources or a large amount of energy [2, 3] . It has significant advantages in offering embedded neuromorphic processing systems, which have the ability to solve classification problems efficiently.
In recent years, the memristor [4] has received significant attention as synapses for neuromorphic systems. The memristor is a non-volatile device that can store the synaptic weights of neural networks. Memristive array can naturally carry out vector-matrix multiplication which is a computationally expensive operation for neural networks. It has been recently demonstrated that analog vector-matrix multiplication can be of orders of magnitude that are more efficient than ASIC, GPUs or FPGA based implementations [5] .
We propose a Full-Parallel Convolutional Neural Network (FP-CNN) for the application of embedded systems, robotics and real-time or mobile scenarios, which can achieve high recognition accuracy with three layers. Our work aims to take full advantages of the parallel characteristic of the memristor crossbar array for maximum throughput. We use many Kernel-Memristor-Crossbar modules to generate output feature maps in one single processing cycle and adopt the absolute activation function innovatively. This function can achieve high recognition accuracy rate in the proposed network architecture.
The rest of this paper is organized as follows. Section 2 depicts the detailed architecture of the FP-CNN (including its parallel structure in hardware), then analyzes the performance of the absolute activation function, and demonstrates the analysis of the FP-CNN's system error. Section 3 summarizes the paper at the end.
Full-Parallel convolutional neural networks
The FP-CNN is driven by spike signal with amplitude and the convolution computation is performed by the memristor crossbar array. The FP-CNN only has three layers, and it can get the desired recognition accuracy with the appropriate parameters selected.
FP-CNN architecture
The FP-CNN has four major components: the convolution layer, the activation function module, the pooling layer, and the fully connected layer. The Fig. 1 shows the schematic of the architecture of the FP-CNN implemented.
Convolutional kernels are used to extract the features from the input images and produce the feature maps. The FP-CNN uses twenty convolution kernels to extract the features from the input image, each of size 9 Â 9. Each of these kernels slides across the corresponding feature field with a stride of 9 Â 9 to perform elementwise vector-matrix multiplication. These operations produce twenty output feature maps, each of size ðI W À 8Þ Â ðI H À 8Þ, and here we assume that the size of the input image is I W Â I H . The FP-CNN has no bias in the end of convolution operations. Therefore, the output feature maps can be represented as
From Eq. (1) and Eq. (2), it can be seen that the operations in the FP-CNN system are all vector-matrix multiplication and the operations can be perfectly implemented using the memristive array.
Absolute activation function analysis
Rectified Linear Unit (ReLU), sigmoid and some other activation functions are applied widely in most popular deep neural networks [6, 7] . However, in the FP-CNN architecture that has only three major layers, these activation functions do not work well. For example, ReLU modules can be fragile during training and can "die". Mathematically if the input vector to the network is X ¼ ðx 1 ; x 2 ; . . . ; x n Þ, the weights vector is W ¼ ðw 1 ; w 2 ; . . . ; w n Þ. Therefore, the weighted sum can be described as
We assume a very simple error measure
where fðÁÞ is the activation function, and the realistic output value is represented by y. The gradient value for the deltas of the backpropagation algorithms is
where f 0 ðÁÞ is the derived function of the fðÁÞ. So far, for a certain weight w i , we can calculate it as
We know that the ReLU function is
Combined with Eq. (6), the output has only two possible gradient values. That is
As we can see from the equations above, a ReLU neuron could cause the weights not to update in such a way that the inputs are negative. If this happens, then the gradient flowing through the unit will be zero from that point on. Once a ReLU ends up in this state, it is unlikely to recover, and thus, the gradient descent learning will not alter the weights.
Based on this reason and the possibility of circuit realization, we raise the negative half-axis of ReLU up by 45 degrees to get the absolute value function (Abs). The Abs function has a nonlinearity characteristic in its domain. It can be described as AbsðxÞ ¼ Àx;
We specifically considered the value of the derivative function at zero. After applying the Abs, Eq. (8) can be rewritten as
Our work will use the "active neuron rate" to verify the excellent performance of the Abs activation function in the FP-CNN architecture.
In the training procedure, we use the Back-Propagation (the BP algorithm) with the mini-batch Stochastic Gradient Descent (SGD method) to update the synaptic weights. Fig. 2 (a) and (b) respectively show the accuracy and the loss value from using different activation functions. By using the Abs, the FP-CNN can achieve 99% accuracy rate on MNIST in 100 epochs.
Giving that the SGD method can not only consider a single input x i and further consider many, the hope is that not all inputs will put the ReLU on the flat side, and thus, the gradient will be non-zero for some inputs. If at least one input x i has our ReLU on the steep side, then the ReLU is still alive because there is still some learning going on, but it is getting smaller weights for the updated value for this neuron. For this reason, our experiments set an activation threshold (EPS value) that illustrates whether the neuron is still alive. Fig. 3 illustrates the activation rate by applying Abs and ReLU. As mentioned above, we define the EPS (e.g., in our experiments, we set the EPS to 1e-3 and 1e-4.) as the activation threshold. This definition means if the difference between the updated weights in two iterations is greater than EPS, we consider the neurons to be active. It can be seen from the experimental results that Abs have better activation effects than ReLU.
Adaption to the hardware
As mentioned above, the FP-CNN can produce the entire output feature maps in one processing cycle. The system no longer needs to wait for an output feature map to be completed before the next operation can be executed. This outcome means that it eliminates the requirement of a data storage device between each network layer.
The FP-CNN has no additive bias at the end of layers, and all the multiplication operations are implemented on a memristive array according to Eq. (1) and Eq. (2). After the FP-CNN network training is finished, the weight matrix also has some negative weights. Therefore, the converted mapping method is applied (Fig. 4(a) ). We assume that W represents a 2 Â 2 weights matrix, and it includes positive and negative values. Then, the W is converted to the 2 Â 4 matrix so that the memristor crossbar can easily calculate the weighted-sum with the amplifiers. Each original value is extended in two parts, W þ and W À . If one element is a positive value, then the value is defined in W þ , and the value of W À is zero. In contrast, the negative element value is defined in W À and the W þ is zero. Similarly, if the arrangement of the memristor in the array corresponds to the converted matrix, the R off takes the place of the zero element. Fig. 4(b) is a simple demo about performing a convolution computation in a memristor crossbar. A two-dimensional input image is converted into a one-dimensional electrical signal as an input, and the weightedsum computation is completed by the memristive array. module. The Abs mainly consists of two op-amps and two diodes. The V in terminal receives the spike signals from the convolution layer. The V out terminal generates an activation spike and sends it to the pooling layer. The pooling module requires three op-amps and three Single-Pole-Double-Throw switches (SPDT). The V i in receives the spike signal that has been activated, and the maximum voltage will be sent to the fully connected layer by V out .
The fully connected layer requires the 100 Â 20 Â 2 Â 10 memristors to complete the final decision task, and the Out i voltage represents the classification result.
In general, the FP-CNN needs 1,336,000 memristors, 1,296,000 are located in the convolution layer, and 40,000 are located in the fully connected layer. However, in the FP-CNN system, half of the memristors in the array can be set to a maximum resistance value (representing a zero value in the matrix multiplication).
Performance metrics function
The parameters in the FP-CNN (such as the size of each kernel), the amount of kernels, and the size of the pooling layer are all key to calculating the recognition accuracy and the power consumption. The accuracy and consumption are also two of the most important points in hardware applications. In other words, the scale of the network can be moderately reduced to for different applications to achieve the lower power consumption requirements. From the system flowchart of Fig. 5 , we can see that the number of parallel convolutional modules K convX i can be expressed as
The number of pooling modules is
where k is the total number of convolution kernels, I W and I H denote the size of input image, and K s z and P s indicates the size of kernels and pooling respectively. From Fig. 6 we can see that there are three op-amps in each K convX i
, two op-amps in each Abs module and the number of analog switches (SPDT) and op-amps used in each pooling module is P 2 s À 1, which can be considered P 2 s À 1 % P 2 s for ease of calculation. Therefore, the number of op-amps and SPDTs in the Abs and pooling modules are calculated respectively as
The memristors are distributed in the K convX i modules and the fully connected modules. Therefore, the total number of memristors in the FP-CNN can be described as
where the N is the number of categories. In summary, the system's power consumption can be approximated as
where P op , P SPDT and P mem are the power consumption of a single op-amp, SPDT switch and memristor, respectively. The memristor crossbar array energy and area model are based on [5] , the op-amps power dissipation data comes from [8] , and the SPDT power consumption model is adapted from [9] .
Combining the power consumption, recognition accuracy rate and the number of memristors, we define a performance metrics function to resize the FP-CNN scale, such as the K s , P s , and k. The metrics are as follows.
Where Acc means the recognition accuracy rate, and the baseline represents the accuracy we fixed. The value of β is determined by the specific parameters of the network, and the FP-CNN uses ¼ 0:1. The value of V reflects the efficacy of the selected parameters in the hardware implementation. That is, when the baseline accuracy is given, using the set of parameters with the largest V value can achieve the lower power consumption in the hardware implementation. Our team generated 100 sets of data on a PC platform with different k, K s , and P s parameters. Regarding the parameters setup, we randomly choose K s , P s , and k in the range of 1 to 20. After the parameters are fixed, we trained 100 epochs to get the corresponding recognition accuracy rate. After all the experiment data are obtained, we calculate the N M , P cost , and V by using Eq. (16), Eq. (17) and Eq. (18), respectively. Fig. 7(c) shows the recognition accuracy of the corresponding number of convolution kernels. Fig. 7(d) illustrates the relationship between the size of the convolution kernel and the recognition accuracy. From Fig. 7(d) , we can see that the recognition rate reaches its highest when the convolution kernel size is 9 Â 9, which is the same size used in the FP-CNN.
To verify the performance capabilities of the metrics function, we set the baseline accuracy to be in the range of 90% to 99%, and each cycle is incremented by 0.01% (i.e., baseline = 90%, 90.01%, 90.02%, …, and 99%). In each cycle, we calculated the corresponding metrics V and P cost and found that the P cost with the largest V value is the smallest of all data. For instance, Fig. 7(b) illustrates that when the baseline is 90%, the V obtains the maximum value with the 81nd data, and the corresponding P cost obtains the minimum value.
System error analysis
This work is implemented with the Theano python open-source library, and we also use LTspice for some circuit simulations. In the experiments, we chose the MNIST dataset for verification, we normalize the gray value of the image as the input voltage, that is, the input voltage range is 0-1 V. In this paper, we refer the data from a fabricated TaO x memristor device. These devices can be repeatedly programmed to different target resistance states from 2 K to 3 MΩ and show the needed linearity at sufficiently low voltages ⩽0:3 V [5] . The recognition accuracy in the figure is the result of averaging multiple experiments. The memristor conductance values are predetermined by training the FP-CNN on software platform. After finishing the training process, we use close-loop tuning scheme [10] to program memristors to the trained values. The memristor device can be tuned from one resistance state to any other states with 1% error tolerance [5, 10] .
In the simulation process, we use device resistance at different levels and randomly generate 1%-20% errors in memristor programming. These errors include the error tolerance of the memristor and the errors generated by the write circuit (the write circuit is referenced from [11] ). We also consider the issue of device yield. We randomly select a part of (1%-20%) the devices and set their conductance value to zero, meaning that the component will lose its function in vector-matrix multiplication.
There are many op-amps in the activation function and pooling module. The amplifier at each module may also impact recognition accuracy. To consider this effect, we assume that each op-amp in the circuit's voltage input offset error, associated gain error and voltage noise. The op-amps error model is constructed as follows:
where E gain , E offset , and E noise denote the op-amps voltage gain error, the offset voltage, and the voltage noise, respectively. As Fig. 8 shows, the op-amps errors are mainly generated at the exit of the convolution module, the activation function output and the pooling output. From the circuit diagram in Fig. 8 , it can be seen that the impact of the SPDT noise in the pooling module also needs to be considered. We discussed the impact of a single factor on the recognition accuracy rate. Thus, the comprehensive error factors should be analyzed in the FP-CNN system. In the memristive array part, we mainly consider the programmed error, the yield of the memristor device and the op-amps error at the end. In the activation function module, we mainly analyze the error of the op-amps (each module has two devices). In the pooling module, op-amps are used as the comparator. Therefore, we neglect the errors effects and only consider the influence of the SPDT noise (THD+N). Fig. 9 demonstrates the recognition accuracy rate of the FP-CNN system after integrating multiple errors.
From Fig. 9 (a), we can see that the memristive array programming error and the device yield have the greatest impacts on the recognition accuracy. When the memristor device programmed error is in the worst case, if the op-amp offset voltage is 100 µV, 95% yield and 0.5% SPDT noise, the recognition accuracy can still reach over 96%. In other words, when the offset voltage is less than 300 µV, the recognition accuracy rate is hardly affected by the programmed error, and a nearly 99% accuracy rate can still be achieved in the best case. Fig. 9(b) illustrates the relationship between the recognition accuracy and yield. As seen from the diagram, the recognition accuracy rate has a significant downward trend as the yield decreases. If the device yield is controlled above 90%, the accuracy can be kept at 98%. The op-amps gain error and SPDT noise have little effect on the recognition rate. The curves in Fig. 9(c) and Fig. 9(d) fluctuate around the mean value and there is no clear downward trend.
Based on the above analysis, when the error is controlled within a reasonable range, the loss of the recognition rate of the FP-CNN system can be controlled within a limited range. In the best case, the 98.9% recognition accuracy rate can still be achieved. The performance achieved in this work was compared with other state-ofthe-art designs for convolutional neural network with memristor crossbar. As listed in Table I , it could be seen that our design offered comparable performances.
Conclusion
In this work, our team propose the FP-CNN architecture, which is a full parallelization convolutional neural networks with a memristive array that can get high recognition accuracy. All the convolution computations can be finished in an one processing cycle. The absolute activation function is innovatively applied in the FP-CNN, and we analyzed the activation function mathematically and experimented to verify its performance. To adapt to different complexity recognition tasks, we proposed the performance metrics function for resizing the scale of the FP-CNN. Similarly, we analyze the impact of systematic errors on the recognition accuracy. Under the influence of normal errors, the FP-CNN can achieve a recognition rate of 98% or more.
