use of limited memristor resources and make the system work at high speed in real-time processing 48 applications, we present a memristor-based cascaded framework with some neuromorphic processing 49 chip. That means several neural network processing chips can be cascaded to improve the processing 50 capability of the dataset. The basic computation unit in this work builds on our prior work devoloping 51 memristor-based CNNs architecture [20] , which has validated that the three-layer CNNs with Abs 52 activation function can get desired recognition accuracy.
53
The rest of this paper is organized as follows, Section II presents the cascaded method based on 54 the basic computation unit and a split method, including the circuits implemented based on memristor 55 crossbar array. Section III exhibits the experimental results. The final Section IV concludes the paper. 
Proposed Cascaded Method

57
To have a better understanding of cascaded method, we first introduce basic units that make up 58 the cascaded network, and then a detailed description of our cascaded CNN framework is presented. 
Basic Computation Unit
60
To take full advantage of limited memristor array, a three-layer simplified CNNs is used as the 61 basic computation unit (BCU). The structure of this network is shown in Figure 1 . The basic computation unit architecture. It consists of three layers, behind the input layer is convolution layer which consists of k kernels, followed by a average-pooling layer and a fully connected layer.
The simplified CNNs includes three layers. The convolution layer includes k kernels with kernel 63 size K s × K s followed by absolute nonlinearity function (Abs), which extracts the features from the 64 input images and produce the feature maps. The average-pooling obtains spatial invariance while 
The Proposed Cascaded Framework
75
Aim to combine several monolithic networks (BCUs) to obtain better performance, we propose a 76 cascaded CNN network, whose specific design can be seen in Figure 2a .
77
The cascaded framework includes three parts. Given the output G ∈ R C×W×H generated from a 78 part, a reconstruction transformation f : R C×W×H → R C×W×H is applied to aggregate outputs over 79 all BCUs of this part, where C is the number of channels of input image, W and H are the spatial 80 dimensions. The output of k th part is described as
4 of 11 from the BCU outputs
. The next Part #2 includes N BCUs, which 89 take the outputs of the previous part as the inputs to produce the F 2 ∈ R C 2 ×W 2 ×H 2 . The final Part
90
#3 includes P BCUs, the first BCU takes the F 2 as the input to generate the G 3_1 ∈ R C 3 ×W 3 ×H 3 , and 91 the second BCU uses G 3_1 to produce the G 3_2 , and so on until G 3_P ∈ R C 3 × W 3 × H 3 is produced. This 92 cascaded mode is called the "M-N-P" type.
93
The typical "3-1-1" cascaded framework is shown in Figure 2b , it includes five BCUs which three
94
BCUs in parallel and two in series to produce the classification output. Based on BCU architecture, the computation unit can be treated as an image transformator.
97
According to Equation (1), the output G can be described as a multiply-addition calculation so that it can be performed by several memristor.
99
As mentioned above, the BCU is a simplied CNN architecture. After the network training is 100 finished, the weight matrix also has some negative weights. arrangement of the memristor in the array corresponds to the converted matrix, the R o f f takes the 107 place of the zero element.
108
The BCU can produce entire output feature maps in one processing cycle. It no longer needs 109 to wait for an output feature map to be completed before the next operation can be executed. This 110 outcome means that it eliminates the requirement of a data storage device between each network layer. function in Equation (4). It should be noted that we use the Out 11 to cascade the next network, the
128
Out 12 is treated as the output when we use the single BCU for classification.
To enhance the driving capability of crossbar arrays, driver circuits are needed so that low current does not need to be reconfigured.
145
In order to reduce the pressure on the input terminals, the BCU split method is need to be 146 considered. According to Equation (1), the input image I * will be sent to the BCU to complete the 147 convolutional calculation. As we can see, the input I * can be described as I * = I * 1 + I * 2 + ... + I * N , so 148 the Equation (1) can be rewritten as
where H : R C×W×H → R C * ×W * ×H * is the transformation in the BCU, and ⊕ represents the 150 reconstruction operator which combined with the output of different split chip using an element-wise 151 addition, and N is the number of subimages. for example, this chip will perform the convolution calculation K convX i1 i11
in the portion of I * .
154
As we mentioned above, the fully connected layer is a process that calculates W · G * (here G * can be 155 regarded as R 1×m N ), and the chip #i will map the output to the
outputs are generated by these chips and accumulated by ⊕ operation.
have the same format as the MNIST dataset and are 28×28 grey scale images with corresponding 176 labels l ∈ [0, 9]. The Fashion-MNIST dataset is a step up from the MNIST dataset. It remains simple 177 enough that complex architectures, learning algorithms, and models are not needed to view progress 178 and soundly measure performance.
179
The weights of neural network are converted to conductivity values of memristor device by 180 Equation (6). The C max and C min indicate the maximum and minimum conductances, respectively. The
181
W is the original weights set, and the W max represents the maximum absolute value of the weights set,
182
and the C i is the conductance of the memristor crossbar array.
The 1T1R memristor crossbar array which stores the network weights is used for simulation. The segment has a resistance of 2Ω, and 5% programmed erros were generated in memristor programming 186 simulation process. The circuit structure in HSPICE is shown in Figure 4 . The process of simulation 187 implemented is illustrated as below. 2. Execute software simulation (used Python) and circuit simulation (used HSPICE) based on step 1. Table 1 . Cascaded architecture for Fashion-MNIST under the 4.15 M parameters. The number before the "Cascaded" is the number of the BCUs. "×14" indicates the cascaded network uses 14 convolution kernels.
Stage
Output Size 7-Cas.×14 (4-2-1) 8-Cas.×14 (4-2-2) 9-Cas.×14 (4-3-2) Based on basic computation unit, we test the cascaded network with different architecture on
221
MNIST and Fashion-MNIST datasets, the configuration details and performance is shown in Table 1 .
222
The BCU which has 14 kernels with 9×9 kernel size and 2×2 pooling size, and Abs activation function 223 applied can achieve 89.68% Fashion-MNIST accuracy, while Cascaded×14 achieves improvements of 8-Cascaded×14(4-2-2) slightly outperforms "4-2-1" (∼0.13%) with 10,000 more parameters. It can be 227 seen that cascaded models can achieve 93.54% accuracy under the 4.15M parameters. Plots in Figure 7 show the potential output of BCU obtained by the HSPICE simulation and the 243 software simulation output (used Python) in 10,000 test samples. and fully connected layers. As shown in Figure 7c , the differences are concentrated in the range of 253 −5 × 10 −6 to 10 × 10 −6 . In Figure 7f , the differences are concentrated in the range of −5 × 10 −3 to 254 5 × 10 −3 . This is two orders of magnitude different for the kernels and fully connected layer output.
228
255
Simulation experiments demonstrate that the output of the circuit is close to the software simulation,
256
indicating that the circuit implementation is feasible. 
