ABSTRACT There is great attention to develop hardware accelerator with better energy efficiency, as well as throughput, than GPUs for convolutional neural network (CNN). The existing solutions have relatively limited parallelism as well as large power consumption (including leakage power). In this paper, we present a resistive random access memory (ReRAM)-accelerated CNN that can achieve significantly higher throughput and energy efficiency when the CNN is trained with binary constraints on both weights and activations, and is further mapped on a digital ReRAM-crossbar. We propose an optimized accelerator architecture tailored for bitwise convolution that features massive parallelism with high energy efficiency. Numerical experiment results show that the binary CNN accelerator on a digital ReRAM-crossbar achieves a peak throughput of 792 GOPS at the power consumption of 4.5 mW, which is 1.61 times faster and 296 times more energyefficient than a high-end GPU.
I. INTRODUCTION
C ONVOLUTIONAL neural network (CNN) has become a promising machine learning engine for imageoriented data analytics [1] . A GPU-based CNN accelerator is currently dominant in use. It can achieve high throughput in convolution but with high power consumption [1] . On the other hand, a field-programmable gate array (FPGA)-based CNN accelerator has also been investigated due to its energy efficiency benefits [2] but it has quite limited low parallelism with the need of reduced numeric precision. Moreover, for image-data oriented computing, a large amount of data needs to be hold in memory with significant leakage power consumption. As such, it really requires a re-examination of both of the CNN algorithm as well the underlying hardware platform toward high energy efficiency as well as high throughput in convolution.
The recent advancement in binary-constrained deep learning [3] has introduced new insight for a more efficient hardware acceleration of the CNN. The work in [3] demonstrates the successful use of binarized weights in CNN with training under the binary constraints. As such, highly parallel bitwise operation can be realized in hardware with great potential to out-performed GPUs. However, there is little work exploring on the hardware accelerator architecture for binary CNNs. It is noticed that the main computation here involves intensive bitwise operations such as convolution, batch normalization, pooling, and activation functions. Moreover, the traditional CPU/GPU-based acceleration is out of memory with arithmetic cores that have poor bandwidth and energy efficiency, not to mention the standby power.
The recent emerging nonvolatile memory (NVM) technologies have shown significantly reduced standby power and increased integration density, as well as close-to DRAM/SRAM access speed. In addition, studies in [4] - [6] have shown the logic implementation based on NVM devices. The resistive random access memory (ReRAM) devices [7] have shown a great potential for an energy-efficient acceleration, especially for a natural mapping of multiplier on crossbar. It can be employed as both storage and computation element with minimized leakage power due to its nonvolatility [8] . The traditional CPU/GPU-based acceleration is out of memory with arithmetic cores that have poor bandwidth to access memory, low efficiency to process data, not to mention the huge standby power of memory. A digital ReRAM-crossbar [9] can further support the binary matrixvector multiplication even under strong nonuniformity with process variation when compared to the analog ReRAMcrossbar computing with additional ADCs [10] - [12] . The digital ReRAM-crossbar implementation is also shown in [13] for single layer feed forward network, which still has limitation on large data sets such as CIFAR-10 [14] . Compared with [15] , this paper evaluates the system on a larger-scale benchmark CIFAR-10 instead of MNIST [16] and performs more energy-efficiency analysis. Moreover, previous works [9] , [10] , and [12] only show the realization of the DOT (matrix-vector-multiplication) operation used in convolution with only exploration on MNIST benchmark. There is no study on how to realize a bitwise matrix-vector multiplication, batch normalization, pooling function, and activation function all on the ReRAM-crossbar devices.
In this paper, based on the digital ReRAM-crossbar, we have developed a bitwise CNN (BCNN)-based image-data processing using CIFAR-10 benchmark. We show that the digital ReRAM-crossbar can be used for a bitwise convolution, batch normalization, pooling function, and activation function, all in ReRAM-crossbar devices. Moreover, the intermediate results between steps are not required to be written back to the memory with little standby power. The inference accuracy is observed high under the ReRAM device variation as well as the binary constrained results.
The rest of this paper is organized as follows. The BCNN operations are compared with conventional CNN in Section II. The in-memory ReRAM accelerator architecture and digital ReRAM-crossbar background are introduced in Section III. The mapping between BNN and digital ReRAMcrossbar is discussed in Section IV. Numerical experiment results are presented in Section V with conclusion drawn in Section VI.
II. CNN WITH BITWISE PARALLELISM
The recent work in [3] suggests a CNN using binary constraints during training. In this section, we will discuss how to generate a CNN model with bitwise parallelism for convolution, batch normalization, pooling function, and activation function.
A. BITWISE CONVOLUTION
The convolution is the most time-consuming and computation-intensive operation in CNN. The binary convolution in [3] uses {−1, +1} for both input features and weights, so that the floating-point matrix-vectormultiplication operation is not required. However, the negative binary weights cannot be directly realized on hardware. To avoid the negative weights, recent work [17] uses bitwise XNOR and bit-count operation in {0, 1} for hardware implementation. The BCNN can be represented as follows:
where
is the binary input feature map and also the output of the binary convolution, and ⊗ is defined as bitwise XNOR operation. Comparing to a real-valued CNN in the single-precision data format, since the elements of weights and feature maps can be stored in 1 b, both of the logic and memory resources required for binary convolutional (BinConv) layer can be greatly reduced. Meanwhile, it can lead to a higher parallelism as well as greater energy-efficiency improvement.
B. BITWISE BATCH NORMALIZATION
Next, batch normalization is required to stabilize and accelerate the training process. In the inference stage, the normalization is retained to match training process. The output of the normalization can be represented by
where µ ∈ R W k ×H k ×D k and σ 2 ∈ R W k ×H k ×D k are the expectation and variance over the mini-batch, while γ ∈ R and β ∈ R W k ×H k ×D k are learnable parameters [18] that scale and shift the normalized value. In the inference stage, µ, σ 2 , γ , and β are all fixed to normalize the convolution output.
C. BITWISE POOLING AND ACTIVATION FUNCTIONS
The pooling layer performs a downsampling across a M × M contiguous region on the feature map output by normalization layer. Pooling is used for selecting the most significant information from the features. It also provides translation invariance and reduces the computation intensity. Two kinds of pooling schemes are commonly used in CNN. One is the max-pooling (MP), which takes the maximum value of the pooling region. The other is average-pooling, which takes the mean value of the pooling region. In this paper, the binary MP is applied with the following equation:
where a k and a k are the features before and after pooling, respectively. The activation function to process the output of MP is called binarization (Binrz), which can be represented as
To summarize pooling and binarization, we can observe that these two steps are to find out the sign of the maximum 38 VOLUME 3, 2017 number in the pooling region. As such, we can do the binarization for all the numbers in the pooling region first, and then find the maximum among them. Since the results from binarization become only 0 or 1, the pooling process only needs to detect if there is any 1 in the region. An example with the exchange of pooling and binarization for a 2 × 2 region is shown as
Here, the 2×2 real-value matrix denotes the output of batch normalization. The first path in (5) is doing binarization first, then MP, while the second path is doing it conversely. It is clear that exchanging the MP and binarization will not affect the final output in the inference stage.
D. BITWISE CNN MODEL OVERVIEW
The overall working flow of a bitwise-parallelized CNN model is shown in Fig. 1(b) . Each BinConv layer takes the binary feature map generated from the previous layer as input and conducts a bitwise convolution between binary feature maps and binary filter weights. The convolution output is further processed by the normalization layer before downsampling by the MP layer. The downsampled feature maps are subsequently fed into Binrz layer that produces binary nonlinear activations according to the input sign.
The key difference between the developed BCNN and the direct-truncated CNN using less precision bit [19] is illustrated as follows. The direct-truncated CNN is obtained by reducing the numerical precision in posttraining phase while the BCNN developed here is obtained by training with binary constraints [3] . As such, the direct-truncated CNN that suffers from accuracy loss in general, but the BCNN retains most of the accuracy with lowest precision. In the inference stage, only the binarized weights w b k will be retained for much smaller storage, faster inference, higher parallelism as well as higher energy-efficiency. The BNN algorithm is detailed in supplementary file.
III. DIGITAL ReRAM FOR IN-MEMORY COMPUTING
In this section, basics of ReRAM device and digital ReRAMcrossbar are reviewed to support an in-memory computing architecture, which will be used to map the BCNN discussed in Section II.
A. ReRAM DEVICE
The emerging ReRAM [20] , [21] is a two-terminal device for memory storage with two nonvolatile states: high resistance state (HRS) R OFF and low resistance state (LRS) R ON . Besides working as memory storage, a ReRAM-crossbar can be applied to perform logic operations. In one ReRAMcrossbar, given the input probing voltage V WL on each writeline (WL), the current I BL on each bit-line (BL) becomes the natural multiplication-accumulation of current through each ReRAM device. Therefore, the ReRAM-crossbar array can intrinsically perform the analog matrix-vectormultiplication [12] by    I BL, 1 . . .
where c i,j is configurable conductance of the ReRAM resistance R i,j , which can represent real number of weights. Compared to the traditional CMOS implementation, the ReRAM-crossbar achieves higher level of parallelism and consumes less power, including standby power.
B. DIGITAL ReRAM-CROSSBAR
Previous mapping of CNN is mainly based on the traditional analog ReRAM-crossbar [10] , [12] . The main limitation is that there exists a huge nonuniform analog resistance for undetermined states in the analog computation, which can result in convolution error [22] . In addition, a large overhead is required in ADC conversions. As shown in Fig. 2 , the recent work in [9] introduces a digital ReRAM-crossbar for roust digital matrix-vector multiplication. In the digital ReRAM-crossbar, only 0 or V r is applied on the input world-line (WL), and only HRS or LRS is configured for each ReRAM device. To overcome the sneak-path problem, half voltage operating scheme [23] is applied. For the output BL, a sense amplifier (SA) is applied to identify whether the output is 0 or 1. Note that the key difference here is the threshold V th of SA in each BL can be configured in a ladder-like voltages. As such, the sensed analog output can be encoded in binary to produce the multiplied result [9] . Such a binary matrix-vector multiplication can still be used in applications such as [24] .
Compared to the traditional analog ReRAMcrossbar [10] , [12] , the digital ReRAM-crossbar has advantages in the following perspectives.
1) It has better programming accuracy of the ReRAM device under process variation with no additional ADC conversation as well.
2) Since only LRS and HRS are configured in digital
ReRAM-crossbar, it does not require high HRS/LRS ratio so that low-power (high LRS) device [21] can be applied. In addition, the wire resistance affects little on IR-drop when using the low-power ReRAM device.
3) The binary input voltage has a better robustness on the IR-drop in large-size crossbar. Moreover, there is no work exploring on how to map all the CNN operations such as normalization, pooling, and activation functions on the ReRAM devices. This paper will show details in Section IV on how to map all BCNN operations on the digital ReRAM devices.
C. IN-MEMORY COMPUTING ARCHITECTURE
Based on the digital ReRAM-crossbar, one can develop an inmemory computing architecture with both memory and logic implemented by the ReRAM-crossbar as shown in Fig. 3 . In this architecture, data-logic pairs are located in a distributed fashion, where data transmission among data block, logic block, and external scheduler are maintained by a control bus. As such, logic block can read data locally and write back to the same data block after the logic computation is done [9] . As a result, the huge I/O communication load between memory and general processor can be relieved because most of the data transmission is done inside data-logic pairs. Based on this in-memory computing architecture using the digital ReRAM-crossbar, we will introduce the main contribution of this paper in next section: how to map all of the BCNN operations? 
IV. BITWISE CNN ON DIGITAL ReRAM-CROSSBAR
In this section, we will focus on the mapping of all the BCNN operations on the digital ReRAM-crossbar such as convolution, batch normalization, pooling, and activation functions.
A. MAPPING BITWISE CONVOLUTION
According to (1), the bitwise convolution can be split into several XNOR and bit-count results of two vectors. To implement (1), we can use two AND operations for a b k−1 and w b k as well as their complements. Therefore, the bitwise convolution on ReRAM-crossbar can be shown as follows:
The mapping of the bitwise convolution is shown in Fig. 4(a) . It requires a 2N × N ReRAM-crossbar, where N is the number of element in the vector. All columns are configured with the same elements that correspond to one column in binary weight w b k of the neural network, and the WL voltages are determined by the binary input a b k−1 . Due to the large ratio between R OFF and R ON , the current through the BL is approximately equal to
, where s k is the inner-product result in (1). Since the current of all BLs is identical, the ladder-like threshold voltages V th,j are set as follows:
where V th,j is the threshold voltage for the j th column. If we use s to denote the output array, and s (j) to denote the output of column j in ReRAM-crossbar, we can have
In this case, the inner-product results s k can be recognized that the first (N − s k ) output bits are 0 and the rest s k bits are 1. The relation between s k and s k can be expressed as s k = g(s k ).
As described in (1), each binary weight vector w k performs bitwise convolution with several input features. As a result, each logic block in Fig. 3 stores a binary vector w k , while the control bus transmits the input feature sequentially. In this case, bitwise convolution can be performed in parallel in separated logic blocks.
B. MAPPING BITWISE BATCH NORMALIZATION
Bitwise batch normalization requires two digital ReRAMcrossbars in the implementation. In the first ReRAMcrossbar, it performs the XOR operation on adjacent bits of the output of bitwise convolution. It can be expressed as
After that, the second ReRAM-crossbar builds a lookup table (LUT). Since µ, σ 2 , γ , and β are all fixed in the inference stage, (2) can be rewritten as
where f (·) represents the LUT. As a result, the LUT is stored in the second ReRAM-crossbar according to the parameters µ, σ 2 , γ , and β. As described in (10), only the s k th row of the LUT is selected, so the batch normalization result can be directly readout. The threshold voltage of both two ReRAMcrossbars are
To have a better illustration, Fig. 4(b) shows the detailed mapping and Table 1 shows the values to store in the second ReRAM-crossbar when µ = 2.5, σ 2 = 5, γ = 1, and β = 0 referred to the IEEE-754 standard.
C. MAPPING BITWISE POOLING AND BINARIZATION
According to (5), we will do the binarization first and then perform MP. The bitwise activation in (4) can be achieved by selecting the sign-bit of the binary format output of Fig. 4(b) . In MP, the output is 0 only when all the numbers in the pooling region are negative. As a result, we can add all the complementary of sign-bit in the pooling region. If the result is not 0, it indicates that at least one positive number is in the pooling region, resulting in the pooling result 1. In summary, the MP in (3) can be rewritten as
As a result, both of the bitwise MP and binarization can be implemented by the addition operation performed on the digital ReRAM-crossbar with one column, as shown in Fig. 4(c) .
D. SUMMARY OF MAPPING BITWISE CNN
As a summary, the four operations of the BCNN in Fig. 1(b) can be fully mapped onto the digital ReRAM-crossbar. All the threshold voltages are fixed even if the parameters of BCNN are changed. Although these operations are implemented in different ReRAM-crossbar arrays, the input/output formats of them are compatible so that one can directly connect them as shown in the logic block in Fig. 3 with pipeline used. The pipeline design is based on Table 2, so that each stage implements a layer. CONV-2, CONV-4, and CONV-6 are the stages which require the most ReRAM cells and computation time. As a result, more logic blocks are assigned to these steps to relieve the critical path. In our simulation, half of the digital ReRAM crossbars are used to perform these three layers. Moreover, since the output feature from Fig. 4(c) is binary, the area overhead and energy consumption of the data storage can be significantly reduced. In addition, because layers in Table 2 are implemented in different data-logic pairs, the volume of data transmitted is also decreased. 
V. NUMERICAL RESULT A. SIMULATION SETTINGS 1) BASELINES
In the simulation, we have implemented different baselines for comparison using both MNIST and CIFAR-10 benchmarks. The detail of each baseline is listed as follows.
1) CPU:
The BNN simulation is run in Matconvnet [25] on a computer server with 3.46-GHz core and 64.0-GB RAM. The BNN network is referred to Table 2 . 2) Our Design of Digital-ReRAM: The ReRAM device model for BNN is based on [21] , [26] , and [27] with the resistance of ReRAM set as 0.5M and 5M as on-state and off-state, respectively, with a working frequency of 200 MHz. SA is based on the design of [28] . The BNN network is also referred to Table 2 . 3) Others: GPU-based [3] BCNN implementations and FPGA-based [2] , CMOS-ASIC-based [29] , and analog-ReRAM-based [12] conventional CNN implementations are selected for performance comparisons as well.
2) NETWORK
The overall network architecture of the BCNN (called BNN) is shown in Table 2 . It has six BinConv layers, three MP layers and three fully connected (FC) layers. It takes a 32 × 32 RGB image as the input of the first layer. We use 128 sets of binary filters, and each set contains three binary filters to process the data from R, G, B channels, receptively. The width and height of each BinConv layer is fixed to 3 × 3 with stride of 1 and zero padding of 1. The BinConv layer performs the bitwise convolution between input feature maps and weights followed by the bitwise batch normalization and the binary activation. Note that the computation for FC layers can be treated as a convolution with stride of 0 and zero padding of 0. Thus, we refer the matrix multiplication in FC layers to convolution as well. As shown in Table 2 , two cascaded convolutional layers form a convolutional block with equivalent 5 × 5 convolution window. This configuration provides more powerful representation capacity with less amount of weights compared with the direct 5 × 5 implementation. Meanwhile, binary batch-normalization is applied after the convolution to accelerate and stabilize the training. We adopt the binary activation to each BinConv layer and FC layer with the exception for the last FC layer. The output of the last FC layer is fed into the softmax layer [30] without binarization to generate a probabilistic distribution of ten classes.
B. ACCURACY COMPARISON
We first show accuracy comparison between the analogReRAM and the digital-ReRAM under device variation. We then show accuracy comparison between the conventional CNN with direct-truncated prevision and the proposed BCNN. Various benchmarks such as CIFAR-10 [14] and MNIST [16] are used here.
1) ERROR UNDER DEVICE VARIATION
In the previous discussion, the digital ReRAM-crossbar has better programming accuracy than the analog ReRAMcrossbar. Fig. 5(a) shows a 200-time Monte-Carlo simulation of a single ReRAM device programing process with different write voltages V w , where the voltage amplitude is under Gaussian distribution (3σ = 3%V w ), and each column denotes a region of 5 k . It is clear that the digital ReRAM-crossbar (only 500 k and 5 M ) can achieve a better uniformity than the analog ReRAM-crossbar. The accuracy comparison against device variation on ReRAM is shown in Fig. 5(b) . Monte-Carlo is applied in the generation of device variation. In CIFAR-10, one can observe that when the device variation (ReRAM resistance value) is more than 8%, there is a large output current error reported. For example, when the device variation reaches 29%, the digital ReRAM can have an accuracy of 87.4% with only 4% decreased compared to no variations, better than the analog one with an accuracy of 84.8% with 7.6% decreased. In MNIST, the digital ReRAM is always better even when the device variation is larger than 27%.
2) ERROR UNDER APPROXIMATION
For a precision-bit direct-truncated CNN, we use the conventional approach to train the full-precision (32 b) CNN first, and then decrease the precision of all the weights in the network. The numerical experiment results of weights with different bit widths is shown in Fig. 5(c) . Here, the weights in the BCNN is 1 b, whose accuracy of is not changed. In CIFAR-10, although the accuracy of the full precision (32 b) can reach 92.4% in the direct-truncated CNN, the bit-width influences the accuracy a lot especially when the bit-width is 42 VOLUME 3, 2017 smaller than 10. For example, when the precision decreases to 6 b, the accuracy drops down to only about 11.8%. In MNIST, the accuracy of the direct-truncated CNN drops significantly when the bit-width is smaller than 6 b. The results show that the proposed BCNN can perform much better than the directtruncated CNN.
C. SCALABILITY STUDY
To achieve a better energy-efficiency of BNN, we do the scalability study to find out the BNN parameters for both good testing accuracy and energy-efficiency. We use a four-layer BNN on MNIST benchmark as a baseline (100% energyefficiency, as shown in Table 3 ), and change the number of output maps of layer 2 (CONV-2, 50) and hidden nodes of layer 3 (FC-1, 500), as shown in Fig. 6 (a) and (b). For each energy-efficiency configuration, we do a 20-epoch training to make a fair comparison. When the number of hidden nodes or output maps decreases, the energy-efficiency is better but it will cause higher testing error rate. To summarize the scalability study, Fig. 6(c) shows that the hidden nodes of layer 2 is more sensitive to testing accuracy. As a result, increasing the hidden nodes of layer 2 is better for higher accuracy, while decreasing the hidden nodes of layer 3 is better for energyefficiency.
D. PERFORMANCE COMPARISON
In this section, 1000 images with 32 × 32 resolution in CIFAR-10 are selected to evaluate the performance among all implementations. Parameters including binary weights, and LUT for batch normalization have been configured by the training process. The detailed comparison is shown in Table 4 with numerical results including area, system throughput, computation time, and energy consumption. In the numerical experiment, every batch comprises of ten images and the results are calculated in parallel. The detailed result figure is included in supplementary file. The black square represents to the class that each image belongs.
1) POWER
The overall power comparison among all the implementations are shown in Table 4 under the similar accuracy of 91.4% in CIFAR-10. Compared to CPU-based and GPU-based BCNN, the proposed digital-ReRAM-based implementation can achieve up to four-magnitude smaller power. Moreover, compared to FPGA-based and CMOS-ASIC-based conventional CNN, the digital-ReRAM-based implementation is 4155 times and 62 times smaller power. In addition, we analyze the detailed power characteristics of the proposed digital-ReRAM design in Fig. 7 . First, we analyze the power distribution on convolution, batch normalization, pooling, and activation, respectively. Fig. 7(a) shows that 89.05% of power is consumed by convolution, while 9.17% of power is for batch normalization, and the rest only takes 1.78%. Second, we analyze the power distribution on different BCNN layers in Table 2 . The results in Fig. 7(b) show that CONV-6 consumes the most power 25.92%, while CONV-4 and CONV-2 also consume more than 23% of the total power. In general, there is over 98% of power consumed by CONV-2 to CONV-6 layers.
2) THROUGHPUT AND EFFICIENCY
For the throughput performance, we use giga operations per second (GOPS) to evaluate all the implementations. The proposed digital-ReRAM can achieve 792 GOPS, which is 535 times and 1.61 times better than CPU-based and GPU-based implementations, respectively. It is also 12.78 times and 18.86 times better than FPGA based and CMOS-ASIC. For energy-efficiency, the digital-ReRAM achieves 176 TOPS/W, three-magnitude better than the CMOS-ASIC-based CNN. In area-efficiency comparison, the digital-ReRAM is 296 times better than CMOS-ASIC. The digital-ReRAM is the best among all the implementations on both throughput and efficiency.
The design exploration of CIFAR-10 is shown in Fig. 8 . We change the number of CONV-4 output maps to find out the optimized parameter. The normalized energy consumption is referred to the configuration in Table 4 . The result shows that the accuracy will not increase when the number is larger than 256. When it is lower than 256, the accuracy drops down even though it has better energy consumption and throughput. As a result, the parameters in Table 2 are optimal with specifications in Table 4 .
VI. CONCLUSION
In this paper, we propose a digital ReRAM-crossbar based in-memory accelerator for BCNN. The BCNN is obtained by training with binary constraints so that operations including convolution, batch normalization, pooling, and activation can be implemented with bitwise parallelism. The BCNN can be further effectively mapped to the digital ReRAM-crossbar with significant speed-up and energy-efficiency improvement. Numerical results using the benchmark CIFAR-10 show that the developed BCNN accelerator on the digital ReRAM-crossbar achieves a peak throughput of 792 GOPS at the power consumption of 4.5 mW, which is 1.61 times faster and 296 times more energy-efficient than existing state of the arts.
