Deep Learning has gained immense success in pushing today's artificial intelligence forward. To solve the challenge of limited labeled data in the supervised learning world, unsupervised learning has been proposed years ago while low accuracy hinters its realistic applications. Generative adversarial network (GAN) emerges as an unsupervised learning approach with promising accuracy and are under extensively study. However, the execution of GAN is extremely memory and computation intensive and results in ultra-low speed and high-power consumption. In this work, we proposed a holistic solution for fast and energy-efficient GAN computation through a memristor-based neuromorphic system. First, we exploited a hardware and software co-design approach to map the computation blocks in GAN efficiently. We also proposed an efficient data flow for optimal parallelism training and testing, depending on the computation correlations between different computing blocks. To compute the unique and complex loss of GAN, we developed a diff-block with optimized accuracy and performance. The experiment results on big data show that our design achieves 2.8× speedup and 6.1× energy-saving compared with the traditional GPU accelerator, as well as 5.5× speedup and 1.4× energy-saving compared with the previous FPGA-based accelerator.
INTRODUCTION
Deep learning has achieved great success in various artificial intelligence applications such as object detection [1, 2] , tracking [3] and natural language processing [4] . Normally, supervised learning is employed in the state-of-the-art applications, where a deep neural network is trained from labeled training data and desired outputs are obtained after going through an inferring function like backpropagation [5, 6] . The supervised learning has been proved powerful, however, challenges emerged with the fast growth of the applications complexity: large-scale of labeled data is in demand and its learning capability is constrained.
Unsupervised learning, which has the capability to learn from the unlabeled data directly appears as a possible solution. Nonetheless, accuracy is usually low in the conventional unsupervised learning and its feasibility in the realistic applications is hindered [7] . Recently, generative adversarial networks (GAN) was proposed as a promising solution for the above challenges [8] . GAN can estimate generative models from an adversarial process by training two modelsgenerative and discriminate model simultaneously [8] . Feature representations can be learned from unlabeled data and improved accuracy of unsupervised learning is achieved on GAN [9] . However, the challenge of high demand for computing resource that exists in deep neural network becomes more severe in GAN computation. Both training and testing executions are particularly slow and energy hungry [8, 9] .
Extensively research efforts have been devoted to accelerate and improve computing efficiency of deep learning, such GPU [6] , FPGA [10] , emerging non-von Neumann accelerators [11, 12] . Very recently, GPU [9] and FPGA [13] based GAN computations have been proposed with significantly improved speed and energy efficiency compared with CPU. However, the computing efficiency is still constrained as data bandwidth in these architectures is still limited and the performance improvement mainly relies on reckless resource accumulation. Recent studies in developing novel non-von Neumann accelerators such as in-memory accelerator [14, 15] and neuromorphic computing system [16, 12, 17] , esp. the designs with novel nano-devices like memristor paved a way towards computational efficient GAN. In [11, 12] , convolutional neural networks (CNNs) were deployed on memristor crossbar and an on-line training process with backpropagation were implemented via a hardware and software co-design methodology. However, these previous approaches cannot be used directly in GAN computation because of the following reasons. First, different with traditional CNNs training, two learning models are executed simultaneously in GAN's training phase and requires an adaptive data flow for optimal computing efficiency. Moreover, backpropogation is simplified in a hardware-friendly way that the initial error in training is considered as the difference between the true label and the predicated label, which works conditionally under the prediction that the cost function of CNN is cross entropy. This assumption is invalid in GAN, therefore, the memristor-based hardware implementation in previous works cannot be utilized.
In this work, we developed a memristor-based unsupervised neuromorphic system for a fast and energy-efficient GAN computation to solve the above challenges. Our contributions can be summarized as follows:
• We exploited a hardware and software co-design approach to map the computation blocks in GAN to the memristor-based crossbars efficiently. The computing system is composed of three major blocks-Generator, Discriminator, and Diff block. The Diff block is designed to compute the cost function of GAN accurately with low hardware cost;
• We proposed an adaptive data flow for GAN with optimal computation parallelism and efficiency. In forward phase, the Generator and Discriminator block worked in parallel to generate artificial data and extract features; In backward phase, the Generator and Discriminator block were trained effectively with the initial errors computed by the Diff block.
• We evaluated the system accuracy in different data precision and the system performance in speed and energy. The proposed system performance on ImageNet and Lsun/bedroom was also compared with the GPUbased and FPGA-based GAN computation in previous works.
The experimental results show that our proposed design can achieve 2.8× (2.7×) and 5.5× (4.8×) speedup, as well as 6.1× (6.1×) and 1.4× (1.5×) energy-saving compared with GPU-based [9] and FPGA-based [13] GAN accelerators respectively on Lsun (ImageNet) dataset.
BACKGROUND

Generative Adversarial Networks
The generative adversarial network was developed as an unsupervised model that can learn effective feature representations from unlabeled data while having improved accuracy compared with traditional unsupervised learning [8, 9] . Two learning models form the GAN: a generator and a discriminator. Normally, the generator is a deconvolutional neural network (Deconv-NN) for artificial data generation, and the discriminator is a convolutional neural network (CNN) for distinguishing the artificial data from real data. The training phase of GAN involves two major learning procedureeffective generator and discriminator learning based on backpropagation. The target of the training process is obtaining a generator that can generate most likely the same as the true CNN training data and a discriminator that can extract feature effectively. The training process is can be summarized as four major procedures, which are named as D f orward , D back , G f orward , and G back in this work.
• D f orward computes the cost function to obtain the error that should be transmitted to the discriminator for its backward weight updating. More specifically, with a batch of m noise samples-z 1 , ..., z m are given as inputs, the generator generates m artificial samples and this process is defined as G(z i ). The CNN based discriminator processes the m artificial samples (i.e. G(z i )) and m real samples (i.e. x 1 , ..., x m ) through forward computations, and then cost function is computed following Equation 1.
• D back updates the weights of the discriminator by ascending the stochastic gradient obtained from Equation 1, i.e. ErrorD.
• G f orward computes the cost function to gain the error that should be given to the generator for its weight updating. The cost function is computed as Equation 2.
• G back updates the weights of the generator by ascending the its stochastic gradient obtained from Equation 2, i.e. ErrorG.
Here, G(·) and D(·) represents the generator and discriminator respectively, x is the real data, and z is the noise given to the generator. 
Memristor Crossbar for (De)Convolutional Computation
The limited data bandwidth, as well as the performance gap between processing units and memory of the conventional computing platform becomes a major obstacle in the deep leaning based applications. Novel computing platforms, such as the in-memory accelerator [14] , neuromorphic computing [18] , etc. therefore have been extensively investigated as a promising solution. The emerging of novel nano-devices such as spin, phase change and memristor device also accelerates its development and corresponding accelerators are developed accordingly. Among them, the memristor based computing platform attracts people's attention own to the high density, high speed, multiple level states, etc [19, 11, 12] . In this work, the memristor-based computation platform for GAN is developed. The generator that based on deconvolutional network and the discriminator based on convolution network of GAN is deployed on the memristor crossbar structure.
In previous work, convolutional networks have been deployed on the memristor-based crossbar structure, as is depicted in Figure 1 (a) [12] . For example, to deploy a convolutional layer with 32 kernels, each kernel is reshaped to a vector that can be programmed to a memristor crossbar. And the inputs data is given to the memristor crossbar to execute the dot-productions computations. Multiple crossbars are connected in parallel to form the large-scale convolutional layer because of the size limitation of the memristor crossbar [12] . The ReLU activation function of CNN can be implemented by the integrate-and-fire circuits (IFCs) and other digital logic [16, 12] . The deployment of deconvolutional network is similar to the convolutional layer. One major difference is the data should be zero-padded before giving to the crossbar, as is shown in Figure 1 (b). To optimally decrease the executions involved by the zeros, we transform and group the input vectors with the zeros in the same locations, and only non-zero inputs in the rows are given to the crossbar and computed.
DESIGN METHODOLOGY
In this section, we built a memristor-based neuromorphic design for accelerating both training and testing phase of GAN. The basic computing architecture and data flow is described in subsection 3.1 To achieve optimal system performance, we proposed a cross-parallel pipline execution flow which is introduced in 3.2.
Memristor-based GAN Architecture
Based on the GAN computation described in 2.1, we developed the memristor-based computing architecture of GAN computation. As is shown in Figure 2 , the proposed architecture is composed by four integrated components: Generator block, Discriminator block, Diff block, and Control unit. The functionality of these components can be summarized as below.
(1) The Discriminator block is designed to compute the
for the stochastic gradient calculation. It is composed of a sea of connected memristorbased CNN units;
(2) The Generator block is built to calculate the G(z i ) in generating the artificial samples from the noise inputs for computing the stochastic gradient in generator weight updating. Similarly, it is composed of a sea of connected memristor-based DeCNN units; (3)Following Equation 1 and Equation 2, the Diff block computes the gradients of the discriminator and generator block respectively. The Diff block is constructed by the memristor-based circuits blocks including LUT (look up table), memory, adder, etc.
(4) The Control unit is designed to control the data flow and is built with combinational and sequential digital logic.
In Figure 2 , the basic data flow is demonstrated as a ∼ f . The explanation of each step is summarized as below, where the definitions of D(·), G(·), x, z, and m are as the same as those in subsection 2.1.
• a: The real data with m samples in a batch, i.e. x i is given to the Discriminator block to compute the D(x i ), where i ∈ {1, ..., m}.
• b: The noise data with m samples in a batch, i.e. z i is given to the Generator block to obtain artificial samples G(z i ), where i ∈ {1, ..., m}.
• c: The G(z i ) is generated from the Generator block and then transmitted to the Discriminator block to compute the D(G(z i )), where i ∈ {1, ..., m}.
• d: This step computes the transmits the D(x i ) and D(G(z i )) from the Discriminator block to the Diff block.
The above steps a ∼ d fulfill the forward computations of GAN.
• e: Based on the gradient calculated by the Diff block, i.e. ErrorD in Equation 1, this step updates the weights of the Discriminator.
• f : The weights of the generator is updated in this step according to the gradient obtained from the Diff block.
These two steps e and f implement the backward computations in GAN.
Correspondingly, the D f orward is composed of a, c and b, c, d, D back is fulfilled by e, G f orward is implemented by b, c, d, and G back is the data flow of f .
The Cross-Parallel Pipeline
In this section, we proposed a cross-parallel pipeline based on the basic data flow for optimized computing efficiency. Hence, we first analyze the time cost of each step among a ∼ f in Section 3.1. The analysis is executed based on the simulations on NVSim simulator [20] and the results are demonstrated in Figure 3 . In the analysis, the steps of e and f are divide into two categories: data transmission and data computation. For example, e1 refers to the gradients transmission time from the Diff block to the Discriminator block while e2 represents the time consumed by updating the weights of the discriminator. Similarly, the f1 is the gradients transmission and f2 is the cost of weights updating pipeline: of the generator. It is observed that the majority of the time cost occurs in updating the generator and discriminator. Figure 4 shows the pipline of each computing block, i.e. the Discriminator block, Generator block, and Diff block based on the basic data flow. As is described, different colors represent the execution states of each block. Then, we analyze the independence property each execution in these blocks to build the cross-parallel pipline, where the independent executions are designed to compute in parallel. For example, observed from Figure 4 , the discriminator block and the generator block are independent from each other when the a and e2 are executing by the discriminator block or b and f2 are executing by the generator block. Hence, these computations can be optimized to be parallelism data flow. There are only two conditions that the generator and discriminator block cannot work in parallel: first, during b and c executions; second, the weights updating has not finished. Otherwise, the generator and discriminator can work highly in parallel and the optimized cross-parallel pipline is depicted in Figure 5 .
In the developed cross-parallel pipline, the D f orward is divided into two parts-D 
Discriminator-block
Generator-block Figure 3 that b is more time-consuming than a, hence the D(x i ) that computed from a is transmitted and stored in the Diff block firstly which is represented by d1 in Figure 5 . And a memory unit based on memristor crossbar is designed in the Diff block to store the computing results of D(x i ). Then, the discriminator is idle which means no execution occurs until the artificial samples generation (i.e. b) is finished. The D(G(z)) is transmitted to Diff block after the c finished, which is represented by d2 in Figure 5 . Immediately, d3 which includes the gradient computation and its transmission to the discriminator block is executed. Consequently, the weights updating in the discriminator (e2) and generator (f2) block run simultaneously. In addition, although the e2 is executed faster than f2, the next training iteration of the discriminator starts asynchronously without introducing additional memory usage in the discriminator block as is indicated by D t+1,1 f orward . The training process of the generator and discriminator become synchronous before c starts in the (t + 1) th iteration. As is discussed above, the parallelism of the computation can be largely improved. The time cost of the GAN computation following the basic pipline and the proposed cross-parallel pipline is evaluated on CIFAR-10 dataset [21] to indicate this improvement. Also, the simulation is executed on the simulator NVSim [20] . The results are shown in Table 1 , we can observe that 1.6× speedup is achieved by utilizing the cross-parallel pipline. In addition, the usage rates of the Discriminator block and Generator block are improved 3.8× and 2.2× respectively.
DESIGN DETAIL
In this section, the implementation of the Dicriminator block, Generator block and Diff block is explained in detail. All of the above blocks are implemented based on the memristor crossbars. 
Parallel and Memory-Free Structure for Discriminator and Generator Blocks
In previous research [12] , the memristor-based computations for CNN training and testing has been implemented. However, these designs cannot be used in the GAN training because two major reasons. First, the gradient for the output layer of CNN was computed as the output minus the true label previously, which is not adapted to GAN. Moreover, the previous framework involves a large number of memory units, such a design will result in heavy area cost and time-consuming in GAN computations.
To solve the above challenges, we proposed a a parallel and memory-free structure as is shown in Figure 6 . The squares in different color represent the processing units of error computations, weight updating, and (de)convolutional operations respectively. These units are built by the memristorbased crossbar, IFC, and digital logic [16, 12] , and a simplified mapping scheme is shown in Figure 7 . The proposed structure is composed of a parallel forward flow and a memory-free backward flow. It can work as the discriminator or generator block with different initial weights programmed on the (de)convolutional operation units.
The Parallel Forward Flow
The forward processing of GAN includes the CNN and DeCNN computations. Consider a GAN structure with 5-layer CNN and 5-layer DeCNN, the parallel flow is depicted in the above part of Figure 6 . Initial or updated weights, i.e. w l in Figure 7 (a) are programmed to the memristor crossbar of the (de)convolutional layer and the results processed from the forward flow are obtained as o l that works as the input to the next layer. The mapping and programming method is follows Section 2.2 and previous researches [12, 16] . Multiple samples can be processed on these same processing units with different inputs, and thus parallel computations can be achieve. The final output of the parallel forward flow is transmitted to the Diff block.
The Memory-Free Backward Flow
The memory-free backward flow aims to update the weights of the forward flow based on the gradient from the diff-block. The backpropogation method used to update weights of (de)cnn can be summarized as equation 3, where l represents the index of the (de)convolutional layer, e l represents the gradient, w l represents the weights, w * l represents the updated weights and w T l represents the transposition matrix of the weights, o l represents the dirivative of the l th layer output signals and α represents the learning rate. Because the dirivative of ReLU activiation function is equal to the output signal itself, o l is equal to o l .
The memory-free backward flow is proposed to implement the backpropogation method, which is depicted as the below part of figure 6. Weights' transposition, i.e. w T l in Figure 7 (c) are read from the operation uints in the forward flow and programmed to the memristor crossbar of the error computation uints. Outputs of each layer in the parallel forward flow, i.e. o l in Figure 7 (b) are programmed to the memristor crossbar of the weight updating uints. The input to the error computation uint is the errors computed from the error commputation uint in the next layer, i.e. e l+1 in Figure 7 (c) and then current layers' errors, i.e. e l in Figure 7 (c) are computed. These errors are also input to the weight updating uints to compute the updated weights i.e. w * l in Figure 7 (b) .
In previous designs [12, 16] , they use memory to store the updated weights or the inter-layer signals of CNN. The reason that the proposed design does not need special memory is that we store the weights and inter-layer signals in the same memristor crossbar as computation uints. The interlayer signals are programmed to the weight updating uints directly used to compute the updated weights, shown as the green arrows in figure 6 . The updated weights are programmed to the (de)cnn computation uints, shown as the blue arrows in figure 6 .
Timing Sequence of the Parallel and MemoryFree Structure
The timing sequence of one iteration for the parallel and memory-free structure is detailed as Figure 8 . First, (de)cnn operation uints process the input data and the inter-layer : Idle Condition Figure 8 : The timing sequence of the parallel and memory-free structure signal is programmed to the weight updating uints immediately. After (de)cnn operation uints processing the input data, the weights from these uints are programmed to the error computation uints and then they begin computing the backpropogation errors and weight updating uints begin computing the new weights which are programed to the (de)cnn operation uints directly.
The Implementation of Diff-Block
The implementation of Diff-Block is detailed shown as Figure 9 . Diff-block is composed by two memristor-based lookup tables (LUT) [22] , the memristor-based memory unit [20, 23] and two memristor-based adders [16] . LUT1 stores values of θ logD(x) and LUT2 stores values of θ log(1 − D(G(z))). Linear transfromation, 1 m , can be done in the adder [16] .
When the discriminator-block transmitted D(x) to the diff-block, values of θ logD(x) are read from the LUT1 and stored in the memristor-based memory. The memory should be able to store m values where m represents the batch size. Generally, the batch size is 64. When the discriminator-block transmitted D(G(z)) to the diff-block, values of θ log(1 − D(G(z))) are read from LUT2. Those values are input to the adder and the ErrorG is computed. Meantime, the data in the memory are input to the adder as well as the above result and ErrorD is computed.
EVALUATION
In this section, we evaluate the performance of the proposed memristor-based GAN accelerator on accuracy, speed, energy, and area cost. The performance is compared with the previous GPU-based platform [9] and the microarchtural design [13] based on FPGA. The Nvidia Geforce GTX 1080 is used as the GPU platform. The proposed design is evaluated on NVSim [20] simulator environment. The memristor crossbar size is designed to be 32 × 32, the resistance range of the memristor device is set to be [50KΩ, 1M Ω], and the required crossbar number is calculated following the implementation in 2.2 and [11] . The circuits designs for the neurons and control units follows the [12] . 
Accuracy in Different Data Precision
In general, the memristor supports limited precision in data transmission, data storage and computation [24, 16] . Although analog states of memristor have been reported by the HP Research Lab [25] , high precision involves in scarification of speed and design cost. In this section, the data bitwidth for optimized accuracy and design cost is explored, and the GAN computing accuracy in different data bitwidth is shown in Figure 10 . As referred in [9] , the performance of GAN can be measured by using the discriminator as a feature extractor for a classifier. In this experiment, SVM works as the classifer and the discriminator of GAN works as the extractor for image classification. MNIST and Cifar-10, whose characters are listed as table 2, are used as the benchmark. The accuracy based on the feature extractor trained from GPU is regarded as the baseline. The accuracy based on the fixed-point feature extractors is compared with the baseline. Two dataset is utilized as is depicted in Table 2: MNIST and CIFAR-10. Figure 10 shows the normalized accuracy when discriminators are trained by data in different data format and bitwidth. The results show that the system accuracy with 8-bit data precision has a slight accuracy loss the the normalized accuracy is still higher than 90% on MNIST and CIFAR10. However, significant accuracy loss is introduced in the 4-bit data precision. Therefore, the 8-bit data precision, i.e. memristor device with 8-bit states is utilized in this work and the following evaluation.
Computing Parallelism
In the training process of GAN, the generator processes a batch of noise samples and the discriminator processes a batch of real images. The computing parallelism is referred to the generator (or discriminator) blocks that compute in parallel in each training iteration, i.e. the s in Figure 6 As is indicated in Section 3.2, the computing efficiency can be improved by higher computing parallelism. Note that higher parallelism results in an increase of the area and design cost in the proposed GAN computing system. In this section, the speed and area cost in different data parallelism scenarios are explored. The bedroom dataset from Lsun [26] is used in this evaluation. The maximum parallelism size is selected to be 64. Figure 11 shows the time and area cost in different parallelism. The computation time decreases with the increase of the parallelism size while the area cost increases heavily in high parallelism scenario. The main reason is that parallel computing can accelerate the speed of the image generation in the generator block, as well as the real and artificial images generation in the discriminator block. However, the speed of other procedures in training such as propagation and weight updating in the backward computations does not rely on the parallelism improvement. Hence, the speed increase rate becomes extremely slow in large parallelism designs while the area cost increases fast. Based on the results, the computing parallelism size is set to be 32 in the developed memristor-based GAN accelerator.
Memristor-based GAN Computing Speed
In this section, the speed of the memristor-based GAN accelerator is evaluated and compared with conventional platforms such as GPU and FPGA in previous works. Two big dataset is chosen-ImageNet and Lsun/bedroom is chosen for better demonstration and the dataset detail is listed in Table 3 . As is discussed above, the data bitwidth in the proposed memristor-based design is 8 and the parallelism size is 32.
The batch size is set to be 64 in these three scenarios. The speed of GAN training process is evaluated based on the fact that the testing process also works as an inter-step in GAN training.
The experimental results are listed in Table 4 It is observe that our proposed design can achieve 2.8× (2.7×) compared with GPU and 5.5× (4.8×) speedup compared with the FPGA-based accelerator on Lsun (ImageNet). In addition, our design has higher speedup for larger dataset because the parallel pipeline and the memory-free structure can largely decrease the time cost as is discussed in Section 4.
We also analyze the time cost of each computing proce- (b) GPU dures of GAN training that described in Section 2.1, and the results are indicated in Figure 12 . As is shown in Figure 12 (b), major time cost in the GPU based GAN computations occurs in the D f orward procedure. However, our design reduces such a cost efficiently from 83.68% to 11.16%, as is demonstrated in Figure 12 (a). Our proposed design performs highly resource usage compared with the GPU own to the developed cross-parallel pipline, hence the GAN computing speed is improved efficiently.
Energy and Area Cost Analysis
In this section, we evaluate and compare the energy and area cost of the memristor-based GAN accelerator. The energy cost of each procedure in GAN training is also analyzed in detail. Table 5 shows the energy cost comparison of the memristor-, GPU-, and FPGA-based accelerator. Our proposed design achieves 6.1× and 1.4× energy saving compared with the GPU and FPGA-based GAN computing respectively. The energy cost of each training procedure is analyzed as Figure 12 . It is observed that the energy cost of the D f orward in the proposed design has a lower energy cost percentage in the whole training process compared with the computations on GPU. Such a low energy cost owns to our proposed memory free data flow in the memristor-based accelerator, in which the energy cost of data communication between memory and CNN (or DeNN) is saved.
The area of the proposed design is 1644mm 2 when the parallelism size is designed to be 32. The parallel forward flows in the discriminator-block and generator-block accounts for 44.8% and 49.7% respectively. The area of the parallel for- 
CONCLUSION
Generative adversarial network (GAN) is an effective unsupervised model that is extremely computationally expensive. To address this issue, we proposed a memristor-based accelerator. The proposed design has two major aspects including a cross-parallel pipeline and the memory-free flow. The proposed accelerator was tested on large dataset: ImageNet and Lsun. With area equal to 1644mm 2 , the proposed accelerator can achieve 2.8× (2.7×) and 5.5× (4.8×) speedup, as well as 6.1× (6.1×) and 1.4× (1.5×) energysaving compared with GPU-based and FPGA-based GAN accelerators respectively on Lsun (ImageNet) dataset.
