has achieved tremendous success in many machine learning applications. Nevertheless, the deep structure has brought significant increases in computation complexity. Largescale deep learning systems mainly operate in high-performance server clusters, thus restricting the application extensions to personal or mobile devices. Previous works on GPU and/or FPGA acceleration for DCNNs show increasing speedup, but ignore other constraints, such as area, power, and energy. Stochastic Computing (SC), as a unique data representation and processing technique, has the potential to enable the design of fully parallel and scalable hardware implementations of large-scale deep learning systems. This paper proposed an automatic design allocation algorithm driven by budget requirement considering overall accuracy performance. This systematic method enables the automatic design of a DCNN where all design parameters are jointly optimized. Experimental results demonstrate that proposed algorithm can achieve a joint optimization of all design parameters given the comprehensive budget of a DCNN.
I. INTRODUCTION
As the important branches of deep learning, Deep Neural Networks (DNNs) and Recurrent Neural Networks (RNNs), outperforms traditional machine learning techniques in several real-world problems [1] [2] [3] . Recently, Deep Convolutional Neural Network (DCNN) , which is one of most widely used types of DNNs, has achieved tremendous success in many machine learning applications, such as speech recognition [4] , image classification [5] , [6] , and video classification [7] . DCNN is now the dominant approach for almost all recognition and detection tasks, and approaches human performance on some tasks [8] . Nevertheless, the introduction of deep structure in deep learning brought significant increases in computation complexity. Large-scale deep learning systems mainly operate in high-performance server clusters, thus restricting the application extensions to personal or mobile devices.
General-Purpose Graphics Processing Units (GPGPUs) are widely used for current deep learning research to accelerate DCNNs [9] . GPGPU's major competitor is Field-Programmable Gate Arrays (FPGAs) . Considering energyefficiency, FPGAs are more suitable for portable and embedded DCNN applications. Previous works on FPGA acceleration for DCNNs [10] [11] [12] show increasing speedup. These implementations, however, focus on improving the throughput of an embedded network, which ignores other constraints to run a network, such as area, power, and energy. A notable trend is that machine learning is running locally on mobile/wearable devices and Internet-of-Things (IoT) entities instead of relying on a remote server. In order to bring the success of DCNNs to these resource constrained systems, designers must overcome the challenges of implementing resource-hungry DCNNs in embedded systems with limited area and power budget.
Stochastic Computing (SC) [13] , as a unique data representation and processing technique, has the potential to enable the design of fully parallel and scalable hardware implementations of large-scale deep learning systems. Many complex arithmetic operations can be implemented with very simple hardware logic in stochastic computing framework [13] [14] [15] [16] [17] , which offers an immense design space for (i) neuron integrations due to the significantly reduced area per neuron, and (ii) performance optimizations with respect to the budget of area, error resiliency, power/energy, or speed. We use the term budget to describe the design constraint when implementing a hardware based DCNN using SC. For example, given the area budget of an embedded design of DCNN, SC enables a comprehensive optimization including other design parameters mentioned above.
A node, referred as to neuron, in a DCNN can be implemented by different stochastic computing based designs. Given these designs, how to arrange them to structure a complete DCNN achieving preferred design parameter(s) such as constrained energy, promised accuracy, and restricted area leaves blanks to researchers. This paper deals with the problem of deriving the optimal structure of hardware deep learning systems given a design budget and proposes an automatic algorithm driven by budget requirement with the comprehensive design parameters of a network taken into consideration. More specifically, the proposed automatic design allocation algorithm greedily decides implementation for each layer of a DCNN, and then optimizes the complete DCNN jointly to achieve the better objective design parameter(s) by reallocating different implementations. This systematic method enables the automatic design of a DCNN where design parameters are jointly optimized given the budget.
The contributions of this paper are summarized as follows, 1) SC paradigm is finely applied to DCNN, i.e., major computing tasks are performed in SC domain. With SC components, the hardware footprints can be significantly reduced for wearable/mobile devices. 2) We explored different implementations for neurons of different layers in a DCNN. Accumulative/Approximate Parallel Counter (APC) [18] , [19] and Multiplexer based neurons with distinct implementation details are investigated. 3) An automatic design allocation algorithm is proposed to optimize the complete DCNN hardware design using SC. This algorithm can also evaluate a general budget-driven hardware optimization for DCNNs.
Experimental results have been demonstrated on the problem of classifying handwritten digits in MNIST database using LeNet-5 [20] with the SC based DCNN designs. It reveals that proposed automatic design allocation algorithm can achieve a joint optimization of all design parameters given the comprehensive budget of a DCNN.
II. RELATED WORKS Experiments in [9] [21] showed a high speedup of DCNN implemented on GPGPU. However, the widespread deployment of DCNNs has been hindered by their high complexities and power consumptions of server-based GPGPU-like devices, particularly in resource-constrained offline wearable devices and embedded systems. Farabet et al. in [10] proposed an FPGA-based processor specific for convolutional networks. The proposed processor was well structured. Nevertheless, the paper lacked evaluation of power and/or energy consumption. Similarly, Cadambi et al. of [12] demonstrated a general accelerator working for five typical tasks including DCNN. The paper, however, explored architectural design space instead of considering optimization of on-chip design parameters of a hardware-based DCNN. In [11] , authors contributed to developing a co-processor to improve the speed of DCNN, but they didn't mention how to optimize the network performance given hardware budget constraints.
Predecessors have been investigating SC as a candidate for implementing hardware-based DCNN. Inspired by [18] , authors in [22] introduced several basic hardware implementations for matrix operations including inner-product calculation which is crucial in DCNN inference process. Based on lots of similar works exploring hardware implementation for stochastic computing, a recent work [19] presented a neuron cell design using SC components, where the progressive precision characteristics of SC was exploited. The aforementioned papers focused on the analysis of performance in SC implementations, but optimization was not accomplished from a higher network-wise view. Even though the synthesis results were listed, but still, there lacked re-design of the implementation for DCNN with constrained hardware resources. The aforementioned researches proved SC is a promising technique for embedded DCNN system. However, there is no existing work designing DCNN using SC with a structural optimization given hardware budget.
III. STOCHASTIC COMPUTING BASED DCNN A. Deep Convolutional Neural Network Architecture
In this paper, we consider a general DCNN architecture, which consists of a stack of convolutional layers, pooling layers, and fully connected layers. By arranging the topology of the above layers, powerful architectures, such as LeNet [23] and AlexNet [24] , can be built for specific applications. Without the loss of generality, we conduct the investigation on the fifth generation of LeNet architecture using SC, which is comprised of two pairs of convolutional and pooling layers, and one fully connected hidden layer with the output layer. Note that the proposed methodology can accommodate other DCNN architectures as well.
A convolutional layer is associated with a set of learnable filters (or kernels), which are activated when specific types of features are found at some spatial positions in the inputs. After obtaining features using convolution, a subsampling step can be applied to aggregate statistics of these features to reduce the dimensions of data and mitigate over-fitting issues. This subsampling operation is realized by a pooling layer in hardware-based DCNNs, where different non-linear functions can be applied, such as max pooling, average pooling, and L2-norm pooling. The activation functions in neurons are non-linear transformation functions, such as Rectified Linear
We adopt the tanh activation function since it can be implemented efficiently as a finite state machine (FSM) in SC using a stochastic approximation method. The fully connected layer is a normal neural network layer with its inputs fully connected with its previous layer. The loss function of DCNN specifies how the network training penalizes the deviation between the predicted and true labels, and typical loss functions are softmax loss, sigmoid cross-entropy loss or Euclidean loss. In general, we can define three kinds of basic neurons (nodes) in hardware-based DCNN based on their corresponding operations as Fig.1 shows. Neurons in convolutional layers and fully connection layers calculate the inner product shown in Fig.1 (a) of inputs and weights based on its incoming connection with the previous layer. And the products are subsampled through a pooling neuron shown in Fig.1 
We use average pooling here as a case study due to its simple hardware implementation. The subsampled outputs are transformed by an activation function shown in Fig.1 (c) to ensure the inputs of next layer are within [−1, 1].
B. Stochastic Computing Based Neuron
In stochastic computing, the value of a number which lies in [0, 1] is represented by the occurrence probability of 1s in a random bit stream. A 4-bit sequence X = 0010, for example,
Only a small subset of the real numbers in the interval [0, 1] can be expressed exactly in SC. Clearly, the precision and accuracy of SC depend on the length of the stream.
The two most popular representations for stochastic numbers are unipolar and bipolar formats, which interpret values in the intervals [0, 1] and [−1, 1], respectively. Unipolar coding is commonly used in unsigned arithmetic operations, whereas bipolar format is used in signed arithmetic calculations. More specifically, in unipolar coding, the number x carried in a stochastic stream of bits X is x = P (X = 1) = P (X), whereas in the bipolar format, x = 2P (X = 1) − 1 = 2P (X) − 1.
In this section, we first conduct a detailed investigation of the energy-accuracy trade-off among two hardware neuron designs using SC, i.e., APC-based neuron and MUX-based neuron, as shown in Fig.2 and Fig.3 , respectively. Hardwarebased pooling is provided afterward, and finally, we present the structure optimization method for the overall DCNN architecture.
1) APC-Based Neuron: Fig.2 illustrates the APC-based hardware neuron design, where the inner product is calculated using XNOR gates (for multiplication in bipolar coding) [25] and an APC (for addition). To be more specific, we denote the number of bipolar inputs and stochastic stream length by n and m, respectively. Accordingly, n XNOR gates are used to generate n products of inputs (x i s) and weights (w i s), and then the APC accumulates the sum of 1s in each column of the products shown in Fig.2 (a) . Instead of an FSM, a saturated up/down counter is used to perform the scaled hyperbolic tangent activation function Btanh(·) for binary inputs as [19] .
2) MUX-Based Neuron: As shown in Fig.3 , the MUXbased neuron is composed of XNOR gates, a MUX, and a K-state FSM. XNOR gates compute the products of bipolar inputs (x i s) and weights (w i s); the n−to−1 MUX sums up all stochastic products; and the hyperbolic tangent activation function is achieved by a K-state FSM, respectively. As the inner product calculated by a MUX is a stochastic number, the K-state FSM design mentioned in [25] can be used here to implement the activation function denoted as Stanh(K, x) = tanh( K·x 2 ) where x is the input. Nevertheless, two problems challenge the implementation: (i) the inner product calculated by an n input MUX is scaled to z n , where the correct inner-product result is z, and (ii) with the input z n , the K-state FSM calculates tanh( K·z 2·n ) instead of the desired value tanh(z). Thus, in order to recover the correct activation, we need to re-scale up the results of MUX by n times and multiply the stream by 2 K (or multiply by 2·n K directly). As opposed to the relatively simple and efficient data conversions on a software platform, such conversions in a hardware-based neuron incur significant hardware overhead, because the linear gain transformation needs one more FSM [25] , and the scaling multiplication requires one XNOR gate as well as another bipolar stochastic stream generated. In this paper, considering an n inputs neuron with inner product denoted by z, we select the state number K such that 2·n K = 1, and the final output of the FSM is calculated as
In this way, we achieve the desired activation result with no additional bit stream conversion (i.e., no hardware overhead).
x n w n ... } } n stochastic bit-streams with m length m n n to 1 Mux one column of n products 3) Pooling Neuron: In a DCNN, down-sampling steps are performed by the neurons in pooling layers, which reduce the dimensions of neurons for following convolutional layers or fully connection layers. Pooling operation achieves the invariance to input data (i.e., image, video, etc.) transformations and better robustness to noise and clutter. Moreover, the interlayer connections can be significantly reduced for a hardware DCNN by using pooling layers. In this paper, we adopt the average pooling which simply outputs a mean of k inputs. In SC-based hardware, we implement it using a MUX and each input x i has the same probability to be selected as output. For example, the stochastic arithmetic mean over 4 inputs (2 × 2 region) is provided by using three 2-to-1 MUXs connected hierarchically, where selection signal for each MUX is a stochastic bit-stream for 0.5.
IV. BUDGET-DRIVEN DESIGN ALLOCATION
Obviously, since neurons in each DCNN layer are homogeneous, we consider neurons in the same layer are implemented using the same design. Design allocation which represents how to arrange different neuron designs for each DCNN's layer is a challenging task. Accuracy, power, area, and energy constraints have to be satisfied. Inspired by the stable marriage problem [26] , we present a design allocation algorithm which optimizes the overall network design given a preferred parameter. The first step is to find a minimum feasible solution (MFS) so that each layer in the DCNN is implemented by a valid neuron design. The second step is to optimize the MFS by re-allocating implementation for neurons in each layer to achieve the desired network-wise objective. The objective of the overall DCNN optimization is evaluated in terms of a comprehensive design score defined as follows
where ω j is the integer weight of each design parameter, C j is the overall network cost in terms of one metric (e.g., area or power), and Err is the overall network error rate. As long as any one of the design parameter is the optimization target, corresponding ω j increases. A small score suggests an optimized design considering both design parameters and network performance.
A. Minimum Feasible Solution
The stable marriage algorithm cannot be applied directly to find a MFS. In the stable marriage problem, an element can be matched to another element freely. However, in our problem, a valid hardware design can implement multiple DCNN layers; several constraints result in limitation to apply a specific hardware design for neurons in a certain DCNN layer. For example, in the case, where using APC-based neuron in a convolutional layer leads to the budget violation, the convolutional layer should not be implemented with the APCbased neuron.
In the minimum feasible design allocation algorithm shown in Algorithm.1, the cost, denoted as c lid , of an implementation i on neurons in a layer l with a specific constraint d is defined as
where ψ is the number of neurons in the current layer and u lid is the unit cost of constraint d for a neuron in layer l by implementation i. Similarly, the budget for layer l with respect to constraint d is denoted by b ld . Layers in a DCNN are firstly initialized to be free. The MFS algorithm tries to find a valid design for each layer at first. An implementation is valid for a specific layer when the cost in every aspect of constraint metric satisfies the requirement. The first valid implementation is used for current layer. However, if all implementations are not valid, the solution is void. And an impossible solution is reported. This greedy process is repeated for each layer of a DCNN to obtain a MFS.
B. Joint Design Allocation Optimization
The MFS guarantees that each DCNN layer has a valid implementation design, but the complete DCNN is not optimized, i.e. the accuracy of DCNN is not optimized for the given constraints. A joint optimization algorithm is proposed in Algorithm.2. The cost in the algorithm is defined in Eqn.(3) , where we only calculate the cost for the constraint t. Taking the MFS and the given design budget as the inputs, the algorithm firstly seeks an implementation with lower cost for each DCNN layer. The cost difference between the old implementation and the new one has to be as big as θ to output the new one. After a lower cost design is ensured, the algorithm evaluates the refined implementation. If the accuracy increases, which means the error rate of the network with new implementation is smaller than the previous one, the algorithm will output the new implementation scheme for the DCNN. When the process reaches an iteration limit τ , the current solution is recognized as the optimized one and algorithm terminates.
V. EXPERIMENTAL RESULTS

A. Comparison between APC-based and MUX-based neuron
We use Synopsys Design Compiler to synthesize the neurons with the 45nm Nangate Open Cell Library [27] . For an APC-based neuron, the area, path delay, energy, power, and absolute error with respect to the input size are shown in Fig.4 (a), (b), (c), (d), and (e), respectively. Energy and absolute error differ due to bit-stream length change, while other measures remain constant. To be more specific, as illustrated in Fig.4 (a) , (c), and (d), the APC-based neuron shows an exponential increase in area, energy, and power including dynamic and leakage power as the input size of a neuron increases exponentially. This means that the area, energy, and power are linearly proportional to input size. However, path
Algorithm 2: Design Allocation Optimization
Data: minimum feasible design allocation solution S, list of feasible hardware implementations I, optimization objective constraint t Result: optimized design allocation solution S lastScore ← ScoreS currScore ← ScoreS Fig.4 (b) reflects a saturated pattern when the input size of a neuron increases to a certain point, which means that a large input size will not lead to extreme long path delay.
The reason results from: With the efficient implementation of Btanh(·) function, the hardware of Btanh(·) increases logarithmically as the input increases, since the input width of Btanh(·) is log 2 n. On the other hand, the number of XNOR gates and the size of the APC grow linearly as the input size increases. Hence, the inner product calculation part, i.e., XNOR array and APC, is dominant in an APC-based neuron, and the area, power, and energy of the entire APCbased neuron cell also increase at the same rate as the inner product part when the input size increases.
We observe that the error is normally distributed with a mean value of 0. In this paper, we denote absolute error as the absolute standard deviation of error's distribution. For different bit-stream lengths, the result, as Fig.4 (e) shows, agrees on the intuitive observation that a longer bit stream can reduce the absolute error of calculation. Moreover, more inputs lead to larger absolute error. The absolute error increases logarithmically with respect to input size. The improvement due to the increase of bit-stream length is non-linear but independent of the input size. Longer bit-stream length helps to reduce the error, but this improvement decreases and converges when bitstream length gets longer than 1024. Designers should consider the latency and energy overhead caused by long bit streams; the convergence of improvement trend helps designers achieve the desired trade-off between accuracy and overhead.
Similarly, we investigate the performance of the MUXbased neuron with respect to its input size. Fig.5 (a) , (b), (c), and (d) show the results of the number of inputs versus the area, path delay, energy, power, and absolute error with respect to the input size. Based on observation on the absolute error the MUX-based neuron can achieve, as shown in Fig.5 (e), when the input size exceeds 64, the absolute errors with different bit-stream length approach 1, which is about 100% error compared to correct value range. Then we plotted the chart for input size from 8 to 64 only. With input size increase linearly, the absolute error increase in a linear pattern. The reason is that MUX addition selects only one bit at a time and ignores the rest of the bits, leading high absolute errors when input size is large. Furthermore, as APC-based neuron performs, longer bit streams can reduce absolute error. The improvements are independent of input size; the improvements with respect to bit-stream length are logarithmic. When the bitstream length is increased large enough, the absolute error will not be reduced significantly. In general, MUX-based neuron gains larger absolute error compared to APC-based neuron, which suggests MUX-based should be applied to more errortolerant arithmetic operations. In addition, one can observe from Fig.5 (a) , (c), and (d) that as the number of inputs increases, area, power, and energy of the MUX-based neuron all tend to increase. The synthesis result also shows the path delay for a neuron increases approximate linearly when we enlarge the input size with the same stride. These are because the MUX-based neuron with more inputs requires more XNOR gates and MUXes for inner product calculation, and more states in the FSM (K = 2·n) to compute the activation. Hence, the increased hardware components result in more area, power, path delay, and energy in the neuron cell. we compare the performance between APC-based neuron and MUX-based neuron using a fixed bit stream length equal to 1024 under different input sizes, as shown in Table. I. Clearly, APC-based neuron is more accurate but occupies more area than MUX-based neuron. Besides, as APC is slower than MUX, the latency of APC-based neuron is larger than MUXbased neuron, which causes APC-based neuron to consume more energy than MUX-based neuron for one calculation. As for the power performance, an APC-based neuron has less switching (due to the long latency) and larger area than the MUX-based neuron, resulting in less dynamic power, more leakage power, and less overall power.
B. Evaluation of design allocation algorithm
We use LeNet-5 DCNN as a case study in this experiment to evaluate our budget-driven algorithm to optimize a stochastic computing based DCNN. Neurons in LeNet-5 DCNN layers are configured as 784(28
The MNIST handwritten digit image dataset [28] consisting of 60,000 training data and 10,000 testing data with 28x28 grayscale image and 10 classes is used in the experiments. The synthesis results are gathered as mentioned in Section III-B1 using Synopsys Design Complier with the 45nm Nangate Open Cell Library [27] .
We first listed several different configurations to implement each DCNN layer as Table. II shows. We fed these configurations into our algorithm given the budget constraint and evaluated the design score, defined as Eqn. (2), for the optimized configuration of the DCNN. To validate our algorithm, we optimized the DCNN with three different optimization targets as an example, i.e., area, power, and energy, given corresponding network-wise constraints. The energy is approximately positively proportional to power; we set the weight for energy as 0 and use power to represent energy approximately. In the experiment, three example cases were studied, where case 1 emphasized area and the rest emphasized power. Table. III, the proposed joint optimization algorithm picked configuration 4, 7, and 14 for case 1, 2, and 3, respectively. Under certain given constraints, some configurations are not valid, which is filtered out by the algorithm. For comparison, we calculated the scores for those configurations which are not optimized when they are valid (configuration 1, 2, 9, 12) . We have conducted a separate exhaustive search, and verified the selected configurations by the proposed algorithm are the best. One can observe from the Table. III that the picked configurations have the lowest score whereas other configurations are invalid or have larger scores. Also being observed from Table. III, our algorithm gives the best trade-off between network accuracy and design parameters (with the highest scores in all the cases).
VI. CONCLUSION
This paper introduced hardware implementation for Deep Convolutional Neural Network using stochastic computing. Each distinct stochastic computing based neuron in DCNN is analyzed. A two-step joint optimization algorithm is proposed that given design budgets, re-structuring the SC-based DCNN can achieve optimized hardware footprint with a relative high network accuracy performance. Experimental results showed with restricted design requirements, the optimized SC-based implementation for DCNN achieved the lower error rate with the least design resources requested.
