Abstract-Large-scale convolutional neural network is a fundamental algorithmic building block in many computer vision and artificial intelligence applications that follow the deep learning principle. However, a typically-sized CNN is well known to be computationally intensive. This work presents a novel stochastic-based and scalable hardware architecture and circuit design that computes a large-scale CNN with FPGA. The key idea is to implement all key components of a deep learning CNN, including multi-dimensional convolution, activation, and pooling layers, completely in the probabilistic computing domain in order to achieve high computing robustness, high performance, and low hardware usage. Our approach has three advantages. First, it can achieve significantly lower algorithmic complexity for any given accuracy requirement. For a N dimensional image feature map, we have theoretically proven that a random sample size of k Ã log ðNÞ is sufficient to achieve no more than 0.05 error at 95 percent confidence level, where k Ã is a constant of 510. This computing complexity, when compared with that of conventional multiplier-based architecture, represents on average 8.97Â and 6.98Â performance improvement for SCNN and Deep SCNN, respectively. Second, this proposed stochastic-based architecture is highly fault-tolerant because the information to be processed is encoded with a large ensemble of random samples. As such, the local perturbations of its computing accuracy will be dissipated globally, thus becoming inconsequential to the final overall results. More interestingly, our measured results have shown that 0.1 percent degradation in computing accuracy of CNN can actually mitigate the well-known overfitting problem. Overall, being highly scalable and energy efficient, our stochastic-based convolutional neural network architecture is well-suited for a modular vision engine with the goal of performing real-time detection, recognition, and segmentation of mega-pixel images, especially those perception-based computing tasks that are inherently fault-tolerant, while still requiring high energy efficiency.
INTRODUCTION
C ONVOLUTIONAL Neural Network (CNN or ConvNet), originally inspired by biological processes, is a special case of feed-forward artificial neural network, where individual neurons are tiled to facilitate computing overlapped visual regions in a high-dimensional signal field. Recently, various CNNs, especially these multilayer perceptron-based variation networks that minimize preprocessing, are widely used in image and video recognition tasks. For example, CNNs have been successfully applied to computer vision and machine learning applications such as face recognition [1] , handwritten character and digit recognition [2] , [3] , speech recognition [4] , etc. To illustrate the CNN's computing power, in Fig. 1 , we present a typical ConvNet architecture designed to recognize a given object from an input image after training. More specifically, this network, consisting of 10 feature extraction layers, reads a 224 Â 224 input and the output of its last fully-connected layer is fed to a 1,000-way softmax which produces a distribution over 1,000 class labels. Each volume of activations along its processing path are the direct computing results with our stochasticbased CNN implementation discussed in Section 4.
Conventionally, a typical CNN consists of four types of main processing layers: convolutional layer, activation layer, pooling layer, and a fully-connected classification layer as in regular neural networks. Each of these layers transforms one volume of activations to another through a differentiable function. Typically, to achieve any acceptable computing capability, millions of convolution and sub-sampling operations need to be performed with significant amounts of intermediate data results, which account for more than 90 percent of the total computing efforts in a CNN-based deep learning system [5] . Therefore, despite of its computing power, a deep CNN is unfortunately very resource-demanding in terms of energy consumption and computation effort. As such, the conventional implementation of CNN has become more and more complex and challenging due to its massive requirements of computation resources, which severely limits its applicability [6] .
Specifically, in practical Recognition, Mining and Synthesis (RMS) applications, a high dimensional database is often required. For example, MNIST handwriting database [7] , used commonly for classification and recognition algorithms, has a feature vector in a 784-dimensional space for each data point in the database. Not surprisingly, fully processing such a big dataset requires a huge power consumption and demands a good energy efficiency. In particular, one main restriction of a CNN's performance is its convolution step [8] . As such, a large number of research works have been presented to improve the hardware implementation of CNN for reducing the computational cost and memory communication. Among various hardware choices including ASIC and GPU, FPGA has recently been identified as a very attractive platform to implement deep learning CNN, which combines many benefits of both VLSI and general purpose soft-processor for accelerating CNN computing. In fact, since early 2,000, many accelerators have been proposed for CNN computing based on FPGA because general purpose processors cannot implement CNN efficiently. Most of the researches have focused on either improving the performance of computing circuit or the global communication with the off-chip memory issue, although a few of them have tried to improve both, such as [9] . So far, almost all existing CNN implementations are based on deterministic computing, therefore their performance are severely limited by hardware budget and memory bandwidth provided by existing FPGA platforms [9] .
All the above challenges call for an alternative computing paradigm for CNN. The key idea of this paper is to propose a Stochastic-based Convolutional Neural Network (SCNN) architecture by applying the stochastic-based computing principles on all computing components. Specifically, the main contributions of this paper include:
We present a stochastic-based multi-dimensional convolution, stochastic pooling, and stochastic activation layers to replace their deterministic counterparts throughout the network. To the best of our knowledge, this paper presents the first CNN architecture and hardware implementation completely based on stochastic computing principle. In contrast to stochastic computing that uses stochastic bit stream to encode the signal value [10] , our proposed methodology uses the probability density function (P.D.F.) of the signal as information carrier which we believe has much higher information content than simply using a large number of identically independent binary random bits. Besides the typical advantages that can be achieved by using stochastic-based computation, such as high energy efficiency, low area cost, simplicity and robustness, our proposed SCNN can also solve several additional CNN-specific challenges, such as using simple integer-based adders instead of the multipliers and adders that deal with fixed/floating point format numbers, achieving the three levels of parallelism simultaneously, avoiding the loop tailing required to fit a small portion of data on-chip, and eliminating the need for storing massive intermediate data results. Throughout this paper, we attempt to show that, by performing CNN computing tasks in a probabilistic domain, our SCNN architecture can readily leverage three types of parallelism in a typical CNN [8] : the parallelism in convolution operations, the parallelism in combining many convolved input feature maps and producing one output feature map, and the parallelism in computing many outputs independently. To our knowledge, there is no efficient hardware architecture that can achieve all these levels of parallelism simultaneously because of the limitations in hardware resources and memory bandwidth if computing in a deterministic domain.
The rest of this paper is outlined as follows. After briefly introducing the concept and architecture of a deep convolutional neural network in Section 2, we present in Section 3 our general ideas of how to stochastically compute a given deep CNN. Specifically, we present detailed descriptions and simple illustrative examples for each of three standard CNN processing layers: convolution, activation, and pooling. We then proceed with discussing the hardware architecture and implementation of our stochastic-based deep CNN in Section 4. In Section 5, we present the measured performance metrics of our stochastic-based CNN for various configurations. Subsequently, we apply the standard error analysis as well as robustness assessment to our implemented stochasticbased CNN in Section 6. Finally, we conclude our paper with Section 7 by offering our preliminary assessment about the applicability of our stochastic CNN methodology.
CONVOLUTIONAL NEURAL NETWORK
Deep learning is a recently developed machine learning methodology that attempts to model high-level abstractions embedded within large-scale data through multiple processing layers, each of which is network structured. Among its many variations, CNN is one of the most popular computing architectures for deep learning. A typical CNN is composed of multiple feature extraction layers and a single or multiple classification layers. The feature extractor filters input images into so-called "feature maps" that summarize and emphasize salient features of the image. Some of these extracted features are visually apparent, such as corners, lines, and circular arcs, etc., while the rest of them are more subtle. To make these features invariant to position shifting or distortions, each input image needs to go through multiple extraction stages, which ultimately result in a low-dimensional vector that feeds into a classifier. Many studies have shown that feature extraction is the most computationally intensive part of a deep learning network [11] . In this paper, we focus on accelerating the complete feature extraction layer with FPGA fabric, while relaying the classification task to an embedded processor core.
A practical convolutional neural network typically consists of several feature extraction layers, each of which normally contains convolutional layers, nonlinear activation layers, and one optional pooling or sub-sampling layer. Fig. 2 illustrates the composition of one feature extraction stage. Specifically, convolution layer extracts features from all local regions of input image by convolving a 2-D filter kernel over the input image. The filter kernels are obtained from a training phase, which are assumed to be known in this paper. We only focus on the feed-forward computation in a CNN. More specifically, we construct each convolution layer with parallel convolutional neurons, each of which can be represented as 3D filtering with different kernel coefficients on the input feature maps. Subsequently, this step produces different feature maps for the next layer, the first dimension is the number of input feature maps and the other two are the height and width of the feature maps. Convolutions consume more than 90 percent of the CNN computation time [12] , thus efficient convolver has a significant effect on the CNN performance. Right next to a convolution layer, a nonlinear activation layer transforms a resulting feature map through a predefined nonlinear function. This is to mirror the biological mechanism of subduing and enhancing neural signals within a biological brain. Finally, an optional pooling layer usually follows in order to reduce the resolution of a convolved image. The pooling layer takes small rectangular blocks from the convolutional layer and subsamples it to produce a single output from that block. There are several ways to do this pooling, such as taking the average or the maximum. While the maxpooling takes the maximum of the block it is pooling, the average pooling takes the average value.
To fully appreciate the enormous size and complexity of a realistic deep learning CNN, we present a recent winning CNN design of 2014 ImageNet competition in Fig. 3 , which contains eight feature extraction layers and three fully connected classification layers [13] . This network's input is 150,528-dimensional, and the number of neurons in the network's remaining layers is given by 253, 440-186, 6,248,110-64, 896-64, 89,63, 264-4,096-4,096-1,000. The output of the last fully-connected layer is fed to a 1,000-way softmax which produces a distribution over the 1,000 class labels.
STOCHASTICALLY COMPUTING CNN
In this section, we present the methodology to stochastically compute all three layers of a feature extraction stage. In the convolution layer, we exploit the well-known convolution theorem in probability theory to significantly reduce its computing complexity. Furthermore, in both nonlinear activation layer and pooling layer, we again only perform simple probabilistic operations to either pass or eliminate random samples. Note that, in all these three layers, we only process random samples. Therefore, the whole computing procedure exhibits a streaming mode. In other words, each layer doesn't have to wait for the completion of its preceding layer in order to proceed, therefore completely being pipelined. This contrasts sharply with the deterministic version of these layers.
We believe that our stochastic-based CNN methodology has several advantages. First, because all processing information are encoded probabilistically, many operations are further decoupled and its overall performance can be significantly improved. Second, the decoupling due to stochastic processing can improve the overall scalability of our CNNbased deep learning, which potentially can greatly reduce its total memory footprint. Finally, throughout our CNNbased deep learning system, no complicated operations such as multiplication or division are required. Instead, All operations are integer-based. The real value of the input data is encoded by probability values determined by the proportion of a particular random samples. Fig. 4 depicts the basic diagram of using convolution to compute feature maps in a deep learning network. Our proposed multi-dimensional stochastic convolution (SC) leverages the probabilistic principle that the probability density function of the sum of two or more independent random variables is the convolution of their individual P.D.F.s [14] , [15] . We define an n-dimensional random variable by mapping the outcomes of the space S to an n-dimensional space, thus obtaining the n-dimensional random variable X ¼ ½X 1 ; X 2 ; . . . ; X n , where X 1 , X 2 , and X n are 1D random variables.
Convolution Layer
Here is the Basic Idea. Suppose X n and Y n are two independent discrete n-dimensional random variables with distribution functions m X ðx 1 ; x 2 ; . . . ; x n Þ and m Y ðy 1 ; y 2 ; . . . ; y n Þ. Let Z n ¼ X n þ Y n . We now determine the distribution function m Z ðz 1 ; z 2 ; . . . ; z n Þ of Z n . In other words, we need to determine the probability that Z n takes on the vector Fig. 2 . Architecture diagram of convolutional neural network with three standard processing layers: convolution, activation, and pooling. Fig. 3 . Architectural illustration of the Deep CNN in [13] . ½z 1 ; z 2 ; . . . ; z n . Suppose that X n ¼ ½k 1 ; k 2 ; . . . ; k n , where k 1 ; k 2 ; . . . ; k n are some integers. Then Z n ¼ ½z 1 ; z 2 ; . . . ; z n if and only if Y n ¼ ½z 1 À k 1 ; z 2 À k 2 ; . . . ; z n À k n . So the event Z n ¼ ½z 1 ; z 2 ; . . . ; z n is the union of the pairwise disjoint events X n ¼ ½k 1 ; k 2 ; . . . ; k n and Y n ¼ ½z 1 À k 1 ; z 2 À k 2 ; . . . ; z n À k n , where k 1 ; k 2 ; . . . ; k n runs over all possible integer values ½À1; þ1. Clearly, we have m Z ðz 1 ; z 2 ; . . . ; z n Þ ¼ P 
(1) where ZðZ1; Z2Þ ¼ XðX1; X2Þ þ YðY 1; Y 2Þ.
The above derivation proves that, through interpreting input waveforms as probability density functions, a conventional multi-dimensional convolution in spatial-temporal domain can be readily translated into a number of independent parallel additions in probabilistic domain. To further discuss this method, we depict its key steps and algorithmatic blocks in Fig. 5 . First, the two n-dimensional random vectors X and Y are treated as two n-dimensional probability density functions m X and m Y , respectively. Second, for each of these n-dimension vectors, we generate large ensembles of random samples accordingly, we will discuss this process in details in Section 4. In Fig. 5 , we denote the set of these random samples as T X 1 ; T X 2 ; . . . ; T Xn and T Y 1 ; T Y 2 ; . . . ; T Yn . Third, corresponding to each dimension 1; 2; . . . ; n, we add random samples T X k and T Y k to generate a new set of random samples T Z k , where k ¼ 1; 2; . . . ; n. Subsequently, we extract the n-dimensional P.D.F of T Z . Finally, we post-process the results of m Z to obtain the final results of X Ã Y.
We now use a simple example of convolving two 2D input matrices to illustrate our probabilistic scheme (see Fig. 6 ). The key step is to interpret each matrix as a 2D P.D.F., i.e., Xði; jÞ ¼ x means P ðI ¼ i; J ¼ jÞ ¼ x P jXj , and consequently generate random samples according to this probability density function. In Fig. 6 , we have produced two streams of random samples, where the left and middle streams follow X and Y respectively. Each random sample is a three-tuple (i, j, and t), where i and j denote the x-and y-index of the input matrices and t denotes the sign of the corresponding matrix entry. When these sample sequences are long enough, the frequency of different random samples should approximate their probability value. For example, the first matrix can produce four possible entry locations (0,0), (0,1), (1,0), and (1,1) to make up a random sample. The extra sign bit t is used to handle the negative value entry in the input matrices. After two streams of random samples are generated, we add them pairwise and produce a new stream of three-tuple random samples. Two counters are assigned for each entry on the output side to determine the probability density function. One of them counts the positive random samples fall within the entry, while the other counts the negative samples. Subtracting the negative counters from their corresponding positive counter produces the ultimate output probability density function. The convolution of two vectors results from repeating the multiply-andaccumulate process. If one of the entries on the input side is negative, it will have a negative impact on a specific entry on the output side. Similarly with the probabilistic-based method, the negative sign has a negative impact on the corresponding counter on the output side. According to the well-known convolution theorem, the probability density function of the resulting random sample stream is the 2D convolution results. Note that the number of random samples can be as large as necessary. Intuitively, the more our random samples are, the more accurate our final convolution results will become.
To show the validity of our proposed algorithm, we have simulated it with Matlab software and compared the 2D conventional and probabilistic convolution with different number of random samples S, where S is the total number of random samples for the whole 2D vector. Given two 2D input matrices X and Y with the size N Â N ¼ 50 Â 50. One unique property of our proposed probabilistic convolution is that its solution can only achieve close-to-100 percent accuracy to the corresponding accurate solution. Given the fact that our proposed probabilistic convolver is mainly used in FPGA-based image and video processors, such accuracy poses no negative impact on the final results of our target applications. In the following comparisons, to be fair, we always choose s 0:001 to be the termination criterion for our probabilistic convolution. We define the relative error r as r ¼
, where Z Ã and Z are the probabilistic and conventional convolution results, respectively. In addition, kZ Ã À Zk denotes the euclidean norm of Z Ã À Z, which is defined by the formula kZ
q . Computationally, our stochastic-based n-dimensional convolution proves to be algorithmically quite efficient. In Section 6, we have analytically proven that, in order to approximate a given P.D.F. with no more error d at more than 1 À a confidence level, we only need to generate no more than S ¼ k Ã log ðNÞ random samples, where N denotes the size of a convolved feature map. Because our stochastic-based convolution algorithm relies solely on pointwise addition of all random sample pairs, it is clear that the computational complexity of our stochastic-based convolution algorithm can be quantified as Oðlog NÞ in big-O notation. In addition, we completely removed all numerical operations except additions and comparisons. This result is far more superior than the performance of the conventional direct convolution methods and the more sophisticated FFTbased method, which require OðN 2 Þ and OðNlog NÞ complexity, although with a slight accuracy loss. Later on in Section 6, we will show that, even with a 10 percent precision loss, our large-scale stochastic-based CNN system suffers no noticeable performance loss for our ImageNet recognition task, yet achieving about 5-10 times performance improvement at various precision settings (See Table 4 ).
Nonlinear Activation Layer
The nonlinear activation layer within a CNN is designed to mimic the biological mechanism of subduing and enhancing neural signals within a brain. Since its introduction, various activation transfer functions have been proposed and investigated, although which function is the best remains unsolved and most likely be application-specific. In this paper, we refrain from discussing which transfer function is the optimal choice because it is highly dependent on the specific input data set and the particular CNN topology. Instead, we present a detailed multi-phase pumping circuit to implement a widely-adopted linear rectifier activation function and demonstrate its effectiveness and hardware efficiency.
The standard way to model a neuron's output f as a function of its input x is with fðxÞ ¼ tanhðxÞ or fðxÞ ¼ 1 1þe Àx . In terms of training time with gradient descent, these saturating nonlinearities are much slower than the non-saturating nonlinearity fðxÞ ¼ maxð0; xÞ. Nair and Hinton [16] refer to neurons with this nonlinearity as Rectified Linear Units (ReLUs). Various research studies have shown that, with ReLUs, a four-layer convolutional neural network can significantly reduce its overall training time, or the number of iterations required to reach 25 percent training error on the CIFAR-10 dataset, by about 4 times when compared with its equivalent network that uses tanh functions. In this study, the fact that all neural signals are encoded probabilistically allows us to perform such a linear rectification completely within the stochastic domain by directly manipulating the random samples passing through the CNN. Specifically, we maintain an accumulative counter CT i in each neuron i. For each incoming random sample s, depending on its sign signðsÞ, we decide its passing or blocking accordingly by taking into consideration the current state of CT i . Algorithm 9 lists all essential steps necessary for this stochastic algorithm. This algorithm has been thoroughly verified with both simulation and hardware implementation.
We now consider how to handle the general case of a softlimiting activation function stochastically. As shown in Fig. 8c , any given form of a typical soft-limiting function can be piecewisely approximated with a number of linear segments with different slopes, each corresponding to a different nonoverlapping signal subrange. Fortunately, because we process all neural signals stochastically, such operations can be readily transformed into modifying their corresponding probability density functions. Specifically, after the signal subrange of an incoming random sample is decided, different linear scopes, within probabilistic domain, can be achieved through upsampling or downsampling these random samples, which in effect increases or decreases the probability or occurring frequency of these random samples.
Pooling Layer
Pooling layers in CNNs are designed to summarize the neuron outputs that belong to neighboring groups in the same kernel map. More specifically, one can think that a pooling layer consists of a grid of pooling units that are spaced t x pixels apart. Therefore, each neuron result after pooling summarizes a block of size k Â k pixels centered at the location of the pooling unit. There are several ways to perform pooling methods. Fig. 10 depicts two most frequently used pooling methods. In max pooling method, its output is given by the maximum activation over non-overlapping rectangular regions of size ðK x ; K y Þ. Max-pooling creates position invariance over larger local regions and down-samples the input image by a factor of K x and K y along each direction [17] . Max-pooling leads to faster convergence rate by selecting superior invariant features which improves generalization performance. In average pooling method, the output is computed by averaging the values in the region of ðK x ; K y Þ.
In this work, because all neural signals are encoded as random sample streams, fortunately, various pooling functions can be performed stochastically, i.e., manipulating random sample in simple ways. For the average pooling depicted in Fig. 10 , its stochastic computing turns out to be quite straightforward. All we need to do is to aggregate all streams of random samples located within each pooling subzone. Note that there is no change needed for all random samples and only their indices need to be changed by modular operations. We now provide a detailed description of our stochastic max-pooling algorithm listed in Algorithm 11. Specifically, at each location ðx; yÞ, we maintain an accumulative counter CT½x½y. For each subzone where we perform the max operation, we have an aggregated counter D max , which records the largest index value for all random samples seen so far in this subzone. Finally, three assisting functions signðsÞ, idx x ðsÞ, and idx y ðsÞ reads out the sign, xindex, and y-index of the random sample s, respectively. The key idea of Algorithm 11 is to output the correct number of random samples corresponding to the location ðx; yÞ with the largest probability value.
HARDWARE ARCHITECTURE AND IMPLEMENTATION
As previously depicted in Fig. 2 , a typical deep learning CNN consists of multiple feature extraction stages, each of which can be divided into multiple layers, such as convolution, activation, and pooling layers. In Fig. 12 , we have shown a stage with P input images or feature maps, each of which convolves with Q kernels, k Â k sized each. All results from a group of P convolutions are subsequently combined into one output image, which will be subsampled or pooled after passing a non-linear activation layer. Our stochastic-based CNN architecture attempts to achieve both high scalability, therefore capable of handling extremely big networks, and high efficiency in hardware usage while consuming low power. In this paper, we pres- ent two stochastic-based architectures. The first one is used to realize conventional CNNs when the whole network can be constructed with a single FPGA device, which is applicable for simple task CNNs, such as the CNN configurations illustrated in Table 1 . The second architecture realizes the state-of-the-art very deep CNN, such as the VGG-16 module [18] used in this paper (see Table 2 ). To facilitate SCNN implementations with multiple FPGA devices while considering hardware resources and memory bandwidth limitations, we need to adopt loop tiling technique to reuse the hardware resources for computation and store all intermediate data off-chip in external memory chips.
The overall hardware architectures for our stochasticbased CNN and deep CNN are depicted in Figs. 16 and 17 respectively. The input image data is read from the external memory block by block, each is 2D N Â N vector, and cached in on-chip buffers. In order to maintain the relatively high speed of our proposed computation compared to the data transfer time, we use double data buffers. While one date buffer is operated upon, the other data buffer will undergo the data loading process. There are two important components in the whole system. The first component is the random number generator used to transform the 2D vector values, representing the input or the convolution kernels, into random samples, whose two-variate probability density function mirrors these stream values. The second component extracts the two-variate probability density function from the resultant random samples. In the following, we provide more circuit design and implementation details.
Random Sample Generation
When a 2D input signal is treated as a two-variate P.D.F. of its indices, each value represents the probability that the input variable falls in a specific x-y coordinate, which is the index of that magnitude. The input buffer address is divided into two parts. The least significant half (LSH) indicates the column index, while the most significant half (MSH) indicates the row index. The block diagram of our random sample generator used for input signals is depicted in Fig. 13 . Parallel randomizers, constructed from LinearFeedback Shift Register (LFSR) and Binary Search (BS) circuit, are used to generate the random numbers that follow the P.D.F. of each row, where N is the number of rows for the input buffer. This random sample represents the column index of that row. One more randomizer is used to find out the row index and determine which column index will be selected by the multiplexer to get a pair of random rowcolumn indices.
Given a row vector X j ¼ fx j ð0Þ; x j ð1Þ; . . . ; x j ðN À 1Þg, where j is the row index and N is the number of columns, we would like to generate an ensemble of random samples of the random variable X j . These random samples will be exclusively drawn from the set f0; 1; . . . ; N À 1g and satisfy
for any i 2 f0; 1; . . . ; N À 1g. Obviously, to achieve high efficiency, we will avoid calculating all probability values . Additionally, we require that the total number of random samples will be solely determined by the user. In other words, at any point of random number generation, all samples that have been generated should faithfully follow the given P.D.F., which is called the ergodic property in statistics.
Our chosen method consists of two steps. First, as the input vector X j ¼ fx j ð0Þ; x j ð1Þ; . . . ; x j ðN À 1Þg comes in, representing a row from a 2D image block, we generate a new running sum vector for that row X j Ã ¼ fx j ð0Þ;
x j ð0Þ þ x j ð1Þ; x j ð0Þ þ x j ð1Þ þ x j ð2Þ; . . . ; P NÀ1 i¼0 x j ðiÞg. Second, we uniformly draw a random sample r x from the closed range ½1; x Ã ðN À 1Þ. Finally, we locate the particular segmented range ½x Ã ðiÞ; x Ã ði þ 1Þ that contains r x , i.e., x Ã ðiÞ r x x Ã ði þ 1Þ, where i 2 f0; 1; . . . ; N À 1g. Based on elementary probability theory, the P.D.F. of all random samples should follow PðX j ¼ iÞ ¼ Fig. 14 depicts a simple example of the random number generation scheme for each row. First, the input vector X j ¼ f3; 8; 11; 9; 4; 1g is converted into a running sum vector X j Ã ¼ f3; 11; 22; 31; 35; 36g. We then draw a random sample 32. In order to find its range index, we need to perform three comparisons with pivot values 31, 36, and 35. For 32, its comparison results are 1 (true), 0 (false), and 0 (false). Incidentally, this 3-bit binary number is equal to the index of the located segmented range ½31; 35.
In general, using the scheme depicted in Fig. 14 , we need to perform log N comparisons in order to find the correct range index. Fortunately, using the hardware structure in Fig. 14c , we can perform all comparisons in a pipelined fashion, therefore significantly improving its throughput.
The last randomizer is used to determine the row index and select one of the column indices generated by the other randomizers. It has N inputs, which are the sum of the last entry of each row vector X j Ã , R ¼ fx 
Probability Density Function Extraction
Given a stream of random samples at high speed, the P.D.F. extraction unit computes their probabilistic distribution, which roughly resembles histogram. When parallel independent adders are present, there will be parallel streams of random sums produced. Although the hardware usage of these adders is relatively cheap by itself, updating the counter values according to the index sums can be computationally demanding, especially when the number of adders is large. This is because the input memory typically has limited bandwidth and a limited number of IO ports. For this reason, we developed a parallel P.D.F. extraction hardware technique, in which, each of the random streams will update its partial P.D.F. in its corresponding block RAM. How to efficiently aggregate these P partial P.D.F.s poses a major performance bottleneck, where P is the number of parallel random streams. To overcome this challenge, we present a novel method that combines both periodic aggregation and adder tree structure. As depicted in Fig. 15 , this parallel P.D.F. extraction unit consists of three parts: 1) P copies of partial P.D.F. extraction units as discussed in Section 4, 2) An adder binary tree structure, and 3) a single block RAM to store the final aggregated P.D.F.
All the P units of partial P.D.F. extraction, depicted as gray blocks in Fig. 15 , share a global index line, whose value rotates from 0 to L À 1 continuously, where L is the BRAM size. The P.D.F. extraction unit has dual ports, one is connected to random stream coming from the adder and used to update the corresponding memory location value. While the other port is connected to the adder tree, and used to read the memory data one-by-one from all the partial P.D.F. extraction units simultaneously to be aggregated for the final P.D.F.. After reading a specific memory location, it will be reset to zero. The memory data is common for both ports, which can access any memory location at any time, therefore data conflict needs to be considered. When one port writes on a specific memory location, the other port must 14. An illustrative example of random number generation. a) Uniform random samples within the range [1, 36] . b) Given any random sample, its specific range can be located by an approach similar to binary search. c) Simple hardware structure that performs such a binary search procedure.
not read from or write on the same location. Each time when a new random index r i arrives, it will first be used as an address to read out its current counter value C i . Simultaneously, this incoming index will also be compared with the current global rotating index value j to avoid conflicts. If r i 6 ¼ j, the value C j will be pushed to the adder tree, C Ã j will be 0, and the counter value C Ã i ¼ C i þ 1. Otherwise when (r i ¼¼ j), the value 0 will be pushed to the adder tree and the counter value will be C i þ 1. There are at least two advantages of this partial P.D.F. extraction method. First, the periodic aggregation will ensure that all counter values will not grow too big since they are reset to zero after each read operation, therefore reducing the size requirement of all block RAM. Second, combining all partial P.D.F. values is completely parallel. In other words, as soon as all P units finish computing partial P.D.F.s, the final aggregated P.D.F. is ready only with a short latency of log 2 P clock cycles. Fig. 16 depicts the whole structure of our proposed SCNN. The core element of the SCNN is the Stochastic Convolution unit. It constructs from the stochastic convolution process and a RNG, generates random row-column indices follow the two-variate P.D.F. of the convolution kernel, the 2D block of weights. As presented above, each convolution is performed stochastically by using adders instead of multipliers. Double parallel adders are required for each SC unit. The first adder is used to add the random row indices of the input signal and a convolution kernel pairwise, and the second adder is used to add the random column indices.
Stochastic Convolution Neural Network (SCNN)
We discuss four important characteristics in the following about the simplicity of our proposed SCNN when compared with the conventional multiplier-based CNN architecture. 1) All the operations in our proposed SCNN are integer-based since they are performed on the indices of pixels, not their magnitudes. Therefore, the circuit is relatively simpler than the conventional implementation which requires either fixed point or floating point operations.
2) The image size reduces after each sub-sampling operation, thus the row-column index range will also be shrunk. As a result, adder size will be smaller for deeper convolution layers. If the input feature map of layer i is 128 Â 128, then 7-bit adders are required. After the sub-sampling, the feature map size will be 64 Â 64 and the adders at layer i þ 1 are 6-bit adders. 3) There is no specific number of adders needed to the success of the convolution computation. To the extreme, we could use only one adder to perform a whole convolution computing task regardless of its dimensionality and input size. In fact, using a large number of adders can only improve its computing performance. Since the addition operations of the SCNN are completely independent from each other, therefore they can be easily parallelized with more adders according to the available FPGA device area. In contrast, the deterministic multiplier-based CNN architecture requires (k 2 ) multipliers, where k is the kernel dimension, to perform efficient convolution. 4) The resultant random samples from the current stochastic convolution layer are not extracted as a P.D.F. because they will subsequently be used for the following sub-sampling and convolution layers. The P.D.F. extraction is needed only at the end of feature extraction layers because all computing within these layers are stochastic-based. This can save a huge area and run time from generating random samples and extracting P.D.F.s between stochastic-based layers, and keep the feature extraction layers as a hierarchy of simple adders.
The convolution kernel row-column random indices can be generated by the circuit presented in the previous section as shown in Fig. 13 . Since the convolution kernel coefficients are fixed, another random numbers generation method can be used. Random samples can be pre-computed and saved in either BRAMs or distributed RAMs, available in the SLI-CEM of FPGA devices. For instance, if the 2D kernel coefficients are [5,1;8,6] , then 25 percent of the row-column indices, that need to be stored in a RAM, are "00", where the MSB is the row index and the LSB is the column index. LFSR circuit is used to randomly select one of the stored indices. This cannot be done in multiplier-based convolution since the convolution kernel coefficients have to be available simultaneously to perform systolic multiplication process. They have to be saved in registers which are very expensive in terms of slice register usage.
Large-Scale and Scalable Architecture for Deep CNN
Deep CNNs used for large-scale classification and recognition tasks are commonly much larger than these CNNs used for simple applications. In fact, a realistic CNN tends to be very large in size and complexity as well as be trained with very large datasets in order to determine its kernel coefficients. For example, the Oxford network VGG-16 shown in Table 2 reads 224 Â 224 input image and contains more than (1.63 M) convolution nodes [18] . Therefore, the stochastic CNN that realizes the whole network and stores all weights on a single chip is hard to support very large CNN models. In this section we propose a uniform stochastic CNN architecture that can be used for large-scale models and handle this so-called scalability problem. Fig. 17 depicts the architecture of our proposed stochastic deep CNN. Parallel Processing Elements (PE), each corresponding to one output feature map, are placed to receive random samples from the Random Number Generators (RNGs). The PE is constructed from multiple stochastic convolution units (CONV) . The number of CONVs represents the number of input ports or feature maps that can be read simultaneously. Each unit performs the convolution between a specific input feature map, represented by random samples, with a k Â k sized kernel stochastically as discussed in Section 3 using multiple parallel adders. Parallel stochastic processing units containing a number of stochastic convolvers can produce a huge number of random samples simultaneously, thus achieving very high overall processing throughput. Clearly, we need to refrain from storing all the intermediate data off-chip in order to mitigate the memory bandwidth problem. Therefore, we need to extract the P.D.F. from the random samples corresponding to each output separately after each processing iteration. The P.D.F. extraction unit can be placed before or after the non-linear activation (NL) and pooling (MP) units. To use the stochastic NL and MP presented in this paper, the P.D.F. extraction can be performed after them. If so, a stochastic NL with MP is required for each random stream, which is a lot. Therefore, in our architecture, we placed the P.D.F. extraction unit before the NL and MP as shown in Fig. 12 . In this case, these two units will be performed conventionally not stochastically. The P. D.F. extraction of a specific output is constructed from parallel BRAMs with switches as shown in Fig. 17 (CONV circuit), where the outputs of them are combined by using the Adder Tree (AT). All switches, connected to the BRAM, are used for double buffering technique, when one buffer is updated by the parallel adders, the other buffer is connected to the AT to produce one stream for each output.
RESULTS ANALYSIS AND COMPARISON
In this section, we present the experimental results and their comparisons with other FPGA-based CNNs for our proposed two architectures, SCNN and Deep SCNN. We used a Xilinx Virtex-6 FPGA device to implement our proposed SCNN. Table 3 shows the resources utilization of our proposed SCNN and Deep SCNN. The first two rows represent the required resources to realize Configuration-1 and Configuration-2 (See Table 1 ), that are usually used for simple tasks, based on the architecture shown in Fig. 16 . In addition, the third row shows the resources utilization required to realize Deep SCNN (See Table 2 ), which has higher recognition accuracy, based on the architecture shown in Fig. 17 . To make our performance comparisons fair, we measure and compare the total execution time for a given computing task between our stochastic-based architectures and the conventional deterministic-based ones. Specifically, we calculate the required stochastic computations to perform convolutional network and find the total execution cycles for different accuracy factors. Subsequently, we convert the results to GOPS (Giga Operations Per Second), by dividing the CNN size over the execution cycles and multiply the result by the cycle period, and use it as the performance metric. Because different FPGAs are used in our study, we also use another performance metric, the performance density [9] , to understand and compare hardware-efficiency of different CNN implementations. In this paper, we define the hardware density as the average GOPS per logic slice (GOPS/Slice). Note that the logic capability of each logic slice across different FPGA devices mostly remains the same, therefore this performance metric of hardware density conceptually indicates how computationally effective a given unit of hardware is being used. Intuitively, this computing density number is also closely related to the energy efficiency for computing. SCNN-Our proposed SCNN uses simple adders to perform convolution stochastically. Therefore, for a given amount of hardware, we can implement a significantly larger number of convolvers working in parallel. Furthermore, because, in our SCNN architecture, all logic operators within feature classification layers deal with stochastic samples, there is no need to store intermediate data. We only need to read the input image at the first layer and store the output of the last layer. This can significantly reduce the overall power consumption of the CNN, which are typically dominated by data transfer and memory access operations. Specifically, in our SCNN implementation, we replace k 2 multipliers and k 2 adders, in each convolver of deterministic architecture, with a pair of adders with a RNG. Note that the execution time of a stochastic-based convolution unit is mainly proportional to the size of the input image and the accuracy factor (s), i.e.,
, where r Â c is the tile size, and S ¼ k Ã log ðr Â cÞ is the total number of random samples needed for each portion of the input image in order to achieve the required accuracy. Table 4 compares our proposed SCNN, using different accuracy factors , s ¼ 0:001; 0:05, and 0.1, with the conventional CNN for two different configurations [8] , [20] shown in Table 1 . To make our performance comparisons fair, we keep the CNN sizes fixed for both deterministic and stochastic architectures at 0.24 and 0.26 GMAC for configuration-1 [20] and configuration-2 [8] respectively. Our results have shown that significant hardware savings and performance improvement when using stochastic-based computing architecture. For example, we have observed that the speed-up factors of our SCNNs in GOPS range from 3.3Â to 9.9Â, and 5.1Â to 15.4Â for different configurations based on the accuracy factor. In terms of performance density, our stochastic CNN approach achieves 5.7Â to 17.14Â and 6.8Â to 20.48Â more GOPS/Slice for different accuracy factors over the deterministic realization of configuration-1 and configuration-2, respectively. Finally, we compared our proposed SCNNs with the conventional CNNs presented in [9] in term of energy consumption. Our results have shown that our SCNN, on average, consumes 20Â less energy than the deterministic CNN. This significant energy saving is due to two main reasons. First, in the conventional deterministic-based CNN architecture, between processing stages, a large amount of intermediate data need to be stored and retrieved, which translate into huge numbers of read/write operations from/to off-chip memory and consume a significant amount of energy consumption even larger than their associated algorithmic operations. Our stochastic-based SCNN architecture, as discussed in Section 4, can significantly reduce its overall memory communications, hence greatly improving its energy efficiency. Second, the conventional CNN architecture requires a great number of multiplications, which are typically achieved through using DSP slices for MAC operations. Unfortunately, it is wellknown that DSP blocks, such as DSP48E1S in Virtex-6 FPGA, consumes 23Â and 26Â more power than distributed RAM 16 arrays of 32 words Â 16 bits with and without M-register respectively, when the clock frequency is 200 MHz. In contrast, in our SCNN architecture, no DSP block is utilized because only simple additions are needed. In summary, all the above results have clearly indicted that our stochasticbased computing approach represents a much more efficient way to utilize a given unit of hardware and energy.
Deep SCNN-To benchmark the performance of our proposed stochastic-based CNN architecture in large scale, we choose the deep learning CNN configuration that wins the first place in image classification task of the 2015 ImageNet Large-Scale Visual Recognition Challenge (ILSVRC). It consists of five convolution layer groups and three fully connected layers. There are different versions of this model, e.g., VGG-11, VGG13, VGG-16, and VGG-19. The number of convolution layers in each group varies in each version. In this paper, we use VGG-16 as the standard model for all the presented results. The configuration of this model is presented in Table 2 , where 13 convolution layers are followed by three fully connected layers. The size of all the convolution kernels is 3 Â 3. Each convolution layer is followed by rectification non-linearity. In all the following results, we adopted the Image-Net dataset to evaluate the performance of our proposed stochastic CNN, which includes 1.3 M images for training dataset and 50 K images for validation dataset divided into 1,000 classes.
For a specific convolution layer, we have limited resources for computation, therefore, the required number of phases will be where, N and M are the number of inputs and outputs images at layer i, R Â C is the input image size where R ¼ C, and ðn; m; r; cÞ are the tile size combination. The time needed for computation for each phase depends on the required number of random samples S which can be determined by
where S ¼ k Ã log ðr Â cÞ, k Ã is relative to the computation error described in Section 6, for instance k Ã ¼ 510 when the error tolerance d ¼ 0:05 and s ¼ 0:05, and PA is the number of parallel adders in each convolver
Under the above constraints, the estimated computation time for convolution layer is given in Equation (4), considering n ¼ CONV and m ¼ PE. Table 5 shows the convolution time required for the VGG-16 CNN model when f ¼ 200 MHz and d ¼ 0:05. In this configuration of our proposed architecture, CONV , PE, and PA are set to 5, 5, and 40, respectively. In each CNN convolution layer, (CONV) parallel convolvers, process n input feature maps, need to be aggregated to produce one output feature map. If we have m output feature maps at this layer, then (PE) parallel processing elements, each has n convolvers, are required. Table 6 compares our proposed Deep SCNN with existing FPGA-based architectures. It shows that ours outperforms other architectures in terms of performance, hardware resources utilization and power efficiency. The performance of SCNN mainly depends on the number of random samples, which controls the level of computing accuracy. The results show that Deep SCNN achieves on average 6.98Â more GOPs compared with the most recently published papers. In terms of area efficiency, our proposed method achieves 4.58Â more GOPS/Slice than the best architecture among the others. This result is not surprising because a digital multiplier is far more expensive than an adder in hardware resources. Finally, the last row in Table 6 presents the power efficiency of our implementation. It shows that SCNN achieves 6.86 Â more GOPS/W compared with the CNN in [9] .
ERROR ANALYSIS AND ROBUSTNESS RESULTS
The success of our stochastic convolver hinges on the fact that we can accurately and efficiently generate an ensemble of random samples that represent any given probability value. One critical question to ask is how many random samples are enough to achieve any required computing efficiency. In this paper, the most important building block of discrete convolution is adding two independent random variables and extracting the probability density of its resulting sum. In the following, we derive the probabilistic error bound in two steps. First, we investigate the required number of random samples in order to precisely represent the desired probability density function. Second, we obtain the relationship between the total number of random samples and the overall accuracy of our probabilistic convolution. Our major conclusion is that, to achieve no more than 0.05 error at 95 percent confidence level in representing an N-sized image feature map, only k Ã log ðNÞ random samples suffice, where N and k Ã denote the input size and a constant of 510, respectively. Even better, as shown in Table 7 , when the confidence level rises to 99 percent, there will only be a 50 percent in its required random sample size to 788.
Before proceeding further with our analytical proof, we introduce the classic ball-and-bin experiment widely used in deriving probabilistic counting arguments. The experiment consists of selecting S balls at random from a bin with W ¼ P N i¼1 w i balls, w i of which are indexed by number i and i ¼ 1; 2; . . . ; N. Random variable I i gives the number of the balls indexed by i in the sample for each i. Intuitively, if with replacement during random sampling, S Â w i =W can be the standard estimate of the random sample I i , i.e., the number of sampled balls with the index i. A simple example of such a ball-and-bin experiment is shown in Fig. 18a depicts a bin of 15 balls with 2, 4, 4, and 5 balls are indexed by 1, 2, 3, and 4, respectively. If randomly select S ¼ 15 balls from this bin, most likely, the index distribution of the sampled balls depicted in Fig. 18c should match with the given ball index distribution depicted in Fig. 18b with a high probability. Intuitively, the more balls we randomly sample from the bin, such a matching probability should increase. Formally, let X ¼ fx 1 ; x 2 ; . . . ; x N g denote a given N-sized input vector. We define a discrete random variable I whose range R I is countable finite. By definition, there is a discrete probability density function that specifies I's probabilistic distribution by giving the probability for I to take a value i, i.e., p : R ! R; i 7 ! pðiÞ ¼ P ðI ¼ iÞ, where i ¼ f1; 2; . . . ; . . . ; Ng. Furthermore, let pðiÞ ¼ Given a well-defined P.D.F., how many random samples are sufficient to approximate this P.D.F. within a pre-determined error bound and with high probability? Before considering the complete target P.D.F. f X , we first solve the random sample size problem for any given single probability p i , where i 2 f1; 2; . . . ; Ng. More specifically, we assume k independent random samples y j , where j 2 1; 2; . . . ; k, will be generated according to the following condition:
Therefore, if there are k independent random samples y j ,
can serve as an unbiased estimator to its true value p i . Our objective is to find the necessary number of random samples k that ensures
where r is the relative error and 1 À a is the confidence level. According to the basic statistics theory (See Page 53 of [23] ), ifp i is unbiased, thenp
ffiffiffiffiffiffiffiffiffiffiffiffi ffi varðp i Þ p has a standard normal distribution. Letting z denote the upper a=2 point of the standard normal distribution yields
To make the above equation satisfied, we need to choose k large enough to make
We now derive the equation of varðp i Þ. From Equation (5), assuming K samples are large enough to yield accurate p i value, we have p i ¼ 1 K P K j¼1 y j . Furthermore, its finite-population variance is
Letp i denotep i ¼ 
To approximate p i withp i , which has probability at least 1 À a of being no further than d ¼ rp i from the accurate value, we now combine Equations (8) and (10) . Therefore, the necessary random sample size requires no more than
Because asymptotically, to obtain the 100 percent p i , the required random sample size approaches 0. Therefore, Equation (12) can be simplified as
where the absolute error d equals to rp i and z is the upper a=2 point of the normal distribution. We now proceed with analyzing the required random samples size to represent a given (P.D.F.). As depicted in Fig. 14 , the random number generation circuit in our stochastic CNN resembles a binary search procedure. For each random sample, its value idx is sequentially generated in a bitwise fashion. Conceptually, the overall random samples generation process can be depicted as a binary refining process of defining a given (P.D.F.). As shown in Fig. 19a , after the first comparison, all random samples are essentially divided into two groups with a given ratio that can be interpreted as a given probability. According to our previous analysis, to approximate p i withp i , which has probability at least 1 À a of being no further than d ¼ rp i from the accurate value, the necessary random samples size requires no
. Similarly in the next comparison, as depicted in Fig. 19b , more samples are further needed to accomplish the second division. Once again, to achieve the same accuracy and the same confidence level, another set of k Ã random samples are needed. For a P.D.F. containing N values, it is clear that such subsequent binary divisions only need to last log ðNÞ times, with each iteration require exactly k Ã random samples. Therefore, in order to approximate a given P.D.F. with no more error than d at more than 1 À a confidence level, we only need to generate no more than S ¼ k Ã log ðNÞ random samples. To illustrate the impact of random samples size towards the CNN computing accuracy, we have chosen six representative image recognition cases from our 50,000 validation dataset. For each case, we use different levels of accuracy (1-Error) by changing the number of random samples for the convolution nodes used in our stochastic-based CNN and then measure all resulting ranking scores, note that (Error = 0 percent) means the deterministic CNN is used. In all six cases presented in Fig. 20 , even when the error percentage exceeds 20 percent, our stochastic-based CNN can still produce the correct image recognition results, although with quite different ranking scores. One interesting case is the case of Fig. 20b , even with 0 error rate, the CNN still produces the incorrect result, which we contribute it to the inherent limitations of CNN-based deep learning methodology.
Top1 and Top5 are used to determine the accuracy of our proposed stochastic CNN. They represent the percentage of matching the input image to the highest rank label or to the first five highest rank labels within the predicted list for the whole validation dataset, respectively. From Fig. 21 , we can see that the Top1 and Top5 for conventional CNN, when (E = 0 percent), are 65.99 and 86.87 percent respectively. The accuracy in SCNN can be controlled by the number of random samples, which changes the Error ( percent) shown in Fig. 21 . The figure shows the effect of decreasing the random samples size on the results accuracy. Specifically, the stochastic CNN with number of random samples that degrade the performance of convolution by Error ¼ 5 percent can achieve 65.68 and 86.77 percent for Top1 and Top5 respectively. 
CONCLUSION
Compared with the existing deterministic architectures for CNN, our stochastic-based architecture can achieve much higher robustness, energy-efficiency, and lower hardware cost, although with a slight degradation in computing accuracy. Fortunately, such a slight inaccuracy has been proven to be quite useful in deep learning area because it can effectively avoid harmful data overfitting. Another interesting advantage of our approach is that the computing accuracy of our architecture can be readily controlled by injecting different number of random samples. This allows easy adjustments, even at run time, to tailor for specific applications. All these benefits can be very attractive to many applications of computer vision and artificial intelligence, where computing cost and robustness are paramount requirements.
