This paper highlights new opportunities for designing large-scale machine learning systems as a consequence of blurring traditional boundaries that have allowed algorithm designers and application-level practitioners to stay -for the most part -oblivious to the details of the underlying hardware-level implementations. The hardware/software co-design methodology advocated here hinges on the deployment of compute-intensive machine learning kernels onto compute platforms that trade-off determinism in the computation for improvement in speed and/or energy efficiency. To achieve this, we revisit digital stochastic circuits for approximating matrix computations that are ubiquitous in machine learning algorithms. Theoretical and empirical evaluation is undertaken to assess the impact of the hardwareinduced computational noise on algorithm performance. As a proof-of-concept, a stochastic hardware simulator is employed for training deep neural networks for image recognition problems.
Introduction
Applications that automate the process of extracting meaningful insights from an ever-increasing trove of user and sensor-generated data have emerged as one of the dominant consumers of computing resources. The natural error-resilience of a large suite of learning algorithms enabling such applications is well-documented, setting them apart from more traditional workloads that typically require high precision computation and number representations with high dynamic range. The strategy of embracing errors during computation is in fact a binding theme across several disciplines that impact large-scale machine learning. It is well appreciated that in the presence of statistical approximation and estimation errors, high-precision computation in the context of learning is rather unnecessary [1] . Consequently, stochastic optimization techniques [2] and randomized numerical linear algebra [3] are becoming critical components of the lower layers of an optimized machine learning software stack, blurring the computation-statistics interface [4] . Yet, machine learning applications continue to be deployed on general purpose computing platforms that have been designed to cater to the needs of the traditional workloads, incurring high, and often unnecessary penalty in terms of degradation in the overall system performance.
The motivation for this paper stems from the idea that the learning algorithm's intrinsic robustness to noise may be leveraged to relax certain constraints on the underlying hardware. In the proposed model, the compute engines executing the algorithm perform approximate computations, introducing non-deterministic errors in the process. It is reasonable to expect that this loss in accuracy is accompanied by a corresponding increase in speed and/or energy-efficiency per computation. For instance, reduced precision, fixed point units are typically faster and often consume far less hardware resources and power than floating point engines. If exposing the hardware-generated noise to the algorithm does not result in degradation in terms of a pre-defined measure-of-quality metric, this hardware-software co-design scheme can prove to be a viable approach for optimizing the system performance.
Provoking a discussion along these lines is especially timely, given the increasing likelihood of the demise of Moore's law/Dennard scaling and the resulting collapse of the conventional model of processor design that owes much of its success to the sustainability of transistor area and performance scaling. Our approach involves identifying noise-tolerant kernels that dominate the algorithm run time and offloading the execution of these kernels onto a dedicated hardware accelerator that performs approximate computations while interacting closely with the host processor. As it will be elucidated in the sections to follow, invoking approximations at the compute level adds another dimension (and complexity) in the hardware design space and to truly benefit from such an approach entails careful engineering of several closely coupled aspects of system design. This includes definition of the accelerator microarchitecture and optimizing the host-accelerator interface which involves non-trivial optimization in the design subspace defined by the requirements on performance, accuracy, energy consumption and implementation costs. It is also equally important to preserve the programming model so that these hardware benefits can be readily absorbed at the application-level without incurring additional software development costs. Since hardware design is typically beleaguered with substantial engineering costs and longer development time than software, it is not only prudent but also necessary to firmly establish the feasibility of the approximate computing techniques for a given set of target applications. Also, care should be taken to avoid a common pitfall in application-specific integrated circuit design that results in the usefulness of the hardware solution to be limited to only a niche set of applications. Honoring this constraint is imperative for adequate amortization of the hardware development costs. This work addresses these research problems and makes the following contributions:
1. Based on the observation that computations involving large matrices are pervasive in data analytics and machine learning workloads, we propose the use of digital stochastic circuits for approximate matrix multiplication and develop a high-level abstraction to model the error introduced by this stochastic hardware. 2. Using this abstraction, we analyze the impact of approximate computation on the gradient descent algorithm and present new techniques by which the stochastic hardware can augment algorithm design and improve execution time. 3. We train deep neural networks for the MNIST handwritten digit classification problem. We observe that networks trained in the presence of hardware noise yield error rates no worse than those trained using precise computations.
Matrix Multiplication using Stochastic Hardware
The foundations of stochastic computing circuits can be traced to the work by Poppelbaum [5] and Gaines [6] in the late 1960s, coinciding with the early days of computing revolution. After a fallow period that spanned several decades, there has been a discernible renewal of interest [7, 8, 9, 10] in this rather unconventional method of information processing. In this section, we present the key concepts of stochastic computation and extend them for implementing approximate matrix multiplication. We focus our attention on the multiplication operation for two main reasons: 1) From an application perspective, general matrix multiplication (GEMM) represents the most computationally expensive function within any basic linear algebra subprogram (BLAS Level 3) library implementation, and 2) From a hardware implementation perspective, multipliers consume significantly more resources in terms of area and energy than adders and subtractors. The hardware circuit complexity for a n-bit binary tree multiplier is O(n 2 ), and O(n) for a n-bit full adder. As discussed next, stochastic circuits significantly reduce the complexity of hardware implementation of certain arithmetic functions, providing an opportunity to achieve a high degree of parallelism and a corresponding improvement in computational performance.
Stochastic representation. Within the stochastic computation framework, a number x ∈ [0, 1] is represented as a N -bit long Bernoulli sequence, X = {X 1 , X 2 , ..., X N } such that the binary random variable X i takes the value 1 with a probability equal to x i.e. P (X i = 1) = x 1 . Each of these N stochastic bits can be generated by comparing x against a random sample drawn independently from U[0, 1] -assigning X i to 1 if x is greater than the random sample and to 0, otherwise.The number encoded in a given stochastic bit-sequence can be estimated by counting the average occurrence of 1s in the sequence.
Scalar multiplication. Encoding numbers as probabilities allows for implementation of arithmetic operations such as addition and multiplication using simple digital logic gates at the cost of introducing non-deterministic errors in the computation. Consider two scalar quantities a and b scaled appropriately to lie in [0, 1]. Let A, B be a N -bit long stochastic sequences representing a and b, respectively. By definition,
Let C be a stochastic bit-sequence obtained by performing a bit-wise logical AND operation on sequences A and B. It is assumed that the digital hardware implements exact AND gates i.e. AND(0, 0) = 0, AND(0, 1) = 0, AND(1, 1) = 1 with probability 1. Therefore,
The expected value and the variance of the binary random variable C i can be expressed as:
The number represented by the Bernoulli sequence C = {C 1 , C 2 , ..., C N } may be viewed as a random variable obtained by averaging the N independent binary random variables C i . i.e.
For large N , invoking the central limit theorem, multiplication using stochastic number representations produces an unbiased estimator of the product, corrupted with zero-mean Gaussian noise and variance that is inversely proportional to N -the number of bits used in the stochastic sequence. It is possible to extend this analysis to a, b ∈ [0, r], by sampling the random number used for generating the stochastic bit from U[0, r]. In such a case, the error variance in Eq. (4) needs to be modified as:
Note that error variance depends on the values of the numbers being multiplied, and tends to zero as the inputs approach the limits 0 and/or r.
Vector inner product. The stochastic computation methodology described above can also be applied to vector dot product and matrix multiplication. Consider two vectors a, b ∈ R d and define c 0 = a, b , the inner product of vectors a and b. We assume that each component of a and b ∈ [0, r]. c 0 can be estimated by generating N stochastic bits for each of the d components of a and b, and counting the occurrence of 1 in the bit-wise AND of the N d bits representing vectors a and b.
For uncorrelated stochastic bit-sequences,
, and In every clock cycle, the j th unit produces a stochastic bit for a j and b j by comparing them against random numbers drawn independently from a uniform distribution. The stochastic bits are fed into an AND gate, generating a bit that is set to 1 with probability a j b j . A d-bit parallel counter sums up the 1-bit output of these d units producing an estimate of the inner product. The output of the counter can be averaged over N clock cycles in order to refine the result of the inner product computation. It is preferred to force N to be a power of 2 so that the normalization of the accumulator result by N can be implemented using inexpensive bit shift operations. In this particular design of a multiplyand-accumulate unit, the accuracy of the stochastic computation can be tuned by suitably adjusting N and allows trading off accuracy for improvement in computation time. Interestingly, this control over the accuracy can be achieved without modifying the underlying hardware, differentiating stochastic computing circuits from other approximate computing techniques such as low-precision digital circuits and analog circuits 2 . Compared with the latter, stochastic circuits offer the advantage of seamless compatibility with the state-of-the-art in using standard CMOS logic gates, providing an opportunity for rapid design prototyping and verification using low-cost, commodity FPGAs (for eg. [11, 12] ).
As compared with low-precision digital circuits, stochastic circuits provide an extremely areaefficient implementation of basic arithmetic functions, but incur significant overheads in terms of generating the stochastic bit-sequences. These overheads include the additional circuitry required for comparators and random number generators, and may potentially limit the degree of parallelism that may be achieved. Addressing these limitations of stochastic computing circuits is an emerging research topic and a diverse set of solutions have recently come forth. For example, [9] adopts a device-level approach and proposes the use of memristive devices [13] for stochastic bit-sequence generation. In [8] , the authors present a parallel stochastic computing architecture that improves the computation speed and accuracy at the cost of increasing the area footprint of the overall system. In an orthogonal approach, [10] proposes the use of low-discrepancy quasi-random sequences for generating stochastic bits and improving the speed of stochastic circuits. To augment these efforts, circuit and architecture-level solutions are needed that optimize the place in the system for generating stochastic bit-sequences (near memory, near caches or near the core), hardware interfaces that feed data into the stochastic-bit generator, as well as all the components that might be needed to build a complete "Stochastic ALU". Given these open research questions regarding the specifics of the stochastic hardware, we defer further discussion to a future report. Nonetheless, the noise model presented above can be used to assess the impact of stochastic computations on the behavior of machine learning algorithms. The findings of this investigation will not only determine the compatibility of stochastic computing for such applications, but also provide valuable insights that can influence hardware design.
Learning in Presence of Hardware-induced Noise
Machine learning tasks are routinely formulated as an optimization problem with the aim of finding a set of model parameters that minimizes a well-defined cost function. This optimization problem is typically solved using gradient-based first order techniques. The calculation of the gradient is computationally expensive, and the algorithm may be sped-up by offloading the gradient computation onto a stochastic hardware accelerator. The stochastic hardware returns a noise-corrupted version of the gradient. Given this setting, consider the k th iteration of the noisy batch gradient descent for obtaining x * that minimizes f (x) : R n → R,
∇f (x k ) is an unbiased estimator of the true gradient ∇f (x k ) and G k is a vector representing the error introduced by the stochastic hardware. As shown previously, entries of G k are i.i.d, satisfying
where N k is the length of the stochastic bit sequence used in the k th iteration 3 . Note that this formulation of gradient descent does not differ appreciably from that of the classical stochastic gradient descent algorithm [14, 2] or gradient descent with noise-corrupted gradient (Proposition 3 in [15] ). These proof techniques are directly applicable to the problem in Eq. (8) and theoretical guarantees for convergence can be achieved by enforcing strict constraints on the permissible learning rate schedules. The learning rate α k is required to decrease monotonically and satisfy α k = ∞ and α 2 k < ∞. However, in a practical machine learning setting, approximate optimization is often sufficient [1] and sometimes preferred in order to avoid over-fitting. Given these relaxed constraints, we would like to understand how the hardware-induced error propagates through the successive iterations of gradient descent and more importantly, discover new methods by which the stochastic hardware can improve computational performance. For further analysis of algorithm in Eq. (8), we assume that f is l-strongly convex,
with Lipschitz continuous gradients
The expected distance of the k +1 th iterate from the solution x * , conditioned on the previous iterate, can be expressed as
At this point, we can borrow some well-known results from convex optimization of f (x) [16] 
where The error shown is obtained after averaging over 100 repetitions. Gradient descent is run for k = 50
If α k is fixed at the optimal learning rate for batch gradient descent [16] α = 2 l+L , and the same bit-sequence length N 0 is used in each iteration, Eq. (14) can be simplified to
Since
L+l < 1, the algorithm converges to a random variable with expectation x * and variance σ 2 * = σ 2 0 lLN0 . A more intriguing result emerges when we allow α k decay to exponentially with k i.e. α k = α 0 η k , where 0 < η < 1, and at the same time, let N k decrease with k as N k = 1 + N 0 γ k , 0 < γ ≤ 1. As a consequence of reducing N k , the stochastic hardware is expected to compute the k th gradient faster while producing a less accurate estimate. However, the effect of this additional variance is partially mitigated by the exponentially decaying learning rate, as determined by Eq. (14) . For the parallel implementation of stochastic computing circuits as shown in Figure 1 , it is reasonable to assume that the computation time for the k th iteration, t k , scales in proportion to N k . To understand the implications of this choice of α k and N k , we minimize a convex f (x) under different settings of hyperparameters η and γ. The results are shown in Figure 2 . Clearly, there exists a γ < 1 that yields, statistically, similar optimization error as in the case when N k is kept constant (γ = 1). Furthermore, for any γ < 1 there is an improvement in the total algorithm run-time, arising primarily due to the reduction in the computation time needed for the iterations that are executed using a smaller N k . Note that this improvement occurs in addition to any acceleration by the virtue of offloading the computation onto the stochastic hardware.
Training Neural Networks using Stochastic Hardware
As a demonstration of the stochastic computing techniques developed in Section 2, we consider the problem of training deep neural networks using the back-propagation method. This choice is motivated by the fact that training the deep neural networks is computationally demanding, creating the necessity for efficient hardware acceleration techniques that enable the scalability of the learning algorithm for training large, complex neural network architectures using big training data sets. In addition, the computational complexity and the execution time of the mini-batch stochastic gradient descent (SGD) algorithm typically used for neural network training is dominated by a series of dense GEMM operations in the feed-forward, error back-propagation and weight update calculation steps. Furthermore, the mini-batch SGD is inherently a sequential algorithm -only limited benefits can be achieved by model-level, data-level parallelism [17] -and accelerating the dense GEMM operations can immensely improve its computational performance. As a result, mini-batch SGD is particularly well-suited for implementation on stochastic hardware that performs fast, but approximate GEMM.
We investigate the impact of approximate matrix computations on the classification performance of a deep neural network. We consider the digit classification task on the MNIST dataset. This dataset comprises of 60, 000 training images and 10, 000 test images -each image is 28x28 pixels containing a digit from 0 to 9 and the pixel values are scaled to [0, 1]. In our experiments the effect of GEMM computation on a stochastic hardware accelerator is modeled by adding a random matrix to the result of a precise computation 4 . Each element of this random matrix is sampled from a Gaussian distribution N (0, σ 2 ), where σ 2 is inversely proportional to the length of the stochastic bit-sequence used to represent the numbers. We also modulate the noise variance σ 2 in accordance with Eqs. (5) and (7) . The following functions are assumed to be computed on the stochastic hardware:
1. Forward propagation of the input vector across each layer:
Backward propagation of the error vector across each layer: a. GEMM operation to calculate: ζ l = W l δ l+1 b. Hadamard product for evaluating:
3. Calculation of the update to the weight matrix:
In the notation used above, l indexes the different layers with l = 1 corresponding to the input layer, g is the sigmoid activation function, X l+1 := g (Y l+1 ) is the input to the l + 1 th layer. W l is the weight matrix and B l is bias vector associated with the layer l. We train the neural network using SGD with mini-batch size of 100, and cross-entropy objective function. Momentum (p = 0.5) is used to speed up the convergence of gradient descent. We adopt an exponentially decreasing learning rate -scaling it by a factor of 0.99 after every epoch of training.
In the first set of experiments, we construct a neural network with 2 hidden layers, each containing 400 units. The weight matrices for each layer are initialized to random values sampled from N (0, 0.1). The bias vectors are initialized to 0. The learning rate for the first epoch is set to 2. Weight decay parameter λ = 0.001 is used for all the layers. We train this neural network for 100 epochs under different noise conditions determined by N , the length of the stochastic bit-sequence used for number representation. It is important to note that the parameters described above are kept unchanged while training the network in the presence of noise-corrupted computations. As a benchmark for comparison, we train a 'control' network using precise computations. Figure 3a shows the evolution of the cross-entropy error as the network is trained. We observe a monotonic increase in the cross-entropy training error as N is reduced. However, as shown in Figure 3b , networks trained on a stochastic hardware do not suffer any degradation in the classification performance as compared with the control network. On the contrary, there seems to be a slight, but noticeable, improvement in the classification accuracy for training in presence of hardware-induced noise. The control network incorrectly classifies 175 test images (error = 1.75%), where as the network 'SC-256' yields a test error of 1.62%. This result should not come as a surprise, especially when considered under the light of some prior work (for eg. [18, 19] ) that provide insights into the noise benefits in neural network training. In [18] , Bishop equates training in presence of input noise to Tikhonov regularization, and similarly in [19] , the addition of weight noise during training is shown to improve the neural network's generalization ability and error-tolerance. It is reasonable to expect that the training on stochastic hardware also serves to weakly regularize the deep network by the virtue of adding noise at the several stages of the backpropagation algorithm.
A variety of different techniques have been proposed to further improve the classification accuracy deep neural networks. These include generative pre-training [20] , stochastic regularization through dropout [21] , dropconnect [22] , stochastic pooling [23] , and other architectures for deep networks such as convolutional neural networks. These techniques can be used in conjunction with our approach of training on stochastic hardware. To support this conjecture, we train another set of deep networks in which we initialize the layer weights by performing unsupervised feature learning using stacked sparse autoencoders [24] . This weight pre-training is also performed on the stochastic hardware. The network is then fine-tuned for 50 epochs using mini-batch SGD. The results shown in Figure 4 , are qualitatively similar to the previous case where the weights were initialized randomly. With weight pre-training, the test error for the 'control' network drops to about 1.35%. Again, we notice that networks trained on stochastic hardware achieve slightly lower test error than the 'control' network. From the results shown in Figures 3b and 4b , no clear trend emerges that would dictate the preference for a particular value of N , the length of the stochastic bit-sequence. Rather, we find that there exists a wide range of N which may be used for neural network training without degrading the classification accuracy. We speculate that this range of N depends on the problem at hand as well as the particular choice of network hyperparameters for a given problem. Extending this study to include different datasets, different neural network architectures and performing a more detailed sensitivity analysis will perhaps give further insights.
Conclusions
In this paper, we have proposed a framework for acceleration of machine learning algorithms via purpose-built non-deterministic hardware. We have sketched the design of stochastic circuits where numbers are encoded in terms of hardware-instantiated random variables. When standard batch gradient descent procedures for convex optimization of machine learning objective functions run on this stochastic hardware, the noise introduced due to computational errors turns these procedures into variations of stochastic gradient descent that are somewhat different from those commonly considered in the literature. In particular, apart from step-sizes, the stochastic bit-sequence length offers a complimentary knob with which learning algorithms can be orchestrated towards faster convergence. As a proof-of-concept, we have empirically demonstrated that deep learning techniques can be accelerated, with no loss of accuracy, by offloading key back-propagation computations onto stochastic hardware. Extrapolating, we envision the emergence of "big data" frameworks for machine learning based on relaxed, inexact models of computing running on error-embracing components all across the stack, right down to low-level hardware circuitry.
