Training neural networks is a time-and compute-intensive operation. This is mainly due to the large amount of floating point tensor operations that are required during training. These constraints limit the scope of design space explorations (in terms of hyperparameter search) for data scientists and researchers. Recent work has explored the possibility of reducing the numerical precision used to represent parameters, activations, and gradients during neural network training as a way to reduce the computational cost of training (and thus reducing training time) [1] [2] . In this paper we develop a novel dynamic precision scaling scheme and evaluate its performance, comparing it to previous works. Using stochastic fixedpoint rounding, a quantization-error based scaling scheme, and dynamic bit-widths during training, we achieve 98.8% test accuracy on the MNIST dataset using an average bit-width of just 16 bits for weights and 14 bits for activations. This beats the previous state-of-the-art dynamic bit-width precision scaling algorithm.
Introduction and Motivation
It is well established that neural networks, though ordinarily trained using 32-bit or 64-bit floating point representation, can achieve desirable accuracy during inference with reduced precision weights and activations [3] [4] . In extreme cases these results can be attained with as low as 1-bit precision for weights and activations [5] [6] . These reduced precision networks are amenable to acceleration on custom hardware platforms which can take advantage of lower bit-widths in order to speed up computation [1] [7] . Reduced precision strategies are not typically applied during back-propagation whilst training, as this can lead to heavily reduced accuracy or even non-convergence.
Recent work has shown that dynamic precision scaling, a technique in which the numerical precision used during training is varied on-the-fly as training progresses, can achieve computational speedups (on custom hardware) without hampering accuracy [1] [2] . Dynamic precision scaling uses feedback from the training process to decide on an appropriate number representation. How to appropriately scale the precision used during training is an open-ended question. Na and Mukhopadhyay suggest starting with reduced precision, and increasing precision whenever training becomes numerically unstable, or when training loss stagnates [1] . Other dynamic precision scaling methodologies are easily conceivable (e.g. an epoch based approach), but are yet to be rigorously investigated.
In this paper, we present a novel dynamic precision scaling algorithm that builds on the findings of prior work. We propose to use the stochastic fixed-point rounding method suggested by Gupta et al. [7] , the dynamic bit-width representation used by Na and Mukhopadhyay [1] , and a novel precision scaling algorithm that leverages information on the quantization error encountered during rounding as a heuristic for scaling the number of fractional bits utilized. arXiv:1801.08621v1 [cs.
LG] 25 Jan 2018
Methodology
This work targets the custom fixed point arithmetic units described by Na and Mukhopadhyay [1] and aims to investigate a novel dynamic precision scaling scheme and its effect on neural network training. In order to perform evaluations, we emulate a dynamic fixed point representation by using custom Caffe layers that quantize/round the native floating point values to values that are legal in our fixed point format. In our study, we consider training a neural network using stochastic gradient descent with dynamically scaled precision for weights, activations, and gradients during both the forward (inference) and backward pass. Figure 3 gives a visual overview of a generalized dynamic precision scaling scheme implemented during training. As per Na and Mukhopadhyay, we quantize weights, biases, activations, and gradients at the appropriate pass through the network, and update the precision on-the-fly during training. 
Fixed Point Representation and Quantization/Rounding
Fixed point numbers are represented by a fractional portion appended to an integer portion, with an implied radix point in between. We allow our fixed point representation to use arbitrary bit-width for both the integer and fractional parts, and represent the bit-width of the integer part as IL and the bit-width of the fractional part as F L. We denote a given fixed point representation, then, as IL, F L . Dynamic precision scaling modifies IL and F L during training.
Round-to-nearest Equation.
Round
Stochastic Rounding Equation.
Inspired by Gupta et al., we use stochastic rounding during quantization of floating point values to IL, F L [7] . The naive round-to-nearest algorithm introduces a biased rounding scheme. In order to overcome this, the implementation of the Stochastic round scheme was used, because the it implements an unbiased rounding, i.e. E(Round(x, IL, F L )) = x.
Unnecessary Precision Unnecessary Precision Figure 2 : The same number (1.5 decimal) represented in 5, 5 , 2, 8 , and 2, 2 fixed point format. Scaling the 5, 5 representation to 2, 8 is consistent with fixed bit-width, dynamic radix precision scaling. Dynamic bit-width, dynamic radix is a more flexible scheme that would allow scaling to the 2, 2 representation. Dynamic precision scaling tries to find the smallest bit-width at which training progresses.
Prior works employ either a fixed bit-width, dynamic radix scheme [2] [8] [9] in which IL + F L = N , where N is some fixed integer (usually 16), or a dynamic bit-width, dynamic radix scheme in which IL and F L are free to vary independently [1] . Note that in a fixed bit-width scheme, IL and F L are inter-dependent, as increasing one necessitates a decrease in the other.
Dynamic Precision Scaling
Our work proposes a dynamic precision scaling algorithm that builds on prior work in an attempt to minimize bit-width during trianing whilst still achieving adequate accuracy. Algorithm 1 shows our training algorithm with a call to a generalized dynamic precision scaling method. Algorithm 2 specifies our dynamic precision scaling algorithm, in which we make the contribution of using the average quantization error percentage of the fixed point rounding as a metric by which to scale the number of fractional bits.
The dynamic precision scaling algorithm presented is aggressive compared to some prior work [9] , as it attempts to reduce the bit-width whenever the number of overflows or the quantization error is lower than their respective thresholds.
Related Work
In order to relate our work to prior research in this area, here we compare our methodology to the work of Na and Mukhopadhyay, who published results for dynamic precision scaling during training, specifically targeting their custom flexible multiply-accumulate unit that can speed up reduced precision arithmetic [1] . Further comparison can be made to the work of Courbariaux et al., who investigate an overflow-based dynamic fixed point precision scaling technique with fixed bit-width during neural network training [2] . Finally, inspiration can be drawn from implementations by Gupta et al., who study the effect of limited precision data representation and computation on neural network training [7] . By using the Stochastic rounding method, as compared to the Round-to-nearest method for fixed point rounding, they achieve far superior accuracy and convergence rate. Since our work builds the work of Na and Mukhopadhyay, Courbariaux et al., as well as Gupta et al., we will discuss the contributions of each, as well as the differences as compared to our implementation.
The work of Na and Mukhopadhyay investigates training based on a dynamic precision scaling algorithm that uses a maximum bit-width, ml, target bit width, tl, and a unit bit step s. They use these parameters to scale the precision as training becomes stagnant, by adding the unit bit step to the target bit-width at the detection of stagnant training. The goal of their work was not to find the optimal precision, but to find a precision that is good enough to offer speedup on their novel multiply-accumulate unit without hindering the convergence of training. for i ∈ max_iter 5: forward_pass(); 6: calculate_loss(); 7: backward_pass(loss); 8: scale_precision(IL, FL); See Algorithm 2 9:
(forward_pass): 10: for layer ∈ layers if R > R_max: 3: IL ← IL + 1 
End
The work of Courbariaux et al. explores the area of dynamic fixed point precision scaling by implementing a fixed bit-width, dynamic radix based method that uses a greedy algorithm which favors the precision of the fractional part of the fixed point number. The algorithm uses the following components: matrix overflow rate, maximum overflow rate and a scaling factor. It is implemented as follows -if the matrix overflow rate is greater than the maximum overflow rate, we shift the radix bit right by one. Otherwise, if twice the matrix overflow rate is less than or equal to the maximum overflow rate (i.e., there is 'headroom' in the integer part), then shift the radix bit left by one. If neither of the precision criteria are met, leave the scaling factor alone.
Finally, the work of Gupta et al., implements a fixed precision method to train their model, while comparing the effects of using round-to-nearest method versus the stochastic rounding method. They allocate a 16-bit word length, and train the model with various (fixed) IL and FL components of the word length. Combinations such as 8, 8 , 10, 6 and 14, 2 are used. Gupta et al. also demonstrate negligible overhead of incorporating stochastic rounding within matrix multiplication, as well as show better accuracy compared to the round-to-nearest method.
Our implementation draws key contributions from these works and pieces them together to explore a novel scheme for reduced precision training. Table 1 gives a comprehensive comparison of each related work as it is compared to the work done in this paper. Firstly, when compared to the work of Courbariaux et al., it can be seen that our methodology is different due to the fact that we explore a dynamic bit width, dynamic radix based methodology, while they use a fixed bit width, dynamic radix based approach. The work done by Gupta et al. uses a fixed bit width, fixed radix approach, with no dynamic precision scaling, while the work done in this paper uses an overflow and quantization error based approach to scaling. Finally the work done by Na and Mukhopadhyay use a convergence based scaling approach, as well as a round-to-nearest method, while this paper explores the stochastic based rounding method.
Authors
Fixed point format (bit width, radix)
Scaling
Rounding Precision Granularity Na et al. [1] ( 
Evaluation
Due to the short timescale of this project and the lengthy process of tuning hyperparameters and training networks, we focus our evaluation of dynamic precision scaling on the simple LeNet network [10] trained on the MNIST dataset.
We train using Caffe and our custom rounding layers and dynamic precision scaling algorithm. We use a batch size of 64, and train for 10,000 iterations. We use an initial learning rate of 0.01, momentum of 0.9, a weight decay factor 0f 0.0005, and scale the learning rate using lr = lr init * (1 + γ * iter) −pow , where γ = 0.0001 and pow = 0.75. We update IL and FL once each iteration, and using E max = R max = 0.01%.
We compare our results to a baseline network trained on the same dataset with the same hyperparameters, but using full-precision floating point for all attributes.
Our results reveal that we can achieve accuracy on-par with the baseline, whilst drastically reducing the bit-width used for both weights and activations. Our dynamic precision scaling algorithm in general, however, doesn't reduce the gradient bit-width very much, as this requires the most precision in order for training to converge. Figure 3 : The bit-width used for weights and activations is greatly reduced from the baseline of 32 bits.
Note that naively reducing the bit-width of weights and activations to a fixed 13-bits with no dynamic precision scaling results in the training process failing to converge. With dynamic precision scaling, however, 13-bit weights and activations are sufficient early in the training process.
Limitations
Due to the limitations of the specialized MAC unit that Na and Mukhopadhyay have implemented, the smallest number we can represent is 2 −F L , whereas other schemes such as FlexPoint [9] can accommodate a much wider range of precisions using the same bit-width by leveraging a shared external exponent/scaling factor. Using this type of representation, we believe we would be able to achieve even lower bit-width, as we expect that the majority of fractional bits are used just to eliminate rounding-to-zero situations.
Unfortunately, we also cannot verify the accuracy of our algorithm with large data sets due to lack of GPU resources and the short timescale of the project.
Further, in our experimentation, we found that E max and R max essentially act as new hyperparameters that require tuning in order to achieve good results. This parameters serve to control how aggressive the algorithm is in scaling back the bit-width used during training. If the algorithm is too aggressive in trying to reduce the bit-width, training can fail to converge.
We believe that there is further work to be done in finding a better dynamic precision scaling algorithm, as there is more potential speedup to be obtained. 
Conclusion
The goal of our experiments was to investigate the effects of a new dynamic precision scaling scheme on training convergence versus the floating point. We were able to show that by using our novel dynamic precision scaling algorithm, the network was able to achieve the same training and test accuracy as the baseline, within a small margin. Although it may not have a direct impact when implemented on software, the takeaway is that with specialized hardware, such as the multiply-accumulate unit implemented by Na and Mukhopadhyay, we can see a direct speedup in training time, as well as a cost and energy reduction since training can converge with the same accuracy as the baseline by using less bits.
Using stochastic rounding and a novel overflow and quantization error based scaling scheme, our algorithm also achieves convergence using lower bit-width for parameters and activations than Na and Mukhopadhyay's implementation, meaning we expect a larger speedup in hardware.
