DNNs are ubiquitous datacenter workloads, requiring orders of magnitude more computing power from servers than traditional workloads. As such, datacenter operators are forced to adopt domain-specific accelerators that employ halfprecision floating-point (FP) numeric representations to improve arithmetic density. Unfortunately, even these representations are not dense enough, and are, therefore, sub-optimal for DNNs. We propose a hybrid approach that employs dense block floating-point (BFP) arithmetic on dot product computations and FP arithmetic elsewhere. While using BFP improves the performance of dot product operations, that compose most of DNN computations, allowing values to freely float between dot product operations leads to a better choice of tensor exponents when converting values to back BFP. We show that models trained with hybrid BFP-FP arithmetic either match or outperform their FP32 counterparts, leading to more compact models and denser arithmetic in computing platforms.
Introduction
Today's ubiquitous online services are often driven by DNNs to provide custom-tailored user content. Delivering faster inference and more accurate training, however, is often limited by the arithmetic density of the underlying hardware platform. Most users resort to graphics processing units (GPUs) as the platform of choice for training neural networks because GPUs offer high arithmetic density per silicon area through full precision floating-point (FP32) units. However, even traditional GPUs have proved not to have dense enough arithmetic for DNNs, and accelerators are moving to narrow arithmetic to improve logic density. For instance, NVIDIA's Volta (nvi, 2018) and Google's TPU2 architectures employ half-precision floating point (FP16) arithmetic.
Unfortunately, optimizing floating-point -even narrow FP16 logic -has been a daunting task for device designers. Sequential implementation of floating-point logic is quite slow and parallelizing the logic is prohibitively resource intensive compromising density. A promising solution to this problem is to utilize fixed-point arithmetic, which promises great gains in both speed and density; unfortunately, performing training with fixed-point networks has been unsuccessful to this point due to the lack of dynamic range inherent in the fixed-point representation.
Signal processing platforms have historically resorted to block floating-point (BFP), whose representation is shown in Figure 1 , as a way to optimize for both performance and density. The use of BFP has allowed signal processors to convert common algorithms (e.g., FFT) to dense and parallel integer arithmetic hardware. We observe that BFPs are also likely to be effective in neural networks, increasing the arithmetic density of accelerators and improving the dynamic range of fixed-point-like arithmetic taking the first step towards effective training in dense arithmetic. Naive application of BFP to DNN training, however, is not straight forward. Tensor values often drift during training requiring a new choice of exponent -or quantization points.
In this paper, we make the observation that in DNNs, the majority of the arithmetic operations executed are performed as part of dot product calculations, and therefore, limiting dense fixed-point-like arithmetic to only replacing the dot products still allows us to accelerate the majority of the network. As such, the rest of the operations can be implemented in traditional floating-point logic with little performance degradation. We propose a hybrid BFP-FP framework where values float freely between dot product computations in BFP, resulting in better choice of exponents, and perform the rest of the training in traditional floatingpoint arithmetic. Hybrid BFP-FP training also makes the underlying hardware more friendly to users, who can use complex arithmetic, undisturbed by limitations imposed by BFP implementations.
The separation between dot products and other operations already exists in commodity hardware in NVIDIA Volta 's FP16 Tensor Cores (nvi, 2018) and in Google's inference-only, fixed-point based accelerator, Tensor Processing Unit (Jouppi et al., 2017) Training with half precision floating-point. Half precision floating-point (FP16) (Dally, 2015) is quickly becoming the state-of-the-art for neural networks training, with both Google's TPU2 (goo, 2017) and NVIDIA's Volta (nvi, 2018) GPUs adopting half-precision floating-point as their arithmetic representation. However, FP16 suffers from limited range, and it often requires weights and gradients to be scaled in order to converge. Also, FP16 incurs larger area and power requirements in hardware. BFP solves this problem by sharing exponents across matrices, enabling the usage of exponents with large bit-widths with little communication overheads, preserving large dynamic range and obviating gradient scaling techniques.
Specialized Arithmetic for DNNs
Due to the massive computational requirements for DNNs when employed in datacenter scale online services, operators such as Google started adopting specialized numeric representations for DNNs. So far, accelerators have employed fixed-point for inference (Jouppi et al., 2017) , and narrow floating-point representations, such as FP16 (goo, 2017; nvi, 2018) , for training. From a hardware design perspective, the use of reduced precision arithmetic allows silicon designers to improve logic density and energyefficiency, while minimizing the number of bits used to represent models relaxes demands on both memory capacity and bandwidth. From the user's perspective, arithmetic representations must be usable, not resulting in accuracy deterioration for models, nor requiring novel algorithmic techniques to recover model performance.
FP32 representations are usable but inefficient. They represent numbers with a 24-bit wide mantissa and a 8-bit wide exponent. In terms of precision, the 24-bit mantissa used in FP32 is overkill for DNNs. Figure 2a shows the training loss and Table 1 shows the test error of a ResNet-20 model trained on CIFAR-10 with truncated mantissas. The model converges even with a 4-bit mantissa, achieving best performance with 16-bit mantissas, and failing to converge only with 1-bit mantissas. In contrast, while the 8-bit exponent provides an appropriate dynamic range as shown in Figure 2b , training already suffers with a 6-bit exponent, and completely fails to converge with a 2-bit exponent. Silicon implementations of the the mantissa-exponent encoding normalize output mantissa of every operation. Normalization is implemented by a shifter in silicon, an expensive hardware structure in terms of area and power.
Using FP16 mitigates the area issues of FP32, employing narrow 11-bit mantissas and 5-bit exponents. However, FP16 is still expensive compared to fixed-point logic. For instance, although the area of an FP16 multiplier is 4.7× smaller than that of a FP32 multiplier in 45nm manufacturing process node (Dally, 2015) , it is 13× larger than its 8-bit fixed-point counterpart. FP16 is also notoriously difficult to use, as the 5-bit exponent results in narrow dynamic range that is not sufficient to represent gradients throughout the training process. As such, from a usability perspective, the numeric representation must have wide dynamic range. Dynamic range is important during the training process, as the loss value decreases, and the gradient values also decrease.
Given these requirements, we identify block floating-point (BFP) as the ideal numeric representation for DNNs. BFP represents numbers with a mantissa and exponent, like floating-point, but exponents are shared across entire tensors, as shown in Figure 1 , resulting in dot products that can be computed entirely in fixed-point logic. Since over 99% of the arithmetic operations executed by DNN training and inference are dot product computations, we are able to fold almost all the DNNs' computations into fixed-point logic.
4. DNN Training using BFP Arithmetic 4.1. Using BFP in DNNs computation Equation (1) computes the real value a i of an element i of a BFP tensor a with mantissa a a i and exponent e a .
Equation (2) calculates the dot product between BFP tensors a and b, each with N elements. the average value of the tensor can be represented by a minimum number of bits, and is a compromise between the two aforementioned policies. Figure 4 illustrates the ranges of values lost by each policy.
We evaluated two rounding policies: round-to-nearest and stochastic rounding (Gupta et al., 2015) . Round-to-nearest (determ) deterministically rounds numbers to the nearest value, while stochastic rounding (stoc) stochastically rounds numbers with probability depending on the remainder of the number. We will show that rounding policies play a larger role when operating with narrow mantissas.
FPGA BFP prototype
To illustrate the area trade-offs of hybrid BFP-FP accelerators, we synthesized a proof-of-concept accelerator, shown in figure 5 . We implemented the basic operations needed for neural network training (i.e., matrix multiplication, transpose, convolutions and data movement operations) using a dataflow similar to (Chen et al., 2016) .
The matrix multiplication unit employs 75×75 systolic array of multiply-accumulate (MAC) units that feed a 75-wide activation/loss unit. The matrix multiplication unit operates on BFP values and the other units operate on custom floating point representation that features a 10-bit exponent and a 8-bit mantissa. In steady state, the matrix multiplication unit computes 75 dot products taking 75-wide tensors as inputs per cycle.
The FP-to-BFP units convert tensors by detecting the maximum exponent of the input FP tensors and normalizing the mantissas accordingly, while the BFP-to-FP unit normalizes the mantissas according to the single given exponent. The activation/loss and the conversion units are capable of processing a single 75-wide tensor per cycle. Weights are kept in BFP throughout the entire training process and during inference.
We synthesized the accelerator in a Stratix V 5SGSD5 FPGA at a clock rate of 200MHz. We achieve a maxi- Figure 5 . Hybrid BFP-FP accelerator with BFP mum throughput of 1 TOp/s when using 8-bit wide MACs in the matrix-multiplier with FP activations, the FP-to-BFP and the BFP-to-FP conversion units occupying less than 10% of the FPGA resources. This is an 8.5× throughput improvement over a variant of the accelerator that employs FP16 MAC units synthesized on the same FPGA.
Methodology

Implementation
We train DNNs with the hybrid approach, using BFP in the compute-intensive operations (matrix multiplications, convolutions) and FP32 in the other operations. We modified TensorFlow's (Abadi et al., 2016) matrix multiplications and convolution operations to reproduce the behaviour of BFP matrix multipliers in both the forward and backward passes.
We used TensorFlow's defun function to create a new op that processes the inputs and outputs of both the forward and backward passes of another tensorflow op, to simulate the usage of BFP. In the forward pass, shown in Figure 6a , we convert both inputs (x and w) to BFP, giving the x tensor one exponent per training input and the w tensor one exponent per matrix. Then we execute the target operation with native floating-point arithmetic, and saturate the outputs of the original op, to simulate the saturation that occurs in fixed-point matrix multipliers. In the backward pass, we perform the same pre-/post-processing of the inputs/outputs of the x derivative (Figure 6b ), but handle the w derivative differently (Figure 6c ) since it performs a reduction across entire batches. Thus, to emulate the behavior of an accelerator with native BFP, we convert inputs to BFP tensors that share exponents across the entire batch. Finally, we re-align weights and their gradients during updates to simulate the update of weights stored in BFP.
Using defun enables us to evaluate the impact of the hybrid approach on training quality without building and integrating a full-blown distributed BFP accelerator into a machine learning framework. It also enables us to take advantage of the highly optimized GPU kernels already available for all the different varieties of convolution and fully-connected layers.
Evaluation Setup
Datasets. We experiment with a set of popular image classification tasks.
• CIFAR-10 and CIFAR-100 (Krizhevsky, 2009) . Each consists of a training set of size 50K and a test set of size 10K. Instances are 32 × 32 color images representing 10 or 100 classes. We adopt a standard data augmentation scheme Huang et al., 2016) , by randomly cropping and flipping. For preprocessing, we normalize the data using the channel means and standard deviations.
Note that we use a model trained on CIFAR-10 to explore the design space of block-floating point implementations, and report the overall performance of BFP on the more challenging CIFAR-100.
• The SVHN (Netzer et al., 2011) Evaluation Metric. To evaluate the impact of BFP, we tune the models using only FP32, and then train the same models from scratch with the same hyper-parameters in BFP. We report training loss and best top-1 error.
Training. We train CIFAR-10/CIFAR-100 with ResNet and WideResNet (Zagoruyko & Komodakis, 2016) , and SHVN (Netzer et al., 2011) with ResNet, using various configurations of BFP.
Our models are trained by momentum SGD with a minibatch size of 128. We use a weight decay of 1e − 4 and momentum of 0.9 for our datasets. We trained models on CIFAR-10 and CIFAR-100 for 250 epochs starting with a learning rate of 0.1, and dividing it by 10 at 32K, 48K and 64K iterations . We trained the SVHN models for 160 epochs, starting from an initial learning rate of 0.01, and dividing it by 10 at epochs 80 and 120.
Evaluation
We now evaluate DNN training with the hybrid approach, that is referred to as BFP for simplicity, comparing it to FP32-based training. We start with a BFP design space exploration, where we train a ResNet-20 model on CIFAR-10, t explore the different choices of exponent range and rounding policy, as well as various mantissa bit-widths. Then we compare BFP-with FP32-based training for more challenging models on CIFAR-100 and SVHN. Our evaluation intends to show that BFP can be used as a drop-in replace-
