Deep neural networks (DNNs) have been demonstrated as effective prognostic models across various domains, e.g. natural language processing, computer vision, and genomics. However, modern-day DNNs demand high compute and memory storage for executing any reasonably complex task. To optimize the inference time and alleviate the power consumption of these networks, DNN accelerators with low-precision representations of data and DNN parameters are being actively studied. An interesting research question is in how low-precision networks can be ported to edge-devices with similar performance as high-precision networks. In this work, we employ the fixed-point, floating point, and posit numerical formats at ≤8-bit precision within a DNN accelerator, Deep Positron, with exact multiply-and-accumulate (EMAC) units for inference. A unified analysis quantifies the trade-offs between overall network efficiency and performance across five classification tasks.
INTRODUCTION
The deep neural network (DNN) is a popular learning paradigm that can generalize to tasks from disparate domains while achieving state-of-the-art performance. However, these networks are computationally heavyweight with regard to both compute and memory resources. For example, an outrageously large neural network with 32-bit floating point, such as an LSTM with a mixture of experts [29] , approximately requires 137 billion parameters. To manage the training and batch inference of these networks, hardware accelerators are employed, such as Google's Tensor, Processing Unit to decrease latency and increase throughput, embedded and/or reconfigurable devices to mitigate power bottlenecks, or targeted ASICs to optimize the overall performance. A predominant factor contributing to the computational cost is the large footprint of primitives, known as multiply-and-accumulate (MAC) operations, which perform weighted summations of the neuronal inputs. Techniques such as sparsity and low-precision representation [5, 7, 13, 32] have been extensively studied to reduce the cost associated with MACs. For example, substituting 8-bit fixed-point for 32-bit fixed-point when performing inference on CIFAR-10 with AlexNet reduces the energy consumption 6× [14] . These techniques become a necessity when deploying DNNs on end-devices, such as AI on the edge or IoT devices.
Of the methods used to mitigate these constraints, low-precision techniques have shown the most promise. For example, linear and nonlinear quantization have been able to match 32-bit floating point performance with 8-bit fixed-point and 8-bit floating point accelerators [5, 19, 26] . However, quantizing to an ultra-low bit precision, i.e. ≤8-bits, can necessitate an increase in computational complexity. For example, a DNN has to be retrained or the number of hyperparameters significantly increased [24] to maintain performance. A more lightweight solution is to perform DNN training and inference at a low-precision numerical format (fixed-point, floating point, or posit [11] ) instead of quantizing a trained network (e.g. with 32-bit floating point). Previous studies have compared DNN inference with low-precision (e.g. 8-bit) to high-precision floating point (e.g. 32-bit) [14] . However, these works compare numerical formats with disparate bit-widths and thereby do not fairly provide a comprehensive, holistic study of the network efficiency.
The recently proposed posit numerical format offers wider dynamic range, better accuracy, and improved closure over IEEE-754 floating point [10] . Fig. 1 shows intuitively that a natural posit distribution (e.g. 8-bit posit, es = 0) may be an optimal fit for representing DNN parameters (e.g. of ConvNet). In this work, we investigate the effectiveness of ultra-low precision posits for DNN inference. The designs of several multiply-and-accumulate units for the posit, fixed-point, and floating point formats at low-precision are analyzed for resource utilization, latency, power consumption, and energy-delay-product. We carry out various classification tasks and compare the trade-offs between accuracy degradation and hardware efficacy. Our results indicate that posits outperform at ultra-low precision and can be realized at a similar cost to floating point in DNN accelerators.
RELATED WORK
Since the late 1980s, low-precision fixed-point and floating point computation have been studied [12, 15] . In recent years, research attention has increased towards deep learning applications. Multiple groups have demonstrated that 16-bit fixed-point DNNs can perform inference with trivial degradation in performance [1, 8] . However, most of these works study DNN inference at varying bit-precision. There is a need for a more fair comparison between different number formats of corresponding bit-width paired with FPGA soft cores. For instance, Hashemi et al. analyze 32-bit fixedpoint and 32-bit floating point DNN inference on three DNN architectures (LeNet, ConvNet, and AlexNet) and show that fixed-point reduces the energy consumption by ∼12% while suffering a mere 0-1% accuracy drop [14] . Recently, Chung et al. proposed a DNN accelerator (Brainwave) that increases inference throughput within a Stratix-10 FPGA by 3× by substituting 8-bit ms-fp8, a novel spatial floating point format, in place of 8-bit fixed-point [5] .
Several groups have previously studied the usage of the posit format in DNNs. Langroudi et al. study the efficacy of posit representations of DNN parameters and activations [21] . The work demonstrates that DNN inference using 7-bit posits endures <1% accuracy degradation on ImageNet classification using AlexNet and that posits have a 30% less ravenous memory footprint than fixedpoint for multiple DNNs while maintaining a <1% drop in accuracy. Cococcioni et al. review the effectiveness of posits for autonomous driving functions [6] . A discussion of a posit processing unit as an alternative to a floating point processing unit develops into an argument for posits as they exhibit a better trade-off between accuracy and implementation complexity. Most recently, J. Johnson proposed a log float format which couples posits with a logarithmic EMAC operation referred to as exact log-linear multiply-add (ELMA) [18] . Use of the novel format within ResNet-50 achieves <1% accuracy deterioration for ImageNet classification, and the ELMA shows much lower power consumption than the IEEE-754 floating point.
In this work, we demonstrate that posit arithmetic at ultra-low bit-width is an innate fit for DNN inference. The EMAC-equipped, parameterized Deep Positron architecture is mounted on an FPGA soft processor and compares assiduously the fixed-point, floating point, and posit formats at same bit-width.
BACKGROUND 3.1 Deep Neural Networks
The DNN is a connectionist, predictive model used commonly for classification and regression. These networks learn a nonlinear input-to-output mapping in either a supervised, unsupervised, or semi-supervised manner. Before being able to perform inference, a DNN is trained to minimize a cost function and update parameters, called weights and biases, using backpropagation. Customarily, either 16-bit or 32-bit floating point arithmetic is used for DNN inference. However, 32-bit IEEE-754 floating point representation maintains a massive dynamic range of over 80 decades, which is beyond the range required for DNNs. Thus, this design of numerical distribution yields low information-per-bit based on Shannon maximum entropy [28] . 16-bit floating point, often present in NVIDIA accelerators, unveils the format's limitations: nontrivial exception cases, underflow and overflow to ±infinity or zero, and redundant NaN and zero representations. Posit arithmetic offers an elegant solution to these limitations at generic bit-width.
Posit Numerical Format
The posit numerical format, a Type III unum, was proposed to improve upon the deficiencies of the IEEE-754 floating point format and to address complaints about Type I and II unums [10, 31] . The posit format offers better dynamic range, accuracy, and program reproducibility than IEEE floating point. A posit number comprises n bits and es exponent bits, which controls the dynamic range. The primary divergence posit takes from floating point is the introduction of a signed, run-length encoded regime bit-field. The longer this field is, a posit number has lower precision but larger magnitude, and vice versa for shorter run-lengths. Two posit bit-strings are reserved: 00...0 for zero and 10...0 for "Not a Real, " which can denote infinity, division by zero, etc.. The following shows the interpretation of a binary posit bit-string. 
The numerical value a posit represents is then given by (1)
where k is the regime, e is the unsigned exponent (es > 0), and f is the value of the fraction bits. If a posit number is negative, the 2's complement is taken before decoding. We recommend reviewing [10] for a more thorough introduction and intuition to the posit format.
METHODOLOGY
We build off of [2] , using the proposed Deep Positron architecture. The framework is parameterized by bit-width, numerical type, and DNN hyperparameters, so networks of arbitrary width and depth can be constructed for the fixed-point, floating point, and posit formats. The following sections further describe the EMAC operation and detail the EMAC algorithms for each numerical format.
Exact Multiply-and-Accumulate (EMAC)
The multiply-and-accumulate (MAC) operation is ubiquitous within DNNs -each neuron computes a weighted sum of its inputs. In most implementations, this operation is usually inexact, meaning rounding or truncation results in accumulation of error. The EMAC mitigates this issue by implementing a variant of the Kulisch accumulator [20] and delaying error until every product of each layer has been accumulated. This minimization of local error becomes substantial at low-precision. In each EMAC module, a wide register accumulates fixed-point values and rounds in a deferred stage. For k multiplications, the accumulator width is computed using (2)
where max and min are the maximum and minimum value magnitudes for a given numerical system, respectively. Each EMAC is pipelined into three stages: multiplication, accumulation, and rounding. A fourth stage, implementing the trivial activation function, ReLU(x) = max(x, 0), is present for hidden layer neurons. For further introduction to EMACs and the exact dot product, we recommend reviewing [2, 20] . 
Fixed-Point EMAC
We parameterize the fixed-point EMAC as n, the bit-width, and Q, the number of fractional bits, where n > Q. Fig. 2 shows the block diagram design of the EMAC with signal bit-widths indicated. The functionality of the unit is described by Algorithm 1. The general characteristics of a fixed-point number are given by the following. result ← {s r , e r , m r } 22: end procedure
Posit EMAC
The posit EMAC, shown in Fig. 4 , is parameterized by n, the bitwidth, and es, the number of exponential bits. In this implementation, we do not consider "Not a Real" as all DNN parameters and data are real-valued and posits do not overflow to infinity. Algorithm 3 describes the data extraction process for each EMAC input, which is more involved per the dynamic length regime. The EMAC employs this process as outlined by Algorithm 4. The relevant attributes of a given posit format are calculated using the following, reg ← rc ? zc−1 : −zc ▷ Select regime 12: return sign, reg, exp, frac 13: end procedure where useed can be thought of as the scale factor base, as shown in (1) . return result 44: end procedure
EXPERIMENTAL RESULTS
In all experiments, we synthesize the EMACs onto a Virtex-7 FPGA (xc7vx485t-2ffg1761c) using Vivado 2017.2. and expand upon the results from [2] . With regard to energy and latency, the posit EMAC is competitive with the floating point EMAC. While using more resources for the same bit-precision, posits offer a wider dynamic range at fewer bits while maintaining a faster maximum operational frequency. Moreover, the energy-delay-product of the floating point and posit EMACs are comparable. The fixed-point EMAC, obviously, is uncontested with its resource utilization and latency; its lack of an exponential parameter results in a far more slender accumulation register. However, fixed-point offers poor dynamic range compared with the other formats at the same bit-precision.
The quantization error of a tensor X is computed as the meansquared-error as shown in (3). Fig. 5 shows a layer-wise heatmap of quantization error between formats for the MNIST and Fashion MNIST classification tasks. It is clear that posits suffer the least consequences from quantization, which is especially noticeable at ≤5-bit precision.
We evaluate the inference accuracy of several feedforward threeor four-layer neural networks, instantiated on the Deep Positron accelerator, on five datasets. The baseline results are taken from networks trained and evaluated using standard IEEE-754 floating point at 32-bit precision. The inputs and weights of the trained
dense_2 dense_3 dense_4 overall avg. Fashion MNIST [33] 10,000 89.6% (1) 89.6% (4) 89.2% (4) 89.5% 2 The term "dense" is synonymous with a fully-connected feedforward layer in a DNN. point, as shown in Table 1 . In some cases, an 8-bit posit matches the performance of the 32-bit floating point baseline. An interesting result is that both posit and floating point at 8-bit precision improve upon the baseline performance for the Fashion MNIST task. We compare energy, delay, and the energy-delay-product against the average Deep Positron performance across all formats with [5, 8] bit-precision. Figs. 6 and 7 depict the average accuracy degradation across the five classification tasks against these metrics for each bit-width. Posit consistently outperforms at a slight cost in power. Fixed-point maintains the lowest delay across all bit-widths, as expected, but offers the worst performance. While the floating point EMAC generally uses less power than the posit EMAC, the posit EMAC enjoys lower latencies across all bit-widths whilst maintaining lower accuracy degradation.
Exploiting the Posit es Parameter
Experimental results in this paper are evaluated by exploiting the performance of posit numerical formats with es ∈ {0, 1, 2} across five data sets. As is shown in Fig. 6 , the energy-delay-product of the posit EMAC is dependent upon the es parameter. For instance, the energy-delay-product of the posit EMAC with es = 0, on average, is 3× and 1.4× less than the energy-delay-product of the posit EMAC with es = 2 and es = 1, respectively. On the other hand, Fixed Float Posit the average performance of DNN inference with es = 1 for the posit EMAC among the five datasets and [5, 7] bit-precision is 2% and 4% percent better than with es = 2 and es = 0, respectively. Thus, Deep Positron equipped with the posit (es = 1) EMAC has a better trade-off between energy-delay-product and accuracy for [5, 7] bits. For 8-bit, the results suggest that es = 1 is a better fit for energy-efficient applications and es = 2 for accuracy-dependent applications.
Comparison with Other Posit Hardware Implementations
A summary of previous studies which design posit arithmetic hardware is shown in Table 2 . Several groups implement posit basic arithmetic algorithms, such as addition, subtraction, multiplication, and exact-dot-product (Quire) on FPGA for various applications [3, 4, 16-18, 23, 25] . Kumar et al. provided a hardware generator for posit addition, subtraction, and multiplication and showed reduced latency and area consumption of 32-bit posit addition with es = 3 over IEEE-754 floating point addition [16, 17] . However, the comparison is between two different FPGA platforms which diminishes the merit of this comparison. They also ignore several characteristic demands for posit arithmetic, such as round-to-nearest with ties to even or unbiased rounding. To better realize the advantages of posit arithmetic over IEEE-754 floating point with complete posit arithmetic features, Chaurasiya et al. proposed a parameterized posit arithmetic hardware generator [3] . They emphasized that resource utilization and energy of the posit arithmetic unit is comparable with IEEE-754 float when the same number of bits are considered for both formats. However, the area consumption of the posit hardware is less than IEEE-745 float at similar precision and dynamic range. To simplify and expedite hardware design, as well as improve the usability of posits on heterogeneous platforms, researchers in [23] and [25] use high-level languages, such as C# and OpenCL, to generate posit arithmetic hardware for FPGAs. Most of the previous works do not support the exact-dot-product operation and do not design specialized posit arithmetic for deep learning applications as we presented in this paper. In [2] , a parameterized FPGA-mounted DNN accelerator is constructed which employs exact-dot-product algorithms for the posit, fixed-point, and floating point formats. The paper shows strong preliminary results that posits are a natural fit for low-precision inference. Proceeding this work, J. Johnson proposed an exact log-linear multiply-add arithmetic algorithm for deep learning applications using a posit multiplier in the log domain and a Kulisch adder [18] . The results indicate better performance of 8-bit posit multiply-add over 8-bit fixed-point multiply-add with similar accuracy for the ResNet-50 neural network and ImageNet dataset. However, the paper targets an ASIC platform and convolutional neural network at 8-bit precision whereas we study an FPGA implementation and fullyconnected neural network at [5, 8] bit-precision.
CONCLUSIONS
We demonstrate that the recent posit numerical system has a high affinity for deep neural network inference at ≤8-bit precision. The proposed posit hardware is shown to be competitive with the floating point counterpart in terms of resource utilization and energydelay-product. Moreover, the posit EMAC offers a superior maximum operating frequency over that of floating point. With regard to performance degradation, direct quantization to ultra-low precision favors posits heavily, surpassing fixed-point vastly. Moreover, the performance of floating point is either matched or surpassed consistently by posits across multiple datasets. The success of prospective new classes of learning algorithms will be coordinately contingent on the underlying hardware.
