Abstract-Low-precision DNNs have been extensively explored in order to reduce the size of DNN models for edge devices. Recently, the posit numerical format has shown promise for DNN data representation and compute with ultra-low precision ∈ [5..8] bits. However, previous studies were limited to studying posit for DNN inference only. In this paper, we propose the Cheetah framework, which supports both DNN training and inference using posits, as well as other commonly used formats. Additionally, the framework is amenable for different quantization approaches and supports mixed-precision floating point and fixed-point numerical formats. Cheetah is evaluated on three datasets: MNIST, Fashion MNIST, and CIFAR-10. Results indicate that 16-bit posits outperform 16-bit floating point in DNN training. Furthermore, performing inference with [5..8]-bit posits improves the trade-off between performance and energydelay-product over both [5..8]-bit float and fixed-point.
requirements to match putative edge resources. Several groups have proposed compressed DNN models with new computeand memory-efficient neural networks [8] [9] [10] and parameterefficient neural networks, such as DNN pruning [11] , distillation [12] , and low-precision arithmetic [13] , [14] .
Among these approaches to compress DNN models, lowprecision arithmetic is noted for its ability to reduce memory capacity, bandwidth, latency, and energy consumption associated with MAC units in DNNs, and an increase in the level of data parallelism [13] , [15] , [16] . For instance, DNN inference with compressed models, such as MobileNet with 8-bit fixedpoint parameters, utilizes only ∼4.2 M parameters and ∼1.1 megaFLOPS [8] . While this alleviates some of the design constraints for the edge, DNN models must still run quickly with high accuracy for complex visual or video recognition tasks on-device. Therefore, a conflicting design constraint here is that the network's precision cannot compromise a DNN's overall performance. For instance, there is a ∼10% gap between the performance of low-precision DNN models (e.g, MobileNet with 8-bit fixed-point DNN parameters) and highprecision DNN models (e.g, MobileNet with 32-bit floating point DNN parameters) for real-time (30 FPS) classification on ImageNet data with a Snapdragon 835 LITTLE core [13] .
The ultimate goal of designing the low-precision DNN is reducing the hardware complexity of the high-precision DNN model such that it can be ported on to edge devices with performance similar to the high-precision DNN. The hardware complexity and performance in low-precision DNNs rely heavily on the quantization approach and the numerical format. Prevailing techniques, such as complex vector quantization or hardware-friendly numerical formats, lead to undesirable hardware complexity or performance penalties [17] , [18] .
To understand the correlation between hardware complexity and performance of low-precision neural networks for the edge, a hardware and software co-design framework is required. Previous studies have addressed this by proposing low-precision frameworks [13] [14] [15] [16] , [19] [20] [21] [22] . However, the scope of these studies is limited, as highlighted below:
1) None of the previous works explore the propriety of the posit numerical format for both DNN training and inference by comprehensive comparison with fixed and float formats [19] [20] [21] [22] . 2) There is a lack of comparison between the efficacy of quantization approaches, numerical formats, and the associated hardware complexity.
3) In most of the previous works, the comparison across numerical formats are conducted for varying bit-widths (e.g. 32-bit floating point compared to 8-bit fixed-point [15] ). Such comparisons do not offer insights on viability of utilizing the same bit-precision across numerical formats for a particular task. To address the gaps in previous studies, we are motivated to propose Cheetah as a comprehensive hardware and software co-design framework to explore the advantage of low-precision for both DNN training and inference. The current version of Cheetah supports three numerical formats (fixed-point, floating point, and posit), two quantization approaches (rounding and linear), and two DNN models (feedforward neural networks and convolutional neural networks).
II. BACKGROUND

A. Deep Neural Network
Deep neural networks (DNNs) [23] are artificial neural networks that are used for various tasks, such as classification, regression and prediction, by learning the correlation between examples from a corpus of data called training sets [24] . These networks are capable of learning a non-linear inputto-output mapping in either a supervised, unsupervised, or semi-supervised manner. The DNN models contain a sequence of layers, each comprising a set of nodes. The connectivity between layers depends on the DNN architecture (e.g. globally connected in feedforward neural network or locally connected in convolutional neural network).
A major computation in a DNN node is the MAC operation. Specifically, a node in feedforward neural and convolutional neural network computes (1) where B indicates the bias vector, W is the weights tensor with numerical values that are associated with each connection, A represents the activation vector as input values to each node, Y is the feature vector at the output of each node, and N equals either the number of nodes for a feedforward neural network or the product of the (C, R, S) filter parameters: the number of filter channels, the filter heights, and the filter weights, respectively, for a convolutional neural network.
In a supervised learning scenario for all of these networks, the correctness of classifications is given by the distance between Y and the desired output as calculated by E i , a cost function with respect to the weights. Then, during training, the weights are learned through stochastic gradient descent (SGD) to minimize E i as given by (2) .
B. Posit Numerical Format
The posit, a Type III unum, is a new numerical format with tapered precision characteristic and was proposed as an alternative to IEEE-754 floating format to represent real numbers [25] . Posit revamped the IEEE-754 floating format and addressed complaints about Type I and Type II unums [26] . Posits provides better accuracy, dynamic range, and program reproducibility than IEEE floating point. The essential advantage of posits is their capability to represent non-linearly distributed numbers in a specific dynamic range around 1 with maximum accuracy. The value of a posit number is represented by (3), where s represents the sign, es and f s represent the maximum number of bits allocated for the exponent and fraction, respectively, e and f indicate the exponent and fraction values, respectively, and k, as computed by (4) , represents the regime value.
The regime bit-field is encoded based on the runlength m of identical bits (r...r) terminated by either a regime terminating bit r or the end of the n-bit value. Note that there is no requirement to distinguish between negative and positive zero since only a single bit pattern (00...0) represents zero. Furthermore, instead of defining a NaN for exceptional values and infinity by various bit patterns, a single bit pattern (10...0), "Not-a-Real" (N aR), represents exception values and infinity.
More details about the posit number format can be found in [25] .
III. RELATED WORK As lately as the 1980s, low-precision arithmetic has been studied for shallow neural networks to reduce compute and memory complexity for training and inference without sacrificing performance [27] [28] [29] [30] . In some scenarios, it also improves the performance of training and inference since the quantization noise generated from the use of low-precision parameters in shallow neural network acts as a regularization method [30] , [31] . The outcome of these studies indicate that 16-and 8-bit precision DNN parameters are sufficient for training and inference on shallow networks [28] [29] [30] . The capability of low-precision arithmetic is reevaluated in the deep learning era to reduce memory footprint and energy consumption during training and inference [14] [15] [16] , [19] [20] [21] [22] , [32] [33] [34] [35] [36] [37] [38] .
A. Low-Precision DNN Training
Several of the previous studies have shown that to perform DNN training, either variants of low-precision block floating point (BFP), where a block of floating point DNN parameters used a shared exponent [39] , such as Flexpoint [35] (16-bit fraction with 5-bit shared exponent for DNN parameters), or mixed-precision floating point (16-bit weights, activations, and gradients and 32-bit accumulators in the SGD weight update process) are sufficient to maintain similar performance as 32-bit high-precision floating point. For instance, Courbariaux et al. trained a low-precision DNN on the MNIST, CIFAR-10, and SVHN datasets with the floating point, fixed-point, and BFP numerical formats [32] . They demonstrate that BFP is the most suitable choice for low-precision training due to variability between the dynamic range and precision of DNN parameters [32] . Following this work, Koster et al. proposed the Flexpoint numerical format and a new algorithm called Autoflex to automatically predict the optimal shared exponents for DNN parameters in each iteration of SGD by statistically analyzing the values of DNN parameters in previous iterations [35] .
Aside from managing the shared exponent in the BFP numerical format, Narang et al. used mixed-precision floating point [34] . They used a 16-bit floating point to represent weights, activations, and gradients to perform forward and backward passes. To prevent accuracy loss caused by underflow in the product of learning rate and gradients with (2) in 16-bit floating point, the weights are updated in 32-bit floating point. Additionally, to prevent gradients with very small magnitude from becoming zero when represented by 16-bit float, a new loss scaling approach is proposed [34] .
Recently, Wang et al. and Mellempudi et al. reduce the bit-precision required to represent weights, activations, and gradients to 8-bit by exhaustively analyzing DNN training parameters [14] , [36] . Even in [36] , a new chunk-based addition is presented to solve the truncation issue caused by addition of large-and small-magnitude numbers and thus the number of bits demanded for accumulator and weight updates is reduced to 16-bits. To prevent the requirement of the loss scaling in mixed-precision floating point, Kalamkar et al. [37] proposed the brain floating point (BFLOAT-16) half-precision format with similar dynamic range (7-bit exponent) and less precision (8-bit fraction) compared to 32-bit floating point. The same dynamic range between BFLOAT-16 and 32-bit floating point reduces the conversion complexity between these two formats in DNN training. In training a ResNet model on the ImageNet dataset, BFLOAT-16s achieve the same performance as 32-bit floating point.
B. Low-Precision DNN Inference
The performance of DNN inference without retraining is more robust to the noise that is generated from low-precision DNN parameters as the DNN parameters during inference are static; several groups have demonstrated that either 8-bit BFP or 8-bit fixed-point, coupled with linear quantization, are adequate to represent weights and activations without significantly degrading performance yielded with 32-bit floating point. Note that the accumulation bit-width is selected to be 32 bits to preserve accuracy in performing, in general, thousands of additions in the MAC operations. For instance, Gysel et al. demonstrate that an 8-bit block floating point for representing weights and activations, 8-bit multipliers, and 32-bit accumulation results in <1% accuracy loss on AlexNet with the ImageNet corpus [16] . Following this work, Hashemi et al. introduce low-precision DNN inference networks to better understand the impact of numerical formats on the energy consumption and performance of DNNs [15] , [16] .
For instance, performing inference on AlexNet with the 8-bit fixed-point format yields a 6× improvement in energy consumption over 32-bit fixed-point for the CIFAR-10 dataset [15] . Chung et al. proposed the Brainwave accelerator using 8-bit block floating point with a 5-bit exponent to classify ImageNet dataset on ResNet-50 with <2% accuracy loss [38] . However, the scaling factor parameter in the block floating point numerical format needs to be updated according to the DNN parameter statistics, thus increasing the computational complexity of inference.
To alleviate this problem, researchers have used posits in DNNs [19] [20] [21] [22] . Posits represent numbers more accurately around ±1 and less accurately for very small and large numbers, unlike the uniform precision of the floating point numerical format [40] . This characteristic of posits arises from its tapered precision and suits the distribution of DNN parameters well [19] , [25] . For instance, Langroudi et al. explored the efficacy of posits for representing DNN weights and have shown that it is possible to achieve a loss in accuracy within <1% on the AlexNet and ImageNet corpora with weight representation at 7-bit [19] . They also demonstrate that posits have a 30% less voracious memory footprint than fixed-point for multiple DNNs while maintaining a <1% drop in accuracy. However, in the work, the 7-bit posit quantized weights are converted to 32-bit floats, limiting the posit numerical format for memory storage only.
To take full advantage of the posit numerical format, Carmichael et al. proposed the Deep Positron DNN accelerator which employs the posit numerical format to represent weights and activations combined with an FPGA soft core for ≤8-bit precision exact-MAC operations [20] , [21] . They demonstrate that 8-bit posits outperform 8-bit fixed-point and floating point on low-dimensional datasets, such as Iris [41] . Following these works, most recently, Jeff Johnson proposed a log float format as a combination of the posit numerical format and exact log-linear multiply-add (ELMA), which is the logarithmic version of the exact MAC operation. This work shows that it is possible to classify ImageNet with the ResNet DNN architecture with <1% accuracy degradation [22] .
This research builds on these earlier studies [19] [20] [21] [22] and extends low-precision arithmetic to both DNN training and DNN inference with different quantization approaches for both feedforward and convolution neural networks on various datasets.
IV. PROPOSED FRAMEWORK
The Cheetah framework, shown in Fig. 1 , comprises a two-level software component and a single-level hardware component. The software framework is used to evaluate the performance of various numerical formats and quantization approaches by emulating low-precision DNN training and inference. The hardware framework is a soft-core implemented on FPGA and used for evaluating hardware characteristics of the MAC (multiply-and-accumulate) operations as a fundamental computation in DNN models coupled with various quantization techniques. For each level, two optimization stages are considered to convert the baseline DNN model with 32-bit high-precision floating point with soft-core MACs to a low-precision DNN model with either posit, floating point, or fixed-point arithmetic soft-core exact-MACs (EMACs). This optimization is performed iteratively, reducing the bitprecision by one at each step; the performance degradation and hardware complexity reduction achieved by a numerical format in both DNN training and inference is computed and compared with the specified design constraints (e.g. 3× EDP reduction with similar performance). This iterative process is repeated for the next numerical format after one of the design constraints is violated. Essentially, Cheetah approximates the optimal bit-width for each numerical format based on the performance and hardware complexity constraints. Note that there is a priority between optimization approaches; the numerical format parameter has a higher precedence in the optimization process. This design decision is made to limit the search space and the hardware complexity overhead of the quantization approaches. In performing DNN inference, the current version of Cheetah supports three low-precision numerical formats (fixed-point, floating point and posit), two quantization approaches (rounding and linear), and two DNN models (feedforward and convolutional neural networks). To perform DNN training on feedforward neural networks, Cheetah supports two numerical formats (floating point and posit) with 32-bit and 16-bit precision. For brevity, the architecture explained here is based on single hidden layer feedforward neural network training and inference with the posit numerical format for both rounding and linear quantization approaches, as shown in Fig. 2 .
A. Software Design and Exploration
In emulating feedforward and convolutional DNNs, the output of each layer Y is calculated as in (5)
where α 1 and α 2 are scale factors, B i is the bias term, A i is the activation vector, W ij is the weight matrix, N indicates the number of MAC operations, and Q(·) is the quantization function. First, the feedforward or convolutional neural network is trained by either 32-or 16-bit floating point or posit numbers as shown by Fig. 5 . To perform DNN inference, the 32-bit floating point high-precision learned weights and 32-bit floating point high-precision activations are quantized to either n-bit low-precision fixed-point, floating point, or posit numbers (n ≤ 8).
In the quantization procedure, the values of α 1 and α 2 are dependent on the quantization approach. To perform rounding quantization, α 1 and α 2 are both set to 1 and the 32-bit highprecision floating point values that lie outside dynamic range of one of the low-precision posit numerical formats (e.g. 8-bit posit) are clipped appropriately to either the format's maximum or minimum. During quantization by rounding, a value that is interleaved between two arbitrary numbers is rounded to the nearest number. In the next step, the MAC operation is employed to calculate Y i . To minimize arithmetic error, the MAC operation in this paper is calculated using the EMAC algorithm [20] . In the EMAC, to preserve precision in computing the products, the posit weights and activations are multiplied in a posit format without truncation or rounding at the end of multiplications. To avoid rounding during accumulation, the products are stored in a wide register, or quire in the posit literature, with a width given by (6) . The products are then converted to the fixedpoint format F X (m k ,n k ) , where m k = 2 es+1 × (n − 2) + 2 + log 2 (N op ) is the exponent bit-width and n k = 2 es+1 × (n − 2) is the fraction bit-width. Finally, the N op fixed-point products are accumulated and the result is descaled in linear quantization, again using α 1 and α 2 , and converted back to posit.
Algorithm 1 Posit DOT operation for n-bit inputs each with es exponent bits [20] 1: procedure POSITDOT(weight, activation)
2:
signw, regw, expw, fracw ← DECODE(weight) 3: signa, rega, expa, fraca ← DECODE(activation) 4: sfw ← {regw, expw} Gather scale factors 5: sfa ← {rega, expa}
signmult ← signw ⊕ signa
7:
fracmult ← fracw × fraca 8:
Adjust for overflow 9: normfracmult ← fracmult ovfmult 10: sfmult ← sfw + sfa + ovfmult fracsmult ← signmult ? −fracmult : fracmult
12:
sfbiased ← sfmult + bias Bias the scale factor 13: fracsfixed ← fracsmult sfbiased Shift to fixed 14: sumquire ← fracsfixed + sumquire Accumulate 
Fraction & SF Extraction
Check for overflow 26: regf ← ovfreg ? {{ log 2 (n) −2{1}}), 0} : reg ovfregf ← &regf 31: if ovfregf then 32: shiftneg ← regf − 2
33:
shiftpos ← regf − 1 34:
shiftneg ← regf − 1
36:
shiftpos ← regf 37: end if 38: tmp ← signsf ? tmp2 shiftneg : tmp1 shiftpos result ← signquire ? −resulttmp : resulttmp 43: return result 44: end procedure
B. Hardware Framework
The MAC operation, as introduced as the fundamental DNN operation, calculates the weighted sum of a set of inputs. In many implementations, this operation is inexact, i.e. arithmetic error grows due to iterative rounding and truncation. The EMAC mitigates this concern by adapting the concept of the Kulisch accumulator [42] . The error due to rounding is deferred until after the accumulation of all products, which lowprecision arithmetic further benefits from. In the EMAC, as Figure 3 : A parameterized (n total bits, es exponent bits) FPGA soft core design of the posit exact multiply-and-accumulate (EMAC) operation [20] .
mentioned beforehand, the fixed-point values of N op products are accumulated in a wide register sized as given by (6) . The posit EMAC, illustrated by Fig. 3 , is parameterized by n, the bit-width, and es, the number of exponential bits. "NaR" is not considered as posits do not overflow or underflow and all DNN parameters and data are real numbers. Algorithm 1 describes the bitwise operation of the EMAC dot product. Each EMAC is pipelined into three stages: multiplication, accumulation, and rounding. For further details on EMACs and the exact dot product, we suggest reviewing [20] , [21] , [42] .
V. SIMULATION RESULTS & ANALYSIS The
Cheetah software is implemented in the Keras [43] and TensorFlow [44] frameworks. Rounding quantization, linear quantization, and the EMAC operations with [5, 32] bit precision fixed-point, floating point, and posit numbers for DNN inference and {16, 32}-bit floating point and posit numbers for DNN training are extended to these frameworks via software emulation. To reduce the search space of the α 1 and α 2 parameters, β is selected from {1, 2, 4, 8} which still provides, on average, a wide coverage (∼82%) of the dynamic range of each numerical format, as shown in Table I . 
A. Exploiting Numerical Formats for DNN Inference
To evaluate Cheetah performance on DNN inference, a feedforward neural network and different convolutional neural networks are trained on three benchmarks with 32-bit floating point. The specification of these tasks and inference performance are summarized in Table II . The accuracies of performing DNN inference on these tasks are presented in Table III . On the CIFAR-10 dataset, these performance gains are further noticeable with 5-bit posits having 28.5% and 31.62% improvements over floating point and fixed-point, respectively. The benefits of the posit numerical format are intuitively explained by the nonlinear distribution of its values, similar to that of DNN inference parameters. This hypothesis is explored empirically by calculating the distortion rate of DNN inference parameters with respect to each numerical format. The distortion rate is described by (7) where P indicates the high-precision parameters and Quant(P ) represents the quantized parameters. The results, as shown in Fig. 4 , validate the hypothesis, especially at 5-bit precision where the distortion rate of posit is significantly less than that of the other numerical formats.
B. Exploiting Numerical Formats with Quantization Approaches for DNN Inference
As mentioned before, quantization with rounding has less overhead when compared to the other quantization approaches, but it is not possible to perform DNN inference with 5-bit posits with similar performance of DNN inference as 32-bit floating point. To improve performance of DNN inference, the [5..8]-bit posit numerical format is combined with linear quantization approaches and evaluated for a 4-layer feedforward neural network on the MNIST and Fashion-MNIST datasets. The α 1 × A i and α 2 × W ij in (5) can be either implemented by constant multiplication or by a shift operation where the (a) dense_1 dense_2 dense_3 dense_4 overall avg. Table IV , exhibit that 5-bit low-precision DNN inference achieves similar performance to 32-bit floating point DNN inference on the MNIST data set. Essentially, by deploying this approach, the quantization error produced by the values that lie outside of posit's dynamic range is zeroed out. The linear quantization approach also plays a key role in reducing the hardware complexity of posit EMACs used for DNN inference. Notably, the accuracy of DNN inference with posits is significantly enhanced by using the linear quantization approach in comparison to quantization with rounding. Therefore, the overhead of adding linear quantization is offset by reducing the hardware complexity, i.e. carrying out the posit EMAC operation with es = 0 instead of es = 1, which is explained in depth in the next section.
C. Exploiting Posit and Floating Point for DNN Training
To explore the efficacy of the posit numerical format over the floating point numerical format, a 4-layer feedforward neural network is trained with each number system on the MNIST and Fashion-MNIST datasets. The results indicate that the posit numerical format has a slightly better accuracy in comparison to the floating point number system, as shown in Table V . 16-bit posits outperform 16-bit floats in terms of accuracy. Although Cheetah is evaluated on small datasets, there are two advantages compared to [14] , [36] . Mellempudi et al. [36] use 32-bit numbers for accumulation to reduce the hardware cost of stochastic rounding. Wang et al. [14] reduce the accumulation bit-precision to 16 by using stochastic rounding. However, in this paper, we show the potential of using 16-bit posits for all DNN parameters with a simple and hardware-friendly round-to-nearest algorithm and show less than 1% accuracy degradation without exhaustively analyzing DNN training parameters. To show the effectiveness of the posit numerical format over floating point and fixed-point, we evaluate the trade-off between the energy-delay-product and latency of the EMAC operation vs. average accuracy degradation from 32-bit floating point per bit-width across the three datasets (two for the linear-quantization experiment) with the Cheetah framework, as shown in Figs. 5, 6, 7, 8 , and 9. The energy-delay-product, a combined measure of the latency and resource cost of the EMAC operation, coupled with quantization with rounding [20] and the EMAC operation coupled with linear quantization are selected for all numerical formats and measured on a Virtex-7 FPGA (xc7vx485t-2ffg1761c) with synthesis through Vivado 2017.2. Note that the average accuracy degradation per bit-width is computed using the accuracy results in Table IV. The results, as shown by Fig. 5 , indicate that posit coupled with rounding quantization achieves up to 23% average accuracy improvement over fixed-point. However, this accuracy enhancement is gained at the cost of a 0.41 × 10 −10 increase in energy-delay-product to implement the EMAC unit. Posit also consistently shows better performance, especially at 5-bit compared to the floating point number system at a comparable energy-delay-product. The posit EMAC operation achieves lower latencies, as shown in Fig. 6 , due to a lack of subnormal detection and other exception cases, but exhibits resourcehungry encoding and decoding due to the variable-length regime of the posit numerical format, as shown in Fig. 7 . Overall, the 6-bit posit shows the best trade-off between energy-delay-product and average accuracy degradation from 32-bit floating point on the two benchmarks (when analyzed across the [5..8]-bit range). Looking at the posit numerical format in terms of classification performance and EMAC energy-delay-product, posits with es = 1 provide a better trade-off compared to posits with es ∈ {0, 2}. At [5. .7]-bit precision, the average performance of DNN inference with es = 1 among the three datasets is 2% and 4% better than with es = 2 and es = 0, respectively. These accuracy benefits are coupled with 2.1× less energy-delay-product and 1.4× more energy-delay-product in comparison to es = 2 and es = 0, respectively. These results are measured when the rounding quantization is used. Linear quantization with the shift operation requires similar hardware overhead across all of the numerical formats, as shown in Figs. 8 and 9 . However, the accuracy of performing DNN inference with linear quantization with posits (es = 0) is similar to the accuracy when es = 1. Therefore, it is possible to use EMACs with es = 0 instead of es = 1 and thereby achieve 18% energy-delay-product savings.
A summary of previous studies that propose low-precision frameworks are shown in Table VI . Several research groups have explored the efficacy of floats and fixed-point on the performance and hardware complexity of DNNs with multiple image classification tasks [14] [15] [16] , [32] , [34] , [35] . However, none of these works analyze the appropriateness of the posit numerical format for both DNN training and inference. Additionally, current work does not offer insight on the impact of the quantization approach vs. numerical format on both accuracy and hardware complexity, as investigated in this paper.
VI. CONCLUSIONS
A low-precision DNN framework, Cheetah, for edge devices is proposed in this work. We explored the capacity of various numerical formats, including floating point, fixed-point and posit, for both DNN training and inference. We show that the recent posit numerical format has high efficacy for DNN training at {16, 32}-bit precision and inference at ≤8-bit precision. Moreover, we show that it is possible to achieve better performance and reduce energy consumption by using linear quantization with the posit numerical format. The success of low-precision posits in reducing DNN hardware complexity with negligible accuracy degradation motivates us to evaluate ultra-low precision training in future work. 
