Cheetah: Mixed Low-Precision Hardware & Software Co-Design Framework for
  DNNs on the Edge by Langroudi, Hamed F. et al.
1Cheetah: Mixed Low-Precision Hardware &
Software Co-Design Framework for DNNs on the
Edge
Hamed F. Langroudi, Zachariah Carmichael, David Pastuch, Dhireesha Kudithipudi
(Preprint)
Abstract—Low-precision DNNs have been extensively explored
in order to reduce the size of DNN models for edge devices.
Recently, the posit numerical format has shown promise for
DNN data representation and compute with ultra-low precision
∈ [5..8] bits. However, previous studies were limited to study-
ing posit for DNN inference only. In this paper, we propose
the Cheetah framework, which supports both DNN training
and inference using posits, as well as other commonly used
formats. Additionally, the framework is amenable for different
quantization approaches and supports mixed-precision floating
point and fixed-point numerical formats. Cheetah is evaluated on
three datasets: MNIST, Fashion MNIST, and CIFAR-10. Results
indicate that 16-bit posits outperform 16-bit floating point in
DNN training. Furthermore, performing inference with [5..8]-bit
posits improves the trade-off between performance and energy-
delay-product over both [5..8]-bit float and fixed-point.
Index Terms—Deep neural networks, low-precision arithmetic,
posit numerical format
I. INTRODUCTION
Edge computing is an emerging design paradigm that offers
intelligence-at-the-edge of mobile networks, while addressing
some of the shortcomings of cloud datacenters [1]. The nodes
of the edges host the computing, storage, and communi-
cation capabilities, which provide on-demand learning for
several applications, such as intelligent transportation, smart
cities, and industrial robotics. Inherent characteristics of edge
devices include low latency, reduced data movement cost,
low communication bandwidth, and decentralized real-time
processing [2], [3]. However, deploying intelligence-at-the-
edge is a formidable challenge for several of the deep neural
network (DNN) models. For instance, DNN inference with
AlexNet requires ∼61 M parameters and ∼1.4 gigaFLOPS
[4]. Moreover, the cost of the multiply-and-accumulate (MAC)
units, a fundamental DNN operation, is non-trivial. In a 45 nm
CMOS process, energy consumption doubles from 16-bit floats
to 32-bit floats for addition and it increases by ∼4x for
multiplication [5]. Memory access cost increases by ∼10x
from 8 k to 1 M memory size with 64-bit cache [5]. In general,
there is a gap between memory storage, bandwidth, compute
requirements, and energy consumption of today’s DNN models
and hardware resources available on edge devices [6], [7].
An apparent solution to address this gap is by compress-
ing the size of the networks and reduce the computation
Hamed. F. Langroudi, Zachariah Carmichael, David Pastuch, and Dhireesha
Kudithipudi are with the Department of Computer Engineering, Rochester
Institute of Technology, Rochester, NY, USA
requirements to match putative edge resources. Several groups
have proposed compressed DNN models with new compute-
and memory-efficient neural networks [8]–[10] and parameter-
efficient neural networks, such as DNN pruning [11], distilla-
tion [12], and low-precision arithmetic [13], [14].
Among these approaches to compress DNN models, low-
precision arithmetic is noted for its ability to reduce memory
capacity, bandwidth, latency, and energy consumption associ-
ated with MAC units in DNNs, and an increase in the level of
data parallelism [13], [15], [16]. For instance, DNN inference
with compressed models, such as MobileNet with 8-bit fixed-
point parameters, utilizes only ∼4.2 M parameters and ∼1.1
megaFLOPS [8]. While this alleviates some of the design
constraints for the edge, DNN models must still run quickly
with high accuracy for complex visual or video recognition
tasks on-device. Therefore, a conflicting design constraint
here is that the network’s precision cannot compromise a
DNN’s overall performance. For instance, there is a ∼10% gap
between the performance of low-precision DNN models (e.g,
MobileNet with 8-bit fixed-point DNN parameters) and high-
precision DNN models (e.g, MobileNet with 32-bit floating
point DNN parameters) for real-time (30 FPS) classification
on ImageNet data with a Snapdragon 835 LITTLE core [13].
The ultimate goal of designing the low-precision DNN is
reducing the hardware complexity of the high-precision DNN
model such that it can be ported on to edge devices with
performance similar to the high-precision DNN. The hardware
complexity and performance in low-precision DNNs rely heav-
ily on the quantization approach and the numerical format.
Prevailing techniques, such as complex vector quantization
or hardware-friendly numerical formats, lead to undesirable
hardware complexity or performance penalties [17], [18].
To understand the correlation between hardware complex-
ity and performance of low-precision neural networks for
the edge, a hardware and software co-design framework is
required. Previous studies have addressed this by proposing
low-precision frameworks [13]–[16], [19]–[22]. However, the
scope of these studies is limited, as highlighted below:
1) None of the previous works explore the propriety of
the posit numerical format for both DNN training and
inference by comprehensive comparison with fixed and
float formats [19]–[22].
2) There is a lack of comparison between the efficacy
of quantization approaches, numerical formats, and the
associated hardware complexity.
ar
X
iv
:1
90
8.
02
38
6v
1 
 [c
s.L
G]
  6
 A
ug
 20
19
23) In most of the previous works, the comparison across
numerical formats are conducted for varying bit-widths
(e.g. 32-bit floating point compared to 8-bit fixed-point
[15]). Such comparisons do not offer insights on viabil-
ity of utilizing the same bit-precision across numerical
formats for a particular task.
To address the gaps in previous studies, we are motivated to
propose Cheetah as a comprehensive hardware and software
co-design framework to explore the advantage of low-precision
for both DNN training and inference. The current version of
Cheetah supports three numerical formats (fixed-point, floating
point, and posit), two quantization approaches (rounding and
linear), and two DNN models (feedforward neural networks
and convolutional neural networks).
II. BACKGROUND
A. Deep Neural Network
Deep neural networks (DNNs) [23] are artificial neural
networks that are used for various tasks, such as classification,
regression and prediction, by learning the correlation between
examples from a corpus of data called training sets [24].
These networks are capable of learning a non-linear input-
to-output mapping in either a supervised, unsupervised, or
semi-supervised manner. The DNN models contain a sequence
of layers, each comprising a set of nodes. The connectivity
between layers depends on the DNN architecture (e.g. globally
connected in feedforward neural network or locally connected
in convolutional neural network).
A major computation in a DNN node is the MAC operation.
Specifically, a node in feedforward neural and convolutional
neural network computes (1) where B indicates the bias
vector, W is the weights tensor with numerical values that are
associated with each connection, A represents the activation
vector as input values to each node, Y is the feature vector
at the output of each node, and N equals either the number
of nodes for a feedforward neural network or the product of
the (C,R, S) filter parameters: the number of filter channels,
the filter heights, and the filter weights, respectively, for a
convolutional neural network.
Yj = Bj +
N∑
i=0
Ai ×Wij (1)
In a supervised learning scenario for all of these networks,
the correctness of classifications is given by the distance
between Y and the desired output as calculated by Ei, a cost
function with respect to the weights. Then, during training, the
weights are learned through stochastic gradient descent (SGD)
to minimize Ei as given by (2).
∆Wij = −α ∂Ei
∂Wij
(2)
B. Posit Numerical Format
The posit, a Type III unum, is a new numerical format
with tapered precision characteristic and was proposed as
an alternative to IEEE-754 floating format to represent real
numbers [25]. Posit revamped the IEEE-754 floating format
and addressed complaints about Type I and Type II unums
[26]. Posits provides better accuracy, dynamic range, and
program reproducibility than IEEE floating point. The essential
advantage of posits is their capability to represent non-linearly
distributed numbers in a specific dynamic range around 1 with
maximum accuracy. The value of a posit number is represented
by (3), where s represents the sign, es and fs represent the
maximum number of bits allocated for the exponent and frac-
tion, respectively, e and f indicate the exponent and fraction
values, respectively, and k, as computed by (4), represents the
regime value.
x =

0, if (00...0)
NaR, if (10...0)
(−1)s × 22es×k × 2e ×
(
1 + f
2fs
)
, otherwise
(3)
The regime bit-field is encoded based on the runlength m of
identical bits (r...r) terminated by either a regime terminating
bit r or the end of the n-bit value. Note that there is
no requirement to distinguish between negative and positive
zero since only a single bit pattern (00...0) represents zero.
Furthermore, instead of defining a NaN for exceptional values
and infinity by various bit patterns, a single bit pattern (10...0),
“Not-a-Real” (NaR), represents exception values and infinity.
More details about the posit number format can be found in
[25].
k =
{
−m, if r = 0
m− 1, if r = 1 (4)
III. RELATED WORK
As lately as the 1980s, low-precision arithmetic has been
studied for shallow neural networks to reduce compute and
memory complexity for training and inference without sac-
rificing performance [27]–[30]. In some scenarios, it also
improves the performance of training and inference since the
quantization noise generated from the use of low-precision
parameters in shallow neural network acts as a regularization
method [30], [31]. The outcome of these studies indicate
that 16- and 8-bit precision DNN parameters are sufficient
for training and inference on shallow networks [28]–[30].
The capability of low-precision arithmetic is reevaluated in
the deep learning era to reduce memory footprint and energy
consumption during training and inference [14]–[16], [19]–
[22], [32]–[38].
A. Low-Precision DNN Training
Several of the previous studies have shown that to perform
DNN training, either variants of low-precision block floating
point (BFP), where a block of floating point DNN parameters
used a shared exponent [39], such as Flexpoint [35] (16-bit
fraction with 5-bit shared exponent for DNN parameters), or
mixed-precision floating point (16-bit weights, activations, and
gradients and 32-bit accumulators in the SGD weight update
process) are sufficient to maintain similar performance as 32-
bit high-precision floating point. For instance, Courbariaux
3et al. trained a low-precision DNN on the MNIST, CIFAR-
10, and SVHN datasets with the floating point, fixed-point,
and BFP numerical formats [32]. They demonstrate that BFP
is the most suitable choice for low-precision training due to
variability between the dynamic range and precision of DNN
parameters [32]. Following this work, Koster et al. proposed
the Flexpoint numerical format and a new algorithm called
Autoflex to automatically predict the optimal shared exponents
for DNN parameters in each iteration of SGD by statistically
analyzing the values of DNN parameters in previous iterations
[35].
Aside from managing the shared exponent in the BFP
numerical format, Narang et al. used mixed-precision floating
point [34]. They used a 16-bit floating point to represent
weights, activations, and gradients to perform forward and
backward passes. To prevent accuracy loss caused by under-
flow in the product of learning rate and gradients with (2)
in 16-bit floating point, the weights are updated in 32-bit
floating point. Additionally, to prevent gradients with very
small magnitude from becoming zero when represented by
16-bit float, a new loss scaling approach is proposed [34].
Recently, Wang et al. and Mellempudi et al. reduce the
bit-precision required to represent weights, activations, and
gradients to 8-bit by exhaustively analyzing DNN training
parameters [14], [36]. Even in [36], a new chunk-based
addition is presented to solve the truncation issue caused by
addition of large- and small-magnitude numbers and thus the
number of bits demanded for accumulator and weight updates
is reduced to 16-bits. To prevent the requirement of the loss
scaling in mixed-precision floating point, Kalamkar et al. [37]
proposed the brain floating point (BFLOAT-16) half-precision
format with similar dynamic range (7-bit exponent) and less
precision (8-bit fraction) compared to 32-bit floating point. The
same dynamic range between BFLOAT-16 and 32-bit floating
point reduces the conversion complexity between these two
formats in DNN training. In training a ResNet model on the
ImageNet dataset, BFLOAT-16s achieve the same performance
as 32-bit floating point.
B. Low-Precision DNN Inference
The performance of DNN inference without retraining is
more robust to the noise that is generated from low-precision
DNN parameters as the DNN parameters during inference
are static; several groups have demonstrated that either 8-
bit BFP or 8-bit fixed-point, coupled with linear quantization,
are adequate to represent weights and activations without sig-
nificantly degrading performance yielded with 32-bit floating
point. Note that the accumulation bit-width is selected to
be 32 bits to preserve accuracy in performing, in general,
thousands of additions in the MAC operations. For instance,
Gysel et al. demonstrate that an 8-bit block floating point for
representing weights and activations, 8-bit multipliers, and 32-
bit accumulation results in <1% accuracy loss on AlexNet
with the ImageNet corpus [16]. Following this work, Hashemi
et al. introduce low-precision DNN inference networks to
better understand the impact of numerical formats on the
energy consumption and performance of DNNs [15], [16].
For instance, performing inference on AlexNet with the 8-
bit fixed-point format yields a 6× improvement in energy
consumption over 32-bit fixed-point for the CIFAR-10 dataset
[15]. Chung et al. proposed the Brainwave accelerator using
8-bit block floating point with a 5-bit exponent to classify
ImageNet dataset on ResNet-50 with <2% accuracy loss [38].
However, the scaling factor parameter in the block floating
point numerical format needs to be updated according to the
DNN parameter statistics, thus increasing the computational
complexity of inference.
To alleviate this problem, researchers have used posits in
DNNs [19]–[22]. Posits represent numbers more accurately
around ±1 and less accurately for very small and large
numbers, unlike the uniform precision of the floating point
numerical format [40]. This characteristic of posits arises from
its tapered precision and suits the distribution of DNN param-
eters well [19], [25]. For instance, Langroudi et al. explored
the efficacy of posits for representing DNN weights and have
shown that it is possible to achieve a loss in accuracy within
<1% on the AlexNet and ImageNet corpora with weight
representation at 7-bit [19]. They also demonstrate that posits
have a 30% less voracious memory footprint than fixed-point
for multiple DNNs while maintaining a <1% drop in accuracy.
However, in the work, the 7-bit posit quantized weights are
converted to 32-bit floats, limiting the posit numerical format
for memory storage only.
To take full advantage of the posit numerical format,
Carmichael et al. proposed the Deep Positron DNN accelerator
which employs the posit numerical format to represent weights
and activations combined with an FPGA soft core for ≤8-bit
precision exact-MAC operations [20], [21]. They demonstrate
that 8-bit posits outperform 8-bit fixed-point and floating point
on low-dimensional datasets, such as Iris [41]. Following these
works, most recently, Jeff Johnson proposed a log float format
as a combination of the posit numerical format and exact
log-linear multiply-add (ELMA), which is the logarithmic
version of the exact MAC operation. This work shows that
it is possible to classify ImageNet with the ResNet DNN
architecture with <1% accuracy degradation [22].
This research builds on these earlier studies [19]–[22] and
extends low-precision arithmetic to both DNN training and
DNN inference with different quantization approaches for
both feedforward and convolution neural networks on various
datasets.
IV. PROPOSED FRAMEWORK
The Cheetah framework, shown in Fig. 1, comprises a
two-level software component and a single-level hardware
component. The software framework is used to evaluate the
performance of various numerical formats and quantization
approaches by emulating low-precision DNN training and
inference. The hardware framework is a soft-core implemented
on FPGA and used for evaluating hardware characteristics
of the MAC (multiply-and-accumulate) operations as a fun-
damental computation in DNN models coupled with various
quantization techniques. For each level, two optimization
stages are considered to convert the baseline DNN model with
4Accuracy	
Analysis		
DNN	Models		
(	Keras	&
Tensorflow	)	
Arithmetic	Library
	(	C	&	C++		)	
EMAC	Softcore	
(	VHDL	)	
Numerical	
formats	
Quantization	
approach	
EDP	Analysis
Software
Framework
(High-level)
Software
Framework
(Low-level)
Hardware
Framework	
Accuracy	
Analysis		
EDP	Analysis
Customer	request	:	
3X	EDP	reduction
compared	to	32-bit	
floating	point	DNN	model	
with	Similar	performance
Cheetah	answer	:	
use	this	configuration
(8-bit	posit,	
linear	quantization)
Figure 1: The Cheetah High-level Hardware & Software Co-design framework for DNNs on the edge. EDP: Energy-Delay
Product.
32-bit high-precision floating point with soft-core MACs to a
low-precision DNN model with either posit, floating point,
or fixed-point arithmetic soft-core exact-MACs (EMACs).
This optimization is performed iteratively, reducing the bit-
precision by one at each step; the performance degradation
and hardware complexity reduction achieved by a numerical
format in both DNN training and inference is computed and
compared with the specified design constraints (e.g. 3× EDP
reduction with similar performance). This iterative process is
repeated for the next numerical format after one of the de-
sign constraints is violated. Essentially, Cheetah approximates
the optimal bit-width for each numerical format based on
the performance and hardware complexity constraints. Note
that there is a priority between optimization approaches; the
numerical format parameter has a higher precedence in the
optimization process. This design decision is made to limit
the search space and the hardware complexity overhead of
the quantization approaches. In performing DNN inference,
the current version of Cheetah supports three low-precision
numerical formats (fixed-point, floating point and posit), two
quantization approaches (rounding and linear), and two DNN
models (feedforward and convolutional neural networks). To
perform DNN training on feedforward neural networks, Chee-
tah supports two numerical formats (floating point and posit)
with 32-bit and 16-bit precision. For brevity, the architecture
explained here is based on single hidden layer feedforward
neural network training and inference with the posit numerical
format for both rounding and linear quantization approaches,
as shown in Fig. 2.
A. Software Design and Exploration
In emulating feedforward and convolutional DNNs, the
output of each layer Y is calculated as in (5)
Yj=Bj+
1
α1×α2×
(
N∑
i
[Q(α1×Ai)]× [Q(α2×Wij)]
)
(5)
where α1 and α2 are scale factors, Bi is the bias term, Ai is
the activation vector, Wij is the weight matrix, N indicates
the number of MAC operations, and Q(·) is the quantization
function. First, the feedforward or convolutional neural net-
work is trained by either 32- or 16-bit floating point or posit
numbers as shown by Fig. 5. To perform DNN inference, the
32-bit floating point high-precision learned weights and 32-
bit floating point high-precision activations are quantized to
either n-bit low-precision fixed-point, floating point, or posit
numbers (n ≤ 8).
In the quantization procedure, the values of α1 and α2 are
dependent on the quantization approach. To perform rounding
quantization, α1 and α2 are both set to 1 and the 32-bit high-
precision floating point values that lie outside dynamic range
of one of the low-precision posit numerical formats (e.g. 8-bit
posit) are clipped appropriately to either the format’s maxi-
mum or minimum. During quantization by rounding, a value
that is interleaved between two arbitrary numbers is rounded
to the nearest number. To perform linear quantization, the
activations and weights are quantized to the range [−β, β] by
calculating α1 = βMax(Ai) and setting α2 =
2β
Max(Wi)−Min(Wi) .
In the next step, the MAC operation is employed to calculate
Yi. To minimize arithmetic error, the MAC operation in this
5Fully-Connected 
Layer 
Exact MAC
Qr
Data
→
: 
  
ϕ(x)
ϕ( A)α1
:  
float32 
A
 (posit, fixed, float) 
Y →
R
 R( × )
1
×  α1 α2
Aq Wqϕ( W)α2
C
lassification
:  
float32 
W
→
Aq
Wq
Aq
Wq
Q
uantize
L
ayer 1
D
equantize
 
Figure 2: The Cheetah software framework for feedforward neural networks with one hidden layer. The framework scales to
any DNN architecture.
paper is calculated using the EMAC algorithm [20]. In the
EMAC, to preserve precision in computing the products, the
posit weights and activations are multiplied in a posit format
without truncation or rounding at the end of multiplications. To
avoid rounding during accumulation, the products are stored
in a wide register, or quire in the posit literature, with a width
given by (6). The products are then converted to the fixed-
point format FX(mk,nk), where mk = 2
es+1 × (n − 2) +
2 + dlog2(Nop)e is the exponent bit-width and nk = 2es+1 ×
(n− 2) is the fraction bit-width. Finally, the Nop fixed-point
products are accumulated and the result is descaled in linear
quantization, again using α1 and α2, and converted back to
posit.
wq = dlog2(Nop)e+ 2 es+2 × (n− 2) + 2 (6)
Algorithm 1 Posit DOT operation for n-bit inputs each with
es exponent bits [20]
1: procedure POSITDOT(weight,activation)
2: signw,regw,expw,fracw←DECODE(weight)
3: signa,rega,expa,fraca←DECODE(activation)
4: sfw←{regw,expw} . Gather scale factors
5: sfa←{rega,expa}
Multiplication
6: signmult← signw⊕ signa
7: fracmult← fracw× fraca
8: ovfmult← fracmult[MSB] . Adjust for overflow
9: normfracmult← fracmult ovfmult
10: sfmult← sfw+ sfa+ ovfmult
Accumulation
11: fracsmult← signmult ? −fracmult : fracmult
12: sfbiased← sfmult+ bias . Bias the scale factor
13: fracsfixed← fracsmult sfbiased . Shift to fixed
14: sumquire← fracsfixed+ sumquire . Accumulate
Fraction & SF Extraction
15: signquire← sumquire[MSB]
16: magquire← signquire ? −sumquire : sumquire
17: zc← LEADINGZEROSDETECTOR(magquire)
18: fracquire← magquire[2×(n−2−es)−1+zc : zc]
19: sfquire← zc−bias
Convergent Rounding & Encoding
20: nzero← |fracquire
21: signsf← sfquire[MSB]
22: exp← sfquire[es−1 : 0] . Unpack scale factor
23: regtmp← sfquire[MSB−1 : es]
24: reg← signsf ? −regtmp : regtmp
25: ovfreg← reg[MSB] . Check for overflow
26: regf← ovfreg ? {{dlog2(n)e−2{1}}),0} : reg
27: expf← (ovfreg|∼nzero|(&regf)) ? {es{0}} : exp
28: tmp1←{nzero,0,expf,fracquire[MSB−1 : 0],
{n−1{0}}}
29: tmp2←{0,nzero,expf,fracquire[MSB−1 : 0],
{n−1{0}}}
30: ovfregf←&regf
31: if ovfregf then
32: shiftneg← regf− 2
33: shiftpos← regf− 1
34: else
35: shiftneg← regf− 1
36: shiftpos← regf
37: end if
38: tmp← signsf ? tmp2 shiftneg : tmp1 shiftpos
39: lsb,guard← tmp[MSB−(n−2) : MSB−(n−1)]
40: round←∼(ovfreg|ovfregf) ?
( guard & (lsb | (|tmp[MSB−n : 0])) ) : 0
41: resulttmp← tmp[MSB : MSB−n+1]+round
42: result← signquire ? −resulttmp : resulttmp
43: return result
44: end procedure
B. Hardware Framework
The MAC operation, as introduced as the fundamental DNN
operation, calculates the weighted sum of a set of inputs. In
many implementations, this operation is inexact, i.e. arithmetic
error grows due to iterative rounding and truncation. The
EMAC mitigates this concern by adapting the concept of the
Kulisch accumulator [42]. The error due to rounding is de-
ferred until after the accumulation of all products, which low-
precision arithmetic further benefits from. In the EMAC, as
6D Q
D Q
D Q
Bias
Weight
Activation
es + clog2(n) + 1
 n − 3 − es
 n − 3 − es
〉
〉
 
2
×
(
n
−
2
−
e
s
)
 BIAS
2's C
om
p
 
〈
〈
 
2
×
(
n
−
2
−
e
s
)
+
1
 es + clog2(n) + 1
Round
Normalize
Clip
− 1w
a
Encode
D
ecode
D
ecode
clog2(n) + 1
es
clog2(n) + 1
es
 1
w
a
 1
 1
Decode & 
Shift
es + clog2(n) + 2
2's Comp
n
n
n
n
scale ? shift : identity
scale ? shift : identity
Scale  1
Output
scale ? shift : identity
Figure 3: A parameterized (n total bits, es exponent bits) FPGA soft core design of the posit exact multiply-and-accumulate
(EMAC) operation [20].
mentioned beforehand, the fixed-point values of Nop products
are accumulated in a wide register sized as given by (6). The
posit EMAC, illustrated by Fig. 3, is parameterized by n, the
bit-width, and es, the number of exponential bits. “NaR” is not
considered as posits do not overflow or underflow and all DNN
parameters and data are real numbers. Algorithm 1 describes
the bitwise operation of the EMAC dot product. Each EMAC is
pipelined into three stages: multiplication, accumulation, and
rounding. For further details on EMACs and the exact dot
product, we suggest reviewing [20], [21], [42].
V. SIMULATION RESULTS & ANALYSIS
The Cheetah software is implemented in the Keras [43]
and TensorFlow [44] frameworks. Rounding quantization,
linear quantization, and the EMAC operations with [5,32]-
bit precision fixed-point, floating point, and posit numbers
for DNN inference and {16, 32}-bit floating point and posit
numbers for DNN training are extended to these frameworks
via software emulation. To reduce the search space of the α1
and α2 parameters, β is selected from {1, 2, 4, 8} which still
provides, on average, a wide coverage (∼82%) of the dynamic
range of each numerical format, as shown in Table I.
Table I: The dynamic range coverage of ≤8-bit posit, floating
point, and fixed-point numerical formats. The percentages are
calculated without considering (NaR), infinity, and “Not-a-
Number” (NaN) values.
Format Dynamic Range ≤8-bit
Posit (es=0) 94.12%
Posit (es=1) 81.57%
Posit (es=2) 69.02%
Float (we=4) 66.66%
Float (we=3) 85.71%
Fixed-point (nk=4) 100.0%
A. Exploiting Numerical Formats for DNN Inference
To evaluate Cheetah performance on DNN inference, a
feedforward neural network and different convolutional neu-
ral networks are trained on three benchmarks with 32-bit
floating point. The specification of these tasks and inference
performance are summarized in Table II. The accuracies of
performing DNN inference on these tasks are presented in
Table III in the [5..8]-bit precision version of Cheetah. The
results show that posit with [5..8]-bit precision (mostly es = 1)
outperforms the fixed-point and floating point formats (mostly
we = 4 exponential bits). For instance, the accuracy of
performing DNN inference on Fashion-MNIST is improved
by 5.14% and 4.17% with 5-bit posits in comparison to 5-bit
floating point and fixed-point, respectively. On the CIFAR-10
dataset, these performance gains are further noticeable with
5-bit posits having 28.5% and 31.62% improvements over
floating point and fixed-point, respectively. The benefits of
the posit numerical format are intuitively explained by the
nonlinear distribution of its values, similar to that of DNN
inference parameters. This hypothesis is explored empirically
by calculating the distortion rate of DNN inference parameters
with respect to each numerical format. The distortion rate
is described by (7) where P indicates the high-precision
parameters and Quant(P ) represents the quantized param-
eters. The results, as shown in Fig. 4, validate the hypothesis,
especially at 5-bit precision where the distortion rate of posit
is significantly less than that of the other numerical formats.
d(R) = d(P,Quant(P )) =
1
n
n∑
i
||Pi, Quant(Pi)||2 (7)
B. Exploiting Numerical Formats with Quantization Ap-
proaches for DNN Inference
As mentioned before, quantization with rounding has less
overhead when compared to the other quantization approaches,
but it is not possible to perform DNN inference with 5-bit
posits with similar performance of DNN inference as 32-bit
floating point. To improve performance of DNN inference, the
[5..8]-bit posit numerical format is combined with linear quan-
tization approaches and evaluated for a 4-layer feedforward
neural network on the MNIST and Fashion-MNIST datasets.
The α1 ×Ai and α2 ×Wij in (5) can be either implemented
by constant multiplication or by a shift operation where the
7Table II: Specifications of the benchmark tasks and performance on a baseline 32-bit floating point network
Dataset Layers1 # Parameters # EMAC Ops2 Memory Accuracy
MNIST
4 FC 0.34 M 0.78 k 1.34 MB 98.46%
2 Conv, 2 FC, 1 PL 1.40 M 58.7 k 5.84 MB 99.32%
Fashion-MNIST
4 FC 0.34 M 0.78 k 1.34 MB 89.51%
2 Conv, 3 FC, 2 PL, 1 BN 1.88 M 69.8 k 7.77 MB 92.54%
CIFAR-10 7 Conv, 1 FC, 3 PL 0.95 M 312.6 k 6.23 MB 81.37%
1 Conv: 2D convolutional layer; FC: fully-connected layer; PL: max/avg. pooling layer; BN: batch
normalization layer.
2 The number of EMAC operations for a single sample.
Table III: Cheetah accuracy on three datasets with [5..8]-bit precision compared to fixed and float (respective best results are
when posit has es ∈ {0, 1, 2} and floating point with exponent bit-width we ∈ {3, 4}).
Dataset DNN
Posit Float Fixed
8-bit 7-bit 6-bit 5-bit 8-bit 7-bit 6-bit 5-bit 8-bit 7-bit 6-bit 5-bit
MNIST
FC 98.45% 98.39% 98.37% 98.30% 98.42% 98.39% 98.33% 93.91% 98.31% 97.95% 97.87% 97.88%
Conv 99.35% 99.33% 99.20% 98.94% 99.34% 99.25% 99.12% 92.27% 99.18% 97.14% 97.08% 96.96%
Fashion FC 89.59% 89.44% 89.24% 88.14% 89.56% 89.36% 88.92% 83.00% 89.16% 87.27% 85.20% 83.97%
MNIST Conv 92.70% 92.60% 91.64% 88.92% 92.63% 92.22% 89.58% 68.21% 89.59% 88.63% 85.31% 83.46%
CIFAR-10 Conv 80.40% 76.90% 68.51% 41.33% 79.75% 76.09% 53.68% 12.83% 24.27% 17.43% 12.54% 9.71%
(a)
dense_1 dense_2 dense_3 dense_4 overall avg.
5
6
7
8
Bi
t-p
re
cis
io
n
-5e-03
-2e-03
0e+00
3e-03
5e-03
(b)
dense_1 dense_2 dense_3 dense_4 overall avg.
5
6
7
8
Bi
t-p
re
cis
io
n
-6e-03
-3e-03
0e+00
3e-03
6e-03
(c)
dense_1 dense_2 dense_3 dense_4 overall avg.
5
6
7
8
Bi
t-p
re
cis
io
n
-6e-03
-3e-03
0e+00
3e-03
6e-03
(d)
dense_1 dense_2 dense_3 dense_4 overall avg.
5
6
7
8
Bi
t-p
re
cis
io
n
-6e-03
-3e-03
0e+00
3e-03
6e-03
(e)
co
nv
2d
_1
co
nv
2d
_2
co
nv
2d
_3
co
nv
2d
_4
co
nv
2d
_5
co
nv
2d
_6
co
nv
2d
_7
de
ns
e_
1
ov
er
al
l a
vg
.
5
6
7
8
Bi
t-p
re
cis
io
n
-8e-04
-4e-04
0e+00
4e-04
8e-04
(f)
co
nv
2d
_1
co
nv
2d
_2
co
nv
2d
_3
co
nv
2d
_4
co
nv
2d
_5
co
nv
2d
_6
co
nv
2d
_7
de
ns
e_
1
ov
er
al
l a
vg
.
5
6
7
8
Bi
t-p
re
cis
io
n
-8e-03
-4e-03
0e+00
4e-03
8e-03
Figure 4: Layer-wise delta distortion rate ∆(d(R)) heatmaps compare the precision (rates) of [5..8]-bit numerical formats for
representing 32-bit floating point DNN parameters. The average ∆(d(R)) among all weights in a DNN are shown in the final
column of each heatmap. (a) d(R)posit−d(R)fixed for the MNIST task; (b) d(R)posit−d(R)fixed for the Fashion MNIST task;
(c) d(R)posit − d(R)fixed for the CIFAR-10 task; (d) d(R)posit − d(R)float for the MNIST task; (e) d(R)posit − d(R)float for
the Fashion MNIST task; (f) d(R)posit − d(R)float for the CIFAR-10 task.
8Table IV: Comparison of different quantization approaches. Accuracy on MNIST (top) and Fashion-MNIST (bottom) with
{5-8}-bit precision for posit with es ∈ {0, 1, 2}, fixed-point, and floating point with exponent bit-width we ∈ {3, 4}.
Numerical Format
Rounding Quantization Linear-Quantization with Multiplication Linear-Quantization with Shift
8-bit 7-bit 6-bit 5-bit 8-bit 7-bit 6-bit 5-bit 8-bit 7-bit 6-bit 5-bit
Posit (es = 0) 98.42% 98.37% 98.30% 91.05% 98.46% 98.48% 98.46% 98.19% 98.48% 98.46% 98.39% 98.28%
Posit (es = 1) 98.45% 98.39% 98.34% 98.30% 98.49% 98.47% 98.42% 98.34% 98.48% 98.42% 98.38% 98.42%
Posit (es = 2) 98.44% 98.39% 98.37% 98.16% 98.45% 98.49% 98.38% 97.96% 98.46% 98.41% 98.41% 98.13%
Fixed-point 98.31% 97.95% 97.87% 97.88% 98.47% 98.32% 98.11% 96.41% 98.42% 98.29% 98.16% 97.17%
Floating point 98.42% 98.39% 98.33% 93.91% 98.46% 98.42% 98.36% 98.02% 98.46% 98.45% 98.38% 98.06%
32-bit Floating point 98.46% 98.46% 98.46%
Numerical Format
Rounding Quantization Linear-Quantization with Multiplication Linear-Quantization with Shift
8-bit 7-bit 6-bit 5-bit 8-bit 7-bit 6-bit 5-bit 8-bit 7-bit 6-bit 5-bit
Posit (es = 0) 89.57% 89.21% 88.46% 76.87% 89.64% 89.58% 89.36% 88.17% 89.59% 89.61% 88.31% 88.10%
Posit (es = 1) 89.59% 89.44% 89.22% 88.14% 89.58% 89.52% 89.35% 88.98% 89.58% 89.45% 89.48% 89.07%
Posit (es = 2) 89.56% 89.33% 89.24% 87.07% 89.53% 89.55% 88.98% 87.06% 89.49% 89.52% 89.18% 87.06%
Fixed-point 89.16% 87.27% 85.20% 83.97% 89.52% 88.83% 87.46% 76.58% 89.40% 88.93% 87.10% 82.10%
Floating point 89.56% 89.36% 88.92% 83.00% 89.59% 89.45% 89.00% 87.25% 89.73% 89.32% 88.86% 87.37%
32-bit Floating point 89.51% 89.51% 89.51%
α1 and α2 values are approximated by a power of two. The
results, as shown in Table IV, exhibit that 5-bit low-precision
DNN inference achieves similar performance to 32-bit floating
point DNN inference on the MNIST data set. Essentially, by
deploying this approach, the quantization error produced by
the values that lie outside of posit’s dynamic range is zeroed
out. The linear quantization approach also plays a key role in
reducing the hardware complexity of posit EMACs used for
DNN inference. Notably, the accuracy of DNN inference with
posits is significantly enhanced by using the linear quantization
approach in comparison to quantization with rounding. There-
fore, the overhead of adding linear quantization is offset by
reducing the hardware complexity, i.e. carrying out the posit
EMAC operation with es = 0 instead of es = 1, which is
explained in depth in the next section.
C. Exploiting Posit and Floating Point for DNN Training
To explore the efficacy of the posit numerical format over
the floating point numerical format, a 4-layer feedforward
neural network is trained with each number system on the
MNIST and Fashion-MNIST datasets. The results indicate that
the posit numerical format has a slightly better accuracy in
comparison to the floating point number system, as shown
in Table V. 16-bit posits outperform 16-bit floats in terms of
accuracy. Although Cheetah is evaluated on small datasets,
there are two advantages compared to [14], [36]. Mellempudi
et al. [36] use 32-bit numbers for accumulation to reduce
the hardware cost of stochastic rounding. Wang et al. [14]
reduce the accumulation bit-precision to 16 by using stochastic
rounding. However, in this paper, we show the potential of
using 16-bit posits for all DNN parameters with a simple and
hardware-friendly round-to-nearest algorithm and show less
than 1% accuracy degradation without exhaustively analyzing
DNN training parameters.
Table V: Average accuracy over 10 independent runs on the
test set of the respective dataset. Networks are trained using
only the specified numerical format.
Task Format Accuracy
MNIST
Posit-32 98.131%
Float-32 98.087%
Posit-16 96.535%
Float-16 90.646%
Fashion MNIST
Posit-32 89.263%
Float-32 89.105%
Posit-16 87.400%
Float-16 81.725%
D. EMAC Soft-Core FPGA Implementation
To show the effectiveness of the posit numerical format
over floating point and fixed-point, we evaluate the trade-off
between the energy-delay-product and latency of the EMAC
operation vs. average accuracy degradation from 32-bit float-
ing point per bit-width across the three datasets (two for the
linear-quantization experiment) with the Cheetah framework,
as shown in Figs. 5, 6, 7, 8, and 9. The energy-delay-product,
a combined measure of the latency and resource cost of the
EMAC operation, coupled with quantization with rounding
[20] and the EMAC operation coupled with linear quantization
are selected for all numerical formats and measured on a
Virtex-7 FPGA (xc7vx485t-2ffg1761c) with synthesis through
Vivado 2017.2. Note that the average accuracy degradation per
bit-width is computed using the accuracy results in Table IV.
The results, as shown by Fig. 5, indicate that posit coupled
with rounding quantization achieves up to 23% average ac-
curacy improvement over fixed-point. However, this accuracy
enhancement is gained at the cost of a 0.41× 10−10 increase
in energy-delay-product to implement the EMAC unit. Posit
also consistently shows better performance, especially at 5-bit
9Figure 5: The average accuracy degradation from 32-bit floating point across the two classification tasks vs. the energy-delay-
product of the respective EMAC with rounding quantization. Each <x, y> pair indicates the number of bits and corresponding
parameter bit-width, as indicated in the legend. A star (?) denotes the lowest accuracy degradation for a numerical format and
bit-width.
Figure 6: The average accuracy degradation from 32-bit floating point across the two classification tasks vs. the latency of the
respective EMAC with rounding quantization. Each <x, y> pair indicates the number of bits and corresponding parameter
bit-width, as indicated in the legend. A star (?) denotes the lowest accuracy degradation for a numerical format and bit-width.
10
Figure 7: The average accuracy degradation from 32-bit floating point across the two classification tasks vs. the cost of the
respective EMAC with rounding quantization. Each <x, y> pair indicates the number of bits and corresponding parameter
bit-width (fractional bits or exponent), as labeled along the x-axis.
compared to the floating point number system at a comparable
energy-delay-product. The posit EMAC operation achieves
lower latencies, as shown in Fig. 6, due to a lack of subnormal
detection and other exception cases, but exhibits resource-
hungry encoding and decoding due to the variable-length
regime of the posit numerical format, as shown in Fig. 7.
Overall, the 6-bit posit shows the best trade-off between
energy-delay-product and average accuracy degradation from
32-bit floating point on the two benchmarks (when analyzed
across the [5..8]-bit range). Looking at the posit numerical
format in terms of classification performance and EMAC
energy-delay-product, posits with es = 1 provide a better
trade-off compared to posits with es ∈ {0, 2}. At [5..7]-bit
precision, the average performance of DNN inference with
es = 1 among the three datasets is 2% and 4% better than
with es = 2 and es = 0, respectively. These accuracy
benefits are coupled with 2.1× less energy-delay-product and
1.4× more energy-delay-product in comparison to es = 2
and es = 0, respectively. These results are measured when
the rounding quantization is used. Linear quantization with
the shift operation requires similar hardware overhead across
all of the numerical formats, as shown in Figs. 8 and 9.
However, the accuracy of performing DNN inference with
linear quantization with posits (es = 0) is similar to the
accuracy when es = 1. Therefore, it is possible to use EMACs
with es = 0 instead of es = 1 and thereby achieve 18%
energy-delay-product savings.
A summary of previous studies that propose low-precision
frameworks are shown in Table VI. Several research groups
have explored the efficacy of floats and fixed-point on the
performance and hardware complexity of DNNs with multiple
image classification tasks [14]–[16], [32], [34], [35]. However,
none of these works analyze the appropriateness of the posit
numerical format for both DNN training and inference. Ad-
ditionally, current work does not offer insight on the impact
of the quantization approach vs. numerical format on both
accuracy and hardware complexity, as investigated in this
paper.
VI. CONCLUSIONS
A low-precision DNN framework, Cheetah, for edge devices
is proposed in this work. We explored the capacity of various
numerical formats, including floating point, fixed-point and
posit, for both DNN training and inference. We show that
the recent posit numerical format has high efficacy for DNN
training at {16, 32}-bit precision and inference at ≤8-bit pre-
cision. Moreover, we show that it is possible to achieve better
performance and reduce energy consumption by using linear
quantization with the posit numerical format. The success of
low-precision posits in reducing DNN hardware complexity
with negligible accuracy degradation motivates us to evaluate
ultra-low precision training in future work.
11
0 10 20 30 40 50
Avg. Degradation (%)
10
11
10
10
En
er
gy
-D
el
ay
-P
ro
du
ct
 ) L [ H G      !
 ) L [ H G      !
 ) L [ H G      !
 ) L [ H G      !
 ) O R D W      !
 ) O R D W      !
 ) O R D W      !
 ) O R D W      !
 3 R V L W      !
 3 R V L W      !
 3 R V L W      !
 3 R V L W      !
 3 R V L W      !
 3 R V L W      !
 3 R V L W      !
 3 R V L W      !
 3 R V L W      !
 3 R V L W      !
 3 R V L W      !
Numerical Format
Fixed <N,Q>
Float <N,we>
Posit <N,es>
Figure 8: The average accuracy degradation from 32-bit floating point across the two classification tasks vs. the energy-delay-
product of the respective EMAC with linear quantization. Each <x, y> pair indicates the number of bits and corresponding
parameter bit-width, as indicated in the legend. A star (?) denotes the lowest accuracy degradation for a numerical format and
bit-width.
0 10 20 30 40 50
Avg. Degradation (%)
2 × 10
9
3 × 10
9
4 × 10
9
6 × 10
9
La
te
nc
y 
(s
)
 ) L [ H G      !
 ) L [ H G      !
 ) L [ H G      ! ) L [ H G      !
 ) O R D W      !
 ) O R D W      !
 ) O R D W      !
 ) O R D W      !
 3 R V L W      !
 3 R V L W      !
 3 R V L W      !
 3 R V L W      !
 3 R V L W      !
 3 R V L W      !
 3 R V L W      !
 3 R V L W      !
 3 R V L W      !
 3 R V L W      !
 3 R V L W      !
Numerical Format
Fixed <N,Q>
Float <N,we>
Posit <N,es>
Figure 9: The average accuracy degradation from 32-bit floating point across the two classification tasks vs. the latency of
the respective EMAC with linear quantization. Each <x, y> pair indicates the number of bits and corresponding parameter
bit-width, as indicated in the legend. A star (?) denotes the lowest accuracy degradation for a numerical format and bit-width.
12
Table VI: High-level summary of Cheetah and other low-precision frameworks. All datasets are image classification tasks. WI
BC: Wisconsin Breast Cancer; FMNIST: Fashion MNIST; FP: floating point; FX: fixed-point; PS: posit; SW: software; HW:
hardware.
Courbariaux et al. [45] Gysel et al. [16] Hashemi et al. [15] Carmichael et al. [20] Wang et al. [14] Johnson et al. [22] This Work
Dataset
MNIST, CIFAR-10,
ImageNet
MNIST, CIFAR-10, WI BC, Iris, Mushroom
ImageNet ImageNet
MNIST, FMNIST
SVHN SVHN MNIST, FMNIST CIFAR-10
Numerical Format
FP, FX, FP, FX, FP, FX FP, FX
FP
FX, FP FX, FP
BFP BFP Binary PS PS PS
Bit-precision 12 8 All [5..8] All 8 [5..8]
Utility Training Inference Inference Inference Training Inference Inference & Training
Inference Quantization - Rounding Rounding Rounding - Log Rounding & Linear
Implementation SW SW & HW SW & HW SW & HW SW & HW SW & HW SW & HW
DNN library Theano Caffe Caffe Keras/TensorFlow Home Suite PyTorch Keras/TensorFlow
Device - ASIC ASIC Virtex-7 FPGA ASIC ASIC Virtex-7 FPGA
Technology Node - 65 nm 65 nm 28 nm 14 nm 28 nm 28 nm
REFERENCES
[1] E. Li, Z. Zhou, and X. Chen, “Edge intelligence: On-demand deep
learning model co-inference with device-edge synergy,” in Proceedings
of the 2018 Workshop on Mobile Edge Communications. ACM, 2018,
pp. 31–36.
[2] W. Shi, J. Cao, Q. Zhang, Y. Li, and L. Xu, “Edge computing: Vision
and challenges,” IEEE Internet of Things Journal, vol. 3, no. 5, pp.
637–646, 2016.
[3] M. Satyanarayanan, “The emergence of edge computing,” Computer,
vol. 50, no. 1, pp. 30–39, 2017.
[4] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet
classification with deep convolutional neural networks,” in Advances
in Neural Information Processing Systems 25: 26th Annual
Conference on Neural Information Processing Systems, NeurIPS,
P. L. Bartlett, F. C. N. Pereira, C. J. C. Burges, L. Bottou,
and K. Q. Weinberger, Eds., Lake Tahoe, Nevada, USA, Dec.
2012, pp. 1106–1114. [Online]. Available: http://papers.nips.cc/paper/
4824-imagenet-classification-with-deep-convolutional-neural-networks
[5] M. Horowitz, “1.1 computing’s energy problem (and what we can do
about it),” in 2014 IEEE international solid-state circuits conference
digest of technical papers (ISSCC). IEEE, 2014, pp. 10–14.
[6] X. Xu, Y. Ding, S. X. Hu, M. Niemier, J. Cong et al., “Scaling for edge
inference of deep neural networks,” Nature Electronics, vol. 1, no. 4, p.
216, 2018.
[7] C.-J. Wu, D. Brooks, K. Chen, D. Chen, S. Choudhury et al., “Ma-
chine learning at facebook: Understanding inference at the edge,” in
2019 IEEE International Symposium on High Performance Computer
Architecture (HPCA). IEEE, 2019, pp. 331–344.
[8] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang et al.,
“Mobilenets: Efficient convolutional neural networks for mobile vision
applications,” arXiv preprint arXiv:1704.04861, 2017.
[9] Y. Chen, H. Fang, B. Xu, Z. Yan, Y. Kalantidis et al., “Drop an
octave: Reducing spatial redundancy in convolutional neural networks
with octave convolution,” arXiv preprint arXiv:1904.05049, 2019.
[10] M. Cho and D. Brand, “MEC: Memory-efficient convolution for deep
neural network,” in Proceedings of the 34th International Conference
on Machine Learning, ICML, ser. Proceedings of Machine Learning
Research, D. Precup and Y. W. Teh, Eds., vol. 70. Sydney, NSW,
Australia: PMLR, Aug. 2017, pp. 815–824. [Online]. Available:
http://proceedings.mlr.press/v70/cho17a.html
[11] M. Ren, A. Pokrovsky, B. Yang, and R. Urtasun, “Sbnet: Sparse blocks
network for fast inference,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2018, pp. 8711–8720.
[12] B. Zhou, Y. Sun, D. Bau, and A. Torralba, “Revisiting the importance of
individual units in cnns via ablation,” arXiv preprint arXiv:1806.02891,
2018.
[13] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang et al., “Quantization
and training of neural networks for efficient integer-arithmetic-only
inference,” in The IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), June 2018.
[14] N. Wang, J. Choi, D. Brand, C.-Y. Chen, and K. Gopalakrishnan,
“Training deep neural networks with 8-bit floating point numbers,” in
Advances in neural information processing systems, 2018, pp. 7686–
7695.
[15] S. Hashemi, N. Anthony, H. Tann, R. I. Bahar, and S. Reda,
“Understanding the impact of precision quantization on the accuracy
and energy of neural networks,” in Design, Automation & Test in
Europe Conference & Exhibition, DATE, D. Atienza and G. D.
Natale, Eds. Lausanne, Switzerland: IEEE, Mar. 2017, pp. 1474–1479.
[Online]. Available: https://doi.org/10.23919/DATE.2017.7927224
[16] P. Gysel, J. Pimentel, M. Motamedi, and S. Ghiasi, “Ristretto: A frame-
work for empirical study of resource-efficient inference in convolutional
neural networks,” IEEE Transactions on Neural Networks and Learning
Systems, 2018.
[17] Y. Guo, “A survey on methods and theories of quantized neural net-
works,” arXiv preprint arXiv:1808.04752, 2018.
[18] R. Krishnamoorthi, “Quantizing deep convolutional networks for effi-
cient inference: A whitepaper,” arXiv preprint arXiv:1806.08342, 2018.
[19] S. H. F. Langroudi, T. Pandit, and D. Kudithipudi, “Deep learning infer-
ence on embedded devices: Fixed-point vs posit,” in 2018 1st Workshop
on Energy Efficient Machine Learning and Cognitive Computing for
Embedded Applications (EMC2), March 2018, pp. 19–23.
[20] Z. Carmichael, H. F. Langroudi, C. Khazanov, J. Lillie, J. L.
Gustafson, and D. Kudithipudi, “Deep positron: A deep neural
network using the posit number system,” in Design, Automation
& Test in Europe Conference & Exhibition, DATE. Florence,
Italy: IEEE, Mar. 2019, pp. 1421–1426. [Online]. Available: https:
//doi.org/10.23919/DATE.2019.8715262
[21] Z. Carmichael, H. F. Langroudi, C. Khazanov, J. Lillie, J. L. Gustafson,
and D. Kudithipudi, “Performance-efficiency trade-off of low-precision
numerical formats in deep neural networks,” in Proceedings of the
Conference for Next Generation Arithmetic, ser. CoNGA’19. Singapore,
Singapore: ACM, 2019, pp. 3:1–3:9.
[22] J. Johnson, “Rethinking floating point for deep learning,” arXiv preprint
arXiv:1811.01721, 2018.
[23] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning
applied to document recognition,” Proceedings of the IEEE, vol. 86,
no. 11, pp. 2278–2324, 1998.
[24] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press,
2016, http://www.deeplearningbook.org.
[25] J. L. Gustafson and I. T. Yonemoto, “Beating floating point at its
own game: Posit arithmetic,” Supercomputing Frontiers and Innovations,
vol. 4, no. 2, pp. 71–86, 2017.
[26] W. Tichy, “Unums 2.0: An interview with John L. Gustafson,” Ubiquity,
vol. 2016, no. September, p. 1, 2016.
[27] H. P. Graf, L. D. Jackel, and W. E. Hubbard, “VLSI implementation of
a neural network model,” IEEE Computer, vol. 21, no. 3, pp. 41–49,
1988. [Online]. Available: https://doi.org/10.1109/2.30
[28] A. Iwata, Y. Yoshida, S. Matsuda, Y. Sato, and N. Suzumura, “An arti-
ficial neural network accelerator using general purpose 24 bits floating
point digital signal processors,” in International Joint Conference on
Neural Networks, IJCNN, vol. 2, 1989, pp. 171–175.
[29] D. W. Hammerstrom, “A VLSI architecture for high-performance, low-
cost, on-chip learning,” in IJCNN 1990, International Joint Conference
on Neural Networks. San Diego, CA, USA: IEEE, Jun. 1990, pp. 537–
544. [Online]. Available: https://doi.org/10.1109/IJCNN.1990.137621
[30] K. Asanovic and N. Morgan, “Experimental determination of precision
requirements for back-propagation training of artificial neural networks,”
in In Proceedings of the 2nd International Conference on Microelectron-
ics for Neural Networks, 1991, pp. 9–15.
13
[31] C. M. Bishop, “Training with noise is equivalent to tikhonov regular-
ization,” Neural computation, vol. 7, no. 1, pp. 108–116, 1995.
[32] M. Courbariaux, Y. Bengio, and J. David, “Low precision arithmetic for
deep learning,” in Workshop Track Proceedings of the 3rd International
Conference on Learning Representations, ICLR, Y. Bengio and
Y. LeCun, Eds., San Diego, CA, USA, May 2015. [Online]. Available:
http://arxiv.org/abs/1412.7024
[33] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep
learning with limited numerical precision,” in Proceedings of the 32nd
International Conference on Machine Learning, ICML, ser. JMLR
Workshop and Conference Proceedings, F. R. Bach and D. M. Blei,
Eds., vol. 37. Lille, France: JMLR.org, Jul. 2015, pp. 1737–1746.
[Online]. Available: http://proceedings.mlr.press/v37/gupta15.html
[34] P. Micikevicius, S. Narang, J. Alben, G. F. Diamos, E. Elsen et al.,
“Mixed precision training,” in Conference Track Proceedings of the
6th International Conference on Learning Representations, ICLR.
Vancouver, BC, Canada: OpenReview.net, 2018. [Online]. Available:
https://openreview.net/forum?id=r1gs9JgRZ
[35] U. Ko¨ster, T. Webb, X. Wang, M. Nassar, A. K. Bansal et al., “Flexpoint:
An adaptive numerical format for efficient training of deep neural
networks,” in Advances in Neural Information Processing Systems, 2017,
pp. 1742–1752.
[36] N. Mellempudi, S. Srinivasan, D. Das, and B. Kaul, “Mixed precision
training with 8-bit floating point,” arXiv preprint arXiv:1905.12334,
2019.
[37] D. Kalamkar, D. Mudigere, N. Mellempudi, D. Das, K. Banerjee et al.,
“A study of BFLOAT16 for deep learning training,” arXiv preprint
arXiv:1905.12322, 2019.
[38] E. Chung, J. Fowers, K. Ovtcharov, M. Papamichael, A. Caulfield et al.,
“Serving DNNs in real time at datacenter scale with Project Brainwave,”
IEEE Micro, vol. 38, no. 2, pp. 8–20, 2018.
[39] J. H. Wilkinson, “Rounding errors in algebraic processes,” in IFIP
Congress, 1959, pp. 44–53.
[40] F. de Dinechin, L. Forget, J.-M. Muller, and Y. Uguen, “Posits: the
good, the bad and the ugly,” Dec. 2018, working paper or preprint.
[Online]. Available: https://hal.inria.fr/hal-01959581
[41] R. A. Fisher, “The use of multiple measurements in taxonomic prob-
lems,” Annals of eugenics, vol. 7, no. 2, pp. 179–188, 1936.
[42] U. Kulisch, Computer arithmetic and validity: theory, implementation,
and applications, 1st ed., ser. de Gruyter Studies in Mathematics.
Berlin, New York, USA: Walter de Gruyter, 2008, vol. 33.
[43] F. Chollet et al., “Keras,” https://github.com/keras-team/keras, 2015.
[44] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen et al.,
“TensorFlow: Large-scale machine learning on heterogeneous systems,”
2015. [Online]. Available: https://www.tensorflow.org/
[45] M. Courbariaux, Y. Bengio, and J.-P. David, “Training deep neu-
ral networks with low precision multiplications,” arXiv preprint
arXiv:1412.7024, 2014.
