Deep Learning Training on the Edge with Low-Precision Posits by Langroudi, Hamed F. et al.
1Deep Learning Training on the Edge with
Low-Precision Posits
Hamed F. Langroudi, Zachariah Carmichael, Dhireesha Kudithipudi
(Preprint)
Abstract—Recently, the posit numerical format has shown
promise for DNN data representation and compute with ultra-
low precision ([5..8]-bit). However, majority of studies focus only
on DNN inference. In this work, we propose DNN training
using posits and compare with the floating point training. We
evaluate on both MNIST and Fashion MNIST corpuses, where
16-bit posits outperform 16-bit floating point for end-to-end DNN
training.
Index Terms—Deep neural networks, low-precision arithmetic,
posit numerical format
I. INTRODUCTION
The edge computing, offers a decentralized solution to
cloud-based datacenters [1] and intelligence-at-the-edge of
mobile networks. However, training on the edge is a challenge
for many deep neural networks (DNNs). This arises due to
the significant cost of multiply-and-accumulate (MAC) units,
an ubiquitous operation in all DNNs. In a 45 nm CMOS
process, energy consumption doubles from 16-bit floats to
32-bit floats for addition and by ∼4× for multiplication [2].
Memory access cost increases by ∼10× from 8 kB to 1 MB
memory with a 64-bit cache [2]. In general, there is a gap
between memory storage, bandwidth, compute requirements,
and energy consumption of modern DNNs and hardware
resources available on edge devices [3].
An apparent solution to address this gap is to compress
such networks, thus reducing the compute requirements to
match putative edge resources. Several groups have proposed
compressed new compute- and memory-efficient DNN ar-
chitectures [4]–[6] and parameter-efficient neural networks,
using methods such as DNN pruning [7], distillation [8], and
low-precision arithmetic [9], [10]. Among these approaches,
low-precision arithmetic is noted for its ability to reduce
memory capacity, bandwidth, latency, and energy consumption
associated with MAC units in DNNs and an increase in the
level of data parallelism [9], [11], [12].
The ultimate goal of low-precision DNN design is to reduce
the original hardware complexity of the high-precision DNN
model to a level suitable for edge devices without significantly
degrading performance.
To address the gaps in previous studies, we are motivated
to study low-precision posit for DNN training on the edge.
II. POSIT NUMERICAL FORMAT
An alternative to IEEE-754 floating point numbers, posits
were recently introduced and exhibit a tapered-precision char-
Hamed. F. Langroudi, Zachariah Carmichael, and Dhireesha Kudithipudi
are with the Nueromorphic AI Lab, Department of Computer Engineering,
Rochester Institute of Technology, Rochester, NY, USA
acteristic [13]. Posits, a Type III unum, offer an elegant
resolution to many of the shortcomings of IEEE-754 floating
format and address limitations of both Type I and Type II
unums [14]. Moreover, posits provide better accuracy, dynamic
range, and program reproducibility than IEEE floating point.
The essential advantage of posits is their capability to represent
nonlinearly distributed numbers in a specific dynamic range
about unity (±1.0) with high accuracy. The value of a posit
bit-string is governed by (1), where s represents the sign, es
the maximal exponential bits, fs the maximal fractional bits,
e the exponent value, f the fraction value, and k the regime
value (as given by (2)).
x =

0, if (00...0)
NaR, if (10...0)
(−1)s × 22es×k × 2e ×
(
1 + f
2fs
)
, otherwise
(1)
The regime is encoded based on the runlength m of identical
bits (r...r) terminated by either a regime terminating bit r
or the end of the bit-string of size n. Note that there is
no requirement to distinguish between negative and positive
zero since only a single bit pattern (00...0) represents zero.
Furthermore, instead of defining a “Not-a-Number” (NaN )
for exceptional values and infinity by various bit patterns, a
single bit pattern (10...0), “Not-a-Real” (NaR), represents all
such values. Furthermore, NaR never arises due to overflow
or underflow. More details about the posit numerical format
can be found in [13].
k =
{
−m, if r = 0
m− 1, if r = 1 (2)
III. RELATED WORK
As early as the 1980s, low-precision arithmetic has been
explored in shallow neural networks to decrease both compute
and memory complexity for training and inference without
deteriorating performance [15]–[18]. In some scenarios, this
bit-precision constraint also improves DNN performance due
to the quantization noise acting as a regularization method
[18], [19]. The outcome of these studies indicate that 16- and
8-bit precision DNN parameters are capable of satisfactorily
maintaining performance for both training and inference in
shallow networks [16]–[18]. The capability of low-precision
arithmetic is reevaluated in the deep learning era to reduce
memory footprint and energy consumption during training and
inference [10]–[12], [20]–[30].
In performing DNN training, several previous studies utilize
either variants of low-precision block floating point (BFP)
ar
X
iv
:1
90
7.
13
21
6v
1 
 [c
s.L
G]
  3
0 J
ul 
20
19
2(blocks of floating point DNN parameters that share an ex-
ponent [31]) or mixed-precision floating point. These meth-
ods are sufficient to maintain similar performance as 32-bit
high-precision floating point. For instance, Courbariaux et al.
trained a low-precision DNN on the MNIST, CIFAR-10, and
SVHN datasets with the floating point, fixed-point, and BFP
numerical formats [20] and show that BFP achieves the best
performance due to variability between the dynamic range and
precision of DNN parameters [20]. Following, Koster et al.
proposed the Flexpoint numerical format and a new algorithm,
Autoflex, which analyzes the statistics of the history of DNN
parameters, to optimally select the shared exponents for DNN
parameters iteration-wise during gradient descent [23].
Aside from the BFP numerical format, Narang et al. ex-
plored mixed-precision floating point [22] using 16-bit floating
point weights, activations, and gradients during both the for-
ward and backward passes. To prevent accuracy loss caused by
underflow in 16-bit floating point, the weights are updated with
32-bit floating point. Additionally, to prevent gradients with
very small magnitude from becoming zero when represented
by 16-bit floating point, a new loss scaling approach is
proposed.
Recently, Wang et al. and Mellempudi et al. propose a
method to reduce the bit-precision of weights, activations, and
gradients to 8 bits by exhaustively analyzing DNN parameters
during training [10], [24]. In [24], a new chunk-based addition
is presented to solve the truncation issue caused by the addition
of large- and small-magnitude numbers, thus successfully
reducing the number of bits for the accumulator and weight
updates to 16 bits. To mitigate requiring loss scaling in mixed-
precision floating point training, Kalamkar et al. [25] proposed
the brain floating point (BFLOAT-16) half-precision format
with a reduced 8-bit fractional precision and similar dynamic
range (7-bit exponent) to 32-bit floating point. A side effect of
this representation is that the conversion complexity between
these BFLOAT-16 and IEEE floating point is reduced during
training. In training a ResNet model on the ImageNet dataset,
BFLOAT-16s achieve the same performance as 32-bit floating
point.
This research builds on earlier studies [27]–[30], [32] and
for the first time studies feedforward neural network training
with posits on MNIST and Fashion MNIST datasets .
IV. PROPOSED WORK, RESULTS & ANALYSIS
To perform DNN training in feedforward neural networks,
we study two numerical formats (floating point and posit) with
32-bit and 16-bit precision. For simplicity, the architecture
explained here is based on three hidden layers feedforward
neural network training with the posit numerical format as
shown in Fig. 1.
The networks are implemented in the Keras [33] and Ten-
sorFlow [34] frameworks. {16, 32}-bit floating point and posit
numbers for DNN training are extended to these frameworks
via software emulation.
To compare the posit and floating point numerical formats,
a four-layer feedforward neural network is trained with each
of the number system on the MNIST and Fashion-MNIST
datasets. The results indicate that posits have improved ac-
curacy in comparison to floating point at both 16- and 32-
bit precision, as shown in Table I. Although the evaluation
is demonstrated on small datasets, there are two advantages
compared to [10], [24]. Mellempudi et al. [24] use 32-bit
numbers for accumulation to reduce the hardware cost of
stochastic rounding. Wang et al. [10] reduce the accumulation
bit-precision to 16 by using stochastic rounding. However,
in this paper, we show the potential of using 16-bit posits
for all DNN parameters with a simple and hardware-friendly
round-to-nearest algorithm and show less than 2% accuracy
degradation without exhaustively analyzing the network train-
ing parameters.
Table I: Average accuracy over 10 independent runs on the
test set of the respective dataset. Networks are trained using
only the specified numerical format.
Task Format Accuracy
MNIST
Posit-32 98.131%
Float-32 98.087%
Posit-16 96.535%
Float-16 90.646%
Fashion MNIST
Posit-32 89.263%
Float-32 89.105%
Posit-16 87.400%
Float-16 81.725%
A summary of recent studies that propose low-precision
training frameworks are shown in Table II. Several research
groups have explored the efficacy of floats and BFP on the
performance of DNNs with multiple image classification tasks
[10], [22]–[25], [32]. However, the majority of these works
analyze the appropriateness of the posit numerical format for
DNN training.
V. CONCLUSIONS
This work presented low-precision posit designs for both
training and inference on edge devices. We show that the
novel posit numerical format has high efficacy for DNN
training at {16, 32}-bit precision training, surpassing the
equal-bandwidth fixed-point and floating point counterparts.
The success of posits in these experiments, needs further
exploration of ultra-low precision posit training for richer
datasets.
REFERENCES
[1] E. Li, Z. Zhou, and X. Chen, “Edge intelligence: On-demand deep
learning model co-inference with device-edge synergy,” in Proceedings
of the 2018 Workshop on Mobile Edge Communications. ACM, 2018,
pp. 31–36.
[2] M. Horowitz, “1.1 computing’s energy problem (and what we can do
about it),” in 2014 IEEE international solid-state circuits conference
digest of technical papers (ISSCC). IEEE, 2014, pp. 10–14.
[3] C.-J. Wu, D. Brooks, K. Chen, D. Chen, S. Choudhury et al., “Ma-
chine learning at facebook: Understanding inference at the edge,” in
2019 IEEE International Symposium on High Performance Computer
Architecture (HPCA). IEEE, 2019, pp. 331–344.
[4] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang et al.,
“Mobilenets: Efficient convolutional neural networks for mobile vision
applications,” arXiv preprint arXiv:1704.04861, 2017.
3Activation l  
 
Weight l 
P16
P16
P16 P16 P16
Activation l+1
Error
gradients l+1
 
  P16
P16P16
f(x)
f'(x)
Weight
gradients l 
P16
P16Error
gradients l 
P16
Adam optimizer Updatedweights
Master weights Learning rate  &  
1
 
2
Weights
gradients
P16
P16
P16
P16
P16
Forward Path
Backward Path
Data
C
lassification
Fully-Connected
Layer
 
 
P16P16
f'(x)
Figure 1: The software framework for training the feedforward neural networks with three hidden layers using posit numerical
format. The framework scales to any DNN architecture. P16: Posit16; β1&β2: Exponential decay rates.
Table II: High-level summary of proposed work and other low-precision training frameworks. All datasets are image
classification tasks. FMNIST: Fashion MNIST; FP: floating point; FX: fixed-point; PS: posit.
Montero et al. [32] Narang et al. [22] Koster et al. [23] Mellempudi et al. [24] Wang et al. [10] Kalamkar et al. [25] This Work
Dataset Synthetic ImageNet CIFAR-10 ImageNet ImageNet ImageNet MNIST, FMNIST
Numerical Format PS FP BFP FP FP BFLOAT PS, FP
Bit-precision [8..16] 16/32 16+5 8/32 8/16 16 16
DNN library PySigmoid Caffe/PyTorch Neon TensorFlow Home Suite
IntelCaffe/Caffe2
Keras/TensorFlowNeon/Tensorflow
[5] Y. Chen, H. Fang, B. Xu, Z. Yan, Y. Kalantidis et al., “Drop an
octave: Reducing spatial redundancy in convolutional neural networks
with octave convolution,” arXiv preprint arXiv:1904.05049, 2019.
[6] M. Cho and D. Brand, “MEC: Memory-efficient convolution for deep
neural network,” in Proceedings of the 34th International Conference
on Machine Learning, ICML, ser. Proceedings of Machine Learning
Research, D. Precup and Y. W. Teh, Eds., vol. 70. Sydney, NSW,
Australia: PMLR, Aug. 2017, pp. 815–824. [Online]. Available:
http://proceedings.mlr.press/v70/cho17a.html
[7] M. Ren, A. Pokrovsky, B. Yang, and R. Urtasun, “Sbnet: Sparse blocks
network for fast inference,” in Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, 2018, pp. 8711–8720.
[8] B. Zhou, Y. Sun, D. Bau, and A. Torralba, “Revisiting the importance of
individual units in cnns via ablation,” arXiv preprint arXiv:1806.02891,
2018.
[9] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang et al., “Quantization
and training of neural networks for efficient integer-arithmetic-only
inference,” in The IEEE Conference on Computer Vision and Pattern
Recognition (CVPR), June 2018.
[10] N. Wang, J. Choi, D. Brand, C.-Y. Chen, and K. Gopalakrishnan,
“Training deep neural networks with 8-bit floating point numbers,” in
Advances in neural information processing systems, 2018, pp. 7686–
7695.
[11] S. Hashemi, N. Anthony, H. Tann, R. I. Bahar, and S. Reda,
“Understanding the impact of precision quantization on the accuracy
and energy of neural networks,” in Design, Automation & Test in
Europe Conference & Exhibition, DATE, D. Atienza and G. D.
Natale, Eds. Lausanne, Switzerland: IEEE, Mar. 2017, pp. 1474–1479.
[Online]. Available: https://doi.org/10.23919/DATE.2017.7927224
[12] P. Gysel, J. Pimentel, M. Motamedi, and S. Ghiasi, “Ristretto: A frame-
work for empirical study of resource-efficient inference in convolutional
neural networks,” IEEE Transactions on Neural Networks and Learning
Systems, 2018.
[13] J. L. Gustafson and I. T. Yonemoto, “Beating floating point at its
own game: Posit arithmetic,” Supercomputing Frontiers and Innovations,
vol. 4, no. 2, pp. 71–86, 2017.
[14] W. Tichy, “Unums 2.0: An interview with John L. Gustafson,” Ubiquity,
vol. 2016, no. September, p. 1, 2016.
4[15] H. P. Graf, L. D. Jackel, and W. E. Hubbard, “VLSI implementation of
a neural network model,” IEEE Computer, vol. 21, no. 3, pp. 41–49,
1988. [Online]. Available: https://doi.org/10.1109/2.30
[16] A. Iwata, Y. Yoshida, S. Matsuda, Y. Sato, and N. Suzumura, “An arti-
ficial neural network accelerator using general purpose 24 bits floating
point digital signal processors,” in International Joint Conference on
Neural Networks, IJCNN, vol. 2, 1989, pp. 171–175.
[17] D. W. Hammerstrom, “A VLSI architecture for high-performance, low-
cost, on-chip learning,” in IJCNN 1990, International Joint Conference
on Neural Networks. San Diego, CA, USA: IEEE, Jun. 1990, pp. 537–
544. [Online]. Available: https://doi.org/10.1109/IJCNN.1990.137621
[18] K. Asanovic and N. Morgan, “Experimental determination of precision
requirements for back-propagation training of artificial neural networks,”
in In Proceedings of the 2nd International Conference on Microelectron-
ics for Neural Networks, 1991, pp. 9–15.
[19] C. M. Bishop, “Training with noise is equivalent to tikhonov regular-
ization,” Neural computation, vol. 7, no. 1, pp. 108–116, 1995.
[20] M. Courbariaux, Y. Bengio, and J. David, “Low precision arithmetic for
deep learning,” in Workshop Track Proceedings of the 3rd International
Conference on Learning Representations, ICLR, Y. Bengio and
Y. LeCun, Eds., San Diego, CA, USA, May 2015. [Online]. Available:
http://arxiv.org/abs/1412.7024
[21] S. Gupta, A. Agrawal, K. Gopalakrishnan, and P. Narayanan, “Deep
learning with limited numerical precision,” in Proceedings of the 32nd
International Conference on Machine Learning, ICML, ser. JMLR
Workshop and Conference Proceedings, F. R. Bach and D. M. Blei,
Eds., vol. 37. Lille, France: JMLR.org, Jul. 2015, pp. 1737–1746.
[Online]. Available: http://proceedings.mlr.press/v37/gupta15.html
[22] P. Micikevicius, S. Narang, J. Alben, G. F. Diamos, E. Elsen et al.,
“Mixed precision training,” in Conference Track Proceedings of the
6th International Conference on Learning Representations, ICLR.
Vancouver, BC, Canada: OpenReview.net, 2018. [Online]. Available:
https://openreview.net/forum?id=r1gs9JgRZ
[23] U. Ko¨ster, T. Webb, X. Wang, M. Nassar, A. K. Bansal et al., “Flexpoint:
An adaptive numerical format for efficient training of deep neural
networks,” in Advances in Neural Information Processing Systems, 2017,
pp. 1742–1752.
[24] N. Mellempudi, S. Srinivasan, D. Das, and B. Kaul, “Mixed precision
training with 8-bit floating point,” arXiv preprint arXiv:1905.12334,
2019.
[25] D. Kalamkar, D. Mudigere, N. Mellempudi, D. Das, K. Banerjee et al.,
“A study of BFLOAT16 for deep learning training,” arXiv preprint
arXiv:1905.12322, 2019.
[26] E. Chung, J. Fowers, K. Ovtcharov, M. Papamichael, A. Caulfield et al.,
“Serving DNNs in real time at datacenter scale with Project Brainwave,”
IEEE Micro, vol. 38, no. 2, pp. 8–20, 2018.
[27] Z. Carmichael, H. F. Langroudi, C. Khazanov, J. Lillie, J. L.
Gustafson, and D. Kudithipudi, “Deep positron: A deep neural
network using the posit number system,” in Design, Automation
& Test in Europe Conference & Exhibition, DATE. Florence,
Italy: IEEE, Mar. 2019, pp. 1421–1426. [Online]. Available: https:
//doi.org/10.23919/DATE.2019.8715262
[28] Z. Carmichael, H. F. Langroudi, C. Khazanov, J. Lillie, J. L. Gustafson,
and D. Kudithipudi, “Performance-efficiency trade-off of low-precision
numerical formats in deep neural networks,” in Proceedings of the
Conference for Next Generation Arithmetic, ser. CoNGA’19. Singapore,
Singapore: ACM, 2019, pp. 3:1–3:9.
[29] S. H. F. Langroudi, T. Pandit, and D. Kudithipudi, “Deep learning infer-
ence on embedded devices: Fixed-point vs posit,” in 2018 1st Workshop
on Energy Efficient Machine Learning and Cognitive Computing for
Embedded Applications (EMC2), March 2018, pp. 19–23.
[30] J. Johnson, “Rethinking floating point for deep learning,” arXiv preprint
arXiv:1811.01721, 2018.
[31] J. H. Wilkinson, “Rounding errors in algebraic processes,” in IFIP
Congress, 1959, pp. 44–53.
[32] R. M. Montero, A. A. Del Barrio, and G. Botella, “Template-based
posit multiplication for training and inferring in neural networks,” arXiv
preprint arXiv:1907.04091, 2019.
[33] F. Chollet et al., “Keras,” https://github.com/keras-team/keras, 2015.
[34] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen et al.,
“TensorFlow: Large-scale machine learning on heterogeneous systems,”
2015. [Online]. Available: https://www.tensorflow.org/
