Training Deep Neural Networks Using Posit Number System by Lu, Jinming et al.
Training Deep Neural Networks Using Posit
Number System
Jinming Lu, Siyuan Lu, Zhisheng Wang, Chao Fang, Jun Lin, Zhongfeng Wang, and Li Du
School of Electronic Science and Engineering, Nanjing University, Nanjing, China
Email: {jmlu, sylu, zswang}@smail.nju.edu.cn, fantasysee@foxmail.com, {jlin, zfwang}@nju.edu.cn, dl1989113@ucla.edu
Abstract—With the increasing size of Deep Neural Network
(DNN) models, the high memory space requirements and com-
putational complexity have become an obstacle for efficient DNN
implementations. To ease this problem, using reduced-precision
representations for DNN training and inference has attracted
many interests from researchers. This paper first proposes a
methodology for training DNNs with the posit arithmetic, a type-
3 universal number (Unum) format that is similar to the floating
point(FP) but has reduced precision. A warm-up training strategy
and layer-wise scaling factors are adopted to stabilize training
and fit the dynamic range of DNN parameters. With the proposed
training methodology, we demonstrate the first successful training
of DNN models on ImageNet image classification task in 16
bits posit with no accuracy loss. Then, an efficient hardware
architecture for the posit multiply-and-accumulate operation is
also proposed, which can achieve significant improvement in
energy efficiency than traditional floating-point implementations.
The proposed design is helpful for future low-power DNN
training accelerators.
Index Terms—posit number system, quantization, deep neural
network training
I. INTRODUCTION
Recently deep neural networks (DNNs) have made a great
success in many real-world applications, such as image clas-
sification [1], speech recognition [2], and natural language
processing [3]. With the increasing size of DNNs, the models
show the state-of-the-art performance. However, the high
memory space requirements and computational complexity
have become a serious problem for efficient implementations,
especially on mobile devices.
To alleviate the extremely high demand of computational
resource, many compression methods are proposed, which aim
to generate compact DNN models. At present, the reduced-
precision representation of numbers, also known as quan-
tization, is one of the most attractive topics [4]. However,
these methods mainly focus on the inference phase of DNN.
Researches on training with limited-precision numbers still
remain to be explored.
Because of the existence of more information flows, includ-
ing gradients backpropagation and parameters updating, the
training of DNNs needs higher representation ability for data.
In other words, a suitable number format for DNN training
should have enough dynamic range for big numbers, and have
high precision for numbers in the center of data distribution.
Posit, a type-3 universal number, is introduced by Gustafson
et al. [5]. An n-bit posit number is defined as (n, es), where es
(exponent bits) is used to control dynamic range. Comparing to
standard floating point(FP) number, posit has a better trade-
off between dynamic range and precision, just meeting the
needs of low-bits number for DNN training. Some researchers
have claimed the prospect of posit in DNNs, but practical
implementations and verifications are absent [5] [6]. In this
paper, we first propose an effective strategy for DNN training
using posit number system. After the posit being proved
useful in DNN training, a processing element supporting posit
arithmetics is required to make full use of its efficiency
in DNN accelerators. Our contributions are summarized as
follows:
• With an operation which transforms a real number to
posit format, we illustrate how to apply the posit in DNN
training process.
• We analyze the advantages and disadvantages of the
application of posit in DNN training, then we propose
corresponding solutions to overcome these problems.
Firstly, to deal with the high sensitivity of models in
the early training stage and ensure the convergence of
models, a warm-up training with FP32 is carried out.
Secondly, to take the advantage of posit, we design a
layer-wise scaling factors based on the center of data
distribution in log-domain, making the data distribution
of models match the change of the precision of posit
number. Thirdly, to meet different data ranges of different
layers, we come up with a quanlitative criteria to select a
proper es to achieve a better trade-off between dynamic
range and precision of posit number.
• In order to verify the effectiveness of our methods,
ResNet-18 models are trained on ImageNet dataset and
Cifar-10 dataset, where 8-bit or 16-bit posit numbers are
applied in forward and backward computation, respec-
tively. The experiments show no accuracy loss with the
baseline model.
• We propose a hardware architecture for posit multiply-
and-accumulate (MAC) unit, which is coded by Verilog
HDL and synthesized by Design Compiler under TSMC
28nm technology. Comparing to standard floating point
MAC unit, the posit MAC can reduce the power by
83%, and reduce the area by 76%. It demonstrates that
our design will benefit future low-power DNN training
accelerators.
ar
X
iv
:1
90
9.
03
83
1v
1 
 [c
s.L
G]
  6
 Se
p 2
01
9
𝑒1𝑒2…𝑒𝑒𝑠  𝑓1𝑓2 … 𝑟𝑟… 𝑟 
Sign Regime bits
Exponent bits, 
if any
Mantissa bits, if 
any
n bits
s
Fig. 1. The basic structure of an (n, es) posit number
TABLE I
THE DETAIL STRUCTURES OF POSITIVE VALUES OF (5, 1) POSIT NUMBER
Binary Code Regime Exponent Mantissa Real Value
00000 x x x 0
00001 -3 0 0 1/64
00010 -2 0 0 1/16
00011 -2 1 0 1/8
00100 -1 0 0 1/4
00101 -1 0 1/2 3/8
00110 -1 1 0 1/2
00111 -1 1 1/2 3/4
01000 0 0 0 1
01001 0 0 1/2 3/2
01010 0 1 0 2
01011 0 1 1/2 3
01100 1 0 0 4
01101 1 1 0 8
01110 2 0 0 16
01111 3 0 0 64
II. BACKGROUND
A. Reduced-Precision for DNN Training
Training DNNs with reduced-precision is an appealing
issue. Gupta et al. trained DNNs with fixed-point numbers, and
introduced stochastic rounding procedure to prevent accuracy
degradation [7]. In paper [8], the binary logarithmic data
representation for both inference and training is explored,
so that multiplication operations can be replaced by simpler
shift operations. However, the above works usually can not
provide expected model accuracy on complex tasks because
there are too many information losses caused by the aggressive
approximation.
To deal with this problem, some recent works use reduced-
precision floating point including FP8 or FP16 in training.
Micikevicius et al. [9] used FP16 for forward and back-
ward computation, and kept FP32 for weight update and
accumulation. They also proposed a loss-scaling method to
keep gradients propagation effectively. Furthermore, with a
chunk-based accumulation technique applied, Wang et al. [10]
reduced the precision of the computation to FP8, and the
precision of the weight update and accumulation to FP16.
B. Posit number system
An (n, es) posit number, whose detail structure is shown as
Fig. 1, includes four parts: a sign bit, regime bits, es exponent
bits, and mantissa part. The boundary between the last three
parts are not fixed, as the regime part is encoded by run-
length method. As for the numerical meaning of regime bits,
consecutive k zeros ended by a one means −k, consecutive
TABLE II
NOTATIONS FOR POSIT TRANSFORMATION
Name Description
n posit word size
es posit exponent field size
s sign of the number x
exp the effective exponent value of x
k the regime value of px
e the exponent value of px before rounding
f the mantissa value of px before rounding
kb the regime width of px
eb the exponent width of px
fb the mantissa width of px
pe the exponent value of px after rounding
pf the mantissa value of px after rounding
k+1 ones ended by a zero means k. As an example, a (5, 1)
posit construction is described in Table I. The value of a posit
number p (binary code) is given by Eq. (1).
x =

0 p = 000...0,
±∞ p = 100...0,
(−1)s × useedk × 2e × (1 + f) otherwise.
(1)
where useed = 22
es
determines the dynamic range.
The maximum and the minimum positive values that p can
represent are useedn−2 and useed2−n, respectively.
Some groups have worked on the design of hardware archi-
tecture generators for posit arithmetics. Jaiswal et al. [11] pro-
posed a parameterized posit arithmetic architectures generator,
supporting basic operations such as FP-Posit conversion, addi-
tion/subtraction, and multiplication. Recently, an efficient posit
MAC unit generator that can be combined with a reasonable
pipeline strategy was put forward by Zhang et al. [6], Besides,
the applications of low-bit posit in deep learning also attracted
some attentions. Deep Positron [12], a DNN architecture that
employs exact-multiply-and-accumulates (EMACs) for 8-bit
posit, shows better accuracies than 8-bit fixed-point and FP
for some small datasets. J.Johnson [13] proposed log-float
format inspired by posit, and use it for DNN inference, whose
accuracy loss is less than 1% for ImageNet dataset within
ResNet-50 model.
III. POSIT TRAINING STRATEGY AND EXPERIMENTS
RESULTS
A. Posit Transformation
In this work, all data and computations are represented
in posit format in the training process. Therefore, we have
to transform a real number, which is represented in FP32
format in current computers, to posit format. Here we define
an operator Pn,es(x) to achieve this task. The detail process
is shown in Algorithm 1, and the involved notations are listed
in Table II.
Given the total word size n and exponent field size es, we
can determine the dynamic range of a posit number. To convert
a non-zero number x to corresponding posit number px, firstly
we have to limit its magnitude based on the dynamic range
and then extract sign, regime, exponent, and mantissa parts.
Algorithm 1: Transform a Number to Posit Format
Input: real number x, posit word size n and exponent
field size es
Output: posit number px
1 useed = 22
es
;
2 maxpos = useedn−2, minpos = useed2−n;
3 if abs(x) < minpos then
4 px = 0 ;
5 else
6 s = sign(x) ;
7 x′ = clip{abs(x),minpos,maxpos};
8 exp = blog2x′c ;
9 k = bexp÷ 2esc ;
10 e = exp− k × 2es ;
11 f = x′/2exp − 1 ;
12 if k ≥ 0 then
13 rb = k + 2 ;
14 else
15 rb = −k + 1 ;
16 eb = min{n− 1− rb, es} ;
17 fb = min{n− 1− rb− eb, 0} ;
18 pe = be× 2eb−esc × 2es−eb;
19 pf = bf × 2fbc × 2−fb;
20 px = s× useedk × 2pe × (1 + pf) ;
21 return px ;
(a) histgram of conv1.weight (b) distribution of conv1.weight
(c) histgram of bn4.0.1.weight (d) distribution of bn4.0.1.weight
Fig. 2. The histgrams and distributions of CONV layer and BN layer in
training process
Next, because of the restriction of word size, the width of
each part is adjusted. Therefore, the rounding operations are
applied to the value of each part to fit the adjusted width. Here
we choose the rounding-to-zero method, e.g. the b·c operator
in Algorithm 1, Line 16, 17. Comparing to the rounding-to-
nearest and stochastic rounding methods, the rounding-to-zero
will be more friendly for hardware implementation. Finally,
the posit result px is attained by combining these parts based
on Eq. (1).
With the transformation algorithm accomplished, we insert
it in DNN training computation flow as depicted in Fig. 3,
which includes forward process, backward process, and weight
update process.
B. Training a DNN Model with Posit
Although posit has many benefits while being used in DNN
training, it can not show expected performance if we replace
FP32 with reduced-precision posit directly. There are several
key reasons as follows:
• In the early training stage, the model is more sensitive to
the precision of data, and the distributions of some layers
are unstable, so that the reduced-precision representations
will cause a bad initialization and make the model hard
to converge.
• In fact, the precision of posit number system is basically
symmetrical about 1, but the data distributions in DNN
models are concentrated on limited range. To some extent,
it results mismatching between data distributions and
number representation formats, thereby leading larger
approximation errors.
• For different layers, the data have different ranges, which
means some data distributions are more concentrated and
the others are relatively decentralized. Therefore, it is sub-
optimal to use same data precision (e.g es of posit) to
represent them.
In this section, we propose corresponding methods for dealing
with the above problems.
Warm-up Training: By observing the distributions of
data in training process, we find that most of them are
approximately normal. As shown in Fig. 2, the distributions of
the weights in Convolution (CONV) layers are basically stable
in the training process. However, because of the initialization
method, the distributions of the weights in Batch Normal-
ization (BN) layers have a steep change in the first several
epochs, which may be an important reason of high model
sensitivity in early training process. Therefore, in this phase,
a higher numerical precision is required. On account of this
situation, a warm-up training using FP32 for several epochs
(1-5 epochs) is carried out. It will be helpful to determine the
data distribution effectively and make sure the convergence of
networks.
Distribution-based Shifting: When transforming a real
number x to its reduced-precision format, the most common
idea is approximating it to the nearest reduced-precision value
and clipping it based on the dynamic range of reduced-
precision format. As a result, the numerical errors are in-
evitable. To overcome the second issue, a scaling factor is
introduced to shift the data distribution to a more appropriate
range, whose upper bound is usually the maximum value that
the reduced-precision number can represent [14]. As for posit
number system, its dynamic range is large enough to meet
Conv 
𝐴𝑝
𝑙
 
𝑊𝑝
𝑙  
𝑃(⋅) 
𝐴𝑝
𝑙−1
 
𝐴𝑙  
(a) Forward propagation with posit transfor-
mation
𝑊𝑝
𝑙  Conv 
𝐸𝑝
𝑙
 
𝑃(⋅) 
𝑃(⋅) Δ𝑊𝑝
𝑙  ΔW𝑙  
𝐸𝑙−1 
𝐸𝑝
𝑙−1
 
(b) Backward propagation with posit transformation
𝑊𝑝
𝑙  
Δ𝑊𝑝
𝑙  
𝑃(⋅) 
𝑊𝑝
𝑙  
Weight  
Update 𝑊
𝑙  
(c) Weight update with posit transformation
Fig. 3. DNN training computation flow graph with posit transformation. In the graph, P (·) means the transformation operation, whose subscript (n, es) is
omitted for simplicity. Besides, W , A, ∆W , and E stand for the weight, activation, weight gradient, and error in a layer respectively. The symbols with
subscript p are in posit format.
demand. However, to make full use of the code space of
posit, inspired by the shift-based mapping method [14], we
also propose a layer-wise scaling factor Sf . The calculation
of the scaling factor is shown as Eq. (2).
center = round(mean(log2(x))),
Sf = 2
(center+σ).
(2)
x is a tensor to be converted, center means the approximate
distribution center of the input tensor in log domain, which
stands for that the majority of values are close to this magni-
tude, σ is a predefined positive integer constant, which is set
as 2 in our experiments. As mentioned in previous works [15],
the large values have more importance than small values, so we
add σ to center for shifting values towards small magnitude
a little more. Basd on the warm-up trained model, the scaling
factor of each layer can be calculated. Finally, by applying the
scaling factor before and after transformation operation P (x)
as Eq. (3), the more important values are shifted to the order
of magnitude that has higher precision.
px = P (
x
Sf
)Sf . (3)
Adjust Dynamic Range: During the DNN training process,
different layers have different distribution ranges which are
measured approximately by the difference between the maxi-
mum and minimum value in log domain. For example, in the
first few layers, the ranges of gradients are relatively larger
than the ranges of other values. In this case, the posit number
should have a larger dynamic range, which means a bigger es
value. In this work, for simplicity, we just set the es to be 1
for all weights and activations, and be 2 for all gradients and
errors.
C. Experiment Results
To validate our posit training strategy, we perform experi-
ments with ResNet-18 [1] on ImageNet and Cifar-10 datasets
utilizing Pytorch framework on NVIDIA P100 GPUs. The val-
idate top-1 accuracy and related configuration are summarized
TABLE III
TRAINING CONFIGURATIONS AND VALIDATE ACCURACIES RESULTS
Dataset Cifar-10 ImageNet
model Cifar-ResNet-18 ResNet-18
batch size 512 512
epochs 300 120
optimizer SGD with Moment SGD with Moment
FP32 baseline 93.40 71.02
posit 92.871 71.092
1 posit (8,1) for CONV layers forward pass and weight update,
posit (8,2) for CONV layers backward pass. posit (16,1) for
BN layers forward pass and weight update, posit (16,2) for
BN layers backward pass.
2 posit (16,1) for forward pass and weight update, posit (16,2)
for backward pass.
in Table III. which demonstrate that training with reduced-
precision posit number can achieve FP32 baseline accuracy
without tuning hyperparameters. The training details are as
follows:
Cifar-10: The model uses stochastic gradient descent with
moment 0.9 as optimizer. The initial learning rate is set to
0.1 and divided by 10 at epoch 60, epoch 150, and 250. The
network is trained for 300 epochs with a mini-batch size of
512. The warm-up training runs for 1 epoch.
ImageNet: The model uses stochastic gradient descent with
moment 0.9 as optimizer. The initial learning rate is set to 0.1
and divided by 10 every 30 epochs. The model is trained for
90 epochs with a mini-batch size of 512. The warm-up training
runs for 5 epochs.
IV. ENERGY-EFFICIENT POSIT MAC ARCHITECTURE
By using 8 bits or 16 bits posit number for training, the
model size can be reduced to 25% or 50%, then the energy
consumption can be saved significantly, because the memory
space requirements and the communication bandwidth are re-
duced. As for computational process, the energy consumption
mainly comes from a mass of MAC operations. Since the posit
arithmetic operations are different from traditional floating
Decoder
b
𝑠𝑏  𝑒𝑥𝑝𝑏  𝑓𝑏  
Encoder
FP MAC
Decoder
c
𝑠𝑐  𝑒𝑥𝑝𝑐  𝑓𝑐  
𝑠𝑧  𝑒𝑥𝑝𝑧 𝑓𝑧  
Decoder
a
𝑠𝑎  𝑒𝑥𝑝𝑎  𝑓𝑎  
z 
Fig. 4. The overall architecture for the posit MAC
point arithmetic operations, a dedicated MAC unit is urgently
required to take full advantage of the reduced-precision posit.
As shown in Fig. 4, the posit MAC unit proposed in [6]
mainly compose of three units: a decoder converting posit to
FP, an FP MAC unit, and an encoder converting FP to posit.
In this way, the summation of the encoder delay and decoder
delay consumes about 40% time of the total posit MAC delay.
Based on this result, improved architectures for the encoder
and decoder with lower latency are proposed, which are shown
in Fig. 6 and Fig. 5.
A. The Optimized Decoder and Encoder Architectures
The decoder aims to extract different parts of posit, then
exports effective exponent value and mantissa value. Firstly,
the absolute regime value of the input posit number is calcu-
lated by a LOD (if real regime value is negative) or a LZD
(if real regime value is positive). Secondly, The input is left
shifted by the width of regime bits, which is equal to r or
r + 1, where r is the absolute regime value. The output of
Left Shifter composes of posit exponent value and mantissa
value. Finally the regime value and posit exponent value are
packaged into effective exponent value. The critical path of
the original decoder is determined by the add one operation.
As shown in Fig. 5, we remove the adder, and split the left
shift path by duplicating the Left Shifter. To preserve the
function of the adder, a left-shift-one (“ << 1′′ ) operation is
inserted after the Left Shifter2.
LOD
1
MUX
LZD
in[n-2:0]
in[n-1]
in[n-3:0]
pos_regimeabs(regime)
neg_regime
mantissa
posit_exp
regime
effective_exp
1       0
1
MUX
1                0
Left 
Shifter
(a) The original decoder
LOD
Left 
Shifter1
1
MUX
LZD
Left 
Shifter2
MUX
<<1
in[n-2:0]
in[n-1]
in[n-3:0]
pos_regimeabs(regime)
neg_regime
mantissa
posit_exp
regime
effective_exp
1                0
1       0
(b) The optimized decoder
Fig. 5. The decoder architectures before and after optimization
TABLE IV
DELAY COMPARISON OF ENCODER AND DECODER WITH [6]
posit(8,0) posit(16,1) posit(32,3)
[6] delay(ns) encoder 0.2 0.29 0.35decoder 0.2 0.28 0.34
Ours
delay(ns) encoder 0.13 0.18 0.23decoder 0.14 0.21 0.29
power(mW) encoder 0.21 0.44 0.59decoder 0.27 0.45 0.66
area(µm2) encoder 137 295 540decoder 201 504 960
The encoder converts the FP to posit format. Firstly, a 2n-
bit variable REM is constructed with mantissa and the least
significant bits(LSB) es exponent bits, and the remained bits
are filled by regime sequence. Then REM is right shifted by
the width of regime bits, which is equal to r or r + 1, where
r is the absolute regime value. Therefore, an optimization
method, which is similar to that used in the optimized decoder,
is applied for the encoder architecture.
mantissaeffective_exp[E:0]
out[n-2:0]
Right 
ShifterM
U
X
0
  
  
  
 1
1
Absolute 
Value
abs(effective_exp)[es-1:0]
MUX
0       1
effective_exp[E]
effective_exp[es-1:0]
abs(regime)
regime_size
(a) The original encoder
mantissaeffective_exp[E:0]
out[n-2:0]
Right 
Shifter
M
U
X
1
       0
Absolute 
Value
abs(effective_exp)[es-1:0]
MUX
0       1
effective_exp[E]
effective_exp[es-1:0]
>>1
abs(regime)
(b) The optimized encoder
Fig. 6. The encoder architectures before and after optimization
B. Hardware Implementation Results
The architectures are coded by Verilog HDL and synthe-
sized by Design Compiler under TSMC 28nm technology.
To prove efficiency of the proposed encoder and decoder, the
same parameterized architectures with [6] are evaluated.
The comparison results in Table IV show our encoder speeds
up by 25%-35% and our decoder speeds up by 15%-30%,
thereby reducing the impact of these two units on total delay.
After combining the proposed encoder and decoder with
the FP MAC unit, an energy-efficient posit MAC architecture
is proposed. To meet the requirements of the DNN training
with posit, different posit MAC units which support all kinds
of posit format involved in Table III are implemented. The
implementation results are summarized in Table V. For fair
comparison between the posit MAC and FP32 MAC on energy
consumption, all these units are synthesized with a timing
constraint of 750MHz. Comparing to FP32 MAC, the posit
MAC can reduce the power by 22%-83%, and reduce the area
by 6%-76%.
TABLE V
COMPARISON OF POSIT MAC WITH FP32
Power(mW) Area (µm2)
FP32 2.52 4322
posit(8,1) 0.45 1208
posit(8,2) 0.35 1032
posit(16,1) 1.77 4079
posit(16,2) 1.60 3897
V. CONCLUSION AND FUTURE WORK
In this paper, with several useful methods proposed, the
posit number system is applied to DNN training successfully.
The experiments results show that reduced-precision posit can
achieve similar accuracy with FP32 on different datasets. If
the posit is applied in DNN accelerators, the overhead caused
by data communications can be saved by 2-4×. In order to
take full advantage of posit, an energy-efficient posit MAC
unit is designed. Comparing to FP32 MAC, the posit MAC
can reduce the power by 22%-83%, and reduce the area by
6%-76%.
In the further work, we will implement a hardware accel-
erator for DNN training with posit. On the other hand, the
architectures for posit arithmetic with the encoder and decoder
may be not the optimal method. We will carefully design a
new architecture for the posit MAC to further improve its
performance.
REFERENCES
[1] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep
residual learning for image recognition. In Proceedings of the IEEE
conference on computer vision and pattern recognition, pages 770–778,
2016.
[2] Dario Amodei, Sundaram Ananthanarayanan, Rishita Anubhai, Jingliang
Bai, Eric Battenberg, Carl Case, Jared Casper, Bryan Catanzaro, Qiang
Cheng, Guoliang Chen, et al. Deep speech 2: End-to-end speech
recognition in english and mandarin. In International conference on
machine learning, pages 173–182, 2016.
[3] Yequan Wang, Minlie Huang, Li Zhao, et al. Attention-based lstm
for aspect-level sentiment classification. In Proceedings of the 2016
conference on empirical methods in natural language processing, pages
606–615, 2016.
[4] Raghuraman Krishnamoorthi. Quantizing deep convolutional networks
for efficient inference: A whitepaper. arXiv preprint arXiv:1806.08342,
2018.
[5] John L Gustafson and Isaac T Yonemoto. Beating floating point at its
own game: Posit arithmetic. Supercomputing Frontiers and Innovations,
4(2):71–86, 2017.
[6] Hao Zhang, Jiongrui He, and Seok-Bum Ko. Efficient posit multiply-
accumulate unit generator for deep learning applications. In 2019 IEEE
International Symposium on Circuits and Systems (ISCAS), pages 1–5.
IEEE, 2019.
[7] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish
Narayanan. Deep learning with limited numerical precision. In
International Conference on Machine Learning, pages 1737–1746, 2015.
[8] Daisuke Miyashita, Edward H Lee, and Boris Murmann. Convolutional
neural networks using logarithmic data representation. arXiv preprint
arXiv:1603.01025, 2016.
[9] Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos,
Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii
Kuchaiev, Ganesh Venkatesh, et al. Mixed precision training. arXiv
preprint arXiv:1710.03740, 2017.
[10] Naigang Wang, Jungwook Choi, Daniel Brand, Chia-Yu Chen, and
Kailash Gopalakrishnan. Training deep neural networks with 8-bit
floating point numbers. In Advances in neural information processing
systems, pages 7675–7684, 2018.
[11] Manish Kumar Jaiswal and Hayden K-H So. Universal number posit
arithmetic generator on fpga. In 2018 Design, Automation & Test
in Europe Conference & Exhibition (DATE), pages 1159–1162. IEEE,
2018.
[12] Zachariah Carmichael, Hamed F Langroudi, Char Khazanov, Jeffrey
Lillie, John L Gustafson, and Dhireesha Kudithipudi. Deep positron:
A deep neural network using the posit number system. arXiv preprint
arXiv:1812.01762, 2018.
[13] Jeff Johnson. Rethinking floating point for deep learning. arXiv preprint
arXiv:1811.01721, 2018.
[14] Shuang Wu, Guoqi Li, Feng Chen, and Luping Shi. Training and
inference with integers in deep neural networks. arXiv preprint
arXiv:1802.04680, 2018.
[15] Song Han, Huizi Mao, and William J Dally. Deep compression:
Compressing deep neural networks with pruning, trained quantization
and huffman coding. arXiv preprint arXiv:1510.00149, 2015.
