Template-Based Posit Multiplication for Training and Inferring in Neural
  Networks by Montero, Raúl Murillo et al.
TEMPLATE-BASED POSIT MULTIPLICATION FOR TRAINING AND
INFERRING IN NEURAL NETWORKS
A PREPRINT
Raúl Murillo Montero
Department of Computer Architecture and Automation
Complutense University of Madrid
Madrid, 28040
ramuri01@ucm.es
Alberto A. Del Barrio
Department of Computer Architecture and Automation
Complutense University of Madrid
Madrid, 28040
abarriog@ucm.es
Guillermo Botella
Department of Computer Architecture and Automation
Complutense University of Madrid
Madrid, 28040
gbotella@ucm.es
July 10, 2019
ABSTRACT
The posit number system is arguably the most promising and discussed topic in Arithmetic nowadays.
The recent breakthroughs claimed by the format proposed by John L. Gustafson have put posits in the
spotlight. In this work, we first describe an algorithm for multiplying two posit numbers, even when
the number of exponent bits is zero. This configuration, scarcely tackled in literature, is particularly
interesting because it allows the deployment of a fast sigmoid function. The proposed multiplication
algorithm is then integrated as a template into the well-known FloPoCo framework. Synthesis results
are shown to compare with the floating point multiplication offered by FloPoCo as well. Second, the
performance of posits is studied in the scenario of Neural Networks in both training and inference
stages. To the best of our knowledge, this is the first time that training is done with posit format,
achieving promising results for a binary classification problem even with reduced posit configurations.
In the inference stage, 8-bit posits are as good as floating point when dealing with the MNIST dataset,
but lose some accuracy with CIFAR-10.
Keywords Posits ·Multiplier · Neural Networks · Train · Inference
1 Introduction
Multiple floating-point representations have been used in computers over the years, although the IEEE Standard for
Floating-Point Arithmetic (IEEE 754) [1] is the most common implementation that modern computing systems have
adopted. Since it was established in 1985, the standard has only been revisited in 2008 (IEEE 754-2008) [2], but the
main characteristics of the original remain to keep compatibility with existing implementations and it is not adopted by
all computer systems. However, multiple shortcomings have been identified in the IEEE 754 standard, which are listed
below [3]:
• Different computers using the same IEEE floating-point format are not required produce the same results.
When a computation does not fit into the chosen number representation, the number will be rounded. Even in
the last revision of the standard they introduce the round-to-nearest, ties away from zero rounding scheme and
provide recommendations for computations reproducibility, hardware designers are not coerced to implement
them. Therefore, identical computations can lead to multiple results across different computing platforms [4].
ar
X
iv
:1
90
7.
04
09
1v
1 
 [c
s.C
V]
  9
 Ju
l 2
01
9
A PREPRINT - JULY 10, 2019
r r r . . . r e1 e2 e3 ees. . . f1 f2 f3 f4 f5 f6 . . .s
sign
bit
regime
bits
fraction
bits, if anybits, if any
exponent
r
Figure 1: Layout of an 〈n, es〉 posit number.
• Multiple bit patterns are used for handling exceptions such as the Not a Number (NaN) value, which indicates
that a value is not representable or undefined – for example dividing by zero results in a NaN. The problem is
that the amount of bit patterns that represent NaN may be more than necessary, making hardware design more
complex and decreasing the available number of exactly representable values.
• IEEE 754 makes use of overflow – accepting∞ or −∞ as a substitute for large-magnitude finite numbers –
and underflow – accepting 0 as a substitute for small-magnitude nonzero numbers. Thus, major problems can
be produced, as the above mentioned.
• Rounding is performed on individual operands of every calculation, so associativity and distributivity properties
are not always held in floating-point representation. The last revision of the standard tries to solve this issue
including the Multiply Accumulation (FMA) operation. However, again this may not be supported by all
computer systems.
The above listed shortcomings led to the idea of developing a new number system that can serve as a replacement for
the now ubiquitous IEEE 754 arithmetic. At the beginning of 2017, John L. Gustafson introduced the posit number
representation system, a Type III unum format [3, 5] that has no underflow, overflow or wasted NaN values. Gustafson
claims that posits are not only a suitable replace for the current IEEE Standard for Floating-Point Arithmetic, but also
provide more accurate answers with an equal or smaller number of bits and simpler hardware [6]. As it is illustrated in
[6, 7, 8] there are important benefits when using posits, as a better dynamic range, accuracy, closure and consistency
between machines than with conventional floating point. However, posits are still in development and there is still some
controversy about their improvement [9, 10].
The numerical value of a posit number X , whose bits are distributed as shown in Figure 1, is given by Equation 1:
X = (−1)s ∗ (22es)k ∗ 2e ∗ 1.f , (1)
where k is the regime value, e is the unsigned exponent (if es > 0), and f is the mantissa of the number without the
implicit one. In terms of format layout, the main differences with floating point are the existence of the regime field and
together with the unsigned and unbiased exponent, if there exists such exponent field. The regime is a sequence of bits
with the same value (r) finished with the negation of such value (r). Provided that X = xn−1xn−2...x1x0, this regime
can be expressed as Equation 2 shows.
k = −xn−2 +
xi 6= xn−2∑
i=n−2
(−1)1−xi . (2)
In other words, the regime basically counts the number of occurrences of the bit labelled as r in Figure 1. If r = 1 then
the regime is the number of 1’s minus 1, while if r = 0, the regime is the negative value of the number of 0’s. For
instance, if the regime is 4-bits wide, the value 1110 would be interpreted as k = 2, while the value 0001 would be
k = −3. Hence, detecting the leading 1’s and 0’s is critical for performing this step [11, 12, 13]. However, the main
difficulty to detect the regime, and consequently unpack the posit, is that its length varies dynamically. In numbers
close to zero, there will be many fraction bits and few regime bits and, in large numbers there will be many regime bits
and fewer fraction bits, but there is no fixed amount of bits. Finally, it must be noted that the scaling factor defined by
the regime (k) and the amount of exponent bits (es) is typically named useed.
This variability of configurations that the posit numbers provide is an excellent opportunity to research in the design of
efficient units able to implement operations among posits. The recommended formats for obtaining similar accuracy
results with respect to floating point are: 〈8, 0〉 (w.r.t minifloat [14]), 〈16, 1〉 (w.r.t. half precision) and 〈32, 2〉 (w.r.t.
2
A PREPRINT - JULY 10, 2019
single precision). Nevertheless, the aforementioned variability of posit configurations opens a gate to look for functional
units providing the best trade-offs [15, 16]. It is therefore desirable to provide a generic architecture and a generic flow
to implement posit functional units. For this purpose, in this paper we leverage the FloPoCo framework [17] capabilities.
FloPoCo (Floating-Point Cores, but not only) is an open-source C++ framework for the generation of arithmetic
datapaths which provides a command-line interface that inputs operator specifications and outputs synthesizable
VHDL specially suited for Field-Programmable Gate Arrays (FPGAs). In this paper we have defined a generic posit
multiplication algorithm and integrated it as a template within the FloPoCo framework. This way, it is possible to
generate synthesizable multipliers with any posit configuration, including es = 0. The code is publicly available
at github 1. Moreover, we have evaluated the performance of this algorithm and other posit operators in one of the
scenarios where they can have a greater impact because of their accuracy with reduced formats: the Neural Networks
(NNs). To the best of our knowledge, this is the first time that posits are evaluated in the training stage, providing
promising results. In the inference phase, short posit formats match floating point for the MNIST dataset and lose some
accuracy with CIFAR-10.
The rest of the paper is organized as follows: Section 2 describes the state of the art regarding posit multipliers and
other functional units; Section 3 presents our algorithm for multiplying in the posit format as well as its integration with
the FloPoCo framework; Section 4 presents a study on posits and NNs; Section 5 provides synthesis and simulation
results to validate the approach and finally Section 6 draws our conclusions and future lines of work.
2 Related Work
Since the posit number system was introduced, the interest on a hardware implementation for this format has increased
rapidly. Despite the short life time of posits, several hardware implementations have been proposed since 2017.
The posit arithmetic unit proposed in [18, 5], includes floating-point to posit conversion, posit to floating-point
conversion, addition/subtraction and multiplication. The work in [19] improves the decimal accuracy of the latters and
defines a posit vectorized unit. Although these works seem totally parametrized, there is no support for the case of zero
exponent bits. This is important because in the Deep Learning scenario, the configuration 〈8, 0〉 provides an extremely
fast approach for the sigmoid function [7].
Another general design for posit arithmetic unit that includes adders and multipliers is presented in [20]. In contrast to
the implementations shown in [18, 19], the posit decoder proposed in this architecture uses only a leading zero detector
for decoding the regime, while [18, 5] were using a leading one detector too. The work in [21] improves the prior
ones by employing just one leading one detector and provides a tool for generating different units with different posit
formats. Authors in [22] present a C++ template compliant with Intel OpenCL SDK. Nonetheless, the configuration
with es = 0 is not included either in any of the aforementioned articles. J. Johnson [23] employs posit addition to
complement the logarithmic multiplication when performing inference in Convolutional Neural Networks (CNNs),
studying the 〈8, 0〉 configuration too, but just in simulation. Other studies tackling the inference stage of CNNs are
performed in [24, 25, 26]. In all these works, the training stage is performed in floating point and the weights converted
to posit format, while the inference stage is performed in posit format.
Finally, for the sake of completeness it must be noted that since the posit standard [27] includes fused operations
such as the fused dot product, and due to the importance of this operation for matrix calculus, some research and
development for this kind of implementations has been done. Different matrix-multiply units for posits are presented in
[28, 19, 24, 25]. They make use of the quire register [27] to accumulate the partial additions that are involved in the dot
product, so the result is rounded only after the whole computation. Nevertheless, the fused operations are out of the
scope of this work, as the functional unit that will be described is a posit multiplier.
In this paper we propose an algorithm to perform the multiplication of two posit numbers and integrate it into a
well-known framework as FloPoCo [17], providing support for generating synthesizable multipliers for any posit
configuration, including the case es = 0. Furthermore, we present a study in both the training and inference stages of
NNs. As it there have been mentioned, there have been several studies for the inference stage, but not on the training
phase.
3 The Posit Multiplier
At hardware level, posits were designed to be easy to compute, i.e., to have a circuitry similar to the existing floating
point. The main encoding difference between float and posit formats is the fact that the second one includes a run-time
1https://github.com/RaulMurillo/Posit-Multiplier_FloPoCo
3
A PREPRINT - JULY 10, 2019
varying scaling component. This leads to a format that has no fixed fields at run-time, which is a hardware design
challenge. Below we present a fully functional posit multiplier operator.
Analogously to performing computations with IEEE floats, it is necessary to unpack/decode the operands fields before
carrying out any computation. Therefore, we first present the posit decoding process in Algorithm 1. It must be noted
that the decoder can also be used in other arithmetic modules, e.g. a posit adder. The explanation of such algorithm is
as follows:
• Sign and special cases are detected checking the Most Significant Bit (MSB) and ORing the remaining bits,
respectively (lines 2–5).
• Since posit arithmetic uses 2’s complement for representing negative numbers, dealing with the absolute value
simplifies the data extraction process. Therefore, the 2’s complement of the inputs are obtained, only if it is
necessary, by XORing the input with the replicated sign bit and adding the sign to the Least Significant Bit
(LSB) (line 6).
• The twos[N − 2] bit aids to determine the regime value. In order to use only a leading zero detector [20], we
invert the bits of twos if the regime consists on a sequence of ones (line 8). Then we count the sequence of 0
bits terminating in a 1 bit using a leading zero detector module (line 9).
• For extracting the exponent and the fraction bits, the regime is shifted out from twos, so the exponent is
aligned to the left (line 10).
• The first es bits of the shifted string (if es = 0 this instruction is omitted) correspond to the exponent bits (line
11), while the remaining bits correspond to the fraction (line 12). It must be noted that here the hidden bit is
appended as the MSB.
• The regime depends on the sequence of identical bits that constitute this field. The regime value is zc − 1
when the bits are 1 (positive regime) or −zc when it consists on a sequence of 0 bits (negative regime). Note
that an extra 0 is added to maintain sign bit of the operation (line 13), as zc is a positive value.
Algorithm 1 Posit data extraction
1: procedure DECODE(in)
2: nzero← ∨ in[N − 2 : 0] . Reduction OR
3: sign← in[N − 1] . Extract sign
4: z ← ¬(sign ∨ nzero)
5: inf ← sign ∧ ¬(nzero)
6: twos← ({N − 1{sign}} ⊕ in[N − 2 : 0]) + sign . Input 2’s complement
7: rc← twos[N − 2] . Regime check
8: inv ← {N − 1{rc}} ⊕ twos
9: zc← LZD(inv) . Count leading zeros of regime
10: tmp← twos[N − 4 : 0] (zc− 1) . Shift out the regime
11: exp← tmp[N − 4 : N − es− 3] . Extract exponent
12: frac← nzero & tmp[N − es− 4 : 0] . Extract fraction
13: reg ← rc ? ‘0’ & zc− 1 : −(‘0’ & zc) . Select regime
14: return sign, reg, exp, frac, z, inf
15: end procedure
The process of posit multiplication is almost the same as for floating-point multiplication, i.e. the scaling factors are
added and the fractions are multiplied and rounded. There are few differences when multiplying posits due to the regime
field. The pseudocode for posit multiplication is shown in Algorithm 2 and the explanation of the flow is as follows:
• When the two operands are decoded (lines 2–3), the sign and special cases are handled easily (lines 4–6).
• The Scaling Factor (SF) of each operand consists of the regime and the exponent values, one after the other
(lines 7–8). This is due to how posit decimal values are computed using regime and exponent.
• The resulting fraction field is the outcome after multiplying the two operands fractions as if they were integer
values (line 9). Recall that multiplying two n− bit integers results in an integer of 2n bits. In addition, the
decoder module returns fractions with the hidden bit as MSB, so the first two bits of the fractions multiplication
do not strictly belong to the fraction field of the result, since they correspond to the multiplication of the hidden
bits plus the possible carry bit due to fraction overflow. Therefore, the MSB of the result aids to detect any
overflow when multiplying the fractions(line 10).
4
A PREPRINT - JULY 10, 2019
Algorithm 2 Posit Multiplier Algorithm
1: procedure POSITMULT(inA, inB)
2: signA, regA, expA, fracA, zA, infA ← DECODE(inA)
3: signB , regB , expB , fracB , zB , infB ← DECODE(inB)
4: sign← signA ⊕ signB . Sign computation
5: z ← zA ∨ zB . Special cases computation
6: inf ← infA ∨ infB
7: sfA ← regA & expA . Gather scale factors
8: sfB ← regB & expB
9: fracmult ← fracA × fracB . Fractions multiplication
10: ovfm ← fracmult[MSB] . Adjust for overflow
11: normfrac ← ovfm ? ‘0’ & fracmult : fracmult & ‘0’ . Normalize fraction
12: sfmult ← (sfA[MSB] & sfA) + (sfB [MSB] & sfB) + ovfm . Add scaling factors
13: sfsign ← sfmult[MSB] . Get regime’s sign
14: nzero← ∨ fracmult
15: exp← sfmult[es− 1 : 0] . Unpack scaling factors
16: regtmp ← sfmult[MSB − 2 : es]
17: reg ← sfsign ? − regtmp : regtmp . Get regime’s absolute value
18: ovfreg ← reg[MSB] . Check for regime overflow
19: regf ← ovfreg ? ‘0’ & {dlog2(N)e{‘1’}} : reg
20: ovfregf ←
∧
regf [MSB − 2 : 0]
21: expf ← (ovfreg ∨ ovfregf ∨ ¬nzero) ? {es{‘0’}} : exp
22: tmp1← nzero & ‘0’ & expf & normfrac[MSB − 3 : 0] & {N − 1{‘0’}} . Packing
23: tmp2← ‘0’ & nzero & expf & normfrac[MSB − 3 : 0] & {N − 1{‘0’}}
24: shiftneg ← ovfregf ? regf − 2 : regf − 1
25: shiftpos ← ovfregf ? regf − 1 : regf
26: tmp← sfsign ? tmp2 shiftneg : tmp1 shiftpos . Final answer with extra bits
27: LSB,G,R← tmp[MSB − (N − 1) :MSB − (N + 1)] . Unbiased rounding
28: S ← ∨ tmp[MSB − (N + 2) : 0]
29: round← (ovfreg ∨ ovfregf ) ? ‘0’ : G ∧ (LSB ∨R ∨ S)
30: resulttmp ← ‘0’ & (tmp[MSB :MSB − (N − 1)] + round)
31: result← inf ? infinity : z ? zero : sign ? − resulttmp : resulttmp
32: return result
33: end procedure
• If a fraction overflow occurs, the resulting fraction has to be normalized shifting one bit to the right. In order
to avoid losing any bit for rounding, instead of shifting, we just append a 0 bit as MSB, or as LSB if there is
no overflow (line 11).
• The resulting scaling factor is obtained by adding both operand scales, plus the possible fraction overflow. The
result of adding two bit strings of same size may overflow, and in this case that carry bit indicates the sign
of resulting regime, so it is necessary to replicate the MSB of both scaling factors before adding them (lines
12–13).
• Exponent and regime are extracted from the scaling factors addition. The obtained regime may be negative,
but it is more suitable to handle absolute values (lines 15–17). A similar action is performed in the decoding
stage to simplify the following steps.
• Adding two high-magnitude regimes may result in overflow, so in that case the regime is truncated to the
maximum possible value and the exponent is set to 0 (lines 18–21).
• Once the resulting fields have been computed and adjusted, they have to be packed in the correct order. To
construct the regime correctly, the packed fields have to be right-shifted as a signed integer according to the
sign and value of the regime. It is important to avoid losing any fraction bit to round correctly, so an amount of
0 bits has to be appended on the right (lines 22–26).
• Posits, same as IEEE 754 floats, follow a round-to-nearest-even scheme. To perform a correct unbiased
rounding, the LSB, Guard (G), Round (R) and Sticky (S) bits are needed [29] (lines 27–29). The rounded
result is finally adjusted according to the sign and exceptions.
5
A PREPRINT - JULY 10, 2019
3.1 Integration with FloPoCo
In this subsection we can briefly comment how to integrate the aforementioned algorithms within the FloPoCo
framework [17]. FloPoCo follows an object-oriented class hierarchy, where all operators inherit from a baseline
Operator virtual class. Thus, by extending such class and incorporating Algorithms 1 and 2 it is possible to create a
posit multiplier with n and es as input parameters.
Then, using the command flopoco <options> <operator specification list>, FloPoCo will generate a sin-
gle synthesizable VHDL file [17]. Figure 2 illustrates this process for the command flopoco PositMult N=8 es=1,
with which we obtain the VHDL code for a 〈8, 1〉 multiplier, and changing the values on N and es we can obtain a new
multiplier for any other posit configuration.
addFullComment("Special Cases");
vhdl << tab << declare("nzero") << " <= '0' when Input" << range(N-2, 0)
 << " = 0 else '1';" << endl;
addComment("1 if Input is zero");
vhdl << tab << "z <= Input" << of(N-1) << " NOR nzero;" << endl;
addComment("1 if Input is infinity");
vhdl << tab << "inf <= Input" << of(N-1) << " AND (NOT nzero);" << endl;
addFullComment("Extract Sign bit");
vhdl << tab << declare("my_sign") << " <= Input" << of(N-1) << ";" << endl;
vhdl << tab << "Sign <= my_sign;" << endl;
addFullComment("2's Complement of Input");
vhdl << tab << declare("rep_sign", N-1) << " <= (others => my_sign);" << endl;
vhdl << tab << declare("twos", N-1) << " <= (rep_sign XOR Input"
 << range(N-2,0) << ") + my_sign;" << endl;
vhdl << tab << declare("rc") << " <= twos" << of(N-2) << ";" << endl;
(a) Source code in PositMult.cpp file.
signal nzero :  std_logic;
signal my_sign :  std_logic;
signal rep_sign :  std_logic_vector(6 downto 0);
signal twos :  std_logic_vector(6 downto 0);
signal rc :  std_logic;
signal rep_rc :  std_logic_vector(6 downto 0);
signal inv :  std_logic_vector(6 downto 0);
signal zero_var :  std_logic;
signal zc :  std_logic_vector(2 downto 0);
signal zc_sub :  std_logic_vector(2 downto 0);
signal shifted_twos :  std_logic_vector(13 downto 0);
signal tmp :  std_logic_vector(4 downto 0);
begin
-------------------------------- Special Cases --------------------------------
   nzero <= '0' when Input(6 downto 0) = 0 else '1';
   -- 1 if Input is zero
   z <= Input(7) NOR nzero;
   -- 1 if Input is infinity
   inf <= Input(7) AND (NOT nzero);
----------------------------- Sign bit Extraction -----------------------------
   my_sign <= Input(7);
   Sign <= my_sign;
------------------------- 2's Complement of Input -------------------------
   rep_sign <= (others => my_sign);
   twos <= (rep_sign XOR Input(6 downto 0)) + my_sign;
   rc <= twos(6);
(b) Generated VHDL code.
Figure 2: Generation of synthesizable VHDL from C++ code with FloPoCo.
It is important to mention that, in contrast with the works presented in [18, 5, 20] which only provide implementations
with a non-zero value for es, we designed a generic template that can be used to automatically generate multipliers
for any posit configuration, not only those with es > 0. What is more, it is possible to generate combinational and
sequential and even FPGA-customized versions of the multiplier by just changing the options when invoking FloPoCo.
This will be shown in Section 5.
4 Case Study: Posits and Neural Networks
The recent surge of interest in Artificial Intelligence, and in particular in Deep Learning (DL), together with the
limitations this sector currently has in terms of power consumption and memory resources make us wonder if posits can
be helpful in this field. As described in this paper, the posit number system has many interesting properties, such as
lack of underflow or overflow or the fast approximation of sigmoid function that some configurations of posits can do
(es = 0). These, along with the so-called tapered precision [26], suggest that posits may be suitable for performing DL
tasks.
In a format with tapered precision the values mass around 0 and sparse to higher or lower numbers in less frequency, so
representation of small values is more accurate than using other formats. When we use a number system with tapered
precision, such as posits, the values follow a normal distribution centered in 0. That is the same distribution that DNN
weight parameters usually follow, but even more grouped around 0. Figure 3 illustrates this concept, which suggests
that using posits for DNN may provide more accurate results.
6
A PREPRINT - JULY 10, 2019
−60 −40 −20 0 20 40 60
Value
0
10
20
30
40
50
60
70
80
D
en
si
ty
(a) Distribution of 〈8, 0〉 values.
−2 −1 0 1 2 3
Value
100
101
102
103
104
D
en
si
ty
(b) LeNet-5 weight distribution for CIFAR-10.
Figure 3: Distributions of posit values and NN weights.
1.5 1.0 0.5 0.0 0.5 1.0 1.5
1.0
0.5
0.0
0.5
1.0
Figure 4: Classification problem for posit training.
4.1 Training
The NN training with the posit format has been done using the posit-arithmetic library PySigmoid [30]. The choice
of this particular package is due to the fact that it allows working with specific posit configurations, not only the
“common” Posit〈8, 0〉, Posit〈16, 1〉 and Posit〈32, 2〉, and that it has a function that simulates the hardware operation
for fast sigmoid, which approximates the original function when posits have es = 0, in particular when using the
Posit〈8, 0〉 configuration. According to [7], this fast sigmoid can be achieved by flipping the first bit of the posit and
shifting it right two places, so given an n-bits posit X , this behavior can be modeled by Equation 3, describes how the
fast sigmoid is implemented for an n-bit input.
σfast(X,n) = X ⊕ 2n−1 >> 2 . (3)
To measure how well posits perform at deep learning tasks, the binary classification problem depicted in Figure 4 has
been studied in detail. The samples consist only of two features and classes are obviously separated by a non-linear
boundary.
The case study consists in training a NN architecture with two hidden layers of 4 and 8 neurons, respectively, to solve
the aforementioned binary classification problem and check if posits may be suitable for training. Although there are
multiple libraries and frameworks for Machine Learning (ML) that accelerate and simplify these kind of tasks, our test
requires that all the internal computations involving parameters of the network are done in the posit format. Hence,
the only option is to implement the NN from scratch, casting the input into posit type and replacing all the internal
7
A PREPRINT - JULY 10, 2019
operands by the ones from PySigmoid library [30]. In this way the fused dot product with the quire accumulator can be
employed too. In order to perform operations with the quire, the original library was modified and can be found at
github 2. The original version of this library allowed underflow for certain values, which was fixed in the modified
version we present in this paper.
4.2 Inference
As some research papers show [31, 32, 33], it is difficult to apply lower numerical precision to the training of NNs,
especially when using less than 16 bits. However, many research papers have shown that it is possible to apply
low-precision computing to the inference stage of NNs after training with exact arithmetic [34, 35, 36]. These results
lead to the idea of using a reduced posit format such as Posit〈8, 0〉 for carrying out the DL inference. Performing
low-precision inference can be extremely helpful in embedded systems and applications that make use of DL techniques
such as Autonomous Driving [37].
There are some characteristics of the posit format that can be an advantage when using this format in NNs:
• The comparison of posits uses the same hardware as for comparing integers, which is much faster than floats
comparison. Thus, the typical pooling layers of CNNs could be easily implemented.
• If using a format with es = 0, e.g. Posit〈8, 0〉, the sigmoid function can be approximately calculated as
described by Equation 3.
• Input values for NNs are usually normalized between [-1,1]. Because of the aforementioned tapered precision,
the addition of two posit numbers is pretty accurate near 0.
5 Experiments
In this section the experimental results are presented. First, the synthesis results of template-based multiplier are
discussed and second, the performance of posits operators is evaluated on NNs in both training and inference stages.
5.1 Synthesis Results
Prior to synthesizing the posit multiplier described in Section 3, the verification has been done as follows: first the
golden solution has been obtained with the help of the Mathematica environment [38]. Afterwards, the VHDL the test
bench is run using Xilinx Vivado Design Suite [39] and the outputs compared with those from Mathematica, producing
no mismatch. Different posit configurations have been successfully tested. In particular, complete tests have been done
for 〈8, 0〉, 〈8, 1〉 and 〈8, 2〉, and also, but less exhaustive, for 〈16, 1〉 and 〈32, 2〉.
Several posit multipliers have been synthesized using Synopsys Design Compiler with a 65 nm target-library and
without placing any timing constraint. The delay, area, power and energy of different posit multipliers have been
measured. Besides synthesizing multipliers for different posit formats, the capabilities of FloPoCo have been leveraged
to generate these units with different styles, namely: pipelined, combinational and combinational without employing
hard multipliers nor DSP blocks. Also, several floating-point multipliers have been generated using FloPoCo as well. In
this case, the notatio 〈exp,mant〉 indicates the amount of bits for the exponent and mantissa bits, respectively. Given
that there is also a sign bit, the three explored FloPoCo floating point configurations would correspond with minifloat
(〈4, 3〉), IEEE Half Precision (〈5, 10〉) and IEEE Single Precision (〈8, 23〉). Nevertheless, it must be noted that FloPoCo
floating point does not handle exceptional cases (infinity, NaN) and subnormal numbers. Table 1 presents the synthesis
results. In case of the pipelined designs, the number of stages is indicated between parenthesis next to the delay value.
A first conclusion that can be extracted is the fact that posit pipelined designs are not optimized in terms of pipeline
depth, as they have many stages in comparison with equivalent floating point units. For example, 8-bit multipliers
require at least 7 stages, which is a lot for this kind of components. Second, the FloPoCo units are more efficient but, as
it has been mentioned before, they are not complete while our multiplier is.
In order to compare results with a state-of-the-art multiplier [20], we have used Xilinx Vivado for synthesizing our
proposed multiplier on a ZedBoard Zynq-7000 SoC, the same target as in [20]. This comparison in terms of LUTs
and DSPs is shown in Table 2. In the case of the pipelined design, it must be also considered that extra resources are
necessary. For example, for the case of 32-bits, 65 LUTRAMs, 910 FFs and one BUFG are required.
2https://github.com/RaulMurillo/PySigmoid/tree/master/PySigmoid
8
A PREPRINT - JULY 10, 2019
Posit〈n, es〉 configurations FP 〈exp,mant〉 configurations
〈8, 0〉 〈8, 1〉 〈8, 2〉 〈16, 1〉 〈32, 2〉 〈4, 3〉 〈5, 10〉 〈8, 23〉
Pipelined 0.8 (8) 0.79 (8) 0.78 (7) 1.06 (10) 2.3 (15) 0.63 (2) 1.07 (3) 1.58 (2)
Combinational 3.59 3.52 3.17 6.2 10.34 1.18 2.62 4.97Delay (ns)
Combinational, No hm 3.36 3.23 3.18 6.2 9.6 1.18 2.62 4.39
Pipelined 2799 2745 2481 6898 24299 950 3800 7757
Combinational 1488 1483 1415 3865 15459 700 2883 6684Area (µm2)
Combinational, No hm 1271 1152 1048 3865 21894 700 2883 11640
Pipelined 397 384.3 313.7 862.1 2269.2 158.6 815.3 4362.8
Combinational 631.3 562.1 428.4 2609.6 12693.6 121.4 987.9 4983.4Power (µW)
Combinational, No hm 612.4 503.9 424 2609.6 13053.3 121.4 987.9 4956.5
Pipelined 2.54 2.43 1.71 9.14 78.29 0.20 2.62 13.79
Combinational 2.27 1.98 1.36 16.18 131.25 0.14 2.59 24.77Energy (pJ)
Combinational, No hm 2.06 1.63 1.35 16.18 125.31 0.14 2.59 21.76
Table 1: Posit multipliers synthesis results.
Posit〈16, 1〉 Posit〈32, 2〉Datapath Slice LUT DSP Slice LUT DSP
[20] 218 1 572 4
Pipelined 321 1 891 2
Combinational 266 1 927 2
Combinational, No hm 266 1 1640 0
Table 2: Comparison of posit multipliers synthesis area results.
As can be observed in Table 2, the results are not as good as the ones presented in [20]. Nonetheless, it must be observed
that for 32-bits our implementations employ less DSPs. Furthermore, it must be reminded that our flow also supports
the case of es = 0.
5.2 Neural Networks Training
As it has been described in Section 4.1, different Posit〈n, es〉 configurations has been studied for the binary classification
problem. Moreover, the posit configurations employing es = 0 have utilized the fast sigmoid function too, while those
with es > 0 or floating point formats have made use of the regular sigmoid. The weights and biases are randomly
initialized and the Mean Square Error (MSE) has been used as loss function to compare the outcomes of the network.
An amount of 2500 epochs have been set to compare the losses of the different formats throughout the whole training.
In this manner it is possible to compare whether the network converges or not and also how fast. Table 3 and Figure 5
depict the obtained results.
Epochs
Configuration 0 250 500 750 1000 1250
32-bit Float 0.3701 0.2346 0.1726 0.0839 0.0023 0.0010
64-bit Float 0.3701 0.2346 0.1727 0.1124 0.0023 0.0010
Posit〈8, 0〉 0.3681 0.1882 0.1491 0.1530 0.1530 0.1530
Posit〈10, 0〉 0.3653 0.2129 0.1359 0.0938 0.1478 0.1264
Posit〈12, 0〉 0.3650 0.2467 0.1758 0.1684 0.0140 0.0081
Posit〈16, 0〉 0.3648 0.2817 0.1716 0.1622 0.0645 0.0035
Posit〈16, 1〉 0.3337 0.1772 0.1453 0.0440 0.0019 0.0011
Posit〈32, 2〉 0.3337 0.1758 0.1658 0.0328 0.0017 0.0009
Table 3: Loss function during the NN training.
As can be seen, there is almost no difference in using single or double precision floats. Both Posit〈32, 2〉 and Posit〈16, 1〉
present the same behavior as floats, even with less MSE during the fists epochs. Posit configurations with 16, 14 and
12 bits takes some extra epochs to converge, and only those configurations with 10 and 8 bits present an irregular
convergence. In these cases, the lack of underflow is undermining the posits convergence. In fact, this type of behavior
has already appeared in a Newton-Raphson study presented at [40].
9
A PREPRINT - JULY 10, 2019
0 500 1000 1500 2000 2500
Epochs
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
M
ea
n
S
q
u
ar
ed
E
rr
or
32-bit Float
64-bit Float
Posit〈8, 0〉
Posit〈10, 0〉
Posit〈12, 0〉
Posit〈14, 0〉
Posit〈16, 0〉
Posit〈16, 1〉
Posit〈32, 2〉
Figure 5: Loss function along the NN training.
Therefore, it can be concluded that posits have converged as well as floats, even with some short formats employing the
fast sigmoid approach. Although the proposed NN is a reduced example, these facts point in a good direction to train
more complex CNNs.
5.3 Neural Networks Inference
The performance of Posit〈8, 0〉 format is evaluated on two datasets: MNIST and CIFAR-10 run on the LeNet-5
architecture [41]. In this case, the networks are firstly trained with floating point arithmetic and then the weights are
converted to posit format prior to the inference stage. The networks have been trained using Keras [42] and TensorFlow
[43] frameworks. The posit computations during inference have been simulated with the help of a NumPy library
version which includes a posit data type [44]. In this case, computations are much faster than using the PySigmoid
library, but there is not a fast sigmoid implementation, so the simpler ReLU module is used as activation function
instead. The obtained results are shown in Table 4.
32-bit Float Posit〈8, 0〉 Posit〈8, 0〉(only addition)
Dataset Top-1 Top-5 Top-1 Top-5 Top-1 Top-5
MNIST 99.22% 100.00% 99.32% 99.94% 99.40% 100%
CIFAR-10 68.04% 96.47% 56.11% 92.42% 58.92% 95.62%
Table 4: Performance on CNNs inference.
As can be observed, posits get slightly higher Top-1 and Top-5 accuracies than single precision floats. On the other
hand, when dealing with a more complex dataset as CIFAR-10, there is a loss of 12% in Top-1 and 4% in Top-5. In
order to mitigate this, a hybrid posit-float architecture has been considered. Following the idea described in [23], only
the additions are performed in posit format. Under this scenario the accuracies are higher, but still floating point is
superior. Finally, it must be emphasized that the posits employed in these tests are Posit〈8, 0〉, which in exchange for
the accuracy loss can reduce the memory footprint to a quarter in comparison to single precision, the functional units
complexity and so on.
6 Conclusions
In this work an algorithm for performing multiplication between two posit numbers has been presented. The algorithm
is generic for whatever Posit〈n, es〉 configuration and it has been integrated into the FloPoCo framework. Furthermore,
10
A PREPRINT - JULY 10, 2019
this multiplication algorithm, together with other posit operations, has been employed in the neural networks scenario
for performing training and inference, obtaining promising trade-off results.
In the future, further studies must be made in order to integrate posits to train larger NNs, such as CNNs, and to perform
inference with higher accuracies. A possible direction may be combining different formats, as [23] proposed, bracing
the transprecision concept.
References
[1] IEEE Computer Society Standards Committee and American National Standards Institute. Ieee standard for binary
floating-point arithmetic. ANSI/IEEE Std 754-1985, 1985.
[2] IEEE standard for floating-point arithmetic. IEEE Std 754-2008, pages 1–70, 2008.
[3] John L. Gustafson. The End of Error: Unum Computing, volume 24. CRC Press.
[4] William Kahan and Joseph D Darcy. How java’s floating-point hurts everyone everywhere. In ACM 1998 workshop
on Java for High–Performance Network Computing, pages 1–81. Stanford University, 1998.
[5] Manish Kumar Jaiswal and Hayden K.-H So. Architecture generator for type-3 unum posit adder/subtractor. In
2018 IEEE International Symposium on Circuits and Systems (ISCAS). IEEE, 05 2018.
[6] John L. Gustafson and Isaac T. Yonemoto. Beating floating point at its own game: Posit arithmetic. Supercomputing
Frontiers and Innovations, 4(2):71–86, 06 2017.
[7] John L. Gustafson. Posit arithmetic.
[8] John L. Gustafson. A radical approach to computation with real numbers. Supercomputing Frontiers and
Innovations, 3(2):38–53, 09 2016.
[9] Florent de Dinechin, Luc Forget, Jean-Michel Muller, and Yohann Uguen. Posits: The good, the bad and the ugly.
In Proceedings of the Conference for Next Generation Arithmetic 2019, CoNGA’19, pages 6:1–6:10, 2019.
[10] Yohann Uguen, Luc Forget, and Florent de Dinechin. Evaluating the hardware cost of the posit number system.
working paper or preprint, May 2019.
[11] Min Soo Kim, Alberto A. Del Barrio, Roman Hermida, and Nader Bagherzadeh. Low-power implementation of
mitchell’s approximate logarithmic multiplication for convolutional neural networks. In 2018 23rd Asia and South
Pacific Design Automation Conference (ASP-DAC), pages 617–622. IEEE, 1 2018.
[12] Min Soo Kim, Alberto A. Del Barrio, Leonardo Tavares Oliveira, Roman Hermida, and Nader Bagherzadeh.
Efficient mitchell’s approximate log multipliers for convolutional neural networks. IEEE Transactions on
Computers, 68(5):660–675, 05 2019.
[13] Leonardo Tavares Oliveira, Min Soo Kim, Alberto Antonio Del Barrio, Nader Bagherzadeh, and Ricardo Menotti.
Design of power-efficient fpga convolutional cores with approximate log multiplier. In European Symposium on
Artificial Neural Networks, Computational Intelligence and Machine Learning, pages 203–208, 2019.
[14] Robert Munafo. Survey of floating-point formats, 2018.
[15] Alberto A. Del Barrio, Roman Hermida, and Seda Ogrenci-Memik. A combined arithmetic-high-level synthesis
solution to deploy partial carry-save radix-8 booth multipliers in datapaths. IEEE Transactions on Circuits and
Systems I: Regular Papers, 66(2):742–755, 02 2019.
[16] Alberto A. Del Barrio and Román Hermida. A slack-based approach to efficiently deploy radix 8 booth multipliers.
In Proceedings of the Conference on Design, Automation & Test in Europe, pages 1153–1158, 2017.
[17] Florent de Dinechin and Bogdan Pasca. Designing custom arithmetic data paths with FloPoCo. IEEE Design &
Test of Computers, 28(4):18–27, 07 2011.
[18] Manish Kumar Jaiswal and Hayden K.-H So. Universal number posit arithmetic generator on FPGA. In 2018
Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 03 2018.
[19] Laurens van Dam. Enabling high performance posit arithmetic applications using hardware acceleration. Master’s
thesis, Delft University of Technology, the Netherlands.
[20] Rohit Chaurasiya, John Gustafson, Rahul Shrestha, Jonathan Neudorfer, Sangeeth Nambiar, Kaustav Niyogi,
Farhad Merchant, and Rainer Leupers. Parameterized posit arithmetic hardware generator. In 2018 IEEE 36th
International Conference on Computer Design (ICCD). IEEE, 10 2018.
[21] M. K. Jaiswal and H. K. . So. Pacogen: A hardware posit arithmetic core generator. IEEE Access, 7:74586–74601,
2019.
11
A PREPRINT - JULY 10, 2019
[22] A. Podobas and S. Matsuoka. Hardware implementation of posits and their application in fpgas. In 2018 IEEE
International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pages 138–145, 2018.
[23] Jeff Johnson. Rethinking floating point for deep learning.
[24] Z. Carmichael, H. F. Langroudi, C. Khazanov, J. Lillie, J. L. Gustafson, and D. Kudithipudi. Deep positron: A
deep neural network using the posit number system. In 2019 Design, Automation Test in Europe Conference
Exhibition (DATE), pages 1421–1426, 2019.
[25] Zachariah Carmichael, Hamed F. Langroudi, Char Khazanov, Jeffrey Lillie, John L. Gustafson, and Dhireesha
Kudithipudi. Performance-efficiency trade-off of low-precision numerical formats in deep neural networks. In
Proceedings of the Conference for Next Generation Arithmetic 2019, CoNGA’19, pages 3:1–3:9, 2019.
[26] Hamed F. Langroudi, Zachariah Carmichael, John L. Gustafson, and Dhireesha Kudithipudi. Positnn: Tapered
precision deep learning inference for the edge. 2018.
[27] Posit Working Group. Posit standard documentation.
[28] Jianyu Chen, Zaid Al-Ars, and H. Peter Hofstee. A matrix-multiply unit for posits in reconfigurable logic
leveraging (open)capi. In Proceedings of the Conference for Next Generation Arithmetic, CoNGA ’18, pages
1:1–1:5, New York, NY, USA, 2018. ACM.
[29] Israel Koren. Computer Arithmetic Algorithms. Prentice-Hall, Inc., Englewood Cliffs, NJ, USA, 1993.
[30] Ken Mercado. Pysigmoid, 2017.
[31] Yunji Chen, Tao Luo, Shaoli Liu, Shijin Zhang, Liqiang He, Jia Wang, Ling Li, Tianshi Chen, Zhiwei Xu, Ninghui
Sun, and Olivier Temam. DaDianNao: A machine-learning supercomputer. In 2014 47th Annual IEEE/ACM
International Symposium on Microarchitecture, pages 609–622. IEEE, 12 2014.
[32] Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning with limited
numerical precision. CoRR, abs/1502.02551.
[33] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Low precision arithmetic for deep learning. CoRR,
abs/1412.7024.
[34] Andres Rodriguez, Eden Segal, Etay Meiri, Evarist Fomenko, Y Jim Kim, Haihao Shen, and Barukh Ziv. Lower
numerical precision deep learning inference and training. Intel White Paper, 2018.
[35] Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and
Dmitry Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only inference.
In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 2704–2713.
[36] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quantized neural networks:
Training neural networks with low precision weights and activations. The Journal of Machine Learning Research,
18(1):6869–6898.
[37] Marco Cococcioni, Emanuele Ruffaldi, and Sergio Saponara. Exploiting posit arithmetic for deep neural networks
in autonomous driving applications. In 2018 International Conference of Electrical and Electronic Technologies
for Automotive, pages 1–6. IEEE, 07 2018.
[38] Wolfram Research. Wolfram mathematica. http://www.wolfram.com/mathematica/, 2019. [Online; ac-
cessed 01-July-2019].
[39] Tom Feist. Vivado design suite. White Paper, 2012.
[40] Stan van der Linde. Posits als vervanging van floating-points: Een vergelijking van unum type iii posits met ieee
754 floating points met mathematica en python.
[41] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.
Proceedings of the IEEE, 86(11):2278–2324, 1998.
[42] François Chollet et al. Keras, 2015.
[43] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy
Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael
Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat
Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever,
Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden,
Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-scale machine learning on
heterogeneous systems, 2015. Software available from tensorflow.org.
[44] SpeedGo Computing. Numpy (on top of softposit), 2018.
12
