Scalar Arithmetic Multiple Data: Customizable Precision for Deep Neural
  Networks by Anderson, Andrew & Gregg, David
Scalar Arithmetic Multiple Data:
Customizable Precision for Deep Neural Networks
Andrew Anderson
School of Computer Science and Statistics
Trinity College Dublin
Dublin, Ireland
andersan@cs.tcd.ie
David Gregg
School of Computer Science and Statistics
Trinity College Dublin
Dublin, Ireland
dgregg@cs.tcd.ie
Abstract
antization of weights and activations in Deep Neural Net-
works (DNNs) is a powerful technique for network com-
pression, and has enjoyed signicant aention and success.
However, much of the inference-time benet of quantization
is accessible only through the use of customized hardware
accelerators or by providing an FPGA implementation of
quantized arithmetic.
Building on prior work, we show how to construct ar-
bitrary bit-precise signed and unsigned integer operations
using a soware technique which logically embeds a vec-
tor architecture with custom bit-width lanes in universally
available xed-width scalar arithmetic.
We evaluate our approach on a high-end Intel Haswell
processor, and an embedded ARM processor. Our approach
yields very fast implementations of bit-precise custom DNN
operations, which oen match or exceed the performance of
operations quantized to the sizes supported in native arith-
metic. At the strongest level of quantization, our approach
yields a maximum speedup of∼ 6× on the Intel platform, and
∼ 10× on the ARM platform versus quantization to native
8-bit integers.
1 Motivation
antizing weights and activations in DNNs can reduce (1)
the size of data, which reduces the required memory foot-
print, and (2) memory trac, which consumes execution
time and energy [11]. Performing inference in reduced bit
precision also oers the possibility of decreased execution
time. antization has been shown to be extraordinarily ef-
fective for DNNs, and many networks can function correctly
with highly aggressive quantization [20].
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are not
made or distributed for prot or commercial advantage and that copies bear
this notice and the full citation on the rst page. Copyrights for components
of this work owned by others than ACM must be honored. Abstracting
with credit is permied. To copy otherwise, or republish, to post on servers
or to redistribute to lists, requires prior specic permission and/or a fee.
Request permissions from permissions@acm.org.
Conference’17, Washington, DC, USA
© 2016 ACM. 978-x-xxxx-xxxx-x/YY/MM. . .$15.00
DOI: 10.1145/nnnnnnn.nnnnnnn
Custom hardware accelerators [12, 13, 15] and FPGAs [8]
can exploit this reduction in data precision to reduce the
area and power required for the implementation of the cor-
responding arithmetic, and also to increase throughput.
In contrast, conventional CPU/GPU microarchitectures
typically provide native support for just a few xed levels
of precision, such as 8-, 16-, 32- and 64-bit integer values
and 32- and 64-bit oating point. e result is that soware
is seldom able to take full advantage of relaxed precision
requirements. For example, Ristreo [10] can quantize DNN
activations to signed 6-bit precision, but no machine we are
aware of has native integers with this bitwidth.
In this paper we show that bit-level custom precision on
commodity hardware is not only possible, but highly ecient
for deep convolutional neural networks (CNNs). We extend
prior research to embed a SIMD vector architecture with
custom bit-precise integer vector lanes in native xed-width
scalar arithmetic. Using this embedded architecture, we can
implement a wide range of operations over arrays of bit-
precise signed and unsigned integer types.
However, merely emulating the lane-wise operations of
existing vector architectures neglects the true potential of
custom precision in soware. Instead, we argue that the
greatest opportunities for ecient custom precision in so-
ware arises from dening entirely new operations that do
not necessarily correspond to existing SIMD vector instruc-
tions. ese new operations are not easy to nd because
they depend on novel insights into the sub-structure of ex-
isting machine instructions. Nonetheless, we provide an
example of one such novel operation: a custom bit-level pre-
cise operation which computes the 1D convolution of two
input vectors, based on the wide integer multiplier found in
general-purpose processors.
While the performance benets are aractive, writing this
kind of code by hand is dicult and error prone. We im-
plement the technique in a domain-specic code generator,
which synthesizes a library of ecient C code implementa-
tions for bit-precise DNN operations.
In summary, we make the following contributions:
• We demonstrate the feasibility of supporting bit-level
precise arithmetic for CNNs in soware.
• We extend prior work to provide implementations of
the operations necessary for CNN inference.
ar
X
iv
:1
80
9.
10
57
2v
1 
 [c
s.P
F]
  2
7 S
ep
 20
18
Conference’17, July 2017, Washington, DC, USA A. Anderson and D. Gregg
• We propose a novel custom bit-width convolution
operation that maps to native scalar multiplication.
• We show that signed values are a particular imple-
mentation challenge, and present a novel solution.
• We evaluate our proposed operations using realis-
tic, published quantization scenarios and on multiple
hardware platforms, including embedded/IoT class
devices.
2 Background
Vector computer architectures are among the most success-
ful parallel computers for a wide variety of applications. In
the 1970s supercomputers such as the Cray-1 and Cray X-MP
used deeply-pipelined vector oating point units for perfor-
mance on scientic applications. From the mid-1990s vector
units started to appear in general-purpose processors, such
as Intel MMX (1996) and SSE (1999) and PowerPC AltiVec
(1997). Vector processors are single-instruction multiple data
(SIMD) parallel computers, where a single instruction oper-
ates on multiple data points in parallel.
Modern vector processors have vector registers that con-
tain a xed number of lanes, each of which contains a scalar
value. Vector instructions operate on all lanes in parallel,
using either pipelined arithmetic units (a vector pipeline archi-
tecture) or using multiple parallel arithmetic units (a vector
SIMD architecture [16]). It is worth noting that both vector
SIMD and vector pipeline architectures t within the SIMD
category of Flynn’s taxonomy [5].
In 1997 Fisher and Dietz proposed a slightly dierent clas-
sication of vector architectures [4]. ey coined the term
SIMD within a register (SWAR) to encompass both the then-
emerging vector SIMD architectures and another approach
to vector parallelism that is less well known. is laer ap-
proach is a soware-only emulation of vector computing
that is implemented with scalar instructions. Scalar registers
are divided into notional sub-words, and scalar instructions
are used to operate on multiple sub-words in parallel.
Fisher and Dietz went on to create a SWAR programming
model and compiler that could target either conventional
vector instructions or a soware emulation of vectors us-
ing scalar instructions [3]. Unfortunately, Fisher and Dietz
overloaded the term SWAR to include three quite separate
ideas: (1) all hardware vector SIMD architectures, (2) their
approach to soware emulation of vector architectures, and
(3) their high-level programming model and compiler. As
a result, despite doing much pioneering work in the eld,
they le no separate term for soware emulation of vector
architectures. In the absence of an existing term, we refer
to this soware vector approach as Scalar Arithmetic with
Multiple Data (SAMD).
2.1 Emulating Vector Architectures
On a 64-bit 8-way vector architecture, 64-bit vector registers
can be divided into eight lanes, each containing a single
unsigned byte. Vector arithmetic instructions can be used to
perform operations such as lane-wise addition.
It is equally possible to pack the same eight bytes into a
64-bit scalar register, but applying scalar addition will not
give a correct result (Figure 1). e problem is that scalar
arithmetic assumes that all bytes are parts of a single integer
value.
0010111010010101
uint16_t
4xuint4_t
0101110100101011
   1   0   1    
Figure 1. Correct (0) and incorrect (1) carries in scalar addition on 4-bit data
when packed in a 16-bit data word. Incorrect carries corrupt neighbouring
packed values.
Scalar addition computes carries across the entire 64-bit
word. In contrast, vector operations break the carry-chain
at the boundary of each lane. To emulate vector arithmetic
using scalar arithmetic, some soware mechanism is needed
to sever the carries at vector lane boundaries. Fisher and
Dietz proposed spacer bits to solve this problem.
uint64_t unsigned_add_spacer(uint64_t a, uint64_t b, int bits)
{
// create mask with zeros in spacer bits
uint64_t mask = ˜MSB_LANE_MASK(bits+1);
// clear the spacer bits in each input
uint64_t a_clear = a & mask;
uint64_t b_clear = b & mask;
// add the numbers
return a_clear + b_clear;
}
Figure 2. Addition of unsigned SAMD vectors using one permanent spacer
bit between each lane
Spacer bits consist of one or more bits placed between vector
lanes to catch carries from the neighbouring lane. In the
example in Figure 2 a single spacer bit is added between
each lane, and as a result only seven one-byte values t
within the 64-bit word alongside the spacer bits. To perform
unsigned vector addition using unsigned scalar addition, the
spacer bits must rst be set to zero. We dene a 64-bit mask,
that contains the value zero in each spacer bit, and one in all
other bits. We set the spacer bits in the inputs, a, b to zero,
and then add the two resulting values. Any overows at lane
boundaries are caught in these spacer bits. We refer to these
as permanent spacer bits because they persist throughout the
lifetime of the value.
Note that throughout this paper we use many bitwise
masks to isolate particular lanes or bits within a word. e
particular value of the mask typically depends on the lane-
width and the presence or absence of spacer bits. To describe
these masks clearly, we introduce a small function as shown
in Figure 3.
SAMD: Custom Precision for DNNs Conference’17, July 2017, Washington, DC, USA
uint64_t build_mask(int start, int width, int stride)
{
// create a mask of width 1's
uint64_t sub_mask = (1 << count) - 1;
uint64_t mask = 0;
// lay down the sub_mask at intervals of stride
for(int i = start; i < sizeof(uint64_t)*8; i += stride){
mask = mask | (sub_mask << i);
}
return mask;
}
#define MSB_LANE_MASK(w) build_mask(w-1, 1, w)
#define LSB_LANE_MASK(w) build_mask(0, 1, w)
#define ODD_LANE_MASK(w) build_mask(w, w, 2*w)
#define EVEN_LANE_MASK(w) build_mask(0, w, 2*w)
Figure 3. Construct a mask for clearing bits and lanes in SAMD computa-
tion, and examples of masks used throughout paper.
2.2 Temporary Spacer Bits
A downside of permanent spacer bits is that they occupy
space within the vector word. Fisher and Dietz proposed
virtual spacer bits to prevent overow between lanes without
permanent spacer bits. A virtual spacer bit is a short-lived
spacer bit that is introduced within a routine for operating
on SAMD vectors and is eliminated before the completion
of the routine. We refer to these as temporary spacer bits.
Figure 4 shows unsigned addition using this approach,
computing the correct answer for the addition from Figure 1.
e most signicant bit of each lane is masked to zero and
acts as a temporary spacer bit. e scalar addition is per-
formed on 3-bit values, and any overow is caught in the
temporary spacer bits.
To get a full 4-bit result in each lane, it is necessary to
replace the temporary spacer bit with the correct value of
the most signicant bit of the addition. Fortunately, one-
bit addition can be computed with bitwise xor of the most
signicant bit of each of the two input and of the carry bit
that is stored in the temporary spacer bit.
X010X110X001X101
X101X101X010X011
0XXX1XXX1XXX0XXX
0XXX1XXX0XXX1XXX
0XXX0XXX1XXX1XXX0111100100111000
masked add xor
0111100110110000
xor
Figure 4. Dataow in SAMD addition with temporary spacer bits
Temporary spacer bits remove the need for persistent spacer
bits at the cost of some additional computation. is greatly
simplies the task of emulating operations found in hard-
ware vector instruction sets such as Intel MMX or SSE. us,
we can dene a common set of vector operations that might
be implemented in hardware on some machines or emulated
in soware on others, and guarantee that the same function-
ality is available on all machines. is idea of a common set
of vector operations across all target machines was an im-
portant part of the wider SWAR vision of simplifying vector
programming.
uint64_t unsigned_add(uint64_t a, uint64_t b, int bits)
{
// create mask containing 1 in MSB of each lane
uint64_t mask = MSB_LANE_MASK(bits);
// compute the bitwise sum of MSBs of each lane
uint64_t msb = (a ˆ b) & mask;
// do addition on bits-1 bits per lane
uint64_t sum = (a & ˜mask) + (b & ˜mask);
// add MSB sum without overflow into next lane
return msb ˆ sum;
}
Figure 5. C code that performs addition of signed or unsigned SAMD
vectors using temporary spacer bits
2.3 Our esis
In this paper we argue that emulating existing vector instruc-
tion sets is exactly the wrong use of soware techniques to
perform vector arithmetic with scalar instructions. Vector
instruction sets are now ubiquitous, and it is dicult for
SAMD techniques to compete when performing the same
operations.
On the contrary, we argue that SAMD is most eective
when it is used for operations that are not already supported
by vector instructions. In particular we argue that SAMD is
ideal for customized integer data precision in approximate
computing applications. Furthermore, as we show in Section
5, SAMD can sometimes exploit the structure of scalar arith-
metic to perform multiple vector operations using a single
scalar instruction. To demonstrate our point, we rst review
examples of SAMD operations that emulate existing vector
instructions.
3 Conventional SAMD Arithmetic
Fisher and Dietz published only a handful of their SAMD op-
erations, but the source code of Fisher’s compiler for SWARC
language1 contains many more. Figure 6 shows simplied
versions of SAMD operations for integer add and subtract.
ese operations are correct for both signed and unsigned
values, and can be used for lane-widths with any xed num-
ber of bits.
uint64_t samd_add(uint64_t a, uint64_t b, int bits)
{
uint64_t mask = MSB_LANE_MASK(bits);
uint64_t msb = (a ˆ b) & mask;
uint64_t sum = (a & ˜mask) + (b & ˜mask);
return msb ˆ sum;
}
uint64_t samd_sub(uint64_t a, uint64_t b, int bits)
{
uint64_t mask = MSB_LANE_MASK(bits);
uint64_t msb = (a ˆ b) & mask;
uint64_t diff = (a | mask) - (b & ˜mask);
return msb ˆ diff ˆ mask;
}
Figure 6. Addition and subtraction of SAMD vectors
Unfortunately there is no ecient way to compute lane-
wise multiplication using scalar multiplication instructions.
Unlike addition with its single linear carry-chain which can
1hp://aggregate.org/rsher/Research/Scc/Scc.html
Conference’17, July 2017, Washington, DC, USA A. Anderson and D. Gregg
be broken using bitwise operations, multiplication has many
carry chains. e code generator in Fisher’s SWARC com-
piler can generate a sequence of bitwise shi, mask and add
instructions that are capable of performing lane-wise multi-
plication. eir instruction sequence is complex and requires
O(b) iterations of a shi and add algorithm, where b is the
lane-width. Further, each iteration uses O(loд2(b)) opera-
tions to build a mask, for a total of O(b loд2(b)) operations
to complete lane-wise SAMD multiplication.
In Figure 7 we present an improved lane-wise SAMD mul-
tiplication sequence that requires just O(b) operations for
SAMD multiplication. Our approach builds the write mask
for the multiplication in a constant number of steps. We
rely on the observation that we can eciently compute an
integer, mask with all bits in the range start to stop set to
1, where start > stop; we simply compute mask = (1 
start) − (1  stop).
uint64_t samd_mul(uint64_t a, uint64_t b, int bits)
{
uint64_t read_mask = LSB_LANE_MASK(bits);
uint64_t sum = 0;
for ( int i = bits; i > 0; i-- ) {
uint64_t bit = b & read_mask;
uint64_t write_mask = (bit << bits) - bit;
uint64_t to_add = a & write_mask;
samd_add(sum, to_add, bits);
a = a << 1;
read_mask = read_mask << 1;
}
}
Figure 7. Improved lane-wise SAMD multiplication
4 Vector Scale
Although lane-wise vector multiplication of SAMD vectors
cannot be eciently implemented with a scalar multiply
instruction, another useful operation can be. Key operations
of many algorithms such as matrix multiplication and dis-
crete convolution can be expressed as vector scale operations,
where a vector of values are multiplied by a single scalar.
e very simplest case of vector scaling operates on un-
signed lanes of b bits, where each lane is separated by b
permanent spacer bits. In this case, vector scaling can be
implemented by a scalar unsigned multiplication as shown
in Figure 8. In this example the scalar value occupies the
lowest b bits of the scalar parameter. e vector contains
several b bit values, each separated by b permenent spacer
bits. e product of two b bit numbers contains 2b bits, with
the b bits of overow stored in the permanent spacer bits.
uint64_t unsigned_scale_b_spacers(uint64_t vec, uint64_t scalar,
int bits)
{
// clear the spacer bits in the input
uint64_t mask = ODD_LANE_MASK(bits);
uint64_t vec_clear = vec & mask;
return vec * scalar;
}
Figure 8. Unsigned vector scale with b permanent spacer bits
uint64_t vector_scale_wrap(uint64_t vec, uint64_t scalar,
const int bits)
{
// separate odd and even numbered lanes
uint64_t even_mask = EVEN_LANE_MASK(bits);
uint64_t odd_mask = ODD_LANE_MASK(bits);
uint64_t even = vec & even_mask;
uint64_t odd = vec & odd_mask;
// compute product and mask out higher half
even = (even * scalar) & even_mask;
odd = (odd * scalar) & odd_mask;
return even | odd;
}
Figure 9. Vector scale with b temporary spacer bits.
As we have seen for other operations, a weakness of using
permanent spacer bits is that they occupy space in the input
and result.
Assuming that we want a b bit result in each lane, we can
simply ignore or mask out the contents of the spacer bits.
However, it is worth noting that theb bit vector scale actually
returns a product of 2b bits — the lower half in the result
bits, and the upper half in the neighbouring spacer bits.
4.1 Vector Scale with Temporary Spacer Bits
e downside of permanent spacer bits can be avoided by
using temporary spacer bits for vector scale as shown in
Figure 9. We rst extract odd and even numbered lanes
of the vector into separate registers using bitmasks, which
creates b zero-valued spacer bits between each lane. We
multiply each of the odd and even vectors by the b-bit scalar.
e result is two sets of vectors where each lane contains a
2b-bit value. Finally, we mask o the upper half of each lane,
and merge the odd and even vectors into a single vector of
b-bit values.
Note that when multiplying two b-bit integers, the result
in the lower b bits is the same regardless of whether we
use unsigned or two’s-complement signed multiplication
[9]. erefore, the routine in Figure 9 gives the correct b-bit
result in each lane for either signed or unsigned SAMD arith-
metic. Each signed lane of the vector gets the correct value,
despite the fact that the routine uses unsigned arithmetic on
the underlying integer type that is used to store the SAMD
vector.
5 Convolution
e most computationally-intensive operation in CNNs is
convolution. Simple 1D convolution can be used to express
other operations, such as polynomial multiplication. For
example, the two polynomials 2x2 + 3x + 7 and x2 − 5 can
be expressed as vectors a = [2, 3, 7], b = [1, 0,−5]. e
product of the polynomials can be expressed as the convo-
lution conv(a,b). Likewise, convolution can be expressed
as products of polynomials. Interestingly, our modern po-
sitional number system also represents numbers as polyno-
mials. For example, the number 276 can be expressed as
2 × 102 + 7 × 101 + 6 × 100. Given that we already have
computer hardware to perform multiplication of numbers
SAMD: Custom Precision for DNNs Conference’17, July 2017, Washington, DC, USA
in a positional system, an intriguing question arises: can we
use the same hardware multipliers to perform convolution?
In this section we propose a novel SAMD convolution oper-
ation that uses scalar multiplication to perform convolution
on SAMD vectors.
Polynomial multiplication can be used to multiply num-
bers, but it is not a perfect substitute. Consider the case of
multiplying the following numbers expressed as polynomi-
als.
276 = 2 × 102 + 7 × 101 + 6 × 100 = 2x2 + 7x + 6
15 = 1 × 101 + 5 × 100 = x + 5
For brevity, let us dene x to be the base of the positional
number system; in this case x = 10. Polynomial multiplica-
tion of these two numbers gives us:
x(2x2 + 7x + 6) + 5(2x2 + 7x + 6)
= 2x3 + 17x2 + 41x + 30
= 2 × 103 + 17 × 102 + 41 × 101 + 30 × 100
In contrast, the arithmetic result of multiplying the two
numbers is 276 × 15 = 4140. e results of arithmetic
and polynomial multiplication look quite dierent, but they
are equivalent. e coecients of the terms in the polyno-
mial result are the raw result of multiplication of polynomi-
als. ey do not take account of the fact that, for example,
17 × 102 = 1 × 103 + 7 × 102. If we fully collapse all such
equivalences in the result of the polynomial product, the
result is:
4x3 + x2 + 4x
= 4 × 103 + 1 × 102 + 4 × 101 + 0 × 100
= 4140
e polynomial product is similar to numeric multiplication,
except that it does not allow overow from one coecient
to the next. erefore, if we can prevent overow between
coecients packed into a scalar data word, we can use scalar
multiplication to perform the polynomial product.
5.1 Convolution as Long Multiplication
A simple way to prevent overow between coecients is
to ensure that there are enough bits in each lane of the
result that overow cannot happen. In the rst instance we
consider only unsigned numbers, and in a later section we
will extend our approach to signed numbers. As we saw in
section 4, we can scale an unsigned vector with b-bit lanes by
a b-bit unsigned scalar using native full-word multiplication
by ensuring that there are always b spacer bits between each
lane of the SAMD vector. We can use a similar technique
to perform convolution of SAMD vectors, by ensuring that
there are always enough bits in each result lane to contain
the maximum value computable by the convolution.
Consider the case where we convolve a vector of b-bit
values by a 1D convolution kernel of three b-bit values. Each
output point in the result consists of the sum of three values.
Each of these values is the product of a b-bit input value and
a b-bit kernel value, which gives a 2b-bit result. e sum
k3i0 k3i1 k3i2 k3i3
k2i0 k2i1 k2i2 k2i3
k1i0 k1i1 k1i2 k1i3
unsigned integer multiply
+
+
uint32_t
4xuint4_t
uint64_t
8xuint8_t
0 k1 k2 k3
i0 i1 i2 i3
Figure 10. Convolutional substructure in xed-width unsigned multiply
of three such values requires a maximum of 2b + 2 bits to
represent. erefore, if we guarantee that the output lane
into which we compute such a value has 2b + 2 bits, then
there will never be overow between output lanes. One way
to ensure that each output lane has 2b + 2 bits is to modify
the inputs so that each input and kernel lane has b + 1 bits
and is separated by b + 1 spacer bits.
e eect of puing each b-bit value into an input lane
with 2b + 2 bits (including spacer bits) is that we can con-
sider ourselves to be performing arithmetic on values with
base 22b+2. By design each input has only b bits, so we are
guaranteed that our base-22b+2 convolution never overows
between lanes. us, we can perform SAMD convolution
using multiplication.
Consider the example in Figure 10. We have four input
values i0, i1, i2, i3 and a 1D convolution kernel of three values
k1,k2,k3. e input and the kernel are packed into SAMD
vectors with zero-valued spacer lanes between each lane
containing a value. Figure 10 shows each intermediate value
that is computed as part of the long multiplication. Each in-
termediate value is the product of two inputs and is therefore
occupies a double-width lane in the result. When performing
the multiplication we use machine instructions that com-
pute the full double-width result. For example, the x86-64
architecture has a 64-bit multiply instruction that returns a
128-bit result split across two registers. e result contains
the convolution of the input and kernel.
Note that the example in Figure 10 has an input of length
four and kernel of length three. Each ts within a single
SAMD register, and it is therefore possible to compute the
convolution with just a single multiply. In convolutional neu-
ral networks the kernel typically has a short, xed length
such as three or ve. However, the input is normally much
longer, and we must therefore typically compute the con-
volution of a kernel that ts within a single SAMD register,
and an input that does not.
If we again consider the example in Figure 10, but assume
that the input is just a short subsection of a much larger
input, a paern emerges. In a long convolution with a kernel
of length three, we expect each output to consist of the sum
of three products. In the example we see that two lanes of the
output — those containing the values k3i0 + k2i1 + k1i2 and
Conference’17, July 2017, Washington, DC, USA A. Anderson and D. Gregg
k3i1 + k2i2 + k1i3 — are the sum of three non-zero products.
ese two output values are complete result values of the
convolution. In contrast, the other four outputs are partial
results. Each is the sum of just one or two values. To compute
the nal results for these output points, we need to combine
partial results from dierent multiplications.
If we look at the paern of non-zero intermediate values
of the multiplication in Figure 10 we see that they form a
parallelogram-shaped region. When we perform a longer
convolution the result is a sequence of such parallelograms.
To get the result of a long convolution we align the adjacent
parallelograms by bit shiing the output word and add the
overlapping points.
For convolutional neural networks we must consider two
additional practical issues. First, many CNNs use 2D con-
volution rather than the 1D convolution that we compute.
However, we can compute a 3 × 3 2D convolution as the
sum of three length-3 1D convolutions. CNN convolution
also typically involves multiple channels of input and ker-
nel. e output is computed by summing the corresponding
convolution results across dierent channels. is can be
computed using normal SAMD addition. However, rather
than resolving the overlapping partial sums (or parallelo-
grams) at each channel before summing the channels, it is
oen more ecient to perform the addition across channels
rst, and resolve the overlapping partial sums later.
6 Handling of Signed Values
Our discussion of SAMD convolution and vector scale has
so far assumed that inputs are unsigned. However, CNNs
normally use signed values, at least for the convolution ker-
nel. Most processors provide separate signed and unsigned
multiply instructions. e lower b bits of the result of mul-
tiplying two b-bit values is the same for both signed and
unsigned multiplication. However, the higher b bits of the
result are dierent depending on whether the multiplication
is signed or unsigned. Grys [9] provides an overview of
soware methods for computing signed multiplication by
adjusting the result of unsigned multiplication.
If we are operating on b-bit inputs and computing b-bit
outputs, multiplication of signed values is straighforward.
We simply compute the 2b-bit result using whatever multi-
plication is available and discard the upper b bits. However,
multiplying b-bit signed inputs to create 2b-bit results is
much more complicated.
Unsigned scalar multiplication can be used to implement
unsigned SAMD vector scaling and convolution. All of the
bits of an unsigned number have essentially the same mean-
ing as positive values. Signed scalar multiplication treats
the leading 1’s of a two’s-complement signed number as
the sign extension. But signed multiplication does not deal
correctly with signed SAMD lanes within a scalar value, be-
cause signed scalar multiplication deals only with the lead-
ing bits of the scalar value as indications of a negative value.
A signed scalar multiplier is unable to identify signed SAM
lanes within a machine word. Instead we use unsigned scalar
multiplication, and add xup code to adjust the values to
account for negative values.
For simplicity, we assume the input vector and convolu-
tion kernels consists of signed b-bit integer lanes, each sep-
arated by b zero-valued spacer bits. Note that scalar times
vector multiplication is simply a special case of convolution,
where the kernel contains just one value.
uint64_t samd_sign_extend4mul(uint64_t vec, int bits)
{
// get mask with 1 in MSB of each lane
uint64_t msb_mask = MSB_LANE_MASK(bits);
// extract the sign bits of each lane
uint64_t sign_mask = vec & msb_mask;
// subtract sign bits from adjacent spacer bits
return vec - (sign_mask << 1);
}
Figure 11. Special-purpose sign extension for vector scale or convolution
To sign-extend a negative lane, we need to extend its
leading sign bits into the adjacentb zero-valued spacer bits to
its le. Figure 11 shows a simple method for sign extension.
We shi the leading sign bit of each vector lane by one
position, and subtracting it from the adjacent spacer bit to
the le. Where the leading sign bit has value zero, that lane
contains a positive (or zero) number, and zero is subtracted
from the adjacent spacer bit leaving it zero.
Where the leading leading sign bit is one, the number is
negative, and all b spacer bits must be set to one to fully
sign extend the number. We achieve this by subtracting the
leading sign bit from the adjacent spacer bit which causes
it to switch from zero to one, with a negative carry to the
next bit le. e negative carry cascades through all the
spacer bits, and nally carries into the next populated lane
of the SAMD vector. As a result the next lane to the le of a
negative number is reduced by one.
On rst view one might expect this change of value to
catastrophically change the result of the overall computation,
but remarkably it does not. When we multiply the resulting
vector value by the scalar, we get an answer that is very close
to the correct one. e rightmost lane of the resulting vector
is always correct. Other lanes have the correct value if the
lane to the le is non-negative. Where the lane immediately
to the le is negative, the current lane has a value one less
than the true value. We correct this by adding 1 to each of
the lanes with a negative value to the le.
Figure 12 shows code to perform the convolution and cor-
rect the underow that propagates from adjacent negative
lanes. e shown code is ecient, but non-obvious. e
obvious way to correct the underow from an adjacent neg-
ative lane is to extract the most signicant bit (which is 1 in
the case of a negative number), shi the bit one position to
the le so it aligns with least signicant bit of the lane to
the le and add. However, the obvious approach contains
a subtle bug. Where the correct value in an output lane is
SAMD: Custom Precision for DNNs Conference’17, July 2017, Washington, DC, USA
uint128_t signed_convolution(uint64_t vec, uint64_t kernel,
int bits)
{
// compute 128 bit product of 64-bit inputs
uint128_t product = (uint128_t)vec * (uint128_t)scalar;
// fix underflow from adjacent lane
uint128_t mask = MSB_MASK(bits*2);
uint128_t sign_bits = product & mask;
// add the sign bit to itself to increment
// lane to the left of a negative number
product = product + sign_bits;
// correct the MSB for wrongly removed sign
return product ˆ sign_bits;
}
Figure 12. Signed convolution algorithm with underow correction
zero, the negative underow from an adjacent lane pushes
the value to -1. In two’s complement the representation of -1
is all 1’s. If we correct this value by adding one, the lane will
get the correct value, but will also overow into the next lane
to the le. is would be correct except that the next lane
to the le will have already been incremented because its
direct neighbour is negative before the increment. erefore,
the value is incremented twice.
We instead use a non-obvious sequence that extracts the
sign bits, but rather than shiing them le adds them to
the same location. Where the sign bit is 1, this causes an
overow into the adjacent lane. Where the lane contains the
value -1, the overow switches it to zero, but when we add
another 1 to the MSB there is no overow into the neighbour.
Finally, we correct the MSB of each output lane.
6.1 Optimizations for Permanent Spacer Bits
e method of signed convolution presented so far has as-
sumed that we need to compute the correct value for the
MSB of each vector lane. However, it is worth noting that
computing the correct MSB can be surprisingly expensive for
SAMD operations. For example, in the SAMD addition oper-
ation in Figure 5, half of the bitwise operations are devoted
to computing the correct value of the MSB. If we are willing
to sacrice one bit of precision and not maintain the MSB,
SAMD addition is much more ecient. Where we maintain
a permanent spacer bit in the MSB of each SAMD lane, we
can use cheaper operations, such as the addition in Figure 2
which uses far fewer bitwise operations.
In the case where we maintain a permanent spacer bit in
the MSB of each lane, we can also perform convolution more
eciently. First, the nal xor operation in the convolution
code in Figure 12 is unnecessary if we do not need to compute
the MSB. Given that the xor operates on a double-width value
(in this case 128 bits), this likely corresponds to two scalar
machine instructions.
A second optimization of convolution arises because of
the way that we correct for negative adjacent lanes. Recall
that to correct for negative lanes we extract the sign bit
from the result of the multiplication and add it back into
the result. Where the sign bit has value 1, this propagates
a positive carry into the neighbouring lane. Unfortunately
this operation also corrupts the sign bit, but where the MSB
is a spacer bit, this is not a problem. However, the result of
the addition can leave the lane of a SAMD vector in one of
two possible states. If before the addition, the SAMD lane
contained the value -1, then aer the addition, the SAMD
lane contains the value zero in all bits except the MSB, which
has value 1. If before the addition the SAMD lane contained
any other value, then aer the addition it contains a zero in
the MSB and arbitrary values in other bits.
In CNN convolution, the output of our convolution op-
eration is normally added to an accumulation of the values
across multiple channels. Where a permanent spacer bit is
used in SAMD lanes, we normally clear the MSB of each
operand before performing the addition (as shown in Figure
2). However, given that the MSB of each lane in the output
of our convolution operation is already zero all cases, except
the case where it is one and all other bits are zero, addition
with a SAMD lane where the MSB is zero cannnot overow
into the neighbouring lane. erefore, where the output of
our convolution feeds directly into addition with a perma-
nent spacer bit, there is no need to clear the MSB of the lanes
of our convolution result.
7 Constant Kernels and Overow
Deep neural networks perform large numbers of convolu-
tions at training time and later when the network is used
for inference. During training the convolution kernels are
updated repeatedly as they are slowly tuned to classify the
training data. As the kernel values are slowly modied dur-
ing training they typically need a lot of precision so that
can be repeatedly ne-tuned [14]. However, once training
is complete, it is oen possible to quantize the kernel val-
ues to reduce their precision with minimal loss of inference
accuracy [20].
When preparing to deploy a trained DNN we can exploit
the availability of the kernel values. To guarantee that the
product of two unknown b bit integer values will not over-
ow, we need to reserve 2b bits for the result. However, if
we know that one of the two numbers has the value 5 (or
binary 101) then we know that the size of the result cannot
exceed b + 3 bits. We can extent this principle for more
products. Consider the case where we want to compute the
dot product of vector of unknown b-bit values [x0,x1,x2,x3]
and a vector of known values [4, 3, 9, 6]:
4x0 + 3x1 + 9x2 + 6x3
In the worst case all of x0..x3 might have the largest rep-
resentable b-bit value bmax . However, in that case the result
would be (4 + 3 + 9 + 6)bmax = 22bmax . e value 22 can be
represented in ve bits in an unsigned representation (or six
bits if signed). erefore, if the result of the dot product is
allocated b + 5 bits, we are guaranteed that even the worst
case input cannot cause overow. Where the kernel contains
Conference’17, July 2017, Washington, DC, USA A. Anderson and D. Gregg
int conv_output_bits(int kernel[CHANNELS][K][K])
{
int positive = 0, negative = 0;
for (int c = 0; c < CHANNELS; c++ ) {
for (int kh = 0; kh < K; kh++ ) {
for (int kw = 0; kw < K; kw++ ) {
if ( kernel[c][kh][kw] < 0 )
negative += kernel[c][kh][kw];
else
positive += kernel[c][kh][kw];
}
int negbits = bits_required(negative);
int posbits = bits_required(positive);
return max(negbits, posbits);
}
Figure 13. Find the maximum possible number of additional bits needed
by a signed convolution kernel with unsigned input.
negative values, the corresponding worst case input value is
the negative number with largest absolute value.
A common scenario in DNNs is that the kernels contain
signed values but the input values are unsigned. is arises
in neural networks that use a rectied linear unit (ReLU)
activation function. ReLU operates on each scalar value of
the input tensor in isolation and replaces the existing value,
ve with a new value vn, such that vn = max(ve, 0). ReLU
clamps all negative values to zero, which guarantees that the
resulting scalar values can all be represented as unsigned
numbers. Where the input values are unsigned and the
kernel values are signed, we can compute the worst case
positive and negative values of the result, as shows in Figure
13.
8 Experimental Evaluation
We built a small code generator which synthesizes a SAMD
instruction sequence to implement a given number of iter-
ations of the inner loop of DNN convolution. e overall
DNN convolution loop is shown in Figure 14.
for (unsigned m = 0; m < kernels; m++)
for (unsigned h = k/2; h < (img_h-(k/2)); h++)
for (unsigned w = k/2; w < (img_w-(k/2)); w++) {
T result = output[m][h][w];
for (unsigned d = 0; d < channels; d++) {
for (unsigned y = 0; y < k; y++)
for (unsigned x = 0; x < k; x++)
result +=
(T)image[d][(h + y)-(k/2)][(w + x)-(k/2)]
*
(T)kernel[m][d][y][x];
}
output[m][h][w] = result;
}
Figure 14. Loop performing DNN convolution
In our evaluation, we limit our experiments to the class of
convolutions producing at most 16 bits of output.
Our baseline “native” quantized convolution consumes a
stream of signed 8-bit samples, which are convolved using
the algorithm in Figure 14.
Our SAMD convolutions use precisely the same imple-
mentation, but with our synthesized implementation of the
arithmetic. Aer we generate the C code implementing the
SAMD arithmetic operations, we compile to machine code
using the native compiler (in our case, GCC release 7.2).
All benchmark executables were compiled with -std=c++14
-O3. e NVIDIA Tegra X1 board we used for the Cortex-A57
benchmarking was running NVIDIA JetPack 3.1, while the
Intel machine was running Linux 4.12. We use the same com-
piler options for all experiments, both for the native 8-bit
integer code and our generated code.
8.1 Experimental Setup
We evaluated our approach on an Intel Haswell CPU, specif-
ically the Intel Core i5-4570, and on an ARM Cortex A-57
CPU. e Cortex A-57 is used in NVIDIA’s Tegra X1 platform
for embedded and automotive development.
For our experiments we are using the convolutional layers
of the VGG-B network of Simonyan et al. [17]. Our graphs
present the mean execution time (with error bars represent-
ing one standard error about the mean) of each convolutional
layer, implemented by the loop in Figure 14. From le to
right on the graphs, the depth of the layer in the network
increases, increasing the number of channels and the num-
ber of convolutional kernels. e precise dimensions of each
convolution are tabulated in [17].
On our gures, the native direct convolution algorithm
direct-mhwcyx shows the performance of the convolution
layer when quantized to native 8-bit signed integer weights
and activations. e SAMD(N) bars show the performance
of our generated SAMD convolution layer when quantized
to N -bit signed integer weights and activations. We step N
down from 8 (emulating native 8-bit integers) to 2 (a single
data bit with a sign bit).
8.2 Experimental Results: Intel Haswell
Figures 15 and 16 show the results of benchmarking on Intel
Haswell. On Haswell, GCC generates a very heavily opti-
mized piece of code for the native DNN convolution, making
heavy use of loop unrolling.
Using temporary spacer bits, a more complex instruction
sequence is required (shown in Figure 4) than where the
format contains permanent spacer bits, leading to a higher
execution time overall. Although using temporary spacer
bits minimizes the memory footprint, the overhead of the
more complex instruction sequence is very clearly visible in
the execution time.
In convolution layers late in the VGG network, with very
large numbers of channels, native 8-bit code is faster than
some of the wider SAMD instances with temporary spacer
bits, but in general, as the quantization factor increases
(meaning we can t more values in a SAMD vector) the
performance of SAMD improves rapidly. e best avail-
able speedup for an individual convolution layer, using 2-bit
quantization, is approximately 6× on this Haswell processor.
SAMD: Custom Precision for DNNs Conference’17, July 2017, Washington, DC, USA
 0
 200
 400
 600
 800
 1000
 1200
 1400
conv1_1 conv1_2 conv2_1 conv2_2 conv3_1 conv3_2 conv4_1 conv4_2 conv5_1 conv5_2
Ex
ec
ut
io
n 
Ti
m
e 
(m
s)
SAMD Convolution with Temporary Spacer Bits (VGG on Intel Haswell)
direct-mhwcyx SAMD (8) SAMD (7) SAMD (6) SAMD (5) SAMD (4) SAMD (3) SAMD (2)
Figure 15. Intel Core i5-4570 Performance of quantized DNN convolution with temporary spacer bits
 0
 200
 400
 600
 800
 1000
 1200
 1400
conv1_1 conv1_2 conv2_1 conv2_2 conv3_1 conv3_2 conv4_1 conv4_2 conv5_1 conv5_2
Ex
ec
ut
io
n 
Ti
m
e 
(m
s)
SAMD Convolution with Permanent Spacer Bits (VGG on Intel Haswell)
direct-mhwcyx SAMD (8) SAMD (7) SAMD (6) SAMD (5) SAMD (4) SAMD (3) SAMD (2)
Figure 16. Intel Core i5-4570 Performance of quantized DNN convolution with permanent spacer bits
 0
 500
 1000
 1500
 2000
 2500
 3000
 3500
 4000
 4500
 5000
conv1_1 conv1_2 conv2_1 conv2_2 conv3_1 conv3_2 conv4_1 conv4_2 conv5_1 conv5_2
Ex
ec
ut
io
n 
Ti
m
e 
(m
s)
SAMD Convolution with Temporary Spacer Bits (VGG on ARM Cortex A-57)
direct-mhwcyx SAMD (8) SAMD (7) SAMD (6) SAMD (5) SAMD (4) SAMD (3) SAMD (2)
Figure 17. ARM Cortex A-57 Performance of quantized DNN convolution with with temporary spacer bits
 0
 500
 1000
 1500
 2000
 2500
 3000
 3500
 4000
 4500
 5000
conv1_1 conv1_2 conv2_1 conv2_2 conv3_1 conv3_2 conv4_1 conv4_2 conv5_1 conv5_2
Ex
ec
ut
io
n 
Ti
m
e 
(m
s)
SAMD Convolution with Permanent Spacer Bits (VGG on ARM Cortex A-57)
direct-mhwcyx SAMD (8) SAMD (7) SAMD (6) SAMD (5) SAMD (4) SAMD (3) SAMD (2)
Figure 18. ARM Cortex A-57 Performance of quantized DNN Convolution with permanent spacer bits
Conference’17, July 2017, Washington, DC, USA A. Anderson and D. Gregg
8.3 Experimental Results: ARM Cortex A-57
Figures 17 and 18 show the results of benchmarking on the
ARM Cortex A-57. Here, the SAMD code for temporary
spacer bits is already faster than native code in all cases
(Figure 17). Using permanent spacer bits (Figure 18), the gap
widens even further. Where the incoming data are in reduced
precision, SAMD is a very eective mechanism to achieve
parallelized execution on constrained processors such as the
Cortex A-57.
Using the low-complexity operations with permanent
spacer bits (Figure 18), the best speedup available for an
individual convolution layer, using 2-bit quantization, is ap-
proximately 10× versus native 8-bit integer implementation.
9 Related Work and Conclusion
Algorithms to operate on multiple subwords in parallel have
been known for some time. A 1957 programming textbook
for the 35-bit EDSAC contains a recursive divide and conquer
algorithm to compute the number of set bits (or popcount),
where the bit-width of each partial sum doubles on each
iteration [2, 19]. e famous 1972 HAKMEM technical report
[1] from MIT contains an example (no. 115) that operates on
base-4 subwords using bitwise operations.
In 1975 Lamport outlined a vision for processing multiple
n-bit “bytes” within full word instructions (where n is not
necessarily eight). Lamport outlines an algorithm for SAMD
addition with and without spacer bits, and mentions the
possibility of vector times scalar multiplication albeit without
details. e main focus of the work is lane-wise comparisons
and masks, particularly for processing character data.
Fredriksson and Grabowski [6, 7] use fast FFT convolu-
tions for approximate string matching. To increase paral-
lelism, they pack multiple symbols into a machine word,
before computing the FFT and Fourier domain convolution
on the packed words, which they describe as word-level par-
allelism. In contrast our work deals with both signed and
unsigned values, and computes the convolution in the time
rather than Fourier domain.
Fu et al. [8] present a paper on multiplying two 8-bit in-
puts by an 8-bit scalar value in a single step using the DSP
accelerator units on Xilinx FPGAs. According to their de-
scription — which is tightly bound to implementation on
Xilinx FPGAs — their approach can deal with both signed
and unsigned numbers. eir method is similar to our vec-
tor scale algorithm, but it is missing the crucial post-pass
adjustment that allows correct handling of negative inputs.
It is dicult to know whether they rely on some hardware
mechanism of Xilinx DSP units to nd the correct result for
negative inputs, or they use a post-pass adjustment similar
to ours and neglect to mention it.
Umuroglu et al. [18] propose FINN, a framework for fast
binarized neural network inference. In their formulation,
points in a convolution can take on the values from the set
{−1, 0,+1}.
Conclusion
e quantization scheme of Umuroglu et al. [18] can be
represented without loss of accuracy in our SAMD 2 format,
which permits the values {−2,−1, 0,+1}. Umuroglu et al.
propose to use a customized hardware unit, implemented on
FPGA to actually perform inference.
Our approach allows the quantization scheme of Umuroglu
et al. to be run on a general-purpose CPU in a highly ecient
manner. In fact, Umuroglu et al. also use the VGG-B network
in their experiments. As we have demonstrated, the layers
of this network can be run on a Cortex-A57 with a 2-bit
quantized inference scheme yielding a speedup approaching
10× for several layers.
As demonstrated by our experimental evaluation, quanti-
zation can enable not only memory savings, but also large
inference performance gains without requiring custom hard-
ware support. Soware support for bit-precise quantized
DNN operations using our approach can oen match, and
sometimes far exceed, the performance of inference quan-
tized to the sizes supported in native arithmetic.
References
[1] Michael Beeler, R.W. Gosper, and Rich Schroeppel. Hakmem. Technical
Report AIM-239, Cambridge, MA, USA, February 1972.
[2] Y Edel and A Klein. Population count in arrays. Submied, available
online hp://cage.ugent.be/klein/popc.html, page 95, 2009.
[3] Randall Fisher. General-Purpose SIMD Within A Register: Parallel
Processing on Consumer Microprocessors. PhD thesis, Perdue University,
2003.
[4] Randall J. Fisher and Henry G. Dietz. Compiling for simd within
a register. In Proceedings of the 11th International Workshop on Lan-
guages and Compilers for Parallel Computing, LCPC ’98, pages 290–304,
London, UK, UK, 1999. Springer-Verlag.
[5] M. J. Flynn. Some computer organizations and their eectiveness.
IEEE Transactions on Computers, C-21(9):948–960, Sept 1972.
[6] Kimmo Fredriksson and Szymon Grabowski. Fast convolutions and
their applications in approximate string matching. In Jirˇı´ Fiala, Jan
Kratochvı´l, and Mirka Miller, editors, Combinatorial Algorithms, pages
254–265. Springer-Verlag, Berlin, Heidelberg, 2009.
[7] Kimmo Fredriksson and Szymon Grabowski. Exploiting word-level
parallelism for fast convolutions and their applications in approximate
string matching. European Journal of Combinatorics, 34(1):38 – 51,
2013. Combinatorics and Stringology.
[8] Yoa Fu, Ephrem Wu, Ashish Sirasoa, Sedny Aia, Kamran Khan, and
Ralph Wiig. Deep learning with INT8 optimization on Xilinx devices.
White Paper WP486 (v1.0.1), Xilinx, April 2017.
[9] S. Grys. Signed multiplication technique by means of unsigned mul-
tiply instruction. Comput. Electr. Eng., 37(6):1212–1221, November
2011.
[10] Philipp Gysel. Ristreo: Hardware-oriented approximation of convo-
lutional neural networks. CoRR, abs/1605.06402, 2016.
[11] Song Han and William J. Dally. Bandwidth-ecient deep learning. In
Proceedings of the 55th Annual Design Automation Conference, DAC
2018, San Francisco, CA, USA, June 24-29, 2018, pages 147:1–147:6.
ACM, 2018.
[12] Song Han, Junlong Kang, Huizi Mao, Yiming Hu, Xin Li, Yubin Li,
Dongliang Xie, Hong Luo, Song Yao, Yu Wang, Huazhong Yang, and
William (Bill) J. Dally. ESE: ecient speech recognition engine with
sparse LSTM on FPGA. In Jonathan W. Greene and Jason Helge
Anderson, editors, Proceedings of the 2017 ACM/SIGDA International
SAMD: Custom Precision for DNNs Conference’17, July 2017, Washington, DC, USA
Symposium on Field-Programmable Gate Arrays, FPGA 2017, Monterey,
CA, USA, February 22-24, 2017, pages 75–84. ACM, 2017.
[13] Song Han, Xingyu Liu, Huizi Mao, Jing Pu, Ardavan Pedram, Mark A.
Horowitz, and William J. Dally. EIE: ecient inference engine on com-
pressed deep neural network. In 43rd ACM/IEEE Annual International
Symposium on Computer Architecture, ISCA 2016, Seoul, South Korea,
June 18-22, 2016, pages 243–254. IEEE Computer Society, 2016.
[14] Dominik Marek Loroch, Norbert Wehn, Franz-Josef Pfreundt, and
Janis Keuper. Tensorquant - A simulation toolbox for deep neural
network quantization. CoRR, abs/1710.05758, 2017.
[15] Angshuman Parashar, Minsoo Rhu, Anurag Mukkara, Antonio
Puglielli, Rangharajan Venkatesan, Brucek Khailany, Joel S. Emer,
Stephen W. Keckler, and William J. Dally. SCNN: an accelerator for
compressed-sparse convolutional neural networks. In Proceedings of
the 44th Annual International Symposium on Computer Architecture,
ISCA 2017, Toronto, ON, Canada, June 24-28, 2017, pages 27–40. ACM,
2017.
[16] David A. Paerson and John L. Hennessy. Computer Architecture:
A antitative Approach. Morgan Kaufmann Publishers Inc., San
Francisco, CA, USA, 1990.
[17] Karen Simonyan and Andrew Zisserman. Very deep convolutional
networks for large-scale image recognition. CoRR, abs/1409.1556, 2014.
[18] Yaman Umuroglu, Nicholas J. Fraser, Giulio Gambardella, Michaela
Blo, Philip Leong, Magnus Jahre, and Kees Vissers. Finn: A frame-
work for fast, scalable binarized neural network inference. In Pro-
ceedings of the 2017 ACM/SIGDA International Symposium on Field-
Programmable Gate Arrays, FPGA ’17, pages 65–74, New York, NY,
USA, 2017. ACM.
[19] Maurice V Wilkes, David J Wheeler, Stanley Gill, and FJ Corbato´. e
preparation of programs for an electronic digital computer. Physics
Today, 11:28, 1958.
[20] Chenzhuo Zhu, Song Han, Huizi Mao, and William J. Dally. Trained
ternary quantization. CoRR, abs/1612.01064, 2016.
