Accelerating Binarized Neural Networks via Bit-Tensor-Cores in Turing
  GPUs by Li, Ang & Su, Simon
1Accelerating Binarized Neural Networks via
Bit-Tensor-Cores in Turing GPUs
Ang Li and Simon Su
Abstract—Despite foreseeing tremendous speedups over conventional deep neural networks, the performance advantage of binarized
neural networks (BNNs) has merely been showcased on general-purpose processors such as CPUs and GPUs. In fact, due to being
unable to leverage bit-level-parallelism with a word-based architecture, GPUs have been criticized for extremely low utilization (1%)
when executing BNNs. Consequently, the latest tensorcores in NVIDIA Turing GPUs start to experimentally support bit computation. In
this work, we look into this brand new bit computation capability and characterize its unique features. We show that the stride of
memory access can significantly affect performance delivery and a data-format co-design is highly desired to support the tensorcores
for achieving superior performance than existing software solutions without tensorcores. We realize the tensorcore-accelerated BNN
design, particularly the major functions for fully-connect and convolution layers — bit matrix multiplication and bit convolution.
Evaluations on two NVIDIA Turing GPUs show that, with ResNet-18, our BTC-BNN design can process ImageNet at a rate of 5.6K
images per second, 77% faster than state-of-the-art. Our BNN approach is released on https://github.com/pnnl/TCBNN.
F
1 INTRODUCTION
B INARIZED-neural-network (BNN) [1], [2], [3] is an al-ternative type of deep-neural-networks (DNNs). Com-
pared to general DNNs, such as multi-layer-perceptrons
(MLPs) and convolution-neural-networks (CNNs), the ma-
jor difference of BNN is that it uses a single bit to represent
each entry of the input and weight matrices. BNN evolved
from DNN through binarized-weight-network (BWN) [4]. It
was firstly observed that if the weight matrix can be bina-
rized to +1 and −1, the floating-point (FP) multiplications
can be degraded to addition (i.e., mul +1) and subtraction
(i.e., mul −1). Later, it was further observed that if the input
matrix can be binarized as well, then even the floating-
point additions and subtractions in BWN can be degraded
to logical operations (i.e., xnor for bit dot-product and popc
for bit accumulation) [1], [2], [3].
BNNs bring several advantages over full-precision
DNNs: (a) Reduced and simplified computation. Through bina-
rization, each segment of 32 FP fused-multiply-add (FMA)
operations can be aggregated into an xnor operation and a
popc operation, leading to theoretically 16× speedups; (b)
Reduced data movement and storage. Through binarization, the
whole memory hierarchy and network, including registers,
caches, scratchpad, DRAM, NoC, etc. can accommodate 32×
in both bandwidth and capacity; (c) Reduced cost which com-
prises energy reduction from simplified hardware design
and smaller chip area; (d) Resilience. It has been reported
that compared with differentiable DNNs, the discrete BNNs
exhibit superior stability and robustness against adversarial
attacks [5], [6].
On the flip side of the coin, binarization reduces the
model’s capacity and discretizes the parameter space, lead-
• A. Li is a computer scientist from the High-performance Computing group
of Pacific Northwest National Laboratory (PNNL), Richland, WA, 99354.
E-mail: ang.li@pnnl.gov, see http://www.angliphd.com
• S. Su is a computer scientist from the DoD Supercomputing Resource
Center of U.S. Army Research Laboratory (ARL), Aberdeen Proving
Ground, MD, 21005.
ing to certain accuracy loss. With the tremendous effort
from the machine learning community [2], [7], [8], [9],
accuracy of BNNs have been dramatically enhanced. The
top-1 training accuracy of BNN-based AlexNet and ResNet-
18 on ImageNet dataset has achieved 46.1% [8] and 56.4%
[10] (54.3% and 61% with boosting [11]), with respect to
56.6% and 69.3% for full-precision DNN [12]. A latest BNN
work even reported a top-1 accuracy of 70.7% [13].
Although BNN is not likely to substitute DNNs because
of reduced model capacity, for many HPC [14], [15], [16],
[17] and cloud applications [18], [19], when certain accuracy
levels can be achieved, alternative factors such as latency,
energy, hardware cost, resilience, etc. become more promi-
nent. This is especially the case for practical deployment.
Despite featuring various advantages, the expected per-
formance gain of BNN has rarely been demonstrated on
general purpose processors such as GPUs. This is mainly
because: (i) the fundamental design mismatch between bit-
based algorithms and word-based architecture; (ii) BNN
designs at this stage are mainly driven by the algorithm
community on how to improve training accuracy; little
system and architectural support have been provided on
high performance delivery. Due to (i), most existing BNN
implementations are realized as hardware accelerators (e.g.,
through FPGA [20], [21], [22], [23], [24], [25]) where the
operand bit-width can be flexibly adjusted. Due to (ii), BNN
developers are still relying on full-precision software frame-
works such as TensorFlow and PyTorch over CPUs and
GPUs to emulate the BNN execution. As a result, the lack
of architectural & system support hinders the performance
delivery and the general adoption of BNNs.
This situation has been lately changed for GPUs. On
the software side, a recent work [26] proposed the so-
called binarized software tensor core or BSTC, relying on
GPU’s low-level hardware intrinsics for efficient 2D bit-
block processing, such as bit matrix multiplication (BMM)
and bit convolution (BConv). On the hardware side, the latest
NVIDIA Turing GPUs started to support BMM experimen-
ar
X
iv
:2
00
6.
16
57
8v
1 
 [c
s.D
C]
  3
0 J
un
 20
20
2TABLE 1: Bit-Software-Tensor-Core (BSTC) [26] vs. Bit-Tensorcore (BTC).
uint32 refers to unsigned int. uint64 refers to unsigned long long int. INTU
refer to integer units. SFU refers to special function units.
BSTC BTC
Datatype Bit (uint32, uint64) Bit (uint32)
Functionality Bit Matrix Multiplication Bit Matrix Multiplication
Tile-A size 32×32 or 64×64 8×128
Tile-B size 32×32 or 64×64 128×8
Tile-C size 32×32 or 64×64 8×8
Hardware units INTUs and SFUs TensorCore Units (TCUs)
Processing level per warp per warp
GPU Platforms Kepler or later (≥CC-3.0) Turing GPUs (≥CC-7.5)
tally in their Tensor Core Units (TCUs) [27]. We label this new
bit-capability as Bit-Tensor-Core, or BTC. Table 1 compares
the major features of BSTC and BTC.
In this work, we focus on these bit-tensorcores in Turing
GPUs and investigate how they can be fully leveraged for
advanced performance delivery for BNNs. This paper thus
makes the following major contributions: (i) To the best
of our knowledge, this is the first work to investigate this
brand new bit-computation capability of GPU tensorcores1.
We designed orchestrated microbenchmarks to investigate
the low-level features of BTC. In particular, we observed that
the value of stride exhibits considerable performance impact
on memory fetch from the global memory via Warp-Matrix-
Multiplication-API (WMMA). (ii) Based on our observa-
tions, we proposed a new bit data format specially for bit-
computation on GPU tensorcores. We showed that without
this new bit format, BTC might not exhibit performance
advantage over existing software solutions; (iii) BTC cur-
rently only supports XOR-based bit-matrix-multiplication.
In terms of convolution, traditional approaches [28], [29]
that transform convolution to matrix-multiplication fail to
work properly for BNNs due to the challenge in padding
[30]. In this work, we propose a new approach that can
fully leverage the bit-computation capability of BTC while
effectively resolving this padding issue. We evaluated our
design on two NVIDIA Turing GPU platforms, the results
showed that our BTC-based BMM design could bring up
to 4.4× speedup over the vendor’s Cutlass [31] implemen-
tation. Regarding BNN end-to-end inference performance,
compared with state-of-the-art solution [26], our BTC-based
approach achieved on average 2.20× and 2.25× in latency,
and 1.99× and 1.62× in throughput for VGG-16 and ResNet-
18 on ImageNet dataset. Since bit-computation is increas-
ingly common in many HPC and data-analytics scenarios
[32], [33], [34], [35], [36], [37], [38], [39], our techniques can
be extended to other bit applications.
2 RELATED WORK
We focus on the performance issues of BNN implementation
and the GPU tensorcores in this section. Regarding the
algorithm design for BNNs, please refer to this survey [30].
BNN Implementation The major purpose of BNN im-
plementation is to leverage the system and architectural
features of the platforms to satisfy the stringent latency
and throughput constraints when deploying BNNs in HPC,
cloud and embedded applications, while reducing the area
1. To the best of our knowledge, this feature has not appeared in
vendor’s library like cuBLAS, cuDNN, TensorRT or other library up
to now except Cutlass in which it is supported as an experimental,
unverified function.
Warp Scheduler
TCU LSU
L0$ Constant $BRU
Register Files
Math Dispatch Unit
INTU FPU DPU SFU
Sub-Core
Fig. 1: A subcore of a Turing GPU SM. BRU is branch unit. $ refers to cache.
LSU is the load-store-unit. INTU is the integer-unit. FPU is the floating-point-
unit. DPU is the double-precision-unit. SFU is the special-function-unit. TCU
is the tensorcore-unit, which has its independent data path.
and energy cost [20], [21], [22], [23], [24], [25], [40], [41]. Most
of these implementations focus on FPGA [20], [21], [22], [23],
[24], [25] due to FPGA’s design flexibility at the bit level. Re-
garding general-purpose platforms, an existing CPU work
[41] relies on the AVX/SSE vector instructions to derive
good bit computation performance. It focuses on BMM and
transforms bit-convolution to BMM through im2col() with
costly pre- and post-processing. Another evaluation work
[42] compares CPU, GPU, FPGA and ASIC based BNN
designs, clarifying that the major performance restriction
of CPUs and GPUs is the extremely low utilization due to the
challenge in extracting fine-grained parallelism. Noticeably, the
reported GPU utilization is 1% only [42]. To improve GPU
utilization and extract bit-level-parallelism, a recent work
[26] proposed the binarized-soft-tensor-core (BSTC) on top
of GPU’s SMs and leverages low-level hardware intrinsics
for harvesting the bit-processing capability of GPUs. For
BSTC, the performance gains from better utilization of the
conventional integer/logic units (i.e., INTUs and SFUs, see
Figure 1). This work is different because we focus on the
brand new bit computation capability of the latest Turing
TCUs, and showcase how to harvest the most performance
from this new functional units.
GPU Tensorcore Driven by the demand of training large-
scale DNNs, designing specialized low-precision dense
matrix-matrix multiplication accelerators has become a pop-
ular trend. Particularly, Google presented Tensor-Processing-
Units (TPUs) [43]; Intel announced the Nervana Neural-
Network-Processors (NNPs) for tensor operations; NVIDIA
integrated the Tensorcore Units (TCUs) into their Volta
and Turing GPUs; Qualcomm included the Hexagon-Tensor-
Accelerator (HTA) into their Hexagon 855 system-on-chip.
This work focuses on the tensorcores of GPUs (see Fig-
ure 1). Since being firstly introduced in the Volta architecture
[44], the tensorcore becomes one of the spot-light for GPGPU
research. The relevant works can be summarized in two
categories: (a) Characterization. In [45] and [46], Jia et
al. dissected the Volta (Tesla V100) and the Turing (Tesla
T4) GPUs through microbenchmarking. They depicted the
detailed mapping mechanism from elements of a matrix tile
to registers of a warp-lane in the HMMA instructions for
FP16 matrix multiplication. They found that the 32 threads
of a warp are essentially divided into 8 thread groups, where
the 4 threads per group cooperatively work on the same
regions of matrix C by fetching elements from different
parts of matrix A and matrix B. Markidis et al. [47] stud-
ied the programmability, performance and precision of the
3Volta tensorcores and proposed a technique to compensate
the accuracy loss due to precision degradation from FP32
to FP16. Raihan et al. [48] investigated the design details
of the tensorcores in Volta and Turing GPUs and built
an architecture model for the tensorcores in GPGPU-Sim.
They characterized the WMMA APIs and clarified how the
operand sub-matrix elements were mapped for FP16 GEMM
in Volta tensorcores, and FP16/Int8/Int4 GEMM in Turing
tensorcores. However, they did not investigate the 1-bit
computation mode. Hickmann and Bradford [49] proposed
a testing method for assessing the compliance of IEEE stan-
dard, hardware microarchitecture, and internal precision of
the Volta tensorcores. (b) Application. Haidar et al. [50]
proposed a mixed-precision iterative refinement method to
approach FP64 precision using FP16-based GPU tensorcores
in LU factorization, acting as the first effort to apply GPU
tensorcores for non-machine-learning applications. Sorna et
al. [51] applied FP16 tensorcores for FFT acceleration. Blan-
chard et al. [52] thoroughly analyzed the rounding error of
matrix multiplication and LU factorization when using the
tensorcores. Dakkak et al. [53] showed that the tensorcores,
which were originally designed for 2D FP16 GEMM, can be
adopted for 1D array reduction and scan.
Most of these works, however, focused on FP16 mixed-
precision matrix-multiply in Volta tensorcores. They either
evaluated their performance, programmability, accuracy,
hardware design, or looked into alternative applications
other than GEMM, aiming to preserve higher precision.
None of them have investigated the latest bit computation
capability of the GPU tensorcores. In addition, no existing
works have ever reported the potential performance im-
pact from the stride of segmented memory load, and how
to circumvent the challenges in accelerating convolutions
through the tensorcores. Furthermore, until writing the pa-
per, we have not seen any works leverage GPU tensorcores
for the acceleration of BNNs.
3 GPU BIT TENSORCORES
3.1 GPU Tensorcores
Since the Volta architecture (CC-7.0), NVIDIA GPUs have in-
troduced a novel type of function units known as Tensorcores
into the streaming multiprocessors (SMs) for accelerating
low-precision general matrix multiplication (GEMM). In
Volta, each tensorcore processes 64 FP16 FMA operations
per cycle [54]. The only supported datatype for Volta ten-
sorcores is FP16. For Turing (CC-7.5), more datatypes are
supported, including FP16, signed/unsigned int-8, int-4,
and recently a bit as well. Please refer to [45], [48], [54]
for more details about the hardware features of Volta and
Turing GPU tensorcores.
3.2 CUDA WMMA
Since CUDA Runtime-9.0, the Warp Matrix Multiplication
API (WMMA) has been introduced for operating the ten-
sorcores in Volta and Turing GPUs. The idea is to par-
tition the three input and one output matrices into tiles,
where each warp processes the multiplication of one tile
(TD = TA × TB + TC ). WMMA provides the necessary
primitives to operate on the bit-tiles (e.g., loading input tiles,
1 namespace experimental {
2 namespace precision { struct b1; } // define 1−bit datatype
3 enum bmmaBitOp { bmmaBitOpXOR = 1 }; //only XOR is supported as the mul−op for BTC
4 enum bmmaAccumulateOp { bmmaAccumulateOpPOPC = 1 }; } //popc acts as the acc−op for BTC
Listing 1: WMMA 1-bit definition and operations in crt/mma.h
1 wmma.load.a.sync.aligned.row.m8n8k128.global.b1 {%r21}, [%rd8], %r20; //load matrix sync()
2 wmma.load.b.sync.aligned.col.m8n8k128.global.b1 {%r22}, [%rd5], %r20; //load matrix sync()
3 wmma.load.c.sync.aligned.row.m8n8k128.global.s32 [%rd30],{%r55,%r64},%r60;//load matrix sync()
4 wmma.mma.xor.popc.sync.aligned.row.col.m8n8k128.s32.b1.b1.s32 {%r23,%r24}, {%r21}, {%r22},
{%r19,%r19}; //bmma sync()
5 wmma.store.d.sync.aligned.row.m8n8k128.global.s32 [%rd34], {%r63,%r64},%r60;//store matrix sync()
Listing 2: PTX code for Bit-Matrix-Multiply-API (BMMA)
tiled multiplication, storing output tile): load matrix sync,
mma sync, store matrix sync. These primitives are executed
by the 32 threads of a warp cooperatively. For FP16, the tile
is further partitioned into 32 fragments while each thread
fetches a fragment of data into its register files. Although
the vendor’s official documents have not revealed the exact
mapping schemes, existing works have figured them out
through microbenchmarking [45], [48].
3.3 Cutlass Library
Currently the vendor’s high-performance linear-algebra li-
brary cuBLAS has not supported BMM on Turing tensor-
cores. However, their open-source GEMM library – Cutlass
[31] has integrated it as an experimental and non-verified
feature. BMM is realized using the WMMA API. The input
matrix A is in row-major bit format (compacted as 32-
bit unsigned int), B is in column-major bit format (also
compacted as 32-bit unsigned int). The accumulated input
matrix C and the result matrix D are in row-major 32-bit
signed int format. C and D are usually the same matrix.
BMM in Cutlass conducts 0/1 dot-product while BNN
demands +1/-1 dot-product, we discuss this later.
4 BTC CHARACTERIZATION
To operate on the bit datatype for Turing GPUs, CUDA
WMMA defines the 1-bit precision and the bit operations in
an independent ”experimental” namespace, as listed in List-
ing 1. XOR and POPC for bits (+1/-1) correspond to multiply
and accumulate for floating-point/integer datatypes.
Five APIs are provided for loading the bit-tile A, the
bit-tile B, the int-tile C, and storing the int-tile D, as well
as the multiplication: D = C + A × B. For the bit-matrix-
multiply-API (BMMA), only a single computation paradigm
is defined: the bit-tile A is in row-major of size (8, 128); the
bit-tile B is in column-major of size (128, 8); the int-tile C and
D are square matrices in row-/column-major of size (8, 8).
The bit-tile A and B are compacted as 32 unsigned ints, each
with 32 bits. Therefore, the bit-tile A and B each occupies
128 bytes. The int-tile C and D each occupies 8×8×4=256
bytes. Listing 2 shows the Parallel Thread Execution (PTX)
– the low-level GPU virtual machine ISA code for the five
BMMA APIs. The shape qualifier ”m8n8k128” in Line 2-6
indicate that the bit-tile-multiplication processed per warp is
in size (8,128)×(128,8)=(8,8). ”sync” means the instruction
wait for all warp lanes to synchronize before proceeding.
The ”layout” qualifier specifies if the tile is stored with
a row-major or column-major order in memory. ”type”
indicates the precision of the tile. Using int32 for tile-C and
D is to avoid potential overflow during the accumulation.
4For matrix-multiplication, tile-C and tile-D are usually in the
same size. Thus, the five APIs can be categorized into three
groups: load, store, and computation. We investigate each of
them in the following subsection to figure out potential
design guidelines. Regarding the hardware platform, see
Section VII and Table 2.
4.1 BMMA Load
We first concentrate on bmma load, as memory load is the
most crucial factor for GEMM on GPU [55]. The load API is
1 void load matrix sync (fragment<...> &tileA, const T* mptr, unsigned ldm, layout t layout);
It waits for all the threads of a warp to arrive, and then loads
a bit tile (i.e, a matrix fragment) from the device memory.
It has three parameters. ”mptr” is a 256-bit aligned pointer
pointing to the first element of the matrix in memory. The
memory here can be global or shared memory. ”layout”
can be row- or column-major, but for BMMA, there is
only a single choice — mem row major for matrix-A and
mem col major for matrix-B. ”ldm” is the stride in element
between consecutive rows (in row major) or columns (in col-
umn major) and must be a multiple of 16 bytes), according to
[56]. We find that for shared memory, this is the case; but for
global memory, a multiple of 32 is also feasible (despite with
unpredicted results). To see the impact of mptr (i.e., memory
type) and ldm on the performance of the load primitive, we
measure its average per-thread latency using the clock()
instruction. We add a memory fence operation before the
measurement to ensure that the data fetching has finished.
Figure 2, 3, 4, 5 show the average latency with re-
spect to different values of ldm for load matrix sync() on
global and shared memory of RTX-2080 and RTX-2080Ti
GPUs. As ldm is the stride between consecutive rows of
the matrix, it should be application dependent (e.g., for a
1024 × 1024 × 1024 BMM, ldm should be 1024) and the
raw latency should be irrelevant to ldm. However, coun-
terintuitively it has a strong impact on the performance of
fetching a bit-tile from the global memory. As can be seen
in Figure 2 and Figure 4, ldm=128 and ldm=384 exhibit the
shortest latency. Regarding shared memory, (1) accessing
shared memory exhibits more than 5× less latency than
accessing global memory; (2) the latency for RTX2080Ti is
less than RTX2080, and is unchanged with ldm.
We then consider ”tileA”, and see how the bit-tile (128×8
bits) is distributed among the lanes of a warp. Similar to [48],
we let each lane print out the value of data it fetches. Based
on the value, we can identify the mapping mechanism. We
find that, similar to FP16 and Int8, lanes in BMMA also
establish 8 thread groups — 4 consecutive lanes per group.
Each thread group corresponds to a 128 bit row. Within a 128
bit row, each lane accounts for a 32-bit portion (4 bytes). This
partially explains why ldm=128 delivers the shortest latency:
the 32 lanes of the warp constitute a coalesced memory
access, where the 32 4-byte access are merged as a single
memory request. Regarding why ldm=384 also exhibits good
performance, we suspect this might be because the Turing
L1 data cache is essentially partitioned into two sectors with
independent ports, similar to the L1/Tex cache in Maxwell
and Pascal GPUs. It conserves the data in an interleaving
way at a step of 32B. Consequently, ldm=256 (32B) may trig-
ger a sector-port conflict for simultaneous memory fetches
from the same warp but ldm=384 may not. This is confirmed
by the observation that ldm=128+256X (e.g., 384, 640, 896) all
demonstrate relatively low latency in Figure 2 and 5.
4.2 BMMA Store
The store operation is different from load in that every
element is a 32-bit signed integer. The store API is:
1 void store matrix sync (T* mptr, const fragment<...> &tileC, unsigned ldm, layout t layout);
Again, it waits until all warp lanes arrived before storing
tileC into memory. ”mptr” must be a 256-bit aligned pointer
referring to the first element. ”ldm” describes the stride
in elements between consequent rows in C, and must be
a multiple of 16 bytes (with integer, it corresponds to 4
elements). ”layout” can be row-major or column-major.
We measure the average latency with respect to the stride
ldm on global and shared memory of RTX-2080 and RTX-
2080Ti GPUs, as shown in Figure 6, 7, 8, 9, respectively.
Unlike load, the latency histograms for store do not exhibit
obvious patterns. We also attempt to figure out how the
resulting int-tile tileC is distributed among the lanes. Our
findings show that: (i) If it is row-major, then within the
8 × 8 int tile, each two consecutive elements (from a row)
are stored in two adjacent registers of a lane. For example,
suppose the 8 × 8 elements are E0 to E63 and each lane
uses R4 and R5 to store the integer tile, then (E0, E1) are
stored in R4 and R5 of lane-0, (E2, E3) are stored in R4
and R5 of lane-1, and so on. (ii) If it is column-major, then
each two consecutive elements (from a column) are stored
in two adjacent registers of a lane (i.e., transposed from the
row-major layout). When storing, the two adjacent registers
are encoded as one STG.E.64 memory store for the entire
warp, as if storing an FP64 data.
4.3 BMMA Computation
Finally, we discuss the bit-matrix-multiply API:
1 void bmma sync(fragment<...> &tileD, const fragment<...> &tileA,
2 const fragment<...> &tileB, const fragment<...> &tileC,
3 experimental::bmmaBitOp=experimental::bmmaBitOpXOR,
4 experimental::bmmaAccumulateOp=experimental::bmmaAccumulateOpPOPC);
It waits until all lanes are available for conducting the
BMMA operation: tileD = POPC(tileA XOR tileB) + tileC.
Unlike the condition for FP16 and Int8 where a group of
SASS assembly operations are generated [45], [46], [48],
bmma sync is only translated into a single SASS code:
1 BMMA.88128.XOR.POPC R2, R8.ROW, R9.COL, R2 ;
Our idea here is to measure its raw latency and estimate
how much parallelism, including warp-level-parallelism
(WLP) and instruction-level-parallelism (ILP), are required
to saturate the tensorcore pipeline and hide the latency.
Figure 10, 11, 12, 13 illustrate the total latency of increas-
ing the number of repeated bmma sync operations for the
same tileC/tileD, and different tileC/tileD on the two GPUs.
The raw latency of bmma sync is ∼201 cycles on RTX2080
and ∼190 cycles on RTX2080Ti. As shown in the figure,
the incremental latency with each one more bmma sync
operation is 10 cycles when tileC & tileD are identical for all
operations, and is 4 cycles when tileC & tileD are different on
both platforms. This implies that the pipeline stage delay is
around 4 cycles. When using the same accumulator, 6 extra
53
2
6
4
9
6
1
2
8
1
6
0
1
9
2
2
2
4
2
5
6
2
8
8
3
2
0
3
5
2
3
8
4
4
1
6
4
4
8
4
8
0
5
1
2
5
4
4
5
7
6
6
0
8
6
4
0
6
7
2
7
0
4
7
3
6
7
6
8
8
0
0
8
3
2
8
6
4
8
9
6
9
2
8
9
6
0
9
9
2
1
0
2
4
ldm (stride)
0
20
40
60
80
100
120
140
160
180
200
L
a
te
n
c
y
 (
c
y
c
le
s
) Min Min
Fig. 2: RTX2080 global mem load
1
2
8
2
5
6
3
8
4
5
1
2
6
4
0
7
6
8
8
9
6
1
0
2
4
ldm (stride)
0
10
20
30
40
50
60
L
a
te
n
c
y
 (
c
y
c
le
s
)
Fig. 3: RTX2080 shared mem load
3
2
6
4
9
6
1
2
8
1
6
0
1
9
2
2
2
4
2
5
6
2
8
8
3
2
0
3
5
2
3
8
4
4
1
6
4
4
8
4
8
0
5
1
2
5
4
4
5
7
6
6
0
8
6
4
0
6
7
2
7
0
4
7
3
6
7
6
8
8
0
0
8
3
2
8
6
4
8
9
6
9
2
8
9
6
0
9
9
2
1
0
2
4
ldm (stride)
0
20
40
60
80
100
120
140
160
180
200
L
a
te
n
c
y
 (
c
y
c
le
s
) Min Min
Fig. 4: RTX2080Ti global mem load
1
2
8
2
5
6
3
8
4
5
1
2
6
4
0
7
6
8
8
9
6
1
0
2
4
ldm (stride)
0
10
20
30
40
L
a
te
n
c
y
 (
c
y
c
le
s
)
Fig. 5: RTX2080Ti shared mem load
3
2
6
4
9
6
1
2
8
1
6
0
1
9
2
2
2
4
2
5
6
2
8
8
3
2
0
3
5
2
3
8
4
4
1
6
4
4
8
4
8
0
5
1
2
5
4
4
5
7
6
6
0
8
6
4
0
6
7
2
7
0
4
7
3
6
7
6
8
8
0
0
8
3
2
8
6
4
8
9
6
9
2
8
9
6
0
9
9
2
1
0
2
4
ldm (stride)
0
20
40
60
80
100
120
140
160
180
200
L
a
te
n
c
y
 (
c
y
c
le
s
)
Fig. 6: RTX2080 global mem store
3
2
6
4
9
6
1
2
8
1
6
0
1
9
2
2
2
4
2
5
6
2
8
8
3
2
0
3
5
2
3
8
4
4
1
6
4
4
8
4
8
0
5
1
2
5
4
4
5
7
6
6
0
8
6
4
0
6
7
2
7
0
4
7
3
6
7
6
8
8
0
0
8
3
2
8
6
4
8
9
6
9
2
8
9
6
0
9
9
2
1
0
2
4
ldm (stride)
0
10
20
30
40
50
60
70
L
a
te
n
c
y
 (
c
y
c
le
s
)
Fig. 7: RTX2080 shared mem store
3
2
6
4
9
6
1
2
8
1
6
0
1
9
2
2
2
4
2
5
6
2
8
8
3
2
0
3
5
2
3
8
4
4
1
6
4
4
8
4
8
0
5
1
2
5
4
4
5
7
6
6
0
8
6
4
0
6
7
2
7
0
4
7
3
6
7
6
8
8
0
0
8
3
2
8
6
4
8
9
6
9
2
8
9
6
0
9
9
2
1
0
2
4
ldm (stride)
0
20
40
60
80
100
120
140
160
180
200
L
a
te
n
c
y
 (
c
y
c
le
s
)
Fig. 8: RTX2080Ti global mem store
3
2
6
4
9
6
1
2
8
1
6
0
1
9
2
2
2
4
2
5
6
2
8
8
3
2
0
3
5
2
3
8
4
4
1
6
4
4
8
4
8
0
5
1
2
5
4
4
5
7
6
6
0
8
6
4
0
6
7
2
7
0
4
7
3
6
7
6
8
8
0
0
8
3
2
8
6
4
8
9
6
9
2
8
9
6
0
9
9
2
1
0
2
4
ldm (stride)
0
10
20
30
40
L
a
te
n
c
y
 (
c
y
c
le
s
)
Fig. 9: RTX2080Ti shared mem store
1 2 3 4 5 6 7 8 9
1
0
1
1
1
2
1
3
1
4
1
5
1
6
1
7
1
8
1
9
2
0
2
1
2
2
2
3
2
4
2
5
2
6
2
7
2
8
2
9
3
0
3
1
3
2
bmma_sync ops
0
50
100
150
200
250
300
350
400
450
500
550
L
a
te
n
c
y
 (
c
y
c
le
s
)
Fig. 10: RTX2080 bmma w/ same C
1 2 3 4 5 6 7 8 9
1
0
1
1
1
2
1
3
1
4
1
5
1
6
1
7
1
8
1
9
2
0
2
1
2
2
2
3
2
4
2
5
2
6
2
7
2
8
2
9
3
0
3
1
3
2
bmma_sync ops
0
50
100
150
200
250
300
350
400
450
500
550
L
a
te
n
c
y
 (
c
y
c
le
s
)
Fig. 11: RTX2080 bmma w/ diff C
1 2 3 4 5 6 7 8 9
1
0
1
1
1
2
1
3
1
4
1
5
1
6
1
7
1
8
1
9
2
0
2
1
2
2
2
3
2
4
2
5
2
6
2
7
2
8
2
9
3
0
3
1
3
2
bmma_sync ops
0
50
100
150
200
250
300
350
400
450
500
550
L
a
te
n
c
y
 (
c
y
c
le
s
)
Fig. 12: RTX2080Ti bmma w/ same C
1 2 3 4 5 6 7 8 9
1
0
1
1
1
2
1
3
1
4
1
5
1
6
1
7
1
8
1
9
2
0
2
1
2
2
2
3
2
4
2
5
2
6
2
7
2
8
2
9
3
0
3
1
3
2
bmma_sync ops
0
50
100
150
200
250
300
350
400
450
500
550
L
a
te
n
c
y
 (
c
y
c
le
s
)
Fig. 13: RTX2080Ti bmma w/ diff C
cycles are needed. Given the raw latency of ∼200 cycles,
and the fact that Turing GPU SM comprises four sub-cores
(each subcore can issue one instruction per cycle), with at
maximum 32 warps per SM for Turing (so WLP=32), we
roughly require ILP=200/4×4/32≈7 to saturate the entire
tensorcore pipeline. In other words, with 8 independent
bmma sync operations, we should approach the theoretical
computation bandwidth of the tensorcores.
5 BMM AND BCONV WITH BTC
We present our designs for BTC-based BMM and BConv,
which are the core functions for the fully-connected layer
and convolution layer of BNNs.
5.1 FSB Data Format for BTC
In Section IV-A, we have observed that the value of ldm
can strongly affect the performance of load_matrix_sync
from global memory, where ldm=128 and 384 exhibit the best
performance. Our idea thus is whether we can essentially
fix the value of ldm firmly to 128 or 384. As a result, rather
than storing the bits completely sequential and using the
matrix width for ldm, as practiced by the Cutlass library
and suggested by CUDA programming guide, we propose
a new 2D bit data format where bits are stored in a unit of
128×8 bit-tile. An analogous example is shown in Figure 14.
From the 1D general format to the 2D new format, an array
of 8×4 bits (H=4, W=8) is converted with a tile size of
4×2 (BH=2, BW=4). For BTC, since 384 is not a power of
2, dividing 384 may incur troublesome reminder handling,
we thus use 128 as BW and 8 as BH for the new format.
If the original bits are organized in row-major, both the in-
tile and tile-wise order of the new format are in row-major
(as the case in Figure 14); otherwise, both are organized in
column-major. Since the new format only changes the way
how bits are stored and fetched, no extra space is needed.
However, if the width of the original matrix (i.e., W) can
0 1 2 3 4 5 6 7
8 9 11 12 13 14 15
16 17 18 19 20 21 22 23
24 25 26 27 28 29 30 31
10
w
H
(A) general format
0 1 2 3 8 9 10 11
4 5 7 12 13 14 15
16 17 18 19 24 25 26 27
20 21 22 23 28 29 30 31
6
BW
BH
(B) proposed format
Fig. 14: Fixed-Stride-Bit (FSB) format based on a tile of BH×BW.
not be divided by 128 (i.e., BW), for the convenience of
index calculation, we pad the row to be a factor of 128,
which may occupy some extra space. Note, in order to load
via load matrix sync(), such a kind of padding is required
anyway. Similar requirement has been imposed by pitch
in cudaMemcpy2D(). The temporal overhead only occurs at
array index calculation, which is almost negligible.
5.2 BMM for FC Layer
BMM in BNN is different from GEMM because: (a) Input.
The elements of matrix-A and B are binary values: +1 and
-1. A normal floating-point or int number is binarized via:
xb = sign(x) =
{
1 if x ≥ 0
−1 otherwise (1)
In an FC layer, both A and B have to be binarized ahead
of BMM. However, the binarization of B (i.e., weights) can
be performed offline after the training; only the binarization
of A is in the critical path of inference. Existing work has
shown that such a binarization can be achieved efficiently
through the __ballot() function of GPUs [26]; (b) Com-
putation. The dot-product of GEMM is yi,j =
∑n−1
k=0 ai,kbk,j
where n is the vector length. In terms of BMM, as a and b
become bit-vectors, if using bit-1 to denote +1, and bit-0 to
denote -1, it can be shown that the±1 dot-product becomes:
v = ~a·~b = n−2×popc(~a xor~b) = 2×popc(~a xnor~b)−n (2)
where n is the bit-vector length. xor and xnor are logical
exclusive-or and exclusive-nor. The xnor expression has
widely been used for BNN algorithm research [1], [3] and
61 global void BMM(unsigned *A, unsigned *B, int *C, int A height, int A width, int B width)
2{
3 using namespace nvcuda::wmma::experimental;
4 int bx = blockIdx.x*blockDim.y+threadIdx.y; int by = blockIdx.y;
5 wmma::fragment<wmma::matrix a,8,8,128, precision::b1,wmma::row major>a frag;//tile A
6 wmma::fragment<wmma::matrix b,8,8,128, precision::b1,wmma::col major>b frag;//tile B
7 wmma::fragment<wmma:accumulator,8,8,128, int> c frag; wmma::fill fragment(c frag,0);//tile C
8 for ( int i=0; i<(A width/128);i++) {
9 load matrix sync(a frag, A+bx*8*A width/32+i*128/32, A width);//fetch tile A
10 load matrix sync(b frag, B+by*8*A width/32+i*128/32, A width);//fetch tile B
11 bmma sync(c frag, a frag, b frag, c frag); }//BMM
12 for ( int i=0; i<c frag.num elements;i++) c frag.x[i]=A width−2*c frag.x[i];//fix for +1/−1 logic
13 store matrix sync(C+(bx*8*B width+by*8), c frag, B width, wmma::mem row major); //store tile C
14}
15 BMM<<<dim3(A height/16, B width/8), dim3(32, 2)>>>(...); //invoke BMM kernel
Listing 3: BMM baseline implementation
1 global void BMM(unsigned *A, unsigned *B, int *C, int A height, int A width, int B width)
2{
3 using namespace nvcuda::wmma::experimental;
4 shared uint4 As[32], Bs[32];//buffering (8*128)*8 bit block in shared memory
5 const int laneid=threadIdx.x; const int wx=threadIdx.y; const int wy=threadIdx.z; //tile index
6 const int bx=blockIdx.x; const int by=blockIdx.y; //block index
7 wmma::fragment<wmma::matrix a,8,8,128, precision::b1,wmma::row major>a frag;//tile A
8 wmma::fragment<wmma::matrix b,8,8,128, precision::b1,wmma::col major>b frag;//tile B
9 wmma::fragment<wmma:accumulator,8,8,128, int> c frag; wmma::fill fragment(c frag,0);//tile C
10 for( int k=0;k<A width;k++){
11 if (wx==0&&wy==0){//one warp fetches data into shared memory for 16 warps of a thread block
12 As[laneid]=((uint4*)A)[(bx*32+laneid)*A width+k]; Bs[laneid]=((uint4*)B)[(by*32+laneid)*A width+k];}
13 syncthreads();//for respecting RAW dependency
14 load matrix sync(a frag, &As[wx*8], 128); load matrix sync(b frag, &Bs[wy*8], 128);
15 bmma sync(c frag, a frag, b frag, c frag);
16 syncthreads();}//for respecting WAR dependency
17 for( int i=0; i<c frag.num elements; i++) c frag.x[i] = (A width*128)−(2*c frag.x[i]);//+1/−1 BMM
18 store matrix sync(&C[(bx*4+wx)*8*B width+(by*4+wy)*8], c frag, B width, wmma::mem row major);
19}
20 BMM<<<dim3(A height/32,B width/32),dim3(32,4,4)>>>(...);
Listing 4: Bit-Matrix-Multiplication
FPGA/ASIC implementation [20], [21] while GPU tensor-
cores currently only support xor for BMM computation.
popc stands for population count, which counts the number
of bit-1s in the bit vector. (c) Output. The elements of the
output matrix-C are full-precision integer values. However,
in an FC layer, it can be binarized after a threshold operation
(discussed later), reducing memory access. Therefore, the
third difference with GEMM is that the output-C can be
binarized before the store.
Design-1 Now we present our three BMM designs based
on WMMA, which is the only API for operating the tensor-
cores. The baseline design is shown in Listing 3. Each thread
block comprises two warps and each warp processes BMM
for a 128×8 bit-tile in Line 10. Having two warps per thread
block is for achieving the full occupancy of Turing SMs.
Design-2 As memory load is the most important factor
for matrix multiplication on GPU [55], Design-2 aims at
improving the efficiency of memory load. On one hand,
using a whole warp to fetch only 128 bits is too lightweight.
On the other hand, if coalescing memory access is enforced
with each lane fetches 32bits, the total bit length becomes
32×4×8=1024 bits, which is probably too coarse-grained
for a BNN FC layer given the matrix size is usually less
than 2048. Therefore, motivated by [45], we increase the
load granularity per warp-lane to its max value of 128
bits, leveraging the effective LDG.E.128 SASS instruction.
With each lane fetching 128 bits, a warp of 32 lanes would
fetch a bit segment of 4096 bits, which is sufficient for 4
warps to perform WMMA simultaneously. As a result, we
use a representative warp to fetch 4096 bits of A and 4096
bits of B from global memory to shared memory, which
are then dispatched to 16 warps for WMMA execution, as
listed in Listing 4. Essentially, each thread block processes
a BMM of (128,32)×(32,128) while each warp processes
(128,8)×(8,128). Line 11-12 show how to invoke 128-bit
global memory load through vectorization [45]. Note that
1 #define FLIPBITS(n,b) ((n)ˆ((1u<<(b))−1)) //flip the last b bits of n
2 global void BMMb(unsigned *A, unsigned *B, unsigned *C, int A height, int A width, int B width)
3{
4 using namespace nvcuda::wmma::experimental;
5 int bx = blockIdx.x*blockDim.y+threadIdx.y; int by = blockIdx.y;
6 shared int Cs[64];
7 unsigned laneid; asm(”mov.u32 %0, %%laneid;”:”=r”(laneid)); //get laneid
8 wmma::fragment<wmma::matrix a,8,8,128, precision::b1,wmma::row major>a frag;//tile A
9 wmma::fragment<wmma::matrix b,8,8,128, precision::b1,wmma::col major>b frag;//tile B
10 wmma::fragment<wmma:accumulator,8,8,128, int> c frag; wmma::fill fragment(c frag,0);//tile C
11 for ( int i=0; i<(A width/128);i++) {
12 load matrix sync(a frag, A+bx*8*A width/32+i*128/32, A width);//fetch tile A
13 load matrix sync(b frag, B+by*8*A width/32+i*128/32, A width);//fetch tile B
14 bmma sync(c frag, a frag, b frag, c frag); }//BMM
15 for ( int i=0; i<c frag.num elements;i++) c frag.x[i]=A width−2*c frag.x[i];//fix for +1/−1 logic
16 store matrix sync(Cs, c frag, B width, wmma::mem row major); } //store tile C
17 union{unsigned data; uchar elements[4];} p0,p1;
18 p0.data= ballot(A width−2*Cs[laneid]>=0);
19 p1.data= ballot(A width−2*Cs[laneid+32]>=0);
20 if (laneid<4) { uchar* Cb = (uchar*)&C[0];
21 Cb[(bx*8+laneid)*(B width/8)+FLIPBITS(by,2)]=p0.elements[3−laneid];
22 Cb[(bx*8+4+laneid)*(B width/8)+FLIPBITS(by,2)]=p1.elements[3−laneid]; }
23}
24 BMMb<<<dim3(A height/16, B width/8), dim3(32, 2)>>>(...); //invoke BMMb kernel
Listing 5: BMM in new format with binarized output
1 global void BTC Conv2d(const unsigned* restrict input,
2 const unsigned* restrict filter, int* output, const int in channels, const int out channels,
3 const int in width, const int in height, const int filter width, const int filter height ,
4 const int batch, const int stride vertical, const int stride horizontal, const int out width,
5 const int out height, const int pad h, const int pad w)
6{
7 using namespace nvcuda::wmma::experimental;
8 wmma::fragment<wmma::matrix a, 8, 8, 128, precision::b1, wmma::row major> a frag;
9 wmma::fragment<wmma::matrix b, 8, 8, 128, precision::b1, wmma::col major> b frag;
10 wmma::fragment<wmma::accumulator, 8, 8, 128, int> c frag;
11 unsigned laneid; asm(”mov.u32 %0, %%laneid;”:”=r”(laneid)); //get laneid
12 const int bx = blockIdx.x; //over Q: out width
13 const int by = blockIdx.y; //over P: out height
14 const int bz = blockIdx.z; // over O: out channel/8 * batch/8
15 const int ins = (in channels>>7);//number of steps in C: in channels
16 const int bn = bz/(out channels>>3); //batch
17 const int bo = bz%(out channels>>3); //out channel
18 // coord (ax,ay) in Input from bx,by in Output
19 const int ax0 = bx*stride horizontal−pad w; const int ay0 = by*stride vertical−pad h;
20 // track the number of filter entries that are masked off
21 int exclude = 0; wmma::fill fragment(c frag, 0);
22 // load a window of data from Input
23 for ( int r=0; r<filter height; r++) {
24 for ( int s=0; s<filter width; s++) {
25 const int ay = ay0 + r; // y−coord in Input
26 const int ax = ax0 + s; // x−coord in Input
27 if ((ay>=0) && (ay<in height) && (ax>=0) && (ax<in width)) { //within Input frame
28 for ( int c=0; c<ins; c++) { //per 128−bit in C:in channels
29 // input:[H,W,N,C], filter :[K,K,O,C]
30 load matrix sync(a frag,&input[(ay*in width+ax)*batch*ins*4+bn*ins*32+c*4],in channels);
31 load matrix sync(b frag,&filter[(r*filter width+s)*out channels*ins*4+bo*ins*32+c*4],in channels);
32 bmma sync(c frag, a frag, b frag, c frag);}}
33 else exclude++; //accumulate for points not in frame
34 } }
35 for ( int i=0; i<c frag.num elements; i++)
36 c frag.x[i] = in channels*filter height*filter width− exclude*in channels− (2*c frag.x[i]);
37 // output: [P,Q,N,O] => [by,bx,bn,bo]
38 store matrix sync(&output[(by*out width+bx)*batch*out channels + bn*8*out channels + bo*8],
39 c frag, out channels, wmma::mem row major);
40}
Listing 6: BTC-based BConv design
load_matrix_sync is 5× faster on shared memory than
global memory (Section IV-A).
Design-3 We adopt our new FSB format for BMM. Listing 5
shows the design with the output matrix binarized. As can
be seen, after BMM, each warp holds a tile of C in size of
8×8. We use __ballot() to binarize the 64 elements using
the entire 32 lanes in Line 17-18. In order to write an 8-bit
uchar to a 32-bit unsigned, we define a union for packing
and unpacking. As NVIDIA GPUs adopt little endian, we
define FLIPBIT() for efficient byte index translation.
5.3 BConv for Convolution Layer
The convolution operation here is to cross-correlate a 4D
input tensor (batch, input height, input width, input channels)
with a 4D weight tensor (weight height, weight width, in-
put channels, output channels). We use H to denote in-
put height, W to denote input width, N to denote batch, C
to denote input channel, K to denote weight size and O to
denote output channel. TensorFlow thus uses NHWC for input
and KKCO for filter. PyTorch uses NCHW for input and OCKK
for filter. Traditionally, a 2D convolution can be transformed
7into GEMM through the im2col() process [28], [29], which
can then be accelerated by the tensorcores. However, for
BConv, directly converting to BMM is not feasible due to the
challenge in padding [30]. Different from normal convolu-
tion where the padded zeros shall not affect the correctness,
in BConv the element zero actually denotes -1. Therefore,
after the im2col() process, we are unable to distinguish
the padded 0s from the meaningful zeros representing -1,
leading to inaccurate results.
Thus, the objective here is how to design BConv so that
the padding issue can be well-managed but can still be
accelerated by the bit tensorcores of Turing GPUs. On one
hand, motivated by existing work [26], if the entire filter
window is processed sequentially by a single GPU thread,
a status variable can be allocated to track how many entries
of the filter window fall out of the frame of the input image,
which can be used later to make an amendment accordingly
for ensuring the correctness of bit-padding. On the other
hand, if we ignore the image size and filter size for now
but looking at a particular point [i,j] of the input image, the
batch of N images at that point cross-correlating with an
entry of the filter window [r,s] is essentially to calculate the
following output point [p,q]:
Output[p,q] =
C∑
k=1
input(N, k)[i,j] × filter(k,O)[r,s] (3)
This is just equivalent to multiplying a bit matrix in size
(N,C) with another matrix in size (C,O), which can be
performed by the bit-tensorcores. To summarize, our idea
is to change the input tensor to HWNC, the filter tensor to
KKCO, and perform BMM along the last two dimensions of
these two tensors.
Our first design is shown in Listing 6. We use each warp
to traverse the input channel space at Line 28 and perform
the computation for 8 input images over 8 output channels
(i.e., (8, C)× (C, 8)) using the bit-tensorcores in Line 30-32.
We use ”exclude” to track the number of entries outside the
filter frame at Line 33 and amendment the results at Line 36
for padding and the ±1 logic (see Eq 2). We use c frag for
storing the partial results of convolution. Eventually, the
8×8 resulting matrix tile stored in c frag is written back to
the global memory in row-major at Line 38-39.
Our second BConv design leverages the new bit data
format. We reform the last two dimensions of the input
tensor (N,C) in a bit-tile of 128×8 bits in row-major, and the
filter tensor (C,O) in a bit-tile of 128×8 bits in column-major.
Then, we can adjust the ldm in Line 30-31 from ”in channels”
to 128, later we will show the impact of this adjustment.
6 BNN DESIGN WITH BTC
We present the overall BNN structure and our BTC-based
BNN implementation.
6.1 BNN Network Structure
Figure 15 illustrates the network structure of an example
ResNet. To avoid losing too much non-recoverable infor-
mation at the beginning, if the input images are in full-
precision (e.g., after preprocessing), the first layer of BNN
is not binarized [2], [3], [7]. BWN is adopted here in which
only the weight matrix is binarized. Consequently, we are
unable to use BTC to accelerate the first layer. Also because
the input channels of the first layer is usually very small
(e.g., red, green, blue), to avoid alignment issue and fully
leverage data locality, we binarized the weight matrix into
a 4D bit tensor in KKCO format and buffer the weight into
the shared memory for reuse. Then, by extracting each bit
of the weight, depending on whether it is 1 or 0, we add
or subtract the corresponding element of the input matrix.
The output matrix is binarized and stored in particular bit-
format as the input for the next layer.
Shown in Figure 15, regarding training, a BNN con-
volution layer typically comprises binarization (sign),
bit-convolution (conv), batch-normalization (bn), hard-tanh
(tanh), and pooling (pool). Binarization is the sign func-
tion following Eq 1. Batch-normalization [57] is to reduce
the batch noise:
yi,j =
(
xi,j − E[x∗,j ]√
V ar[x∗,j ] + 
)
· γj + βj (4)
Note that bn is essential for BNNs, as missing it will render
the training unable to converge. Additionally, having bn
brings two extra benefits: (1) bias is thus not necessary for
the bit convolution or fully-connected layer, as bias can be
integrated with βj in Eq 4; (2) the scaling layer proposed in
[3], [7] for BNN is also not necessary as it can be integrated
with γj in Eq 4. Hard tanh is a piecewise linear function:
Htanh(x) = Clip(x,−1, 1) (5)
Since tanh is immediately followed by the sign function, it
has none effect on inference or the forward pass of training.
The major purpose of tanh is to constrain the gradient of
the sign function between -1 and +1 in the backward pass
[1]. Otherwise, if the full-precision activation is too large,
the gradient will be zeroed-out. Additionally, since the sign
binarization function has already imposed non-linearity into
the network, no other activation function such as ReLU [3]
and PReLU [7] is actually needed for BNN. Conversely, extra
activation functions can be harmful based on our tests.
Regarding the order of these functions, it should be
tanh→sign→bconv→pool→bn→tanh→sign for the
training, as it has already been shown that placing pool
before bn can lead to increased training accuracy [3], [30].
However, for inference it would be much faster if equiva-
lently pool is located after bn and even the binarization
of the next layer to convert a max pooling into a logic-
OR operation [21], [26]. Additionally, for inference, bn and
sign of the next layer can be aggregated as a simple
threshold comparison operation (i.e., returns +1 if greater
than a threshold τ and -1 otherwise) [21], [26], labeled as
thrd in Figure 15. In this way, thrd can be further fused
with bconv or bmm to reduce the volume of data access if
the residual is not saved. Finally, tanh is not required for
inference as discussed. Consequently, the ultimate function
order becomes thrd→bconv→thrd→pool→bconv for
inference. Similar condition is also applied for the FC layers.
Traditionally, the last layer of BNN is also in full-
precision [1], [3]. However, Tang et al. [7] showed that
binarizing the final layer with a learned scaling layer could
significantly compact the model as FC layers comprises the
most parameters. Our observation here is that such a scaling
layer can be absorbed by adding a bn function for the last
layer, which may provide even better performance due to
8convimg32b bnpool tanh convsign bn tanh convsign sign fc bnbn
Residual 32bit
32b 32b 32b 32b 32b 32b 32b 32b 32b 32bout1b 1b 1b
sign
1b
W32b sign
1b
W32b sign
1b
W32b sign
1b
W32b
conv
1b
Wb
img32b thrd32b 1b pool 1b conv
1b
Wb
thrd32b 1b conv
1b
Wb
thrd32b 1b fc bn32b 32bout1b
Residual 32bit
32b32bTraining
Inference
1b
Wb
Fig. 15: A typical network structure of a ResNet BNN including 3 Conv layers and 1 FC layer for training and inference.
more constraint output range for the following softmax
function. Note, for the final layer, since the output is real-
valued and there is no future binarization, bn cannot be
converted into a thrd function.
In terms of more advanced models such as ResNet, to
avoid gradient diminishing or explosion, the cross-layer
shortcut connections become vital. Here, the main perfor-
mance concern is that these residuals are real-valued (bit-
residual cannot convey gradient), which may incur sub-
stantial extra memory load & store compared with directly
saving the bits after thrd. In addition, the residual may
need a pooling layer before the injection. Furthermore, it is
also possible that the number of channels needs to adjust. In
those scenarios, we use the type-A shortcut of ResNet [58].
6.2 BTC based Implementation
Similar to [26], we have also fused all the layer functions
into a single GPU kernel so the repeated kernel invocation
& release overhead (as long as 20 µs per invocation [59])
can be eliminated. We implement each layer function as
a GPU device function. These device functions are called
from a global function where the BNN network model
is defined. Due to data dependency across the layers, to
ensure consistency, we rely on CUDA’s cooperative-groups
for global synchronization among all SMs. There are two
major challenges for the overall design here: (i) Achieving
high SM utilization. Since WMMA is executed at the warp
level, with 32 warps per SM for Turing GPUs and 68 SMs
in RTX2080Ti for instance, the overall parallelism offered
by the hardware is 2176 warps, implying 2176 BMMs sized
(8,128)×(128,8) per round. Consequently, the task granular-
ity per warp should be as small as possible in order to use
all the SM warp slots and achieve workload balance; (ii)
Adapting to WMMA format. As BTC can only process BMM
sized (8,128)×(128,8)=(8,8), we need to ensure the row of
the FC input matrix, the column of the weight, the batch
of the BConv image, and the output channel can all divide
8, while the column of the FC input matrix, the row of the
weight, the input channel of BConv can all divide 128. Both
requirements need to be consistent across all layers. Given
the BNN model can be in arbitrary configuration and we
internally use our own FSB format, the address translation
and calculation become more complicated. There is another
format change after the final Conv layer and ahead of the
first FC layer to ensure correct format transition.
7 EVALUATION
We evaluate our BTC-based BNN design in this section.
We first describe the experiment configurations. Then, we
show the evaluation results for BMM and BConv. Finally,
we discuss the performance of BNN using different models
and datasets.
7.1 Experiment Configuration
We use two NVIDIA Turing GPUs with CC-7.5 for evalu-
ation. Their information is listed in Table 2. The RTX2080
GPU is in a Linux 3.10.0 system with Intel Xeon E5-2680
CPU at 2.80 GHz, 128 GB DDR3 DRAM and gcc-4.8.5. The
RTX2080Ti GPU is in a Linux 2.6.32 system with Intel Xeon
E5-6230 CPU at 2.10 GHz, 384 GB DDR4 DRAM and gcc-
4.8.5. All the results reported are the average of 10 times’
execution.
7.2 BMM Evaluation
For BMM evaluation, we randomly generate square matri-
ces with increased sizes from from 128 until 16K. We use
the full-precision GEMM from cuBLAS as the baseline for
validation and performance comparison. We compare our
three BTC-based BMM designs with the BMM approach
from [3], the four BSTC BMM designs from [26], and the BTC
uint-4 and BMM designs from Cutlass [31]. We conduct two
types of testing: (1) General BMM where both the input ma-
trices and the output matrix are floating-points. It includes
binarization for A and B, but excludes the binarization for
C. The tested schemes are listed in Table 3. (2) BNN-specific
BMM where both the input matrices and the output matrix
are binarized. It includes binarization for C but excludes A
and B. This test reflects how BMM actually behaves in a
BNN FC layer. The schemes are listed in Table 4.
Figure 16, 17, 18 and 19 show the results of the two BMM
tests on TU104 RTX2080 GPU and TU102 RTX2080Ti GPU,
respectively. For general BMM in Figure 16 and 18, we have
three major observations: (I) No single approach dominates
the entire matrix range — For small matrices (n≤1K), the
fine-grained 64bit BSTC is relatively better although the ad-
vantage is marginal. This might be due to more fine-grained
thread-block tasks to leverage all SMs; For medium matrices
(1K<n≤4K), Design-3 based on the proposed FSB-format
obtains the best performance, particularly at 4K. For large
matrices (n>4K), the performance of all BTC based designs
drop. This is due to the fierce competition in BTC and
reduced data reuse in the L0/L1 cache. Nevertheless, the
size of FC layers of most BMMs fall in the medium range.
(II) Comparing among Design-1, 2, and 3, while Design-2 is
always better than Design-1 due to improved load efficiency
and shared memory reuse, the new-format based Design-3
significantly outperforms Design-1/2 except on very large
matrices. Overall, without this new FSB format, BTC may
not deliver any performance advantage over existing BSTC
software solutions. For BNN-specific BMM in Figure 17 and
19, the avoidance of binarizing A & B, and reduced memory
store after binarizing C, dramatically amplify the supremacy
of Design-3. The speedup is more than 20× over the full-
precision cuBLAS at 4K on RTX2080. (III) Comparing be-
tween BMMs and uint-4 based GEMM over the same TCUs,
9TABLE 2: Evaluation Platforms. ”Reg” refers to the number of 4-byte registers. ”Thds” refer to threads. ”Dri/Rtm” refer to CUDA driver and runtime versions.
GPU Arch/CC Code SMs CTAs/SM Warps/SM Thds/CTA Regs/SM Shared/SM TCUs/SM Memory Mem Bandwidth Dri/Rtm
RTX-2080Ti Turing-7.5 TU102 68 16 32 1024 64K 64K 8 11GB GDDR6 616 GB/s 10.1/10.0
RTX-2080 Turing-7.5 TU104 46 16 32 1024 64K 64K 8 8GB GDDR6 448 GB/s 10.0/10.0
128 256 512 1K 2K 4K 8K 16K
Matrix Size (X*X)
0
2
4
6
8
10
12
Sp
ee
du
p 
ov
er
 cu
BL
AS
xnor
bmm32
bmm64
bmms32
bmms64
bmma128
bmma
bmmafmt
Fig. 16: General BMM on RTX2080
128 256 512 1K 2K 4K 8K 16K
Matrix Size (X*X)
0
4
8
12
16
20
24
Sp
ee
du
p 
ov
er
 cu
BL
AS
bmm32_b
bmm64_b
bmms32_b
bmms64_b
bmma128_b
bmma_b
bmmafmt_b
Fig. 17: Specific BMM on RTX2080
128 256 512 1K 2K 4K 8K 16K
Matrix Size (X*X)
0
2
4
6
8
10
12
Sp
ee
du
p 
ov
er
 cu
BL
AS
xnor
u4
cutlass
bmm32
bmm64
bmms32
bmms64
bmma128
bmma
bmmafmt
Fig. 18: General BMM on RTX2080Ti
128 256 512 1K 2K 4K 8K 16K
Matrix Size (X*X)
0
4
8
12
16
20
24
Sp
ee
du
p 
ov
er
 cu
BL
AS
bmm32_b
bmm64_b
bmms32_b
bmms64_b
bmma128_b
bmma_b
bmmafmt_b
Fig. 19: Specific BMM on RTX2080Ti
12
8
25
6
38
4
51
2
64
0
76
8
89
6
10
24
11
52
12
80
14
08
15
36
16
64
17
92
19
20
20
48
Input & Output Channel Size
0
5
10
15
20
25
Sp
ee
du
p 
ov
er
 cu
DN
N-
ba
se cudnnfast
bconv32
bconv64
bmma
bmmafmt
Fig. 20: General BConv on RTX2080
12
8
25
6
38
4
51
2
64
0
76
8
89
6
10
24
11
52
12
80
14
08
15
36
16
64
17
92
19
20
20
48
Input & Output Channel Size
0
5
10
15
20
25
Sp
ee
du
p 
ov
er
 cu
DN
N-
ba
se cudnnfast
bconv32_b
bconv64_b
bmma_b
bmmafmt_b
Fig. 21: Specific BConv on RTX2080
12
8
25
6
38
4
51
2
64
0
76
8
89
6
10
24
11
52
12
80
14
08
15
36
16
64
17
92
19
20
20
48
Input & Output Channel Size
0
5
10
15
20
25
Sp
ee
du
p 
ov
er
 cu
DN
N-
ba
se cudnnfast
bconv32
bconv64
bmma
bmmafmt
Fig. 22: General BConv on RTX2080Ti
12
8
25
6
38
4
51
2
64
0
76
8
89
6
10
24
11
52
12
80
14
08
15
36
16
64
17
92
19
20
20
48
Input & Output Channel Size
0
5
10
15
20
25
Sp
ee
du
p 
ov
er
 cu
DN
N-
ba
se cudnnfast
bconv32_b
bconv64_b
bmma_b
bmmafmt_b
Fig. 23: Specific BConv on RTX2080Ti
TABLE 3: BMM full-precision output schemes.
Schemes Description Algorithm Input Output
cuBLAS Simulating BMM via SGEMM SGEMM 32bit 32bit
xnor BMM design in [1] BMM 32bit 32bit
bmm32 32bit BSTC BMM in [26] BMM 32bit 32bit
bmm64 64bit BSTC BMM in [26] BMM 32bit 32bit
bmms32 Fine-grained 32bit BSTC BMM in [26] BMM 32bit 32bit
bmms64 Fine-grained 64bit BSTC BMM in [26] BMM 32bit 32bit
cutlass BTC BMM in Cutlass library [31] BMM 1bit 32bit
u4 BMM via unsigned 4-bits MM [31] 4bits-MM 4bit 32bit
bmma Design-1: basic BTC implementation BMM 32bit 32bit
bmma128 Design-2: 128bit load and shared memory BMM 32bit 32bit
bmmafmt Design-3: new format BMM 32bit 32bit
TABLE 4: BMM bit output schemes.
Schemes Description Input Output
bmm32 b 32bit BSTC BMM in [26] with bin output 1bit 1bit
bmm64 b 32bit BSTC BMM in [26] with bin output 1bit 1bit
bmms32 b Fine-grained 32bit BSTC BMM in [26] with bin output 1bit 1bit
bmms64 b Fine-grained 64bit BSTC BMM in [26] with bin output 1bit 1bit
bmma b Design-1: basic BTC implementation with bin output 1bit 1bit
bmma128 b Design-2: 128bit load with bin output 1bit 1bit
bmmafmt b Design-3: new format with bin output 1bit 1bit
we can observe that BMMs demonstrate obvious advantage.
This is largely because (a) the smaller memory footprint (1-
bit vs 4-bits) reduces the bandwidth and storage pressure
over the data-path and registers; (b) with the same bit-width
for the ALUs in the TCUs, using 1-bit can compact 4×more
data elements than using int-4 or uint-4. Similar conditions
apply to other types such as int-8 and FP16.
7.3 BConv Evaluation
For BConv, there are much more parameters than BMM:
input height, input width, weight height, weight width, batch,
input channels, output channels, stride, pooling, etc. We com-
pare our two BTC-based designs (Note that we use bmma to
denote Design-1 and bmmafmt to denote Design-2) with full-
precision cuDNN-base (no workspace), cudnn-fast (plenty
workspace), and two BSTC designs (bconv32 and bconv64)
from [26]. We use cuDNN-base as the baseline and perform
the two types of test: (1) General BConv where the input,
filter and output tensors are all floating-points; (2) BNN-
specific BConv where all of them are binarized.
Figure 20, 21, 22 and 23 show the results of the two
types of tests with batch=16, input size=64, weight height=3
and stride=1 on the two GPUs. We increase both in-
put channels (C) and output channels (O) from 128 to 2048.
As is shown, our two BTC-based approaches exhibit con-
siderable speedups over existing methods. Particularly, the
FSB new format design achieves about 25× over the full-
precision cuDNN with C=O=640 on RTX2080Ti. Comparing
between the two BTC designs, we can see that (i) when
C=O=128, the two designs are just equivalent, so they show
similar performance; (ii) When C=O=384, Design-1 is better
possibly because ldm=384 is also a good choice for memory
load (see Section IV). (iii) For the other points, Design-2
shows obvious advantages.
7.4 BNN Evaluation
Finally, we evaluate the overall BNN implementation. Ta-
ble 5 lists the six models we used for evaluation. Table 6
and 7 list the latency and throughput we obtained for the
six models on the two NVIDIA Turing GPUs, respectively.
The latency is measured under a batch size of 8 since
8 is the smallest value to leverage the bit-tensorcores so
essentially the latency is for the inference of 8 images. The
throughput is measured under a batch of 1024 images for
MNIST and Cifar10, and 512 for ImageNet. We compare our
performance with the four approaches from the latest BSTC
SBNN work (from http://github.com/uuudown/SBNN.)
[26]. Overall, compared with the best approach SBNN-
64-Fine from SBNN, our BTC using the default format
design achieves on average 2.10× in latency and 1.65×
in throughput on RTX2080Ti, and 2.08× in latency and
1.62× in throughput on RTX2080 across the six models. Our
proposed BTC new format achieves 2.33× in latency and
1.81× in throughput on RTX2080Ti, and 2.25× in latency
and 1.77× in throughput on RTX2080. The best speedup
has been achieved by the FSB-format based design on
RTX2080Ti for ResNet-14 on Cifar10 — 3.79× in latency and
2.84× in throughput.
Regarding this result, we have three observations: (I) Our
BTC design generally achieves more than 2× over existing
work except for MNIST-MLP and ImageNet-Alexnet where
the throughput is actually a little bit worse. The reason is
10
TABLE 5: BTC Evaluation. “1024FC” refers to a fully-connected layer with 1024 neutrons. “2x128C3” refers to 2 convolution layer with 128 output channels
and 3x3 filter. “P2” refers to a 2x2 pooling layer. “128C11/4” refers to a convolution layer with 128 output channels, 11x11 filter size and stride=4. ”Input
size” is of input height×input width×input channels format. ”Output” is the number of categories to classify. “Ref” is short for references. “BNN” refers
to state-of-the-art BNN training accuracy from existing work. “Our BNN” is the BNN training accuracy we obtained from our own BNN implementation.
“Full-Precision” is the 32 bits full-precision training accuracy from existing works.
Dataset Ref Network Ref Network Structure Input Size Out BNN Our BNN Full-Precision
MNIST [60] MLP [1] 1024FC-1024FC-1024FC-1024FC 28x28x1 10 98.6% [1] 97.6% 99.1% [1]
Cifar-10 [61] VGG [4] (2x128C3)-MP2-(2x256C3)-MP2-(2x512C3)-MP2-(3x1024FC) 32x32x3 10 89.9% [1] 88.7% 90.9% [8]
Cifar-10 [61] ResNet-14 [3] 128C3/2-4x128C3-4x256C3-4x512C3-(2x512FC) 32x32x3 10 N/A 91.6% N/A
ImageNet [62] AlexNet [63] (128C11/4)-P2-(256C5)-P2-(3x256C3)-P2-(3x4096FC) 224x224x3 1000 75.7/46.1% [8] 74.2/44.7% 80.2/56.6% [8]
ImageNet [62] VGG-16 [64] (2x64C3)-P2-(2x128C3)-P2-(3x256C3)-P2-2x(3x512C3-P2)-(3x4096FC) 224x224x3 1000 76.8%/NA [41] 77.7/53.4% 88.4%/NA [41]
ImageNet [62] ResNet-18 [3] 64C7/4-4x64C3-4x128C3-4x256C3-4x512C3-(2x512FC) 224x224x3 1000 73.2/51.2% [3] 72.7/48.6% 89.2/69.3% [3]
TABLE 6: BTC Inference Performance on NVIDIA Turing RTX2080 GPU.
MNIST-MLP Cifar10-VGG Cifar10-ResNet14 ImageNet-AlexNet ImageNet-VGG ImageNet-ResNet18
Schemes 8 Latency Throughput 8 Latency Throughput 8 Latency Throughput 8 Latency Throughput 8 Latency Throughput 8 Latency Throughput
SBNN-32 0.227ms 2.88×106fps 1.891ms 1.06×104fps 5.138ms 4.17×103fps 4.494ms 3.18×103fps 27.638ms 4.26×102fps 6.550ms 2.60×103fps
SBNN-32-Fine 0.082ms 1.97×106fps 1.536ms 1.05×104fps 4.382ms 3.97×103fps 3.928ms 3.05×103fps 27.009ms 4.25×102fps 5.944ms 2.37×103fps
SBNN-64 0.908ms 8.44×104fps 2.816ms 1.06×104fps 8.132ms 4.59×103fps 17.258ms 1.95×103fps 40.247ms 5.23×102fps 8.108ms 3.01×103fps
SBNN-64-Fine 0.074ms 5.51×106fps 0.999ms 1.63×104fps 2.550ms 6.52×103fps 2.871ms 3.79×103fps 16.68ms 6.65×102fps 3.736ms 3.42×103fps
BTC 0.061ms 3.37×106fps 0.364ms 3.62×104fps 0.827ms 1.58×104fps 2.367ms 3.85×103fps 7.449ms 1.24×103fps 1.869ms 5.48×103fps
BTC-FMT 0.055ms 5.48×106fps 0.338ms 3.85×104fps 0.724ms 1.71×104fps 2.326ms 3.77×103fps 7.021ms 1.34×103fps 1.833ms 5.55×103fps
TABLE 7: BTC Inference Performance on NVIDIA Turing RTX2080Ti GPU.
MNIST-MLP Cifar10-VGG Cifar10-ResNet14 ImageNet-AlexNet ImageNet-VGG ImageNet-ResNet18
Schemes 8 Latency Throughput 8 Latency Throughput 8 Latency Throughput 8 Latency Throughput 8 Latency Throughput 8 Latency Throughput
SBNN-32 0.252ms 2.99×106fps 1.596ms 1.19×104fps 4.909ms 4.68×103fps 1.937ms 7.62×103fps 23.788ms 5.13×102fps 5.863ms 3.08×103fps
SBNN-32-Fine 0.082ms 2.43×106fps 1.548ms 1.19×104fps 4.061ms 4.65×103fps 1.733ms 7.16×103fps 22.722ms 5.06×102fps 5.145ms 3.10×103fps
SBNN-64 0.952ms 1.03×105fps 2.921ms 1.37×104fps 8.633ms 5.95×103fps 14.214ms 2.59×103fps 31.561ms 7.08×102fps 8.092ms 3.91×103fps
SBNN-64-Fine 0.070ms 6.87×106fps 0.926ms 2.09×104fps 2.341ms 7.86×103fps 2.017ms 4.99×103fps 12.057ms 8.83×102fps 3.233ms 4.52×103fps
BTC 0.057ms 4.38×106fps 0.317ms 4.69×104fps 0.728ms 2.06×104fps 1.878ms 4.87×103fps 5.840ms 1.62×103fps 1.538ms 7.23×103fps
BTC-FMT 0.053ms 6.78×106fps 0.276ms 5.06×104fps 0.618ms 2.24×104fps 1.862ms 4.87×103fps 5.466ms 1.76×103fps 1.438ms 7.34×103fps
TABLE 8: Comparing with FPGA works using AlexNet on ImageNet.
AlexNet/ImageNet Platform Raw Latency Throughput
RebNet [65] Xilinx Virtex VCU108 FPGA 1902 µs 521 img/s
FP-BNN [23] Intel Stratix-V FPGA 1160 µs 862 img/s
O3BNN [25] Xilinx Zynq ZC706 FPGA 774 µs 1292 img/s
SBNN [26] NVIDIA Tesla V100 GPU 979 µs 4400 img/s
BTC NVIDIA RTX2080Ti GPU 559 µs 4869 img/s
TABLE 9: Comparing with CPU, GPU and FPGA using VGG-16 on ImageNet.
Vgg-16/ImageNet Platform Raw Latency Throughput
BitFlow [41] NVIDIA GTX1080 12.87 ms 78 img/s
BitFlow [41] Intel i7-7700 HQ 16.10 ms 62 img/s
BitFlow [41] Intel Xeon-Phi 7210 11.82 ms 85 img/s
O3BNN [25] Xilinx Zynq ZC706 FPGA 5.626 ms 178 img/s
SBNN [26] NVIDIA Tesla V100 GPU 3.208 ms 312 img/s
BTC NVIDIA RTX2080Ti GPU 3.570 ms 1760 img/s
that for MLP, a batch of 1024 is still insufficient for fully
leveraging the bit-tensorcores, as will be discussed later.
For Alexnet, the delay of the first layer remains too large
(77.4%) while the other convolution layers are relatively
smaller than alternative networks, which cannot fully utilize
the BTCs. (II) Although showing better performance, the
speedup led by the new FSB format is not as good as
in BMM and BConv, the major reason is that both the
batch size and the channels are relatively small (batch≤1K,
channels≤512) which is not the region that the FSB format
can demonstrate its best speedups (Section VII).
Table 8 and 9 compare the single image raw latency
and throughput of our BTC-based new format design with
existing BNN approaches for CPU, GPU, Xeon-Phi and
FPGA using Alexnet and VGG-16 on ImageNet. As can be
seen, our design achieves the best single-image raw latency
and throughput on Alexnet, and more than 5× throughput
enhancement on VGG-16 over the existing works as listed.
7.5 Sensitivity Study
To further investigate the performance delivery, we perform
several sensitivity studies in this subsection.
Latency Breakdown: Figure 24 illustrates the percentage
breakdown of the latency (measured by clock() on GPU)
TABLE 10: Layer-wise Synchronization Overhead.
Sync Mnist-MLP Cifar10-vgg Cifar10-ResNet14 Alexnet VGG ResNet18
% 8.26% 14.10% 13.20% 1.36% 1.72% 5.67%
for the inference of 8 images over the six models on the
RTX-2080 GPU. Clearly, the first layer contributes the most
delay for the three ImageNet models due significantly larger
image size than the other two datasets. For AlexNet, the
percentage can be as high as 77.4%. It is also over 35% for
VGG-16 and ResNet-18. This is different from existing belief
that the first layer is often not a big issue due to the least
parameters and computation [2], [26]. The latency for other
layers are roughly balanced.
Synchronization Overhead: As we enforce global synchro-
nization through cooperative-groups per layer to ensure
data consistency, such global synchronizations can intro-
duce extra overhead and idle waiting of SMs. Table 10 shows
the percentage of this synchronization overhead, which is
measured by removing all the synchronization primitives.
As can be seen, this overhead is the most for the medium
network models, e.g., the two on Cifar10, which are 14.1%
and 13.2%, respectively.
Shortcut Overhead: We then focus on the two ResNet
models and measure the overhead incurred by handling
the cross-layer residual. Figure 26 show the latency and
throughput of the two ResNet models on RTX-2080 regard-
ing four scenarios: (a) with residual; (b) save the residual
without fetching them; (c) fetch the residual without saving
them; and (d) without the residual at all. For ResNet-14 on
Cifar10, if eliminating the residual-related operations, we
can gain 9.7% speedup in latency and 14% in throughput.
For ResNet-18 on ImageNet, we can gain 9.0% in latency
and 8.3% in throughput.
Utilization: Finally we investigate the impact of batch
size over the throughput. If the batch size is too small,
the hardware such as the bit-tensorcores might be under-
utilized. Figure 25 shows the inference throughput of the
11
L
1
L
2
L
3
L
4
O
u
t
L
1
L
2
L
3
L
4
L
5
L
6
L
7
L
8
O
u
t
L
1
L
2
L
3
L
4
L
5
L
6
L
7
L
8
L
9
L
1
0
L
1
1
L
1
2
L
1
3
L
1
4
O
u
t
L
1
L
2
L
3
L
4
L
5
L
6
L
7
O
u
t
L
1
L
2
L
3
L
4
L
5
L
6
L
7
L
8
L
9
L
1
0
L
1
1
L
1
2
L
1
3
L
1
4
L
1
5
O
u
t
L
1
L
2
L
3
L
4
L
5
L
6
L
7
L
8
L
9
L
1
0
L
1
1
L
1
2
L
1
3
L
1
4
L
1
5
L
1
6
L
1
7
L
1
8
O
u
t
0%
10%
20%
30%
40%
77.4%
Mnist_MLP
Cifar10_VGG
Cifar10_ResNet-14
ImageNet_Alexnet
ImageNet_VGG
ImageNet_ResNet-18
Fig. 24: Per-layer latency breakdown of our BTC new-format based BNN design on the 6 models.
1 2 4 8
1
6
3
2
6
4
1
2
8
2
5
6
5
1
2 1 2 4 8
1
6
3
2
6
4
1
2
8
2
5
6
5
1
2 1 2 4 8
1
6
3
2
6
4
1
2
8
2
5
6
5
1
2 1 2 4 8
1
6
3
2
6
4
1
2
8
2
5
6
5
1
2 1 2 4 8
1
6
3
2
6
4
1
2
8
2
5
6
5
1
2 1 2 4 8
1
6
3
2
6
4
1
2
8
2
5
6
5
1
2
0%
20%
40%
60%
80%
100%
Mnist_MLP Cifar10_VGG Cifar10_ResNet-14 ImageNet_Alexnet ImageNet_VGG ImageNet_ResNet-18
Fig. 25: Normalized throughput with respect to increased batch size for the BTC new-format BNN design on the 6 models.
ResNet-14 ResNet-18
0.0
0.5
1.0
1.5
2.0
8img Latency (ms)
w/ res
w/o load res
w/o store res
w/o res
ResNet-14 ResNet-18
0
5K
10K
15K
20K
Throughput (fps)
w/ res
w/o load res
w/o store res
w/o res
Fig. 26: Cifar-ResNet14 Latency: 9.7% speedup, Throughput: 14%, ImageNet-
ResNet18 Latency: 9.0% speedup, Throughput: 8.3%
six models with different batch sizes (normalized to the
throughput with a batch of 1024 for MNIST and Cifar10,
and 512 for ImageNet) on RTX2080. As can be seen, for
ImageNet, a batch of 128 is sufficient to achieve the best
throughput while for Cifar10, a batch of 512 is necessary.
For MNIST, even with batch size arising from 16K to 32K,
the throughput is still increasing. The maximum throughput
is obtained at 32K, which is 7.62×106 fps.
8 CONCLUSION
In this paper we investigate and characterize the new bit
computation capability of the tensorcores in NVIDIA Turing
GPUs. We found that the stride of memory access can
significantly impact the performance of memory access.
Based on this observation, we propose a new bit data
format for efficient design of Bit-Matrix-Multiplication and
Bit-Convolution. We built the full implementation for the
inference of binarized neural networks. Evaluations us-
ing six network models (MLP, VGG-like, AlexNet, VGG-
16, ResNet-14/18) on three datasets (MNIST, Cifar10 and
ImageNet) over two latest Turing GPUs (RTX2080 and
RTX2080Ti) show that our design can bring on average
2.33× (up to 3.79×) in latency and 1.81× (up to 2.84×) in
throughput compared with state-of-the-art BNN design for
GPUs, leading to super realtime performance. As a future
work, we are planning to exploit the bit-tensorcore for alter-
native utilization such as BLAS-based graph computation.
REFERENCES
[1] Matthieu Courbariaux, Itay Hubara, Daniel Soudry, Ran El-Yaniv,
and Yoshua Bengio. Binarized neural networks: Training deep
neural networks with weights and activations constrained to+ 1
or-1. arXiv preprint arXiv:1602.02830, 2016.
[2] Itay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv,
and Yoshua Bengio. Binarized neural networks. In Advances in
Neural Information Processing Systems, pages 4107–4115, 2016.
[3] Mohammad Rastegari, Vicente Ordonez, Joseph Redmon, and Ali
Farhadi. Xnor-net: Imagenet classification using binary convolu-
tional neural networks. In European Conference on Computer Vision,
pages 525–542. Springer, 2016.
[4] Matthieu Courbariaux, Yoshua Bengio, and Jean-Pierre David. Bi-
naryconnect: Training deep neural networks with binary weights
during propagations. In Advances in Neural Information Processing
Systems, pages 3123–3131, 2015.
[5] Angus Galloway, Graham W Taylor, and Medhat Moussa. Attack-
ing binarized neural networks. arXiv:1711.00449, 2017.
[6] Elias B Khalil, Amrita Gupta, and Bistra Dilkina. Combinatorial
attacks on binarized neural networks. arXiv:1810.03538, 2018.
[7] Wei Tang, Gang Hua, and Liang Wang. How to train a compact
binary neural network with high accuracy? In AAAI, 2017.
[8] Sajad Darabi, Mouloud Belbahri, Matthieu Courbariaux, and
Vahid Partovi Nia. BNN+: Improved binary network training.
arXiv preprint arXiv:1812.11800, 2018.
[9] Fayez Lahoud, Radhakrishna Achanta, Pablo Ma´rquez-Neila,
and Sabine Su¨sstrunk. Self-binarizing networks. arXiv preprint
arXiv:1902.00730, 2019.
[10] Zechun Liu, Baoyuan Wu, Wenhan Luo, Xin Yang, Wei Liu, and
Kwang-Ting Cheng. Bi-real net: Enhancing the performance of
1-bit CNNs with improved representational capability and ad-
vanced training algorithm. In Proceedings of the European Conference
on Computer Vision (ECCV), pages 722–737, 2018.
[11] Shilin Zhu, Xin Dong, and Hao Su. Binary ensemble neural
network: More bits per network or more networks per bit? In
IEEE Conference on Computer Vision and Pattern Recognition, 2019.
[12] Joseph Bethge, Haojin Yang, Marvin Bornstein, and Christoph
Meinel. BinaryDenseNet: Developing an Architecture for Binary
Neural Networks. In Proceedings of the IEEE International Conference
on Computer Vision Workshops, 2019.
[13] Joseph Bethge, Christian Bartz, Haojin Yang, Ying Chen, and
Christoph Meinel. MeliusNet: Can Binary Neural Networks
Achieve MobileNet-level Accuracy? arXiv:2001.05936, 2020.
[14] Pierre Baldi, Kyle Cranmer, Taylor Faucett, Peter Sadowski, and
Daniel Whiteson. Parameterized machine learning for high-energy
physics. arXiv preprint arXiv:1601.07913, 2016.
[15] Buser Say and Scott Sanner. Planning in factored state and action
spaces with learned binarized neural network transition models.
In IJCAI, pages 4815–4821, 2018.
[16] Svyatoslav Korneev, Nina Narodytska, Luca Pulina, Armando
Tacchella, Nikolaj Bjorner, and Mooly Sagiv. Constrained image
generation using binarized neural networks with decision pro-
cedures. In International Conference on Theory and Applications of
Satisfiability Testing, pages 438–449. Springer, 2018.
[17] Chao Ma, Yulan Guo, Yinjie Lei, and Wei An. Binary volumetric
convolutional neural networks for 3-d object recognition. IEEE
Transactions on Instrumentation and Measurement, (99):1–11, 2018.
[18] Paul Covington, Jay Adams, and Emre Sargin. Deep neural
networks for youtube recommendations. In Proceedings of the 10th
ACM conference on recommender systems. ACM, 2016.
[19] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu,
and Tat-Seng Chua. Neural collaborative filtering. In Proceedings
of the 26th International Conference on World Wide Web. International
World Wide Web Conferences Steering Committee, 2017.
12
[20] Eriko Nurvitadhi, David Sheffield, Jaewoong Sim, Asit Mishra,
Ganesh Venkatesh, and Debbie Marr. Accelerating binarized
neural networks: comparison of FPGA, CPU, GPU, and ASIC. In
International Conference on Field-Programmable Technology, 2016.
[21] Yaman Umuroglu, Nicholas J Fraser, Giulio Gambardella,
Michaela Blott, Philip Leong, Magnus Jahre, and Kees Vissers.
Finn: A framework for fast, scalable binarized neural network
inference. In Proceedings of International Symposium on Field-
Programmable Gate Arrays. ACM, 2017.
[22] Ritchie Zhao, Weinan Song, Wentao Zhang, Tianwei Xing, Jeng-
Hau Lin, Mani Srivastava, Rajesh Gupta, and Zhiru Zhang. Ac-
celerating binarized convolutional neural networks with software-
programmable FPGAs. In Proceedings of International Symposium on
Field-Programmable Gate Arrays. ACM, 2017.
[23] Shuang Liang, Shouyi Yin, Leibo Liu, Wayne Luk, and Shaojun
Wei. FP-BNN: Binarized neural network on FPGA. Neurocomput-
ing, 275:1072–1086, 2018.
[24] Tong Geng, Tianqi Wang, Chunshu Wu, Chen Yang, Shuai-
wen Leon Song, Ang Li, and Martin Herbordt. LP-BNN: Ultra-
low-Latency BNN Inference with Layer Parallelism. In Proceedings
of the 30th IEEE International Conference on Application-specific Sys-
tems, Architectures, and Processors. IEEE, 2019.
[25] Tong Geng, Tianqi Wang, Chunshu Wu, Chen Yang, Wei Wu, Ang
Li, and Martin C Herbordt. O3BNN: an out-of-order architecture
for high-performance binarized neural network inference with
fine-grained pruning. In Proceedings of the ACM International
Conference on Supercomputing, pages 461–472. ACM, 2019.
[26] Ang Li, Tong Geng, Tianqi Wang, Martin Herbordt, Shuai-
wen Leon Song, and Kevin Barker. Bstc: a novel binarized-soft-
tensor-core design for accelerating bit-based approximated neural
nets. In Proceedings of International Conference for High Performance
Computing, Networking, Storage and Analysis. ACM, 2019.
[27] NVIDIA. NVIDIA Turing GPU Architecture, 2019.
[28] Kumar Chellapilla, Sidd Puri, and Patrice Simard. High perfor-
mance convolutional neural networks for document processing. In
Tenth International Workshop on Frontiers in Handwriting Recognition.
Suvisoft, 2006.
[29] Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan
Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. cudnn:
Efficient primitives for deep learning. arXiv:1410.0759, 2014.
[30] Taylor Simons and Dah-Jye Lee. A review of binarized neural
networks. Electronics, 8(6), 2019.
[31] NIVIDA. CUDA Template Library for Dense Linear Algebra at All
Levels and Scales (CUTLASS), 2018.
[32] Wenbin Fang, Mian Lu, Xiangye Xiao, Bingsheng He, and Qiong
Luo. Frequent itemset mining on graphics processors. In Proceed-
ings of the fifth international workshop on data management on new
hardware, pages 34–42. ACM, 2009.
[33] Benjamin Block, Peter Virnau, and Tobias Preis. Multi-GPU accel-
erated multi-spin Monte Carlo simulations of the 2D Ising model.
Computer Physics Communications, 181(9):1549–1556, 2010.
[34] Martı´n Pedemonte, Enrique Alba, and Francisco Luna. Bitwise
operations for GPU implementation of genetic algorithms. In
Proceedings of the 13th annual conference companion on Genetic and
evolutionary computation, pages 439–446. ACM, 2011.
[35] Francesco Fusco, Michail Vlachos, Xenofontas Dimitropoulos, and
Luca Deri. Indexing million of packets per second using GPUs. In
Conference on Internet measurement conference. ACM, 2013.
[36] Kefu Xu, Wenke Cui, Yue Hu, and Li Guo. Bit-parallel multiple
approximate string matching based on GPU. Procedia Computer
Science, 17:523–529, 2013.
[37] Eli Ben-Sasson, Matan Hamilis, Mark Silberstein, and Eran Tromer.
Fast multiplication in binary fields on GPUs via register cache. In
International Conference on Supercomputing. ACM, 2016.
[38] Jingkuan Song. Binary generative adversarial networks for image
retrieval. arXiv preprint arXiv:1708.04150, 2017.
[39] Saman Ashkiani, Martin Farach-Colton, and John D Owens. A
dynamic hash table for the GPU. In IEEE International Parallel and
Distributed Processing Symposium (IPDPS). IEEE, 2018.
[40] Bradley McDanel, Surat Teerapittayanon, and HT Kung. Embed-
ded binarized neural networks. arXiv:1709.02260, 2017.
[41] Yuwei Hu, Jidong Zhai, Dinghua Li, Yifan Gong, Yuhao Zhu,
Wei Liu, Lei Su, and Jiangming Jin. BitFlow: Exploiting Vector
Parallelism for Binary Neural Networks on CPU. In International
Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2018.
[42] Xiaofan Lin, Cong Zhao, and Wei Pan. Towards accurate binary
convolutional neural network. In Advances in Neural Information
Processing Systems, pages 345–353, 2017.
[43] Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson,
Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia,
Nan Boden, Al Borchers, et al. In-datacenter performance analysis
of a tensor processing unit. In Proceedings of the 44th Annual
International Symposium on Computer Architecture. ACM, 2017.
[44] NVIDIA. Volta Architecture White Paper, 2018.
[45] Zhe Jia, Marco Maggioni, Benjamin Staiger, and Daniele P
Scarpazza. Dissecting the NVIDIA Volta GPU architecture via
microbenchmarking. arXiv preprint arXiv:1804.06826, 2018.
[46] Zhe Jia, Marco Maggioni, Jeffrey Smith, and Daniele Paolo
Scarpazza. Dissecting the NVidia Turing T4 GPU via Microbench-
marking. arXiv preprint arXiv:1903.07486, 2019.
[47] Stefano Markidis, Steven Wei Der Chien, Erwin Laure, Ivy Bo
Peng, and Jeffrey S Vetter. Nvidia tensor core programmability,
performance & precision. In International Parallel and Distributed
Processing Symposium Workshops. IEEE, 2018.
[48] Md Aamir Raihan, Negar Goli, and Tor M Aamodt. Modeling
Deep Learning Accelerator Enabled GPUs. In International Sympo-
sium on Performance Analysis of Systems and Software. IEEE, 2019.
[49] Brian Hickmann and Dennis Bradford. Experimental Analysis
of Matrix Multiplication Functional Units. In 26th Symposium on
Computer Arithmetic (ARITH). IEEE, 2019.
[50] Azzam Haidar, Stanimire Tomov, Jack Dongarra, and Nicholas J
Higham. Harnessing GPU tensor cores for fast FP16 arithmetic to
speed up mixed-precision iterative refinement solvers. In Proceed-
ings of the International Conference for High Performance Computing,
Networking, Storage, and Analysis, page 47. IEEE Press, 2018.
[51] Anumeena Sorna, Xiaohe Cheng, Eduardo D’Azevedo, Kwai Won,
and Stanimire Tomov. Optimizing the Fast Fourier Transform Us-
ing Mixed Precision on Tensor Core Hardware. In 2018 IEEE 25th
International Conference on High Performance Computing Workshops
(HiPCW), pages 3–7. IEEE, 2018.
[52] Pierre Blanchard, Nicholas J Higham, Florent Lopez, Theo Mary,
and Srikara Pranesh. Mixed Precision Block Fused Multiply-Add:
Error Analysis and Application to GPU Tensor Cores. 2019.
[53] Abdul Dakkak, Cheng Li, Jinjun Xiong, Isaac Gelado, and Wen-
mei Hwu. Accelerating reduction and scan using tensor core units.
In Proceedings of the ACM International Conference on Supercomput-
ing, pages 46–57. ACM, 2019.
[54] NVIDIA. NVIDIA Tesla V100 GPU Architecture, 2017.
[55] Guangming Tan, Linchuan Li, Sean Triechle, Everett Phillips,
Yungang Bao, and Ninghui Sun. Fast implementation of DGEMM
on Fermi GPU. In Proceedings of International Conference for High
Performance Computing, Networking, Storage and Analysis, 2011.
[56] NVIDIA. CUDA Programming Guide, 2018.
[57] Sergey Ioffe and Christian Szegedy. Batch normalization: Acceler-
ating deep network training by reducing internal covariate shift.
arXiv preprint arXiv:1502.03167, 2015.
[58] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep
residual learning for image recognition. In Proceedings of the IEEE
conference on computer vision and pattern recognition, 2016.
[59] Michael LeBeane, Khaled Hamidouche, Brad Benton, Mauricio
Breternitz, Steven K Reinhardt, and Lizy K John. GPU triggered
networking for intra-kernel communications. In Proceedings of the
International Conference for High Performance Computing, Networking,
Storage and Analysis, pages 1–12, 2017.
[60] Yann LeCun, Corinna Cortes, and CJ Burges. MNIST handwritten
digit database. AT&T Labs [Online]. Available: http://yann. lecun.
com/exdb/mnist, 2, 2010.
[61] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. The CIFAR-10
dataset. online: http://www.cs.toronto.edu/kriz/cifar.html, 2014.
[62] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-
Fei. Imagenet: A large-scale hierarchical image database. In
Conference on Computer Vision and Pattern Recognition. IEEE, 2009.
[63] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Ima-
genet classification with deep convolutional neural networks. In
Advances in neural information processing systems, 2012.
[64] Karen Simonyan and Andrew Zisserman. Very deep convolu-
tional networks for large-scale image recognition. arXiv preprint
arXiv:1409.1556, 2014.
[65] Mohammad Ghasemzadeh, Mohammad Samragh, and Farinaz
Koushanfar. Rebnet: Residual binarized neural network. In 2018
IEEE 26th Annual International Symposium on Field-Programmable
Custom Computing Machines (FCCM), pages 57–64. IEEE, 2018.
