It is always well believed that Binary Neural Networks (BNNs) could drastically accelerate the inference efficiency by replacing the arithmetic operations in float-valued Deep Neural Networks (DNNs) with bit-wise operations. Nevertheless, there has not been open-source implementation in support of this idea on low-end ARM devices (e.g., mobile phones and embedded devices). In this work, we propose daBNN -a super fast inference framework that implements BNNs on ARM devices. Several speed-up and memory refinement strategies for bit-packing, binarized convolution, and memory layout are uniquely devised to enhance inference efficiency. Compared to the recent open-source BNN inference framework, BMXNet, our daBNN is 7×∼23× faster on a single binary convolution, and about 6× faster on Bi-Real Net 18 (a BNN variant of ResNet-18). The daBNN is a BSD-licensed inference framework, and its source code, sample projects and pre-trained models are available on-line: https://github.com/JDAI-CV/dabnn.
INTRODUCTION
The advances in Deep Neural Networks (DNNs) have substantially pushed the limits and reached new state-of-the-arts of technologies in multimedia and computer vision areas. These advancements rely heavily on the requirements to have high-performance computational accelerator-GPUs. Nevertheless, there has been exponential growth in DNN-based apps available on-line for low-end ARM devices (e.g., mobile phones) recently. For instance, in the month Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. MM '19, October 21-25, 2019, Nice, France © 2019 Association for Computing Machinery. ACM ISBN 978-1-4503-6889-6/19/10. . . $15.00 https://doi.org/10.1145/3343031.3350534 of Sep. 2018, Android users downloaded 221 DNN-based apps for around 13 million times from the official Google Play market and the number of DNN-based apps increased by 27% over 3 months [13] . Meanwhile, most major vendors developed DNN inference frameworks tailored to ARM devices, e.g., TensorFlow Lite [2] from Google and Caffe2 [1] from Facebook, which quickly gains popularity for their stronger privacy protection and lower cost for data transmission. However, the inference efficiency of DNNs on ARM devices is often limited with relatively small memory storage and inferior computing power of mobile phones or embedded devices. This adversely hinders the deployment of heavy DNNs on ARM devices. One feasible way to alleviate this problem is to utilize Binary Neural Networks (BNNs), which quantize the activations and weights of DNNs to 1-bit, and lead a significant speed-up over the full precision 1 counterpart with efficient bit-wise operations.
In the literature, there are several inference frameworks for BNNs: BitStream [15] , BitFlow [7] , and BMXNet [14] . Among them, BitStream and BitFlow are not open-source and not available to public. To our best knowledge, the only one open-source BNN inference framework is BMXNet [14] , which is found to be even slower than full-precision TensorFlow Lite in our experiments. Such facts motivate and highlight the explorations of an open-source BNN inference framework highly optimized for ARM devices.
To address such problems, we present daBNN, a super fast inference framework that implements BNNs on ARM devices with several uniquely devised technologies for speeding up the inference. In particular, an upgraded bit-packing scheme is adopted to pack multiple elements simultaneously, which improves the speed of naive sequential method by about 4×. Moreover, our daBNN capitalizes on "binary direct convolution" to squeeze the cost of additional instructions in binary convolution, and meanwhile, a novel memory layout is leveraged to reduce memory access. daBNN is written in C++ and ARM assembly, and provides Java binding and Android package. Experiments demonstrate that our daBNN is characterized with extremely fast speed. Specifically, compared to BMXNet, our daBNN is 7×-23× faster on a single binary convolution, and about 6× faster on Bi-Real Net 18 [9] (a BNN variant of ResNet-18 [6] ). When comparing to full-precision TensorFlow Lite, our daBNN is 8×-10× faster on a single binary convolution, and about 3× faster on Bi-Real Net 18. We believe our daBNN will offer a fertile ground for deploying BNNs in industry and designing novel BNN structures in academia.
DABNN
In this section, we present the detailed implementations in our daBNN, the comparison to existing software, and the potentiality to help the architecture design of BNNs. 2.1 Implementation Details 2.1.1 Bit-packing. Bit-packing is a common scheme for BNNs by binarizing N (e.g., 128) elements into 1-bit (i.e., 1 or 0), and then packing them into an N -bit vector. As such, xnor can be directly performed between these binarized vectors. Previous works [7] [14] [15] often perform bit-packing in a naive way. For example, [7] compares every 32-bit element with zero and sets the corresponding bit sequentially. Unlike them, we directly utilize the existing sign bit in int32 and IEEE 754 float numbers without the need for additional comparison with zero. Moreover, the "right-shift-and-overwrite" SIMD (single instruction, multiple data) instruction is adopted here, which enables the simultaneous gathering of multiple sign bits. Such instruction takes three operands: two vectors α and β (both of which contain several M-bit elements) and a scalar k. It performs the right shift operation over every M-bit element in vector α by k bits, and overwrites the rightmost M − k bits of each M-bit element in vector β. In addition, we further upgrade bit-packing by scattering these instructions on different registers to avoid write-after-write data hazard, as shown in Figure 1 . The experimental results in Figure 2 illustrate that our upgraded bit-packing scheme is ∼4× as fast as the naive way. Note that our method is also compatible with fused-BN-binarization layer in [15] .
Binary Direct Convolution. SGEMM (Single float GEneral
Matrix Multiplication) is a widely adopted approach to implement float convolutions in various high-performance scientific programs.
In the context of BNNs, an alternative operation to SGEMM is BGEMM, which performs binary matrix multiplication for binary convolution. In addition to the common multiplication and add operations, BGEMM includes extra operations that count how many 1s are in a vector. Specifically, we denote U M ×N as the space of matrices with dimension M × N and each element of it is a bitpacked vector. Given two matrices (i.e., A ∈ U M ×K and B ∈ U K ×N ), C ∈ N M ×N (N represents the set of non-negative integers), C = BGEMM(A, B) is measured as:
where ì A i,k and ì B k ,j denotes each element in A and B. In SGEMM, to amortize the cost of loading memory, C is often calculated as
where m k is the k th column of A and n k is the k t h row of B. We argue that this way is sub-optimal for BGEMM especially on ARM devices. In particular, on ARMv8 (the 64-bit ARM architecture) devices, the operation of bitcount contains two instructions: "cnt" and "addv". "cnt" takes an N -byte vector α as input and outputs an N -byte vector β, which β i = the_number _o f _1s(α i ) where α i and β i are the i t h byte of α and β respectively. "addv" sums up all bytes in a vector and outputs the aggregated scalar. Eq. 3 is then expanded as:
Thus, Eq. 4 shows that the operation of binary multiply-addition on ARMv8 devices consists of four instructions: xnor, cnt, addv, and addition. Moreover, on ARMv7 (the 32-bit ARM architecture) devices, there is even no "addv" instruction and ⌈log 2 N ⌉ instructions are needed to sum up all bytes in an N -byte vector, so the operation of binary multiply-addition consists of ⌈log 2 N ⌉ + 3 instructions on these devices. To improve the efficiency of this operation, we re-arrange the calculation order and calculate C = BGEMM(A, B) as the multiplication of a row vector p ∈ U 1×N and q ∈ U M ×1 :
where p i is the i t h row of A and q j is the j t h column of B.
In this way, the cost of "addv" instructions can be mostly squeezed by summing up the results of "cnt" in advance:
Please note that the same transformation can not be employed in Eq. 4 because C is stored as 32-bit integers to save the valuable registers. Therefore in Eq. 4, we have to utilize "addv" to reduce the vector into an integer before every instruction of "addition". Taking a close look on Eq. 6 and 7, we can observe some interesting connections between them and the operation of convolution. Specifically, if we treat A ∈ U M ×K and B ∈ U K ×N as the weight and the im2col-ed input (M: the number of output channels, N : output height × output width, and K: the number of bit-packed vectors in a weight filter), Eq. 6 and 7 can be directly interpreted as the definition of convolution. As such, the refined operation of binary convolution is dubbed as "binary direct convolution".
Though our binary direct convolution enhances the efficiency by excluding most addv instructions, it results in more cost on memory access. To compensate the increase on memory access, a novel memory layout, NC 1 HWC 2 is devised to leverage the spatial redundancy of convolutions, where C 1 = C/C 2 and C 2 is the length of a register (i.e. 128 on ARM devices). NC 1 HWC 2 can be regarded as a refinement of NHWC by splitting NHWC layout into several groups, where each group has C 2 bits. As such, each register holds all channels in a group. A diagram of NC 1 HWC 2 is illustrated in Figure 3 , which indicates that 2/3 registers of previous location are reused in this refined memory layout (i.e., 2/3 memory access is saved). The experimental results in Figure 4 clearly show that the extra "addv" instructions make BGEMM slow, and our binary direct convolution with NC 1 HWC 2 is faster than BGEMM. Note that while we only present the implementation details on ARMv8 here, the proposed binary direct convolution and NC 1 HWC 2 memory layout are also compatible and effective on ARMv7.
Comparison to Existing Software
We compare our daBNN and other BNN inference frameworks in Table 1 , and summarize the following two key differences: (1) daBNN is open-source. Although BitStream and BitFlow are claimed to be fast, neither of them is available on-line.
(2) The speed of daBNN is super fast. To our best knowledge, daBNN is the first BNN implementation that is compared with modern mobile inference framework (e.g., TensorFlow Lite) rather than Caffe [8] or OpenBLAS. Figure 5 compares the latency between our daBNN and existing software (i.e., TensorFlow Lite, BMXNet, Figure 4 : Latency comparison between different convolution methods. "BGEMM without addv" denotes an abnormal implementation of BGEMM by removing the "addv" instructions. It clearly shows that the "addv" instructions make BGEMM slower than our binary direct convolution.
and Caffe). Compared to TensorFlow Lite, our daBNN is 8×-10× faster on a single binary convolution, and about 3× faster on Bi-Real Net 18. Note that the only one existing open-source BNN implementation, BMXNet, is even slower than TensorFlow Lite on Bi-Real Net 18 and several convolutions.
Help to Network Design
daBNN is the first highly-optimized open-sourced BNN inference framework. It not only enables the deployment of BNNs in industries, but also brings help to BNN design for researchers. For example, it is a common practice that the first and last layer in BNNs remain full precision. However, no one find that the first layer, which is usually a convolution with large kernel, often takes up more than half latency of a binary neural network. Given this observation, we replace the first 7×7 convolution layer in Bi-Real Net 18 [9] with a STEM module [12] as shown in Figure 6 , then we get a 30% speedup on Google Pixel 1 without any accuracy loss as shown in Table 2 . We cannot make this easy but effective improvement if we use an under-optimized framework like BMXNet, since the binary convolutions in it are even slower than float convolutions.
MODEL CONVERSION TOOL
We present a model conversion tool, named onnx2bnn, to convert trained BNN models to our daBNN format. We provide pre-built onnx2bnn binaries for Linux, macOS and Windows, and thus no compilation is required for our users. Our model conversion tool supports ONNX (Open Neural Network Exchange) [3] format, which is greatly supported or officially integrated by many frameworks and tools [4, 5, 10, 11] . We only depend on the standard ONNX operators to ensure the interoperability. By contrast, BMXNet implements and depends on some custom MXNet operators (like "QConvolution"), so the BNN models trained in other deep learning frameworks (e.g., TensorFlow [4] and PyTorch [10] ) cannot be deployed on BMXNet easily.
Our model conversion tool recognizes whether a tensor is binary by various ways, e.g., checking whether the tensor is an output of a Sign operator. The convolutions with binary input and binary weight are then converted to binary convolutions in daBNN format. The weight of binary convolutions will be packed so that the model size will be drastically compressed (32× if all weights are packed).
CONCLUSION AND FUTURE WORK
We presented daBNN, a super fast binary neural networks inference framework for ARM devices. We implement binary convolution and other operators by ARM assembly. Extensive experiments show that our daBNN is extremely faster than both BMXNet and modern full precision inference frameworks. We believe that daBNN will greatly help both deployment and design of BNNs. For the ease of use, we publish pre-built binaries, libraries and also a sample Android project. Our source code, sample project, documentation and pre-trained models are published on GitHub. In the future, we are going to implement BNNs on X86 and RISC-V. We are also looking forward to cooperating with research teams to design or search better BNN structures.
