Deep neural networks (DNNs) are widely used in most current automatic speech recognition (ASR) systems. To guarantee good recognition performance, DNNs usually require significant computational resources, which limits their application to low-power devices. Thus, it is appealing to reduce the computational cost while keeping the accuracy. In this work, in light of the success in image recognition, binary DNNs are utilized in speech recognition, which can achieve competitive performance and substantial speed up. To our knowledge, this is the first time that binary DNNs have been used in speech recognition. For binary DNNs, network weights and activations are constrained to be binary values, which enables faster matrix multiplication based on bit operations. By exploiting the hardware population count instructions, the proposed binary matrix multiplication can achieve 5 ∼ 7 times speed up compared with highly optimized floating-point matrix multiplication. This results in much faster DNN inference since matrix multiplication is the most computationally expensive operation. Experiments on both TIMIT phone recognition and a 50-hour Switchboard speech recognition show that, binary DNNs can run about 4 times faster than standard DNNs during inference, with roughly 10.0% relative accuracy reduction.
Introduction
Deep neural networks (DNNs) have been shown to outperform Gaussian mixture models (GMMs) and have become the standard for acoustic modeling in speech recognition [1] . In recent large vocabulary continuous speech recognition systems, DNNs usually have six or more hidden layers with several thousands of neurons per layer. These models incur excessively high computational cost (mainly due to large matrix multiplication), which makes it infeasible to deploy such large models on low-power devices such as mobile phones or tablets. Hence, it is of great interest to speed up the calculation of DNNs while keeping the performance degradation in an acceptable range.
It has been reported that there is a significant redundancy in the parameterization of deep learning models [2] , which leads to a waste of computation. Based on this phenomenon, different approaches have been proposed to reduce the redundancy. Xue et al. [3] applied singular value decomposition (SVD) on weight matrices to exploit the internal sparseness. Yu et al. [4] incorporated a soft regularization during training to minimize the number of nonzero elements of weight matrices. He et al. [5] and Qian et al. [6] proposed a method based on node pruning and arc restructuring to prune DNNs for fast inference. Han et al. [7] explored an iterative process to prune unimportant connections of DNNs, which reduced the storage and computation by an order of magnitude. Novikov et al. [8] represented weight matrices in a multi-linear format such that the number of parameters was largely reduced.
In addition to these approaches that are based on the transformation of matrices, the redundancy can also be reduced by quantization. Binary DNNs [9] is a recent technique that quantizes both network weights and activations to be binary. On a variety of image recognition benchmarks, it is able to gain significant acceleration during inference while still has competitive accuracies. Hence, it is attractive to validate the effectiveness of binary DNNs in speech related tasks.
In this paper, the use of binary DNNs in speech recognition is intensively studied. The contributions of this work are as follows: 1) it is the first time that binary DNNs are used in speech recognition; 2) the proposed binary matrix multiplication implementations that can run 5-7 times faster than aggressively optimized floating-point baselines on Intel, ARM CPUs and NVIDIA GPUs; 3) detailed comparisons between binary DNNs and standard DNNs in speed and accuracy on two speech recognition tasks.
The rest of the paper is organized as follows. Section 2 describes the architecture of binary DNNs. Section 3 analyzes the theoretical peak performance of floating-point and binary matrix multiplication on specific hardware, then compares the real performance between the proposed binary implementations and the state-of-the-art floating-point implementations. Experimental setup and results on TIMIT phone recognition and Switchboard conversational speech recognition are given in Section 4. Finally, Section 5 concludes the whole paper.
Binary DNNs
This section gives the algorithmic description of binary DNNs and insights on the optimization for fast inference.
Binarization of weights and activations
In a feedforward neural network with L layers, let the activation in layer l be a l , the weight matrix between layer l and l + 1 be W l+1,l , and the bias in layer l+1 be b l+1 , where 1 ≤ l ≤ L−1. The binarized weight matrices and activations are denoted bŷ W l+1,l andâ l . Algorithm 1 describes the forward pass of binary DNNs. Note that weight matrix W2,1 is not binarized, since a1 (i. e. the network input) can not be properly binarized or quantized.
In Algorithm 1, function Binarize(·) is used to transform each element of W l+1,l or a l to +1 or −1. The deterministic binarization function is defined by
where x is a floating-point value. The stochastic version that used in this work is different from the one in [9] , which is defined by
where p is a random value that drawn from a normal distribution with zero mean and unit variance. Stochastic binarization is more computationally expensive than the deterministic version, but it reduces overfitting, hence it is used to binarize the activations during training. Note that although function HardTanh(x) = max(−1, min(x, 1)) takes no effect in the forward pass since Binarize(HardTanh(·)) = Binarize(·), it plays an important role in the backward pass.
Algorithm 1 Forward pass of binary DNNs
Input: input (which is also the activation of input layer) a1, weight matrices W2,1, · · · , WL,L−1, and biases b2, · · · , bL Output: activations a2, · · · , aL
Straight through estimator
While the forward pass is straightforward, there is an issue in the backward pass: mathematically the gradient of function Binarize(·) is always zero with respect to its input, which makes the gradient based training impossible. However, this can be resolved by using "straight through estimator" (STE) [10] . Here an variant of STE appeared in [9] is used, which cancels the gradient when the magnitude of the input is too large:
The indicator function 1 |p|≤1 is exactly the gradient of function HardTanh(·). Thus technically, in the backward pass, function Binarize(·) can be treated as an identity function.
Optimization for fast inference

Module reforming
Batch normalization [11] used in Algorithm 1 can be described as
where x is the layer input, γ and β are the learnable parameters, and is a small value to avoid underflow. During inference, the mean µ and variance σ 2 are replaced with fixed values that are estimated over the training data, which can lead to an efficient and compact representation:
, both can be precomputed before the deployment.
Module compacting
During training, floating-point weights are essential to do the update, but during inference, only their binarized version are utilized. For this reason, by replacing W l,l−1 withŴ l,l−1 in Algorithm 1, moduleŴ l,l−1 = Binarize(W l,l−1 ) can be removed. Moreover, considering Binarize(HardTanh(·)) = Binarize(·), module a l = HardTanh(a l ) can also be removed. Furthermore, module a l = BatchNorm(a l ) and a l = Binarize(a l ) can be seamlessly integrated into one module so that the unnecessary computation can be avoided.
Binary matrix multiplication
This section describes population count based binary matrix multiplication, analyzes its performance gain and compares it with floating-point matrix multiplication in both theory and practice.
Population count based binary matrix multiplication
In practical applications, the multiplication of two matrices A ∈ R m×k and B ∈ R k×n , requires 2 × m × n × k floating-point arithmetic operations (i. e. multiplications and additions) 1 . Due to hardware limitation of these operations, the speed of floatingpoint matrix multiplication is hard to improve.
However, in the binary case, it is possible to replace most multiplications and additions with bit operations, as each element of the resulting matrix can be represented by an inner product: Cij = k A ik · B kj . To explain this in detail, two bit operations xor and popcnt are defined as: 1) xor(x, y) is the element-wise exclusive or of integer x and y; 2) popcnt(x) is the number of bits set to 1 in integer x. Assuming there are two vectors a and b with length n, that their elements are constrained to be either +1 or −1 (binary values). Such representation of a or b is not suitable for fast computation, so an extra processing step is needed. First, all −1s are replaced with 0s, then the n 0,1 bits are packed into n/k (k = 32 or 64) k-bit integers. This process converts a and b toā andb. The inner product of a and b can be calculated as
For a better understanding, an example is given here. Let a and b be two vectors of length 8: a = (1, −1, 1, 1, 1, 1, 1, 1) and
It is easy to find that their inner product is -2, which is equal to 8−2×popcnt(xor(ā,b)) = 8 − 2 × popcnt(xor(101111112, 011001012)) = 8 − 2 × popcnt(110110102) = 8 − 2 × 5.
Binary matrix multiplication on CPU
Recent CPUs have built-in support for population count instructions for both 32-bit and 64-bit operands. On Intel Haswell microarchitecture, two 8-wide FMA (fused multiply-add) instructions can be performed every cycle, yielding 32 32-bit floating-point operations per cycle [12] . By contrast, the 64-bit population count instruction can be issued every cycle, yielding 128 binary operations per cycle (other instructions like xor can be issued simultaneously due to superscalar execution). In other words, the population count instruction based binary matrix multiplication has 4.0 = 128/32 times throughput compared with optimized floating-point matrix multiplication. On ARM Cortex A72 microarchitecture, one 4-wide FMA instruction, or up to 8 floating-point operations can be performed per cycle. Although the precise timing of population count instruction on ARM can not be estimated, empirical evaluation shows the speed up it delivers is considerable. To achieve compute-bound performance, the implementation of binary matrix multiplication uses packing to ensure consecutive memory locations access and cache-and register-aware blockings to maximize data reuse. For more details, please refer to [13, 14] .
An Intel i3-4150 CPU (Intel Haswell microarchitecture) running at 3.50 GHz and a HiSilicon Kirin 950 CPU (ARM Cortex A72 microarchitecture) running at 2.30 GHz were used to compare the speed between binary matrix multiplication and floating-point matrix multiplication. The baseline floating-point implementation utilized Intel Math Kernel Library 11.3 Update 3 on Intel platform and OpenBLAS 0.2.19 on ARM platform to achieve maximum speed. To determine the maximum real performance, a matrix multiplication of size (m, n, k) = (2048, 2048, 2048) is commonly used. It is worth noting that, in practice, the batch size m is much smaller than 2048 during inference. Hence, the performance of the matrix multiplication of size (16, 2048, 2048) corresponding to m = 16 was also evaluated. Table 1 reports the single thread GOPS 2 (i. e. billions of floating-point/binary arithmetic operations, or ops, per second) on an Intel i3-4150 CPU. It is shown that, when batch size is 16, binary matrix multiplication can achieve 7.2× speed up. Table 2 reports single thread GOPS on a HiSilicon Kirin 950 CPU. Similar to the results of Table 1 , binary matrix multiplication is 6.7× faster when batch size is 16. The implementation of binary matrix multiplication uses a modified version of an example program in CUDA toolkit document 3 that replaces multiply-add operations with bit operations. The floating-point baseline utilized NVIDIA cuBLAS library to achieve best performance. Table 3 reports the GOPS of both implementations. When batch size is 16, the proposed binary matrix multiplication is able to achieve 5.4× speed up. 
Experiments and results
This section describes the experiments on TIMIT phone recognition task and a 50-hour Switchboard speech recognition task. The speed and recognition accuracy between standard DNNs and binary DNNs on two tasks are compared. All DNN models were trained using Torch7 [15] . Classical GMM-HMM models that generate the alignments for DNN training were trained by Kaldi [16] . Kaldi was also used to decode the speech.
Experiments on TIMIT phone recognition task
The TIMIT corpus of read speech contains 4288 sentences spoken by 630 speakers selected from 8 major dialect regions of American English. The training set contains 3696 sentences from 462 speakers. The development set contains 400 sentences from 50 speakers and the test set contains 192 sentences from 24 speakers. 95% of the training set sentences were used as training data and the remaining 5% were used as validation data. Recognition accuracies were reported on the development set and test set. First, 13-dimensional mel-frequency cepstral coefficients (MFCC) features were extracted with per speaker cepstral mean normalization (CMN), then spliced in time with a context of ±3 frames, and projected to 40 dimensions with linear discriminant analysis (LDA). The resulting features were de-correlated using maximum likelihood linear transform (MLLT) and then applied speaker normalization by feature space maximum likelihood linear regression (fMLLR).
The input vector was constructed by concatenating 11 consecutive frames (a central frame with ±5 contexts) of 40-dimensional fMLLR features, and the one-hot encoding target vector corresponding to the central frame was generated by state level forced alignment of a GMM/HMM model with 1947 tied triphone states. To classify the central frame, a model structure consisted of an input layer with 440 units, 6 hidden layers with 1024 units per layer, and an output layer with 1947 units was used. ReLU was used as the non-linear function in hidden layers of standard model. Cross entropy (CE) was used as training criterion. Stochastic gradient descent (SGD) was used to train the standard model, with an initial learning rate of 0.1, while AdaMax [17] was used to train the binary model, with an initial learning rate of 0.001. During training, the batch size was 256.
Both two models were evaluated on an Intel i3-4150 CPU 4 . During inference, the batch size was set to 16. Phone error rate (PER) and frames per second (FPS) are shown in Table 4 . It is observed that although the first layer is not binarized, in the forward pass the binary model is still 4.0× faster than the highly optimized standard model. The accuracy gap between the two models is not very big, since the relative PER increments of binary model are only 7% on both sets.
Experiments on 50-hour Switchboard speech recognition task
For fast development, a subset of Switchboard corpus was used in this experiment 5 . The subset contains 50-hour audio data spoken by 810 speakers that randomly chosen from the whole 309-hour Switchboard dataset. The Switchboard/CallHome (refered to as SWB and CH) portion of the NIST Hub5 2000 evaluation set and the Fisher/Switchboard (refered to as FSH and SWB) portion of the Rich Transcription 2003 evaluation set were used as the test sets.
36-dimensional log mel-frequency filter bank (FBANK) along with their first and second order derivatives were extracted, and per speaker CMN was applied to the features.
The input was formed of 11 consecutive frames of those acoustic features, and the corresponding target was built by state level forced alignment of a GMM-HMM model with 2723 tied triphone states. The model structure had 6 hidden layers with 2048 units per layer. Sigmoid was used as non-linear function in hidden layers of standard model. The training criterion, learning rate and batch size were the same as those of TIMIT task. A trigram language model trained on the Switchboard transcripts was used for decoding. Word error rate (WER) was used as performance measure. During inference, the batch size was set to 16. Table 5 shows the results of standard model and binary model that were evaluated on an Intel i3-4150 CPU. Similar to the results of TIMIT task, the proposed binary model achieves substantial 3.7× speed up compared with the standard model. On four test sets (SWB/CH of Hub5'00 and FSB/SWB of RT03S), the performance of the binary model is slightly worse than the standard model: the relative increments of WER are 12.9%, 7.6%, 10.9%, and 6.7% respectively.
Conclusions
This paper presents detailed work on using binary DNNs for fast inference in speech recognition. First, based on bit operations such as population count, efficient binary matrix multiplication is proposed on specific hardware. In comparison with highly optimized floating-point baseline, the implementation can achieve substantial 5 ∼ 7 times speed up when the corresponding "batch size" is 16. As matrix multiplications account for more than 90% of overall computation, DNNs with binary weights and activations can benefit a lot from this acceleration. Then, the proposed binary model is evaluated on TIMIT phone recognition task and a 50-hour Switchboard speech recognition task to verify the effectiveness of binary DNNs. The results show that, the binary model can run about 4 times faster than the standard model during inference, with only around relative 10.0% accuracy reduction.
