To improve the throughput and energy efficiency of Deep Neural Networks (DNNs) on customized hardware, lightweight neural networks constrain the weights of DNNs to be a limited combination (denoted as k ∈ {1, 2}) of powers of 2. In such networks, the multiply-accumulate operation can be replaced with a single shift operation, or two shifts and an add operation. To provide even more design flexibility, the k for each convolutional filter can be optimally chosen instead of being fixed for every filter. In this paper, we formulate the selection of k to be differentiable, and describe model training for determining k-based weights on a per-filter basis. Over 46 FPGA-design experiments involving eight configurations and four data sets reveal that lightweight neural networks with a flexible k value (dubbed FLightNNs) fully utilize the hardware resources on Field Programmable Gate Arrays (FPGAs), our experimental results show that FLightNNs can achieve 2× speedup when compared to lightweight NNs with k = 2, with only 0.1% accuracy degradation. Compared to a 4-bit fixed-point quantization, FLightNNs achieve higher accuracy and up to 2× inference speedup, due to their lightweight shift operations. In addition, our experiments also demonstrate that FLightNNs can achieve higher computational energy efficiency for ASIC implementation.
INTRODUCTION
Emerging vision, speech and natural language applications have widely adopted deep learning models and, as a result, have achieved state-of-the-art accuracy. Furthermore, recent industrial efforts have focused on implementing the models on mobile devices [1] . However, real-time applications based on these deep models may incur unacceptably large latencies and can easily drain the battery on energy-limited devices. For example, smartphones can only run Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. DAC '19 Figure 1 : A discrete Pareto-optimal curve for LightNN models w.r .t . test error and latency/energy. More continuous Pareto-optimal points are needed to adapt to the latency/energy constraints determined by the hardware and application. the AlexNet-based object detection for one hour [2] . Therefore, prior research has proposed model compression techniques including pruning and quantization to satisfy the stringent energy and speed requirements [3] . One of the recently proposed quantization approaches, LightNN, constrains the weights of DNNs to be a sum of k powers of 2, and therefore can use shift and add operations to replace the multiplications between activations and weights [4] . For LightNN-1 1 , all the multiplications of the DNNs will be replaced by a shift operation, while for LightNN-2, two shifts and an add replace the multiplication. Since shift operations are much more lightweight on customized hardware (e.g., FPGA or ASIC), LightNNs can achieve faster speed and lower energy consumption, and generally maintain accuracy for over-parameterized models [4, 5] . Although LightNNs provide better energy-efficiency, they lack the flexibility to provide fine-grained trade-offs between energy and accuracy. As shown in Fig. 1 , the energy efficiency for these models also exhibits gaps, making the Pareto front of accuracy and energy discrete. However, a continuous accuracy and energy/latency trade-off is an important feature for designers to target different market segments (e.g., IoT devices, edge devices, and mobile devices).
To provide a more flexible Pareto front for the LightNN framework, we propose to equip each convolutional filter with the freedom to use a different number of shift-and-add operations to approximate multiplications. Specifically, we introduce a set of free variables k = {k 1 , . . . , k F } where each element represents the number of shift-and-add for the corresponding convolutional filter. As a result, a more contiguous Pareto front can be achieved. For example, if we constrain k ∈ {1, 2} F , then the throughput and energy consumption of the new model will sit between LightNN-1 (k = {1} F ) and LightNN-2 (k = {2} F ). Formally, we are solving min w,k L(w, k), where L is the loss function and w is the weights vector. However, the commonly adopted stochastic gradient descent (SGD) algorithm does not apply in this case since L is non-differentiable w.r .t . k. In this paper, we propose a differentiable training algorithm which enables end-to-end optimization with standard SGD. The resulting network is dubbed FLightNN for its flexible k values.
Prior work has extensively explored approaches to reduce latency and energy consumption of DNNs on hardware, through both algorithmic [2, 6] and hardware [7, 8] efforts. Since the latency and energy consumption of DNNs generally stem from computational cost and memory accesses, prior work in the algorithmic domain mainly focuses on the reduction of FLOPs and model size. Some work reduces the number of parameters through weight pruning [9] , while some other work introduces structural sparsity via filter pruning for Convolutional Neural Networks (CNNs) [10] to enable speedup on general hardware platforms incorporating CPUs and GPUs. To reduce the model size, previous work has also conducted neural architecture search with energy constraint [2] . In addition to algorithmic advances, prior art has also proposed methodologies to achieve fast and energy-efficient DNNs. Some previous work proposes the co-design of the hardware platform and the architecture of the neural network running on it [11] . Some work proposes more lightweight DNN units for faster inference on general-purpose hardware [12] , while others propose hardware-friendly DNN computation units to enable energy-efficient implementation on customized hardware [13] .
By reducing the weight and activation precision, DNN quantization has proved to be an effective technique to improve the speed and energy efficiency of DNNs on customized hardware, due to its lower computational cost and fewer memory accesses [14] . Gupta et al. show that a DNN with 16-bit fixed-point representation can achieve competitive accuracy compared to the full-precision network [14] . In the same vein, Zhou et al. explored the DNN accuracy w.r.t. a wide range of bit widths [15] . These uniform quantization approaches enable fixed-point hardware implementation for DNNs. Courbariaux et al. propose BinaryConnect, which uses only 1 bit for the DNN parameters, turning multiplications into XNOR operations on customized hardware [16] . However, these models require an over-parameterized model size to maintain a high accuracy [4] .
LightNNs constrain the model weights to be a power of 2, or the sum of a limited number of powers of 2 [4] , while the activations use fixed-point quantization. Therefore, the multiplication between weights and activations can be implemented in hardware by shift operations and fixed-point additions. Compared to DNNs with fixed-point quantization, LightNNs replace the fixed-point multipliers by more lightweight shift operators, or shift and additions. Since the shift operators can be implemented using Look-Up Table ( LUT) on FPGA while fixed-point multipliers require Digital Signal Processing (DSP) units, LightNNs can have higher inference speed than fixed-point DNNs when run on DSP-bounded FPGAs. In addition, in an ASIC implementation, shift operations are more lightweight than multiplications, making LightNNs more energy and area efficient than fixed-point DNNs.
However, LightNNs use a single k value (i.e., the number of shifts per multiplication) across the whole network, and therefore lack flexibility to provide a fine-grained energy/latency and accuracy trade-off for hardware designers. Therefore, we propose FLightNNs which use customized k values for each convolutional filter to enable a more continuous Pareto front. Recent work has explored the idea of differentiable training for architecture search [17] and neural network pruning [18] . In this paper, we propose an end-to-end differentiable training algorithm for FLightNNs via approximate gradient computation for non-differentiable operations and regularization to encourage sparsity. Moreover, the proposed differentiable training approach uses gradual quantization, which can achieve higher accuracy than LightNN-1 without increasing latency. In summary, this paper has the following key contributions:
(i) We propose a differentiable training algorithm for FLightNNs, which provides a continuous Pareto front for hardware designers to search for a highly accurate model under the hardware resource constraints.
(ii) The differentiable training for FLightNNs enables gradual quantization, and further pushes forward the Pareto-optimal curve.
LIGHTNN OVERVIEW
As a quantized DNN model, LightNNs constrain the weights of a network to be the sum of k powers of 2, denoted as LightNN-k. Thus, the multiplications between weights and activations can be implemented with k shift operations and k −1 additions. Specifically, LightNN-1 constrains the weights to be a power of 2, and only uses a shift for a multiplication. The approximation function used by LightNN-k to quantize a full-precision weight w can be formulated in a recursive way:
which rounds the weight w to a nearest power of 2.
LightNNs are trained with a modified backpropagation algorithm. In the forward phase of each training iteration, the parameters are first approximated using the Q k function. Then, in the backward phase, the gradients of loss w.r.t. quantized weights are computed, and applied to the full-precision weights in the weight update phase. LightNNs have been proved to be accurate and energy-efficient on customized hardware [4] . LightNN-2 can generally have an accuracy close to full-precision DNNs, while LightNN-1 can achieve higher energy efficiency than LightNN-2. Due to the nature of the discrete k values, there exists a gap between LightNN-1 and LightNN-2 w.r.t. accuracy and energy. We propose to customize the k values for each convolutional filter, and thus, achieve a smoother energy-accuracy trade-off to provide hardware designers with more design options.
DIFFERENTIABLE TRAINING FOR FLIGHTNNS
In this section, we first define the quantization function, and then introduce the end-to-end training algorithm for FLightNNs, equipped with a regularization loss to penalize large k values.
Quantization function
We first denote the i th filter of the network as w i and the quantization function for the filter w i as Q k (w i |t), where k = max i k is the maximum number of shifts used for this network, and vector t is a latent variable that controls the approximation (e.g., some threshold value). Also, we denote the residual resulting from the approximation as r i,k = w i − Q k (w i |t). Then, we formally define the quantization function as follows:
rounds the input variable to a nearest power of 2, and [.] is a rounding-to-integer function. This quantization flow is shown in Fig. 2 . To interpret the thresholds t, t 0 determines whether this filter is pruned out, and t 1 determines whether one shift is enough, etc. Then, the number of shifts for the i-th filter is
filter is equivalent to finding optimal thresholds t.
The FLightNN quantization approach targets efficient hardware implementation. Instead of assigning a customized k i for each weight, FLightNNs have customized k i values per filter, and therefore preserve the structural sparsity. As shown in Fig. 3 , the convolution with a k i = 2 filter can be equivalently converted to the sum of two convolutions each with a k i = 1 filter. Thus, FLightNNs can be efficiently implemented as LightNN-1 with an extra summation of feature maps per layer.
Differentiable training
Instead of picking the threshods t by hand, we consider them as trainable parameters. Therefore, the loss function L(w 2 , t) is a function of both weights and thresholds. Similar to prior work on DNN quantization [15, 16] , we use the straight-through estimator (STE) [19] 
, which becomes a differentiable expression.
To compute the gradient for thresholds, i.e., , we relax the indicator function д(x, t j ) = 1(x > t j ) to a sigmoid function [20] , σ (.), when computing gradients, i.e.,д(x, t j ) = σ (x − t j ). In addition, we use STE to compute the gradient for R(x). Thus, the gradient ∂w q i ∂t j can be computed by:
and ∂r i,l ∂t j are 0 for l < j; otherwise, they can be computed with the result of
2 The bias term is omitted for simplicity. 
, and
Regularization
To encourage smaller k i for the filters, we also add a regularization loss: L r eд,k (w) = k −1 j=0 λ j i ||r i, j || 2 where λ j performs as a handle to balance accuracy and model sparsity. This regularization loss is the sum of several group Lasso losses, since they can introduce structural sparsity [10] . The first item λ 0 i ||r i,0 || 2 = λ 0 i ||w i || 2 is used to prune the whole filters out, while the other items (j > 0) regularize the residuals. Fig. 4 shows the two items of regularization loss and their sum for the case k = 2, with λ 0 =1e-5 and λ 1 =3e-5. Therefore, the total loss for training a FLightNN is:
The new training algorithm is summarized in Algo. 1. This is the same as the conventional backpropagation algorithm for fullprecision DNNs, except that in the forward phase, the weights are quantized given the thresholds t. Then, due to the differentiability of the quantization function w.r.t. w and t, one can compute their gradients and update their values in each training iteration.
EXPERIMENTAL RESULTS
In this section, we first introduce the experiment setup. Then, we show the accuracy results of different quantized DNN models by software training, as well as their throughput on the FPGA and energy efficiency on the ASIC, to verify the effectiveness of FLightNNs. 
Setup
We conduct experiments on both small and large CNNs for CIFAR-10, SVHN, CIFAR-100 and ImageNet datasets. The eight adopted network configurations are shown in Table 1 . To explore the FLightNN performance on different types of network structures, we use a VGG structure with a series of stacked convolutional layers for Network 1, 3, 4 and 5, and adopt the ResNet structure with skip connections across layers for network 2, 6, 7 and 8. Networks 1, 2 and 3 are used for experiments on CIFAR-10; networks 4 and 5 are used for SVHN; networks 6 and 7 are used for CIFAR-100; the last one, network 8, is used for ImageNet. For all networks, each convolutional layer is followed by a batch normalization layer and a Leaky ReLU activation function [21] , and optionally followed by a max-pooling layer. We use the Adam optimizer [22] to train the network. For each of the networks, we train different quantized models including full-precision DNNs, fixed-point DNNs with 4-bit weights and 8-bit activations, LightNN-2 with 8-bit weights and 8-bit activations, LightNN-1 with 4-bit weights and 8-bit activations, and FLightNNs with 8-bit activations. Due to large training times and limitations in computing resources, we train the ImageNet dataset on a ResNet-10 with reduced width (i.e., network 8), for LightNN-1, LightNN-2 and FLightNNs. For all FLightNNs, we initialize the threshods t to 0, and set the largest shifts k as 2. For all, except the 32-bit full-precision model, we use 8-bit fixed-point quantization for the activations. By varying λ, we can have different accuracy-throughput or accuracy-energy trade-offs for FLightNNs. All these networks are trained in software through PyTorch.
Accuracy-throughput trade-off on FPGA
To show the accuracy-throughput trade-off of the models, we implement the inference of each network's largest convolutional layer for each of the quantized DNN models on FPGA since prior work has shown that convolution operations typically take over 90% of the computation time of a CNN [23] . Our implementation is built on the Xilinx Zynq ZC706 evaluation board. Its working frequency is 100 MHz. Pre-synthesis is executed on an Intel i7-4790 CPU (3.6GHz) with 16GB RAM. We use Vivado HLS [24] for FPGA implementation. The C code of DNN designs are parallelized by adding HLS-defined pragma and the parallel version is validated with the Vivado HLS timing analysis tool. To make a fair comparison, the same pragma and directives are used for full-precision, fixed-point DNNs, LightNNs and FLightNNs, and we follow the same scheduling settings as prior work [5] . Batched inference is adopted, and the maximum batch size without running out of FPGA resources is set to obtain the highest throughtput. Tables 2, 3, 4 and 5 show the accuracy and throughput comparison for full-precision DNNs, fixedpoint DNNs, LightNNs and FLightNNs. For all the experimented datasets, LightNNs show the advantage of flexible accuracy-speed trade-offs. In most of the networks (e.g., networks 1, 3, 6 and 7), FLightNNs can achieve an accuracy close to LightNN-2, but have much higher speedup than LightNN-2. Thus, FLightNNs provide continuous trade-offs for accuracy and speed. Compared to the fixed-point quantization, FLightNNs can achieve higher accuracy, and up to 2.0×, 1.8× and 1.8× speedup for CIFAR-10, SVHN, CIFAR-100 datasets, respectively. This is because the multiplication is replaced by shift operators, which require only LUT resources on FPGA while the multipliers require DSP units which are generally more scarce than LUT. Therefore, the computation for FLightNNs allows larger batch sizes than that of fixed-point DNNs, increasing data parallelism, and thus, improving the throughput.
It is also interesting to note that by comparing some FLightNNs (e.g., FL 1a , FL 2a , FL 3a , FL 6a and FL 7a ) with LightNN-1, we find that FLightNNs can achieve higher accuracy with the same or even lower storage as LightNN-1. This is because initially FLightNNs quantize all the filters with two shifts (since t is initialized as 0), and gradually add constraints to the filters. This gradual quantization may be better than training a network with only one shift from scratch, as LightNN-1 does. The benefit of gradual quantization has also been observed by prior work [25] which shows that gradually imposing quantization constraints can achieve better accuracy than directly quantizing with a strict constraint. Table 6 shows the FPGA resources utilization for networks 7 and 8. Since full-precision and fixed-point DNNs require DSP for both multiplication and addition, while LightNNs and FLightNNs only need DSP for addition, full-precision and fixed-point DNNs have larger DSP resource utilization. Compared to full-precision DNNs which use 32-bit floating point operations, fixed-point DNNs only use 4-bit weights and 8-bit activations, and therefore consume fewer DSP units. LightNNs and FLightNNs use LUT to implement the multipliers, and have a higher utilization of LUT than full-precision and fixed-point DNNs. However, the performance of (F)LightNNs is not bounded by LUT resources since the maximum usage of LUT by LightNN-2 is only 42% and 17% for networks 7 and 8, respectively. Instead, the memory resource (BRAM) bounds the performance for (F)LightNNs, while for full-precision and fixed-point DNNs, the performance is bounded by both BRAM and DSP.
Accuracy-energy trade-off on ASIC
For all quantized DNNs, we designed pipelined implementations with one stage per neuron, where the computation unit is reused for each neuron. A 65nm commercial standard library is adopted. The Synopsys Design Compiler [26] is used to generate the gate-level netlist of the computation units. The power consumption of all computation operations within one layer is calculated using Synopsys Primetime. We keep all the DNN architectures implemented in an unoptimized fashion because our main objective is to compare how different quantized DNNs impact computational energy.
The accuracy and computational energy trade-offs for the quantized DNN models are shown in Table 2 : Accuracy and FPGA throughput for CIFAR-10. In the "Model" column, "Full", "L-2", "L-1", "FP", "FL" indicate full-precision DNN, LightNN-2, LightNN-1, Fixed-point DNN, and FLightNN, respectively. The subscript "xWyA" indicates x bits for weights and y bits for activations. The FLightNN results are shown in bold face. We use subscript a and b to denote the two trained FLightNNs for each network. These notations also apply for 
DISCUSSION
Since FLightNNs customize the k i for each filter, LightNN-1 and LightNN-2 can be considered as two special cases for FLightNNs. Therefore, the Pareto front created by the searched FLightNN solutions should be the upper bound for the front of LightNN-1 and LightNN-2 with varied parameter numbers. We test this hypothesis on CIFAR-100 dataset using networks with varied number of convolutional filters. As shown in Fig. 6 , the accuracy-storage Pareto-front created by FLightNNs is consistently higher than the LightNNs. This indicates that instead of only filling in the Pareto front of LightNNs, FLightNNs can push forward the Pareto front, due to their larger design space. The proposed differentiable training algorithm optimizes both k i and weight values in an end-to-end fashion, and therefore significantly reduces searching effort compared to exhaustive or heuristic methods with multiple rounds of training. Future work will further improve training efficiency by using optimized training loss [27] or proper labels [28] .
CONCLUSION
In this paper, we propose FLightNNs which customize the number of shift operations for each filter of LightNNs. Equipped with the proposed differentiable training algorithm, FLightNNs can achieve a flexible trade-off between accuracy and speed/energy. Our experimental results on FPGA and ASIC simulations show that FLightNNs can provide a more continuous Pareto front for LightNN models and consistently outperform fixed-point DNNs w.r.t. both accuracy and speed/energy. Moreover, due to the gradual quantization nature of the differentiable training, FLightNNs can achieve higher accuracy than LightNN-1 without sacrificing speed and energy efficiency, and thus, push forward the Pareto-optimal front. These promising results suggest the potentials for FLightNNs to achieve fast and accurate inference on learning-based customized hardware. 
