Deep neural networks have proven to be particularly e ective in visual and audio recognition tasks. Existing models tend to be computationally expensive and memory intensive, however, and so methods for hardwareoriented approximation have become a hot topic. Research has shown that custom hardware-based neural network accelerators can surpass their general-purpose processor equivalents in terms of both throughput and energy e ciency. Application-tailored accelerators, when co-designed with approximation-based network training methods, transform large, dense and computationally expensive networks into small, sparse and hardware-e cient alternatives, increasing the feasibility of network deployment. In this article, we provide a comprehensive evaluation of approximation methods for high-performance network inference along with in-depth discussion of their e ectiveness for custom hardware implementation. We also include proposals for future research based on a thorough analysis of current trends. is article represents the rst survey providing detailed comparisons of custom hardware accelerators featuring approximation for both convolutional and recurrent neural networks, through which we hope to inspire exciting new developments in the eld.
INTRODUCTION
e exponentially growing availability of digital data such as images, videos and speech from myriad sources, including social media and the Internet of ings, is driving the demand for high-performance data analysis. Compared to other machine learning algorithms, deep neural networks (DNNs) have achieved dramatic accuracy improvements over the past decade. ey have now been employed in a vast range of application domains, from image classi cation [137] and object detection [90] to autonomous driving [19] and drone navigation [42] . Two classes of DNN-convolutional and recurrent (CNNs and RNNs)-are particularly popular. While CNNs excel in learning spatial features, RNNs are more suited to problems involving time series.
As tasks increase in complexity, inference architectures become deeper and more computationally expensive. For example, a small LeNet-5 model targe ing the simple MNIST handwri en digitclassi cation task requires 680 kop/cl (thousand arithmetic operations per classi cation, where an arithmetic operation is either an addition or multiplication), while a VGG16 implementation executing the 1000-class ImageNet task requires 31 Gop/cl along with 550 MiB of 32-bit oatingpoint weight storage [136] . e development of algorithms for reducing the computational and storage costs of DNN inference is therefore essential for throughput-, latency-and energy-critical applications. Recent work has shown that, with the use of approximation, DNN deployment becomes more feasible thanks to its resultant reductions in memory use and compute complexity.
DNN approximation algorithms can be classi ed into two broad categories: quantisation and weight reduction.
antisation methods reduce the precision of weights, activations (neuron outputs) or both, while weight reduction removes redundant parameters through pruning and structural simpli cation. By doing so, the la er commonly leads to reductions in numbers of activations per network as well. We assess methods of both types in this article since they both contribute to DNN acceleration.
For many years, general-purpose processors (GPPs), particularly multi-core CPUs and GPUs, have been the dominant hardware platforms for DNN inference. For uncompressed DNN models, layer operations are mapped to dense oating-point matrix multiplications, which can be e ciently processed in parallel by GPPs following the single-instruction, multiple-data (SIMD) or singleinstruction, multiple-thread (SIMT) parallel-processing paradigms. With DNN approximation, however, there is an emerging trend of using custom hardware platforms, such as eld-programmable gate arrays (FPGAs) and application-speci c integrated circuits (ASICs), to accelerate inference instead. While GPUs still excel at dense oating-point computation, researchers have reported higher throughput and energy e ciency with custom hardware through the use of low-precision xed-point quantisation [66, 127] . Moreover, SIMD and SIMT architectures o en perform poorly when operating on sparse data; DNNs compressed via ne-grained weight reduction have been shown to execute more e ciently in custom hardware [52, 109] . Logic and memory hierarchy customisability o en make custom hardware DNN inference faster and signi cantly more energy e cient than through the use of GPPs.
A signi cant number of world-leading information technology rms have selected custom hardware over GPPs for the implementation of their next-generation DNN architectures. ese include ASICs, e.g. Google's Tensor Processing Unit (TPU) [65] , Intel Nervana [1] and IBM TrueNorth [2] , as well as FPGA-based designs such as Microso Brainwave [28] and Xilinx Everest [151] . In general, ASIC designs can achieve state-of-the art throughput and energy e ciency. eir time-consuming and resource-demanding design and fabrication processes, however, make it hard for them to keep up with the rapid evolution of DNN algorithms [28, 66] .
High-level implementation tools, including Intel's OpenCL So ware Development Kit and Xilinx Vivado High-Level Synthesis, and Python-to-netlist neural network frameworks, such as DNNWeaver [125] , make the DNN hardware design process for both FPGAs and ASICs faster and simpler. Such so ware allows DNN architects unfamiliar with hardware development to migrate their designs to custom hardware with relative ease. Recon gurability, meanwhile, enables rapid design iteration, making FPGAs ideal prototyping and deployment devices for cu ing-edge DNNs.
rough this survey, we aim to equip researchers new to the eld with a comprehensive grounding of DNN approximation, revealing how custom hardware is able to achieve greater performance than GPPs for inference. More speci cally, we make the following novel contributions:
• We motivate DNN approximation for custom hardware by comparing the so-called roo ine models [107] of comparable FPGA, ASIC, CPU and GPU platforms of di erent scales. • We survey key trends in approximation for state-of-the-art DNNs. We detail low-precision quantisation and weight-reduction methods, introducing recent algorithmic developments and assessing their relative strengths and weaknesses. • We evaluate the performance of custom hardware implementations of each method, focussing on accuracy, compression, throughput, latency and energy e ciency. • Based on identi ed trends, we propose several promising directions for future research.
ere are some existing surveys on DNN approximation. Cheng et al. [25] , Guo et al. [49] , Cheng et al. [24] and Sze et al. [136] surveyed algorithms for DNN compression and acceleration. Of these, Cheng et al. [24] brie y evaluated system-level designs for FPGA implementation. Guo et al. only surveyed quantisation methods; weight reduction was not mentioned. Nurvitadhi et al. compared Intel FPGA performance to that of GPU platforms for CNN inference benchmarks [104] .
is article represents the rst survey that provides not only a comprehensive evaluation of approximation algorithms for e cient DNN inference, but also in-depth analysis and comparison of these algorithms' implementations in custom hardware, covering both CNNs and RNNs.
PERFORMANCE EVALUATION METRICS
We evaluate the e ectiveness of DNN approximation by considering the following factors.
• Accuracy. e two accuracy metrics commonly used in machine learning research are training and testing accuracy, which respectively capture the proportions of correct classi cations over training and testing datasets. roughout this article, "accuracy" always refers to testing accuracy, which is indicative of a particular DNN's generalisability. Top-n accuracy captures the proportion of testing data for which any of the n highest-probability predictions match the correct results. Accuracies are reported as percentages, with changes expressed in percentage points (pp). Where comparisons are drawn against baselines, these are uncompressed implementations of the same networks, trained and tested using identical datasets, with all data in IEEE-754 single-precision oating-point format (FP32).
• Compression ratio. A network's weight storage requirement vs that of the above baseline.
• roughput. Classi cations produced per second (cl/s). Also known as classi cation rate. • Latency. e end-to-end processing time for one classi cation, in seconds (s).
• Energy e ciency. e throughput obtained per unit power, expressed in cl/J. We also discuss application-speci c considerations, e.g. parameter tuning time and design exibility.
WHY CUSTOM HARDWARE? A ROOFLINE MODEL ANALYSIS
For DNN inference, approximation contributes to increases in throughput in three ways: increased parallelism, memory transfer reductions and workload reductions. With the help of roo ine modelling, we can explain each factor's contribution, revealing why custom hardware can squeeze more speedup from approximation than GPPs. ) and one-bit ( ) fixed-point weights [58] . A roo ine model captures the theoretical peak performance of an acceleration platform while also re ecting the e ects of o -chip memory data transfers. For any high-performance computing engine, the peak arithmetic performance, expressed in op/s, is limited by two factors: memory bandwidth and the amount of available compute resources. In the context of DNN inference, memory bandwidth limits the rate at which activations can be read and wri en, as well as that at which parameters stored o -chip can be fetched. By compute resources, we mean on-chip parallel-processing units able to perform operations: chie y multiplication. When memory bound, the arithmetic performance of a platform does not scale with any increase in parallelism. At the compute bound, meanwhile, all available processing resources are saturated. Figure 1 overlays the estimated roo ines of DNN inference accelerators on several hardware platforms. e abscissa shows the arithmetic intensity of DNN inference, while the ordinate indicates the peak a ainable arithmetic performance. Arithmetic intensity, also commonly referred to as operational intensity or compute-to-communication (CTC) ratio, is expressed as the number of arithmetic operations performed per byte of o -chip memory tra c (op/B). Arithmetic performance is memory bound when the arithmetic intensity is to the le of the break point. When to the right, it is compute bound: resource limitations prevent further scaling.
For fairness, platforms were divided into datacentre and embedded scales and compared accordingly. For FPGA-based accelerators, compute bounds were approximated under the assumption that the cost per xed-point multiply-accumulate (MAC) unit was 2.5 lookup tables (LUTs) for one-bit (binary), 40 LUTs for eight-bit and eight LUTs and half a digital signal processing (DSP) block for 16-bit precision, as suggested by Umuroglu et al. [141] . Both weights and activations were quantised at the same precision. We assumed that both Xilinx FPGAs featured, the Kintex UltraScale KU115 and Zynq ZC706, had 4.8 GB/s of o -chip memory bandwidth, and that implementations on the two devices were clocked at 350 and 200 MHz, respectively [141] .
Compute Bound Flexibility
From Figure 1 , we can observe that, due to their specialised support for oating-point arithmetic operations, GPUs can deliver the highest arithmetic performance for FP32 DNN inference. When moving from oating-point to lower-precision xed-point data representations, however, custom hardware design exibility facilitates the trading o of precision for increased performance. Being robust to reductions in precision, DNNs can take great advantage of this exibility [32] . e ASIC implementation featured, the TPU, has the greatest compute bound-92 Top/s-following which is the KU115 FPGA. Since FPGAs a ord their users total post-fabrication architectural freedom, di erent compute bounds are reachable, dependent upon the chosen precision, for the same device. As a result, the KU115 has compute bounds of 1.0 Top/s with 16- 
Arithmetic Performance Increases from Network Compression
Reaching a platform's compute bound is only possible if the executing application is not limited by its memory. If it is, then, to achieve higher arithmetic performance, higher arithmetic intensity is required. With network compression in the form of precision reductions, less o -chip memory needs to be accessed per operation performed, hence higher arithmetic intensity-and subsequently performance, if the application is not compute bound-is achievable. Networks can also be compressed via weight reduction, which both saves memory and removes the need to perform the associated operations. is can also lead to increased arithmetic intensity and thus performance: a smaller network can use on-chip caching more e ciently, reducing, or even entirely eliminating, o -chip memory tra c [141] . Performance gains from network compression can be supported from observations from the roo ine models, in which, when bounded by memory, an increase in arithmetic intensity means a rightward shi along a roo ine, resulting in an increase in arithmetic performance. Although all hardware platforms can bene t from network compression, custom hardware implementations, featuring higher compute bounds than GPPs, stand to gain the most; GPPs hit their compute bounds earlier when arithmetic intensity increases.
Limitations
While roo ine models can allow one to predict increases in arithmetic performance (in op/s) that will arise from increased parallelism and memory transfer reductions gained through approximation, they can capture the corresponding changes in throughput (in cl/s) to only a limited extent. To understand the throughput impacts of weight-reduction methods, we must consider an additional factor. Arithmetic performance and throughput are related by workload (op/cl): the number of arithmetic operations performed per classi cation. Since weight reduction removes unimportant parameters, these methods achieve simultaneous memory transfer and workload reductions. As memory transfer reductions can facilitate arithmetic performance increases, it is possible for throughput increases to outpace those in arithmetic performance realised through their employment.
antisation methods, on the other hand, do not cause reductions in workload since the numbers of operations performed per classi cation remain the same. For these, increases in arithmetic performance result in proportionate increases in throughput.
Roo ine modelling does not account for the discrepancies in accuracy that arise from approximation. In general, while DNN approximation results in information loss and subsequent accuracy degradation, the majority of works surveyed in this article suggest that the acceptance of low to moderate sacri ces in accuracy can result in signi cant performance improvement. Some show that, in certain scenarios, the introduction of approximation can actually improve accuracy by reducing model over ing. e remainder of this article places great emphasis on the analysis of tradeo s between network compression and accuracy.
Latency-critical DNN applications, such as advanced driver assistance systems, require the swi production of classi cations. Many user-interfacing applications also require low latency to maintain adequate user experience [121] . Roo ine models do not inherently capture latency. Herein, we detail how custom hardware can achieve state-of-the-art DNN inference latency, as well as throughput, thanks to its exibility.
Approximation in custom hardware can also achieve superior energy e ciency-another metric whose behaviour is not natively observable through roo ine modelling-vs competing platforms. Custom hardware-based DNN inferencing applications operate at lower clock frequencies and hence consume less power, while also a aining higher throughput and/or lower latency, than those running on GPPs. Furthermore, some implementations, by exploiting customisability, outperform GPU-based versions in terms of memory energy e ciency.
QUANTISATION
e rst major approximation theme we consider is that of quantisation. FPGA and ASIC exibility permits the implementation of low-precision DNNs, thereby increasing throughput through parallelisation and by reducing reliance on slow o -chip memory.
Fixed-point Representation
4.1.1 Algorithmic Development. A oating-point-quantised DNN typically allows for an arbitrary binary point position, i.e. exponent value, for each individual parameter. is exibility in data representation range comes at the expense of high resource use, power consumption and arithmetic operation latency, however. Fixed-point-quantised DNNs generally use consistent, predetermined precisions and binary point locations, i.e. equal maximum and minimum representable magnitudes, for entire networks.
is allows for fast, cheap and power-e cient arithmetic operations in hardware, but enforces the use of constant data representation ranges. Early works, such as Courbariaux et al.'s [32] , surveyed this topic, signalling that the accuracy of CNN inference can be preserved even with forward propagation conducted in low-precision xed-point formats. Jacob et al. performed eight-bit quantisation of a popular CNN model, MobileNet, reporting an up-to 50% reduction in inference latency on an ARM CPU with only a 1.8 pp accuracy drop for the Common Objects in Context (COCO) dataset [62] . erea er, many authors presented FPGA-based CNN and RNN inference frameworks using low-precision xed-point formats that achieved superior throughputs to their oating-point counterparts with negligible accuracy drops [94, 155] . However, since data in di erent layers can have very di erent ranges, using a constant quantisation resolution for an entire network can provide suboptimal bandwidth e ciency.
Courbariaux et al. [32] , Qiu et al. [111] and Shin et al. [129] explored using block oating point (BFP) for weight and activation quantisation. With BFP, o en unfortunately referred to as "dynamic xed point" [148] , groups of variables share common binary point locations represented as scaling factors updated during training based on data distributions. As such, it can be seen as a compromise between fully oating-and xed-point formats. ese authors associated each layer's parameters with a scaling factor, updated a er each arithmetic operation by checking the parameters' over ow status during training. eir experiments showed that, for both CNNs and RNNs, BFP quantisation of both weights and activations can result in the incursion of below-1.0 pp accuracy losses. Since then, BFP has become common in the hardware inference of DNNs as well.
Many authors have explored methods allowing for the automatic selection of layer-wise precision. Inspired by Sung et al. [135] , Shin et al. proposed the exhaustive search for cost-optimal precisions to use within long short-term memories (LSTMs) through analysis of the tradeo between signalto-quantisation-noise ratio (SQNR) and precision [129] . e time complexity of such searches is too high to be practical, however. Qiu et al. formulated an optimisation problem for minimising quantisation error with respect to changes in precision and binary point location [111] . A greedy method was proposed for its solution, resulting in desirable layer-wise CNN quantisations. Lin et al. [84] formulated and solved an SQNR-based optimisation problem to identify the optimal xed-point precision per layer of a custom-designed CNN, showing that the proposed scheme o ered over 1.2× compression for the CIFAR-10 dataset with no loss in accuracy. eir method converts pretrained networks from FP32 into further-quantised equivalents without retraining.
Many authors have focussed on reducing accuracy losses through the modi cation of rounding schemes. Gupta et al. trained CNNs with 16-bit xed-point weight representation using stochastic rounding, achieving lossless compression for the MNIST and CIFAR-10 datasets [51] . By following [72] . Experiments with AlexNet showed that the use of seven-bit oating-point weights could achieve the same accuracy as 11-bit xed-point representation with ImageNet. e authors suggested that weight range is more important than precision in preserving accuracy. is observation laid the foundations for logarithmic quantisation (Section 4.3), which trades o precision for range. e authors of Adaptive antisation investigated quantisation at a ner granularity than the aforementioned down-to layer-wise methods [69] . During retraining, networks adapt, with each lter allowed to assume an independent precision. Experiments with small-scale datasets and models showed that Adaptive antisation, when combined with pruning, is able to achieve accuracies and compression ratios superior to binarised neural networks, for which each datum is represented using only a single bit. A framework for implementing low-precision quantisation, DoReFa-Net, supports arbitrary precisions for weights, activations and gradients, from 32-bit xed point down to binary [161] . Its authors conducted empirical analysis of various data precision combinations, concluding that accuracy deteriorates rapidly when weights and/or activations are quantised to fewer than four bits. xed-point data. e throughput advantages and energy savings of FPGAs become more signi cant as precision decreases. Colangelo et al. presented an Intel FPGA-based inference framework taking advantage of bandwidth and computation savings from low-precision data [29] . eir experimental results for AlexNet, as presented in Figure 2 , showed that, as precision fell, the throughput of their FPGA implementation improved and eventually exceeded that of a GPU of similar scale, supporting the conclusions by Nurvitadhi et al. e FPGA achieved an order-of-magnitude throughput improvement over the GPU at binary precision. Zhang et al. showed that a xed-point-quantised long-term recurrent convolutional network (LRCN) implementation on a Xilinx Virtex 7 VC709 FPGA could achieve a 3.1× throughput speedup vs an Nvidia K80 GPU equivalent [157] .
Köster et al. presented Flexpoint, another BFP variant, for CNN training and inference [70] . Using the " ex16+5" (16-bit mantissa and ve-bit shared exponent) data format, Intel's neural network ASIC, Nervana, was shown to achieve the same accuracy as FP32, while reducing memory bandwidth by around 50%, for the training of AlexNet and ResNet with ImageNet. e latest-generation Intel FPGAs can pack up to either one 27-× 27-bit or two 18 × 19 MAC(s) per DSP block. When using lower precisions on FPGAs, many authors have implemented multipliers using LUTs instead of DSPs to achieve higher resource e ciency. Boutros et al. [15] proposed the enhancement of DSP blocks to support low-precision MACs with some 12% area overhead and no drop in achievable frequency. One such enhanced DSP can perform one 27 × 27 or two 18 × 19, four 9 × 9 or eight 4 × 4 parallel MAC(s). e authors implemented AlexNet, VGG-16 and ResNet-50 using the enhanced DSPs. On average, they improved the throughput of eight-bit and four-bit DNNs by 1.3× and 1.6×, respectively, while correspondingly reducing the occupied area by 15% and 30% compared to the default use of DSPs in the Intel Arria 10 they targe ed.
Sharma et al. [126] and Moons et al. [99] both introduced variable-precision bit-parallel ASIC implementations. Sharma et al.'s Bit Fusion consists of an array of bit-level MACs that dynamically fuse to match the precisions of individual DNN layers [126] . Experiments with AlexNet showed that Bit Fusion, while consuming only 900 mW of power, is only 16% slower than an Nvidia Titan Xp implementation using its native eight-bit vector instructions. e Titan Xp can consume up to 250 W of power. Moons et al. used similar ideas, with their implementation consuming 76 mW to achieve 47 cl/s for AlexNet, outperforming static-precision Eyeriss by 3.9× in energy e ciency [99] .
Having realised the importance of exibility of precision in achieving high DNN inference e ciency, GPP manufacturers have recently begun to o er support for low-precision MACs. Intel Cascade Lake CPUs provide so-called Vector Neural Network Instructions in 16-and eight-bit formats [61] , while Nvidia Turing GPUs support TensorRT, a deep learning platform integrable with TensorFlow, allowing for low-precision arithmetic down to as few as four bits [106] .
We can categorise MACs into two families: bit-parallel and -serial. FPGA-and ASIC-based DNN inference architectures with consistent precision generally use bit-parallel MACs for performance and/or simplicity of reuse. fpgaConvNet [142] , Angel-eye [48] , ESE [52] and works by Chang et al. [18] and Shen et al. [127] represent the state-of-the-art in FPGA-based CNN and RNN implementation using low-precision bit-parallel MACs. DaDianNao [22] , Cnvlutin [4] , NeuFlow [40] and the TPU [66] , meanwhile, are cu ing-edge ASIC-based bit-parallel DNN inference platforms. For bit-parallel MACs, DNN hardware is typically designed to natively support the maximum precision of an entire network. However, as suggested by Khoram et al. [69] and Li et al. [82] , since actual precision requirements vary considerably across DNN layers, bit-parallel DNN hardware typically processes an excess of bits per operation. Bit-serial alternatives, however, allow precision to be trivially varied at runtime, making their use suitable for ne-grained mixed-precision networks.
Stripes [67] , Loom [123] and Bit Pragmatic (PRA) [3] are ASIC-based DNN accelerators that perform layer-wise mixed-precision inference using bit-serial MACs. Among these, experiments showed that Stripes achieved a 1.3× throughput increase over bit-parallel DaDianNao with VGG-19 [67] . Based on Stripes, Albericio et al. proposed an ASIC implementation, PRA, which performs bit-serial neuron activations by shi ing inputs with respect to the indices of non-zero bits in the weights [3] . Experiments showed that PRA could achieve 2.6× and 2.0× increases in throughput and energy e ciency, respectively, vs DaDianNao. Gudovskiy et al. proposed an FPGA implementation, Shi CNN, using similar ideas to PRA [47] . Shi CNN was shown to obtain 4.2× and 3.8× energy e ciency savings over two baseline CNN platforms using DSP-and LUT-based bit-parallel MACs, respectively. Moss et al. presented an FPGA-based customisable matrix multiplication framework dedicated to DNN inference [100] . eir implementation allows for the runtime switching between static-precision bit-parallel and dynamic-precision bit-serial MAC implementations. ey observed up-to 50× throughput increases vs FP32 baselines for AlexNet, VGGNet and ResNet.
Binarisation and Ternarisation
4.2.1 Algorithmic Development. Binarisation is the quantisation of parameters into just two values, typically {−1, 1} with a scaling factor. Although binary quantisation leads to the incursion of greater error than non-binary xed-point quantisation, inference operations can be substantially simpli ed. Early works, such as BinaryConnect, focussed on partial binarisation, for which only weights are binarised [31] . Full binarisation of CNNs was proposed in BinaryNet: both weights and activations are binarised [30] . For binarised training, weights are binarised only during forward propagation; they are not binarised during backward propagation since stochastic gradient descent is sensitive to quantisation and does not work well with very low precisions. e authors of BinaryConnect and BinaryNet proposed binarisation of two types: deterministic and stochastic. For deterministic binarisation, a simple sign function is used, while the stochastic binarisation process is equivalent to stochastic rounding, as was shown in Equation 1. Since the derivative of the sign function is a Dirac delta function with zero everywhere but the origin, rendering the training process impossible, the authors of BinaryNet resorted to using a hard hyperbolic tangent (tanh) function to cope with this problem during backward propagation [30] : In this way, the gradient of their cost function could be preserved for weights within [−1, 1] during training. Clipping was also applied to the real-valued weights to constrain them to [−1, 1]. Experiments with the MNIST and CIFAR-10 datasets on unidenti ed networks showed that BinaryConnect achieved around 1-2 pp higher prediction accuracies than FP32 baselines. e authors suggested that this was due to stochastic rounding's regularisation e ect, whence randomisation is injected into a network in a similar way to "dropout" in the form of per-neuron binarisation noise [133] . Experiments with BinaryNet with MNIST, CIFAR-10 and SVHN-also on unknown networksshowed less-than 1 pp accuracy losses compared to baseline cases. However, this regularisation e ect was only seen for small datasets. For large-scale ones such as ImageNet, although BinaryNet with AlexNet achieved signi cant memory and computational complexity reductions, this was accompanied by around 30 pp top-one accuracy drops. Binarisation's high error inducement outweighed the positives of regularisation in these cases.
In an e ort to improve BinaryNet's data representation, XNOR-Net features trainable lter-wise scaling factors for forward propagation [112] . ese scaling factors retain the average magnitudes of weights and activations in order to improve the expressiveness of binarised networks. Experiments with XNOR-Net inferencing AlexNet with the ImageNet dataset showed that this method successfully improved top-one accuracy by around 20 pp compared with BinaryNet, while there was still an accuracy drop of over 10 pp vs a FP32 baseline. XNOR-Net does, however, require averaging operations over input features, adding costly high-precision dividers [44] .
ABC-Net alleviates the information loss from binarisation by approximating FP32 parameters and activations as linear combinations of multiple binary values [86] . Its authors pointed out that, during forward propagation, their K-binarisation scheme (K parallel bitwise XNORs) is cheaper than performing K-bit xed-point multiplication, emphasising ABC-Net's superior resource e ciency over conventional xed-point CNN implementations. A ve-bit weight/activation ABC-Net achieved a 14 pp top-one accuracy improvement vs XNOR-Net with ImageNet on ResNet-18.
Tang et al. proposed a number of improvements to the binarised retraining process [139] . One of their discoveries was that a low learning rate is preferable in order to avoid frequent parameter oscillation, which leads to prolonged and ine cient training. Furthermore, a binary-constrained regulariser was added to their training loss function to encourage more bipolar weight values (closer to ±1). is was implemented within the function as
wherein W , b and λ represent weight, bias and regularisation factor, respectively. L is the network's depth and M l and N l are the input and output channel numbers in the l th layer. loss task (W , b) returns the task-related loss based on the original network se ings, while loss post-reg (W , b) gives the postregularisation loss. Tang et al.'s regulariser penalised with respect to the implemented network's overall quantisation loss. ese optimisations, together with multi-bit activation representation, resulted in a 6.4 pp top-one AlexNet accuracy increase over XNOR-Net for ImageNet. Going further, HWGQ addressed the problem of mismatching gradients between the binarised forward activation function, sign, and the backward activation function, hard tanh [16] . HWGQ uses a half-wave Gaussian-quantised (HWGQ) recti ed linear unit (ReLU) for forward propagation and a standard ReLU function for backward propagation. e authors' experiments with AlexNet produced a 47% top-one ImageNet error rate: the lowest achieved for a binary network to date.
O et al. suggested that RNNs are not amenable to binarisation since the large quantisation losses of near-zero values forced to ±1 get ampli ed over their recursions [108] . Nevertheless, Liu et al. implemented binarisation in LSTMs targe ing English and Chinese language modelling, although they only applied it to input and output embedding layers (those that encode text as vectors) [91] . e authors reported up to 11× compression of those layers without accuracy loss. Given these seemingly con icting conclusions, further experiments are required to establish the e ectiveness of binarisation in RNNs.
Adding zero to the binary value set gives ternary representation. TernaryConnect [87] and O et al.'s work [108] introduced ternary CNNs and RNNs, respectively, for improved accuracy. e accuracies of TernaryConnect exceeded the previous-best results for MNIST, CIFAR-10 and SVHN reported by the authors of BinaryConnect [31] . For each layer l, Ternary Weight Networks (TWNs) use tunable symmetric thresholds ±δ l to di erentiate 0 from ±1 [78] . For an AlexNet implementation classifying ImageNet, TWNs achieved a 46% top-one error rate: lower than all binarised neural networks reported thus far. In Trained Ternary antization, parameters are represented in the form {w − l , 0, w + l }, wherein w − l and w + l are trainable [162] . Compared with TWNs, a further accuracy improvement-around 5 pp-was reported for AlexNet with ImageNet.
Mellempudi et al. presented Fine-grained antisation (FGQ) [95] , which involves the ternarisation of a pretrained FP32 network into groups, then ternarising each group independently. Within a group , the ternary weights can have distinct quantisation levels −w , 0, w . Although groups can be determined arbitrarily, in this case the authors grouped by channel to promote implementational e ciency. Assuming that a network has G such groups, there are 2G + 1 distinct levels with which to represent weights in total, increasing the model's representation capacity over ternarisation with equal granularity. Weights are partitioned along channels for simplicity. Experiments with ImageNet showed that an FGQ-quantised AlexNet with ternary weights and four-bit activations su ered 7.8 pp accuracy loss compared to the baseline.
Alemdar et al. combined ternarisation with knowledge distillation, in which shallower "student" networks are used to mimic deeper "teachers" [5] . In hardware, ternarisation requires cheaper arithmetic operators than higher-than-two-bit xed-point quantisation. To improve the accuracy of a ternary student network, stochastic rounding (Equation 1) is used while ternarising during teacher network backward propagation. Experiments with MNIST, CIFAR-10 and SVHN on arbitrarily chosen models showed that ASIC implementations of this work achieved 3.1× greater energy e ciency, on average, than IBM TrueNorth executing the same benchmarks with ternary data [2] .
While low-precision networks lead to signi cant network compression, they o en require higher numbers of neurons to achieve accuracies comparable to their oating-point counterparts. For the CIFAR-10 dataset, for example, binary networks such as FINN and ReBNet require a wider and deeper model, CNV, in order to achieve similar accuracy to an FP32 baseline with CifarNet, a much thinner and shallower model [130] . Zhu et al. proposed the Binary Ensemble Neural Network (BENN), in which multiple binarised networks are aggregated by "boosting" (parallel ensemble with trained weights) [163] . e authors showed that their network ensembles exhibited lower bias and variance than their individual constituents while also having improved robustness to noise. Experiments with AlexNet on the ImageNet dataset showed that the use of BENN, with AdaBoost (adaptive boosting) and an ensemble of six binarised networks, led to only 2.3 pp of top-one accuracy loss vs an FP32 baseline. e authors of WRPN explored the same phenomenon by gradually reducing network precision and increasing the number of channels of an originally FP32 network, nding that, by increasing model complexity, a low-precision network can eventually match or even surpass the accuracy of its baseline. Further research is required to identify models that are particularly amenable to low-precision inference [96] . xed-point multiplication. Furthermore, accumulation becomes a population count (popcount) operation, which, on an FPGA, requires half the LUTs of an equivalent adder tree [141] . Umuroglu et al. [141] and Ghasemzadeh et al. [44] suggested that, during binary inference, operations in batch normalisation can be simpli ed to binary thresholding, where = sign(αx −b) = sign(x − b /α). x, α, b and are the input, scaling factor, bias and output, respectively. A max-, min-or average-pooling layer in a binary network can be e ciently implemented using OR, AND or majority functions.
On GPUs, 32 one-bit activations and weights can be packed into each word to perform bit-wise XNORs. On a Titan X Pascal GPU, 32 32-bit popcounts can be issued per cycle per streaming multiprocessor (SM). us, up to 512 binary MAC operations can be performed per cycle per SM. As it can issue up to 128 FP32 MAC instructions per cycle per SM, however, it can be estimated that the theoretical peak throughput gain of a binary network over FP32 for that GPU is only 4× [104] .
On FPGAs, Binary network inference can show more signi cant performance gains. Many frameworks, including FINN [141] , FP-BNN [83] and that from Moss et al. [100] , have been built to achieve this, resulting in orders of magnitude higher throughput and energy e ciency than oating-point counterparts of comparable scale. FINN's authors constructed small binary networks for the MNIST, CIFAR-10 and SVHN datasets targe ing the Xilinx Zynq ZC706 FPGA. Experiments with the CNV network (110 Mop/cl) resulted in sustained throughput of 22 kcl/s-the highest throughput at the time of publication-while consuming as li le as 25 W of power. e authors of FP-BNN implemented AlexNet (2.3 Gop/cl), one of the larger CNNs, on an Intel Stratix V FPGA, reporting a throughput of 870 cl/s, 2.7× faster than a 235 W-consuming Tesla K40 GPU executing the same binary network, while drawing only 26 W of power. On a smaller custom network designed for CIFAR-10 inference (1.2 Gop/cl), in which arithmetic intensity was higher, FP-BNN achieved a peak throughput of 7.6 kcl/s. Moss et al. showed that, with binarisation, the HARPv2 heterogeneous platform could achieve a peak throughput of 110 cl/s for VGGNet, with 1.2× greater energy e ciency than a Titan X Pascal GPU-based alternative [100] . e authors of ReBNet implemented "residual binarisation" on FPGAs [44] : similar to ABC-NET's aforementioned K-binarisation scheme [86] . ey observed accuracy improvements when higher data widths were used, as was the case for ABC-Net. ReBNet's authors reported that their work exposes a continuum between accuracy and area, making it amenable to a wide range of application requirements and hardware constraints.
Prost-Boucle et al. implemented ternary CNNs on a Xilinx Virtex-7 VC709 FPGA, presenting both high-performance-and low-power-targe ing designs [110] . eir experiments with the CNV model classifying CIFAR-10 demonstrated a 6.6 pp accuracy improvement compared to FINN's binarised inference. In high-performance mode, up to 27 kcl/s was achieved with around 13 W of power consumption while, in low-power mode, 14 kcl/s was obtained for half the power. e authors of YodaNN introduced a 65 nm ASIC implementation featuring partial binarisation, in which activations and weights are quantised to 12 and one bit(s), respectively [9] . Experiments with AlexNet and the ImageNet dataset showed that YodaNN achieved a throughput of 0.50 cl/s and an energy e ciency of 2.0 kcl/J at 0.60 V.
4.3
Logarithmic antisation 4.3.1 Algorithmic Development. In a base-two logarithmic representation, parameters are quantised into powers of two with a scaling factor. Suiting the observation that a weight's representation range is more important than its precision in preserving network accuracy, logarithmic representations can cover wide ranges using few bits [72] . While logarithmic representation can also be used for activations, this has yet to be explored. LogNet's authors quantised CNNs with weights encoded in a four-bit logarithmic format, a er which they performed retraining to recover some lost accuracy [76] . eir experiments with the ImageNet dataset revealed 4.9 pp and 4.6 pp top-ve accuracy drops for AlexNet and VGG16, respectively. In Incremental antisation (INQ), weights are iteratively quantised into a logarithmic format, with activations le as eight-bit xed point values [159] . In each iteration, parameters in each layer are partitioned into two groups using a threshold on absolute parameter values. e group with higher absolute values is quantised into powers of two directly, whereas the other is retrained in the following iteration in FP32 to compensate for losses. is process repeats until all parameters are quantised. Experiments with ImageNet on AlexNet showed a negligible (∼0.1 pp) accuracy loss against the baseline while using only ve and eight bits per weight and activation, respectively. 4.3.2 Hardware Implementation. For hardware inference, base-two logarithmic representations see multiplications converted into binary shi s for greater area and energy e ciencies as well as speed. GPPs perform binary shi s using shi ers embedded in arithmetic and logic units, most of which can move their operands by an arbitrary number of bits per operation. On an Nvidia Maxwell GPU, the theoretical peak throughput of 32-bit binary shi s is 50% of that of FP32 MACs [105] .
In custom hardware, a multiplication between an exponentially quantised weight parameter and an activation can be implemented cheaply using a variable-length binary shi er. With LogNet, CNN inference is performed on FPGAs with four-bit logarithmic-quantised weights [76] . Experiments with three convolutional layers showed an over-3.0× energy e ciency improvement vs an Nvidia Titan X GPU implementation, while a four-bit logarithmic implementation of AlexNet demonstrated an around-5 pp accuracy loss for ImageNet. Wang et al. implemented base-two logarithmic quantisation on weights associated with input, output and forget gates in LSTMs while leaving the remaining gates in non-logarithmic eight-bit xed-point precision [146] . In their 90 nm ASIC implementation, multiplications with logarithmic-quantised weights are implemented with shi -and-add operations, which occupy signi cantly less area than MACs using non-logarithmic xed-point quantisation. Wang et al.'s ASIC was able to process a 512 × 512 LSTM layer within 1.7 µs at a silicon area cost of 31 mm 2 . e implementations mentioned above reuse binary shi ers over di erent groups of weights for scalability. For custom hardware, if shi amounts are constant, no logic is required for multiplication: they can be performed in routing alone. is means that xing DNN parameters using constantlength shi s instead of multiplications can result in signi cant resource and latency savings. Server-scale platforms with massive resource availability, such as Microso Catapult [17] and Amazon Web Services, should be able to bene t hugely from such optimisations.
WEIGHT REDUCTION
Let us now turn to DNN approximation's second key subject: weight reduction. Here, parameters deemed unimportant are eliminated entirely. Weight reduction improves the performance of hardware inference by reducing both workload and o -chip memory tra c.
Pruning
Pruning is the process of removing redundant connections in a DNN. Inspired by early works including Optimal Brain Damage [75] and Optimal Brain Surgeon [56] , Srinivas et al. proposed a retraining-free method for removing redundant neurons in trained CNNs [132] . Similar neurons can be wired together and hence pruned away. e authors proposed the similarity evaluation of neurons using a matrix of their squared Euclidean distances. is method resulted in 6.7× and 1.5× compression for the MNIST and AlexNet networks, respectively. Experiments with AlexNet revealed 2.2 pp of ImageNet accuracy loss.
Han et al. were the rst to propose an iterative pruning process [55] . In their work, one iteration consists of pruning followed by retraining, allowing the remaining connections to learn to compensate for the pruning loss. A er many such iterations, lossless compression ratios of 9.0 and 13 were achieved for AlexNet and VGG16, respectively, both classifying the ImageNet dataset. e authors a empted to promote sparsity in the networks by penalising non-zero parameters with an l 1 or l 2 norm-based sparsity regulariser [34] during retraining. An l 2 norm-based sparsity regulariser can be implemented as Inspired by Han et al.'s work, the authors of Dynamic Network Surgery (DNS) performed pruning followed by "splicing, " wherein the salience (importance) of the remaining parameters is evaluated; parameters' salience varies when others are removed [50] . DNS achieved 110× and 18× compression for LeNet-5 and AlexNet, respectively. e proposals above all see DNNs pruned at element-wise granularity, o en referred to as negrained pruning. Although pruning at the nest granularity leads to excellent compression ratios, it can also result in signi cant irregularities in weight distribution, which, in turn, can make it di cult for the inference hardware to convert compression into increased throughput. Coarse-grained pruning methods have hence been proposed, which produce larger but denser networks than those resulting from ne-trained pruning. Lebedev et al. introduced Structured Brain Damage, wherein a group-wise sparsi cation regulariser (Equation 3) shapes each weight matrix's non-zeroes into a regular, dense pa ern [74] . Experiments showed 3.0× improvements in both compression ratio and throughput with sub-1.5 pp accuracy degradation for AlexNet classifying ImageNet. Wen et al. [147] , Li et al. [80] , He et al. [57] and Su et al. [134] performed structured pruning along channels, lters, layers and shapes (arbitrary groups of parameters) of CNNs. All of these works proposed the pruning of groups of redundant parameters based on sums of parameter magnitudes where, intuitively, those with lower values are deemed less important. e authors of Network Slimming argued that, although sparsity can be realised at di erent granularities, pruning at the channel level provides a tradeo between exibility and ease of hardware implementation [92] . e output of Network Slimming is simply a "thinned" version of an unpruned network. With every convolutional and fully connected layer followed by a batch-normalisation layer, networks are trained before pruning such that batch normalisation scaling factors represent the relative importance of each channel. Layer-wise pruning is then performed by thresholding them. An l 1 sparsity regulariser is used on the scaling factors, instead of each parameter, in order to promote channel-wise sparsity. 20× compression and a 5× workload reduction were reported against an unpruned baseline for VGGNet. Experiments with ImageNet on the VGG-A model demonstrated about-5.8× compression with less than 0.1 pp of accuracy loss.
Decisions on whether to prune speci c parameters are based on parameter salience. Establishing accurate salience estimations is thus crucial for pruning e ectiveness. Molchanov et al. proposed and compared various criteria for determining weight salience, including pruning by the magnitude, mutual information (against classi cation ground truth) and Taylor expansion of quantisation noise [97] . Of these, the Taylor expansion-based criterion was found to perform particularly well. Unlike the works above, which all de ned parameter salience as the impact on accuracy, Yang et al. de ned it as the impact on energy e ciency, achieving an energy saving of 3.7× with ImageNet on AlexNet against an Nvidia Titan X GPU equivalent [152] . 5.1.2 Hardware Implementation. Coarse-grained pruning produces outputs in structured and dense pa erns such that the Basic Linear Algebra Subprograms (BLAS) for GPPs can directly bene t from reductions in workload. It is more challenging for GPPs to bene t from ne-grained pruning, however. Modern GPUs follow a SIMT execution model, in which threads execute the same sequence of instructions on di erent data. Compute speed is thus bo lenecked by the slowest thread; others remain idle until synchronisation points are reached. Checking for zeroes in matrices adds extra instructions to each thread, further reducing computational e ciency. An alternative approach is to use linear algebra libraries supporting zero-skipping, such as sparse matrix-vector multiplication (SPMV). Monakov et al. proposed a matrix storage format that improves locality and enables automatic parameter tuning on GPUs [98] . Bell et al. implemented data structures and algorithms for SPMV on an Nvidia GeForce GTX 280 GPU, with which they achieved state-of-theart FP32 performance [12] . For SPMV to show performance and/or memory storage advantages, however, matrices need to be highly sparse. is is o en the case for RNNs, which normally have over 80% sparsity [52] , but is not usually true for CNNs (typically only 5-50% sparsity) [74] .
Custom hardware can handle irregular, sparse data more e ciently than GPPs for ne-grainedpruned DNNs. Li et al. presented an FPGA design framework for CNN sparsi cation and acceleration [81] . eir work features a load balancing-aware sparsi cation training scheme facilitating e cient parallelism. eir FPGA implementation of AlexNet achieved 12× throughput acceleration over an Intel Xeon CPU-based benchmark. Posewsky et al. presented an FPGA implementation of high-throughput zero-skipping suiting ne-grained pruning [109] . e authors proposed that, post-pruning, each non-zero weight be encoded as a two-element tuple (w i , z i ) containing weight value w i and number of preceding zeroes z i , where i is the weight's index. In this way, when a batch of input activations is bu ered on-chip, the hardware will only fetch the weights pointed to by z i , corresponding to non-zeroes only. Experiments with an unidenti ed model showed that their Xilinx Zynq XC7Z020 FPGA implementation surpassed the throughput of ARM Cortex-A9 and Intel Core i7-5600U CPU equivalents, with > 85% energy savings.
ESE's authors reported that, with pruning and retraining, more than 90% of the parameters of an arbitrarily chosen LSTM trained on the TIMIT dataset could be pruned away without harming accuracy [52] . Its authors proposed "balance-aware" pruning to shape weight matrices into equal workloads for parallel compute units during retraining. On FPGAs, weight matrices are stored and computed in a compressed sparse column format to skip zeroes under this proposal. ESE demonstrated 3.0× throughput acceleration vs an Nvidia Pascal Titan X GPU implementation. e authors of Eyeriss [23] , EIE [53] , Cnvlutin [4] and Laconic [124] sought to remove multiplications by zero-valued activations. e authors of Cnvlutin achieved this by computing only non-zero inputs and using an "o set" bu er, alongside the input bu er, to store the indices of each input's corresponding weights a er zero-skipping. A hardware controller lls the o set bu er on the y such that it does not consume extra bandwidth. To further increase acceleration, Cnvlutin prunes near-zero outputs during inference in order to increase the sparsity of the next layer's input bu er. Experiments with several CNNs, including AlexNet, GoogleNet and VGG-19, showed 1.2-1.6× throughput increases over DaDianNao [22] without any loss in accuracy for ImageNet. While Cnvlutin incurred an area overhead of 4.5% over DaDianNao, it beat it by 1.5× in terms of energy e ciency for an unnamed model. Eyeriss, EIE and Laconic's authors achieved bene ts from pruning using similar strategies to those employed by Cnvlutin's.
Unlike the previous proposals, which all prune parameters to achieve throughput speedups, the authors of Eyeriss and Minerva targe ed power savings through the elimination of redundant o -chip memory fetches [23, 114] . Experiments with Minerva showed that their 40 nm ASIC implementation achieved an 8.1× energy e ciency reduction-also for an unidenti ed modelcompared with an ASIC baseline.
Weight Sharing
5.2.1 Algorithmic Development. Weight sharing groups parameters into buckets, reducing network size as well as enabling multiplications to be converted into cheaper table lookups. In HashedNets, a low-cost hash function is used to randomly group connection weights, the connections in each of which all share a single value [21] . ese parameters are then trained to adjust to the weight sharing with standard backward propagation. Experiments with the MNIST dataset showed that HashedNets achieved a compression ratio of 64 with an around-0.7 pp accuracy improvement against a ve-layer CNN baseline. e authors suggested that the accuracy rise could be a ributed to the "virtual" connections created that seemingly increased expressiveness.
Ullrich et al. performed retraining using so weight sharing on pretrained networks in order to ne-tune the centroids used for parameter clustering [140] . So weight sharing was originally proposed by Nowlan and Hinton, who modelled cluster centroids with a mixture of Gaussians [102] . When retraining with this constraint, weights tend to concentrate very tightly around a number of cluster components, the centroids of which optimise to improve accuracy. Experiments showed 160× compression for MNIST on LeNet-5 with an accuracy loss of ∼0.1 pp.
With Deep Compression, weight sharing is performed in several steps [54] . A network is rst pruned with iterative retraining [55] , a er which weights are quantised via k-means clustering. e quantised network is then retrained again to ne-tune the remaining connections and update the cluster centroids. Finally, the quantised weights are compressed with Hu man coding to save memory. With k-means clustering, the spatial complexity of a size-K weight matrix reduces from O K 2 to O(k). Using their basket of approximation techniques, the authors of Deep Compression achieved 35× overall compression for AlexNet with no drop in ImageNet accuracy. e proposals above only encode weights. Both LookNN [113] and antised CNN [149] follow the "product quantisation" algorithm [64] , which encode both weights and activations. Rather than operating element-wise, this method does so on subvectors of weight matrices. Experiments with antised CNN revealed 19× AlexNet compression in return for 1.5 pp of ImageNet accuracy loss.
5.2.2
Hardware Implementation. During inference, weight sharing-based implementations require a large number of lookup operations, which can be performed signi cantly more e ciently on FPGAs than GPPs. Samragh et al. implemented weight sharing on FPGAs [120] . Here, k-means cluster centroids are determined with tunable parameters during retraining, eliminating almost all multiplications. An up-to 15× improvement in throughput and compression ratio of 9.0 were reported along with with sub-0.1 pp of accuracy losses for small DNN datasets such as MNIST and ISOLET on unidenti ed network models. e authors of PQ-CNN presented a hardware-so ware framework for compressing and accelerating CNNs on FPGAs using product quantisation [64] , adopting a similar idea to that used in antised CNN [149, 156] . Going further, the authors implemented an extra codebook to compress encoding parameters, increasing the compression of the original algorithm. During inference, since all possible multiplication outputs with every codeword are precomputed and stored on-chip, PQ-CNN sees dot products for both convolutions and fully connected layers converted into table lookups and accumulations. e authors' Amazon F1 implementation achieved 4.6 kcl/s for the VGG16 model with a sub-0.5 pp drop in top-ve accuracy for ImageNet. [36] . A biclustering approximation performs k-means clustering on rows and columns of weight matrices [64] . ese methods were tested with a 15-layer CNN classifying the ImageNet dataset. Among them, SVD achieved the best performance: 13× compression of the rst fully connected layer with 0.84 pp of top-one accuracy loss. Tai et al. also performed network decomposition using SVD [138] . ey achieved up to 5.0× compression and a 1.8× throughput speedup for ImageNet on AlexNet, reporting a top-ve accuracy reduction below 0.5 pp.
Low-rank
While post-training decomposition is simple and exible, many works have shown that training a er decomposition can recover compression losses. As suggested by Jaderberg et al., weight matrices can be decomposed into several low-rank matrices to enable workload and/or memory reductions [63] . e authors proposed the factorisation of each of their four-dimensional layers into a sequence of two regular convolutional layers, each of three dimensions. Experiments with various nonstandard scene text character recognition datasets showed that this method achieved, on average, a 4.5× increase in throughput with around-1 pp falls in accuracy for some unidenti ed networks.
is factorisation scheme inspired MobileNet, which uses one three-dimensional "depthwise" and one two-dimensional "splitwise" separable convolutional layers to approximate each original layer [60] . Assume that a convolutional layer contains K × K × M × N values, where K, M and N are the size of the kernel and numbers of input and output channels, respectively. In MobileNet, this is factorised into a depthwise convolutional layer with K × K × M × 1 values and a pointwise convolutional layer of size 1 × 1 × M × N .
is method e ectively reduces the complexity of forward propagation from O MD 2 K 2 N to O MD 2 K 2 + N , where D is the size of the input feature map. Experiments with ImageNet showed that MobileNet can achieve a 3.0 pp top-one accuracy improvement with 46× compression for AlexNet.
Ba et al. combined low-rank factorisation with knowledge distillation, where a deep and complex neural network is mimicked with a simpler, shallower one [11] . More detail on knowledge distillation is given in Section 5.5. e authors noticed that learning is very slow for the weight matrices of shallow networks. Since there are many highly correlated parameters, gradient descent converges slowly, with the majority of training time spent on matrix-vector multiplication. ey suggested that forward and backward propagation could be sped up by approximating each large weight matrix as the product of two low-rank matrices. Increases in convergence rate of the network mimicking and reductions in memory space complexity were observed. Lebedev et al. presented a CP decomposition-based retraining method facilitating greater workload reductions, achieving a 4.5× throughput boost with ∼1 pp of top-ve ImageNet accuracy loss for layer two of AlexNet [73] .
Following the logic that learnt weight matrices tend to be structured and can be decomposed using low-rank factorisation, Denil et al. suggested the storage of only parts of weight matrices, predicting the remainder using a second learning model [35] . ey reported that, in the best case-with small-scale datasets-more than 95% of weights can be predicted without accuracy loss. e networks used therein were nonstandard.
Rather than compressing layers individually, Kim et al. performed "one-shot" whole-network compression using Tucker decomposition. Here, the post-decomposision ranks of all layers are determined all at once through global Bayesian matrix factorisation. Experiments showed that, while this method requires at least 10 retraining epochs for accuracy recovery, the inference of AlexNet on an Nvidia Titan X GPU achieved 1.8× speedup, with 1.7 pp of top-ve ImageNet accuracy loss, against an FP32 baseline on the same platform.
5.3.2
Hardware Implementation. Low-rank factorisation methods produce structured DNN models which can inference e ciently on GPPs with dense matrix-vector BLAS. Li et al. presented a CNN compression framework combining coarse-grained pruning using sparsi cation with lowrank factorisation [77] . Similar to the idea proposed by Jaderberg et al. [63] , the authors represented lters as linear combinations of lower-rank basis lters. GPU experiments with AlexNet, GoogleNet and VGGNet-A revealed about-2× throughput speedups without accuracy loss for ImageNet.
Custom hardware implementations, however, can achieve comparable performance with lower power envelopes. Rizakis et al. implemented SVD-factorised gates for LSTMs [115] . In their proposal, SVD is performed on the weights of the four LSTM gates independently. For each gate, the weights associated with both the current input and previous output are concatenated together to form a large weight matrix, which is then SVD-factorised. Pruning is also performed by retaining only rows with a majority of non-zeroes in each weight matrix. e authors implemented their design on an FPGA platform, achieving a 6.5× throughput increase for an arbitrarily chosen LSTM compared with an uncompressed FPGA-based LSTM baseline.
Structured Matrices
5.4.1 Algorithmic Development. A weight matrix can be represented as a structure of repeated pa erns such that it can be expressed with fewer parameters. e use of circulant matrices for representing weight matrices W in CNNs and RNNs has proven to be a very popular proposal [26, 27, 93, 131, 146] . A circulant matrix W circ of size K is square, with all rows being a shi ed version of the rst, w 0 * , thereby reducing spatial complexity from O K 2 to O(K). It is constructed as such:
. e multiplication of W circ by input vector x can thus be computed using a fast Fourier transform (FFT) of the rst row of W circ , reducing inference time complexity from O K 2 to O(K log K), as
While the circulant matrix method has shown outstanding memory and computational complexity reductions, its application also introduces accuracy degradation. For example, the AlexNet implementation of a circulant matrix-based framework, CirCNN, achieved compression of 40× with 16-bit xed-point quantisation, yet its use also resulted in 2.2 pp of ImageNet accuracy degradation against an FP32 baseline [37] . An alternative transformation, the Adaptive Fastfood transform (AFT), achieved a compression ratio of 3.7, but only about 0.1 pp of accuracy loss with ImageNet, for AlexNet [154] . In an AFT, a weight matrix W is approximated as
in which S, G and B are trainable diagonal matrices, H a Hadamard matrix and Π ∈ {0, 1} K ×K a trainable permutation matrix. is and the circulant method have equal complexities.
For both of the aforementioned structures, generality is not guaranteed when dealing with classi cation tasks of varying scales. Sindhwani et al. proposed structured transformations characterised by the notion of a displacement rank parameter [131] . With di erent displacement ranks, a continuum is exposed from fully structured to completely unstructured. With displacement rank less than or equal to two, weight matrices become Toeplitz matrices, which have the form
Di erent to a circulant matrix, a Toeplitz matrix W Top of size K has element values w −(K −1) to w K −1 . Matrix-vector multiplications can still take advantage of FFTs by embedding Toeplitz matrices into larger circulant matrices, as in
, and exploiting the relationship
wherein I and 0 are identity and zero matrices, respectively [45] . A family of Toeplitz-like matrices can be generated by increasing rank beyond two. With rank K, a matrix becomes unstructured and uncompressed. Lu et al. applied Toeplitz-like matrices in LSTMs, with weight matrices of gates trained in Toeplitz-like structures of various ranks [93] . e authors compressed the rst two layers of an unidenti ed ve-layer LSTM into structures of rank ve, achieving a compression ratio of around 1.7 with ∼0.3 pp loss in speech recognition accuracy for a dataset consisting of some 300 hours of English u erances.
While the authors of the works mentioned above reported that the use of circulant matrix-based methods resulted in the incursion of at-least 2 pp accuracy drops for large-scale CNN image classications, their accuracies for RNN tasks are signi cantly superior. Wang et al. implemented circulant matrices together with non-linear function approximation and quantisation for LSTMs [146] . Language modelling and speech recognition were performed by their 90 nm ASIC, achieving more than 20× compression with a 2.8 pp loss in accuracy for classi cation of the AN4 speech database.
C-LSTM features block-circulant matrices, each of which consists of circulant submatrices of arbitrary size [145] . Tunable block size facilitates a tradeo between storage requirements and accuracy. Experiments with the Google LSTM architecture revealed a linear relationship between block size and compression ratio, as well as a clear tradeo between block size and TIMIT phone error rate (PER) increase. For an LSTM model with block size of eight on the TIMIT dataset, C-LSTM exhibited 7.6× compression and a 2.6× workload reduction while incurring a 0.32 pp PER rise.
Hardware
Implementation. Convolutions on GPPs are normally performed a er unrolling, a ening four-dimensional inputs and kernels into two-dimensional matrices. is converts fourdimensional tensor operations into two-dimensional matrix multiplications, trading o memory use for performance. For block-circulant matrix methods, since each two-dimensional slice of a kernel is circulant, the two-dimensional unrolled version of that kernel is also block-circulant. Time complexity reductions from the FFT-based method for block-circulant matrix inference are hence achievable for DNN inference performed on both GPPs and in custom hardware. Despite this, custom hardware implementations still excel in terms of energy e ciency [37] .
Combined with 16-bit xed-point quantisation, FPGA-based C-LSTM [145] achieved a 10× throughput speedup and 34× energy e ciency improvement over ESE for the Google LSTM, the prior state of the art. Ding et al. presented implementations using similar methods for CNNs and RNNs on both FPGAs and ASICs [38] . Using Intel Cyclone V FPGAs, the authors achieved at-least 150× and 72× improvements in performance and energy e ciency, respectively, over IBM TrueNorth implementations [2] of some unidenti ed networks. For Xilinx Kintex UltraScale FPGA LSTM implementation, the proposed architecture achieved up-to 21× and 34× improvements in throughput and energy e ciency, respectively, over ESE for the Google LSTM [52] . e authors also experimented with a LeNet-5 ASIC implementation, achieving a throughput of 1. [146] . ey adopted a hybrid strategy in their work, also exploiting xed-point quantisation and activation function approximation. With a 520 KiB on-chip memory allocation, the authors were able to process a 512 × 512 compressed layer of an arbitrarily chosen LSTM in 1.7 µs: equivalent to 580 kcl/s. Fox et al. implemented AFTs for accelerating matrix-vector multiplication on FPGAs [41] . Although their work was not presented in the context of DNN inference, its results on matrixvector multiplication are still relevant. e authors concluded that the AFT's small memory complexity allows for the processing of input matrices some 1000× larger than previous online kernel methods with the same area occupancy.
Knowledge Distillation
5.5.1 Algorithmic Development. Knowledge distillation mimics large, complex DNNs using simpler and shallower networks in order to achieve network compression. In one of the earliest works in this eld, Hinton et al. suggested that knowledge could be distilled from an ensemble of models (teachers) into a simple model (student) by training the student model with outputs from the teachers [59] . Ba et al. provided empirical evidence showing that, in simple machine learning tasks, a student network can mimic teacher networks with comparable performance [11] . In FITNet, intermediate outputs of these teacher models are used as "hints" for training the student model to improve its accuracy [116] . Experiments with the CIFAR-10 dataset showed that a FITNet trained from an unidenti ed 9M-parameter teacher CNN could achieve 10× compression and a 1.4 pp accuracy improvement vs the teacher network. e authors explained that a reduction in network complexity from teacher to student led to less over ing, causing the accuracy increase.
Chen et al. proposed various optimisations for improving the performance of network mimicking [20] . Unlike carefully selected image classi cation datasets with uniform class distributions, object detection problems need to deal with dominant background classes. Class-weighted crossentropy can be introduced to handle such scenarios, wherein a background class is assigned an appropriate scaling factor to correct for class imbalances. When teacher over ing occurs, hints from a teacher network may "mislead" a student into even more severe over ing. In an e ort to avoid this, Chen et al. used their teacher network's original regression curve as an upper bound for student network training. Experiments with the PASCAL, KITTI and COCO datasets showed that these optimisations improved accuracies by 3-5 pp.
Alemdar et al. introduced a framework for knowledge distillation in which ternary student networks were trained from a ternarised teacher [5] . During ternarisation, two thresholds for each weight index i, δ − i , δ + i , are used to di erentiate quantisation levels w − i , w + i from zero. e authors suggested that the use of well selected thresholds should result in outputs from the student network perfectly matching those of the teacher network. A greedy method was proposed to search for thresholds by minimising the di erence between the probability distribution functions of layer-wise outputs from the student and teacher networks. Experiments with MNIST, CIFAR-10 and SVHN showed that this work achieved higher accuracies than IBM TrueNorth classifying the same datasets on VGG-like models with ternary data [2] .
Hardware Implementation. Knowledge distillation essentially converts deep DNNs into
shallow ones, which, from a hardware perspective, allows the replacement of deep, sequential processing with parallel, distributed processing. is structural conversion greatly facilitates the acceleration of DNN training and inference using GPPs. Ba et al. even observed that some shallow, mimicked models reached similar accuracies for TIMIT to deep models about 8× more quickly [11] .
While some acceleration can be achieved with knowledge distillation on GPPs, further bene t can be realised given the exibility of custom hardware by taking advantage of additional approximation. Alemdar et al. presented a hardware mapping framework in which student networks trained through network mimicking are translated into hardware descriptions for FPGA or ASIC implementation [5] . eir Xilinx Virtex-7 FPGA prototype achieved an over-30× throughput improvement and comparable energy e ciency vs IBM TrueNorth [2] executing VGG-like models. A 28nm ASIC implementation was also presented and compared against a state-of-the-art ASIC implementation, EIE [53] . While their ASIC did not beat EIE in terms of throughput, it did achieve 1.2× energy e ciency and 2.9× area occupancy improvements for an unidenti ed network model.
INPUT-DEPENDENT COMPUTATION

Algorithmic Development
Di erent regions of a DNN's input data may have di ering levels of contribution to its output. Input-dependent computation exploits this observation by assigning compute proportionally to the input data's relative importance. Stochastic Times Smooth units mask CNN input frames with a pretrained binary decision matrix to facilitate conditional computation, which was shown to give 10× compression for a nonstandard CNN classifying the MNIST dataset with a 0.2 pp accuracy improvement [14] . Karpathy et al. allocated more resources to the centres of CNN input frames for improved video classi cation accuracy [68] . eir implementation consists of two CNNs in parallel, with a "context stream" CNN processing entire frames and a "fovea stream" CNN processing only the centre of each. e authors reported a 65% prediction accuracy on the UCF-101 video prediction dataset: state-of-the-art performance at the time of publication.
Low-rank approximation has not just been studied in the parameter space; it has been used for input compression as well. In Deep3 [118] and DeLight [117] , input data matrices are factorised into lower-rank matrices using an "embedding matrix. " ese are iteratively updated to reduce the Frobenius norm of factorisation errors. Experiments with Deep3 on GPUs on various deep learning tasks, including audio classi cation, demonstrated up-to 11× inference speedups compared to a TensorFlow baseline running the same models [118] .
While the aforementioned static computation allocation schemes can achieve signi cant resource savings and/or accuracy improvements, recent research, such as Dynamic Capacity Networks, has introduced dynamic input-dependent allocation, guided at runtime by additional pretrained subnetworks [6] . In Bengio et al. [13] and Liu et al.'s [89] proposals, and Runtime Neural Pruning (RNP) [85] , partial execution of DNNs is performed using pretrained Markov decision process reinforcement learning. RNP was shown to achieve a 10× workload reduction and 5.9× latency reduction for VGG16 with ImageNet in return for a 4.9 pp drop in top-ve accuracy. Runtime methods achieve superior accuracy to their static counterparts at the expense of an extra network. e works discussed above all targe ed CNNs, for which computation is dependent upon the spatial features of their inputs. e authors of DeltaRNN, on the other hand, reduced RNN workload based on inputs' temporal behaviour [43] . DeltaRNN updates the output of an RNN only when its input changes by more than some threshold. ey reported 9.8× throughput and 130× e ciency improvements for an arbitrarily chosen network, with a 1.5 pp accuracy drop, against their baseline classifying the TIDIGITs dataset.
Hardware Implementation
Since input-dependent computation involves frequent dynamic branching during inference, these implementations are not likely to pipeline e ciently, especially for deep CNNs. Hence, for CNN implementations exploiting this method, throughput is not their greatest advantage. ey are instead focussed more on latency-critical applications, which generally do not require high throughput. Custom hardware, unlike GPPs, allows for specially designed dynamic branching mechanisms which can inference ne-grained, irregular data pa erns more e ciently. e authors of CascasdeCNN presented the input-dependent computation of CNN inference on FPGAs [71] . Similar to Dynamic Capacity Networks [6] , CascadeCNN features a high-precision subnetwork in addition to a low-precision main network. e former is activated when there is a potential misclassi cation in the la er, i.e. when the con dence of the main network's best guess is low. Experiments showed that CascadeCNN achieved latency reductions of up to 55% for VGG-16 and 48% for AlexNet over the baseline design for the same resource budget and accuracy for ImageNet. e FPGA implementation of DeltaRNN on an LSTM requiring 5.6 Gop/cl demonstrated reduced o -chip memory bandwidth, achieving a throughput of 220 cl/s and an energy e ciency of 29 cl/J: state-of-the-art performance for RNN inference at the time [43] .
ACTIVATION FUNCTION APPROXIMATION
Algorithmic Development
With non-linear functions such as sigmoid and tanh, computations including exponentiation and division are expensive to perform. Piecewise Linear Approximation of Non-linear Functions (PLAN) simpli es such functions into serieses of table lookups [8] . In turn, this leads to the quantisation of activations in subsequent layers, reducing both memory requirements and numbers of arithmetic operations to perform. PLAN appears more o en in RNN implementations than CNNs; mainstream CNNs use ReLU as the activation function, which can be cheaply implemented by comparing outputs with zero. In RNNs, on the other hand, empirical analysis suggests that sigmoid and tanh provide be er performance, whereas ReLU not only performs poorly but also diverges frequently, partly because it is positively unbounded [39] .
Hardware Implementation
PLAN can be e ciently implemented in custom hardware. Guan et al. implemented PLAN within an FPGA-based inference framework for unidenti ed LSTMs, and their experiments showed that its use introduced only 0.63 pp of TIMIT accuracy degradation [46] . Li et al. [82] and the authors of ESE [52] , C-LSTM [145] and DeltaRNN [43] implemented arbitrarily chosen RNNs on FPGAs with PLAN, reporting increases in throughput with negligible accuracy losses for the same dataset. WAGE [150] TWN [78] TTQ [162] FGQ [95] Binary-Net [30] XNOR-Net [112] HWGQ [16] Compression vs baseline Top-one error rate (%) (a) antisation methods: baseline ( ), eight-bit fixed point ( ), logarithmic ( ), ternary ( ) and binary ( ). CirCNN [37] antised CNN [149] Str. Brain Damage [74] Less is More [160] Data Free [132] Network Pruning [55] DNS [50] Fastfood-16-AD [154] SVD CP [138] Compression vs baseline Top-one error rate (%) (b) Weight-reduction methods: baseline ( ), hybrid ( ), weight sharing ( ), pruning ( ), structured matrix ( ) and factorisation ( ). 
TRADEOFFS AND CURRENT TRENDS
us far, we have detailed DNN approximation techniques and their hardware implementations on di erent platforms. Performance evaluations were made against benchmarks and baseline implementations of their authors' choosing, which are inconsistent and o en not particularly useful when a empting to perform comparisons. We now quantitatively evaluate the hardware and so ware performance of those works using common DNN models and datasets as benchmarks. By doing so, we analyse the compression-accuracy tradeo s of the approximation techniques and their design-space exploration for custom hardware, from which we explain current research trends. Fig. 3a compares the compression-accuracy behaviour of key quantisation methods introduced in Section 4 for ImageNet on AlexNet, indicating a clear relationship between precision and error rate. Among the methods, binary networks exhibit greater accuracy degradations (≥ 4.5 pp) than the remainder (< 3.0 pp), while also achieving the greatest compression ratios: 32 vs an FP32 baseline. e parameters of trained DNNs usually have Gaussian-like distributions, wherein the majority of data have near-zero values. For this reason, binary networks exhibit high quantisation error for values with small magnitudes because they are unable to represent zeroes. Compared to binarisation, ternarisation generally results in be er accuracy, with compression ratios of 16. Among all methods compared, TTQ has the highest accuracy at a reasonably high compression ratio, suggesting that the ability to represent zeroes has signi cant implications for network performance [162] . INQ reached a similar level of accuracy to TTQ, but with a lower compression ratio (6.4) [159] . e accuracy of INQ is higher than xed-point-quantised networks with similar precisions, supporting the conclusion by Lai et al. that it is weights' representation range, rather than precision, that is crucial to the preservation of accuracy [72] . Fig. 3b facilitates comparison of the compression-accuracy tradeo s, also for ImageNet on AlexNet, of the key weight-reduction methods introduced in Section 5. It shows that the reported [149] , and structured matrices, e.g. CirCNN [37] , are higher than the alternatives. is observation supports the theoretical analysis in Sections 5.2 and 5.4 that these methods have good memory complexity reduction capabilities.
Compression vs Accuracy
Structured matrix methods induce signi cant accuracy degradation in CNNs [37] , but not so much in LSTMs [37, 145] . is phenomenon is not yet well understood.
Pruning-based methods also lead to the obtainment of good accuracies at high compression ratios. Among them, ne-grained methods (DNS [50] and Network Pruning [55] ) show more promising tradeo s than coarse-grained alternatives (Structured Brain Damage [74] and Less is More [160] ).
is suggests that higher pruning granularities, despite inducing signi cant irregularity, possess greater potential for network compression and memory transfer reductions.
Deep Compression exhibited both outstanding accuracy and compression [54] . As a hybrid strategy, multiple quantisation methods work together to provide high compression.
We can conclude that (re)training has proven to be e ective in compensating for accuracy losses incurred due to approximation [55, 122] . e authors of methods exploiting binarisation, ternarisation, structured matrices, low-rank factorisation and knowledge distillation trained their networks from scratch, while the remaining methods-apart from Data Free [132] -use postapproximation retraining. Although Data Free featured pruning of similar neurons without the employment of retraining, it was used for all of the implementations in Fig. 3 , suggesting that retraining has become a standard accuracy-recovery approach in state-of-the-art proposals. Table 1 shows how each approximation method contributes to DNN inference acceleration in custom hardware. Increases in parallelism and reductions in model memory use increase compute bounds and arithmetic intensities, respectively, which, in turn, increase throughput.
Design-space Exploration
antisation-based methods allow for increased parallelism through the use of cheaper arithmetic units. ey also facilitate memory transfer reductions. With extremely low-precision quantisation, it becomes feasible to x parameters in hardware such that weights do not need to be stored in, or fetched from, o -chip memory. Weight-reduction methods reduce numbers of parameters, saving memory while simultaneously decreasing workload. Weight sharing is slightly di erent from the other weight-reduction methods because it does not necessarily cause a reduction in workload. e number of operations to be performed per classi cation can be reduced if results are precomputed and stored on-chip, such as in PQ-CNN, however [156] . Unlike weight-reduction methods, input-dependent methods reduce workload without decreasing memory occupancy.
rough precomputation, activation function approximation only reduces workload. Hybrid strategies have been commonly adopted recently; these can bene t from all three factors, achieving greater performance than could be realised through the use of any single method. Table 2 details the performance of state-of-the-art FPGA-based DNN inference engines targe ing the CIFAR-10 (CNN), ImageNet (CNN) and TIMIT (RNN) datasets. Implementations are ordered according to power consumption, thus platforms of similar scales are adjacent. While categorised with respect to their target datasets, frameworks accelerating the inference of the same dataset may have been benchmarked using di erent DNN models and hence with dissimilar workloads. Some works did not report full-network workload information, making it impossible for us to quantify their throughputs. We thus detail arithmetic performance, which captures raw computational speed, as well.
Throughput.
In general, custom hardware implementations exhibit up-to orders-of-magnitude higher throughput than GPP equivalents of similar scales, corresponding to the conclusions drawn in Section 3. Among the custom hardware implementations, the throughput of ASIC platforms is higher than other works with similar power consumption, largely due to their higher clock frequencies.
By comparing Wang et al. [144] and Zhao et al.'s [158] CIFAR-10-targe ing CNN implementations with the Going Deeper [111] , fpgaConvNet [142] and FP-BNN [83] ImageNet CNNs, all of which used FPGAs of similar scales, we can observe that, as precision is reduced, linear or even superlinear throughput increases can be achieved. Superlinear increases can be explained using the roo ine modelling in Section 3. With quantisation on FPGAs, the use of cheaper xed-point processing units allows for increased parallel-computing capability via area savings, in turn leading to increases in compute bounds. Arithmetic intensity can also be increased as model size decreases due to the opportunities presented by on-chip caching.
e combined e ect of these factors allows inference throughput to increase linearly if the baseline is memory bound, or superlinearly if compute bound. e accuracy-throughput tradeo exposed through quantisation makes it possible for embedded-scale custom hardware implementations to beat even high-end GPPs in terms of inference throughput. is is evident throughout Table 2 , in which the performance of schemes employing binarisation on custom hardware can be seen to have achieved either superior or comparable throughput to that of popular high-performance GPPs.
EIE [53] and Li et al.'s work [81] used pruning with xed-point quantisation in ASICs and FPGAs, respectively, for CNN weight reduction. Comparing these against other works listed that used the same platform but without pruning, NeuFlow in ASICs [40] and Going Deeper in FPGAs [111] , signi cantly superior arithmetic performance was obtained. is supports the other conclusion drawn from the roo ine modelling in Section 3: with network compression, operational intensity increases due to reduced o -chip memory tra c, facilitating speedups. EIE, using ne-grained pruning with runtime zero-skipping, achieved a 19× improvement in arithmetic performance over NeuFlow, whereas Li et al.'s work, using coarse-grained pruning, achieved only 2× improvement over Going Deeper. is seems to support the conclusion in Sections 3 and 8.1 that ne-grained pruning results in more workload reduction than coarse-grained, and that custom hardware allows for the design of e cient mechanisms to convert these reductions into speedups.
As mentioned in Section 5.4, circulant matrix-based methods do not work well with CNNs due to their signi cant accuracy losses, yet they provide exceptionally good accuracy and compression for RNNs. is is re ected in Table 2 , in which it is shown that C-LSTM exhibited 47× and 390× gains in throughput and e ciency, respectively, compared to a GPU implementation [145] . Among all RNN implementations listed, those that employed block-circulant matrices or input-dependent computation achieved superior throughputs and e ciencies vs the remainder since the use of these methods resulted in the greatest workload reductions. Almost all of the listed RNN FPGA frameworks made use of hybrid strategies, featuring processing elements tailored to low-precision computation along with weight reduction, achieving signi cant throughput improvements compared to GPU alternatives. xed-point quantisation for ImageNet classi cation [94] . Tradeo s between resource consumption and throughput were systematically analysed, with high performance achieved by balancing memory tra c and computation. e authors reported throughput of 21 cl/s and latency of 48 ms, both of which are 4.7× higher than the previous state of the art, Going Deeper [111] . e earliest version of fpgaConvNet was throughput-oriented [142] . e authors later extended their design-space exploration tool to optimise for latency in addition to throughput, demonstrating outstanding latency-critical application performance vs alternative embedded implementations [143] . Zhang et al. also presented an FPGA-based RNN/CNN inference framework, providing highly con gurable layer templates and a design-space exploration engine for resource allocation management facilitating design optimisation for resource-constrained latency minimisation [157] .
Hardware implementations of input-dependent computation methods have an intrinsic emphasis on latency. Due to their conditional computation nature, pipeline stalls happen frequently, reducing throughput. is is not a problem for latency-driven applications, however, in which the inference batch size is normally one. Implementations based on input-dependent methods, e.g. CascadeCNN [71] , are able to achieve signi cant latency reductions. Table 2 also facilitates the energy e ciency comparison of DNN inference implementations. Given a constant power budget, higher throughput translates to higher energy e ciency. us, approximation methods leading to higher parallelism and workload and/or o -chip memory transfer reductions, such as binarisation [9] , logarithmic quantisation [146] and block-circulant matrices [145] , tend to result in higher energy e ciencies over alternative techniques with comparable network topologies and power consumptions.
Energy E iciency.
When comparing platforms with similar throughput, the e ciency of power-hungry high-end GPPs tends to be lower than custom hardware implementations'. ese facilitate parallelism at low precisions, achieving high throughput when running at a few hundred MHz, while CPUs and GPUs tend to operate at speeds on the order of GHz. For example, a binary HARPv2 implementation can provide comparable throughput to a Titan X Pascal GPU's, but is 24% more energy e cient [100] . e ASIC implementations achieve the highest energy e ciencies, primarily because they are not con gurable and thus have lower capacitive loading than FPGA equivalents. Due to hardware overheads allowing for arbitrary logic and routing con gurations and their lack of clock tree customisability, FPGAs can never compete with ASICs in terms of energy e ciency, yet FPGA implementations are still signi cantly more e cient than GPPs [7] . Memory hierarchy customisability also facilitates e ciency improvements, as was shown for YodaNN [9] .
Application-specific Considerations
8.3.1 Retraining Time and Parameter Fine-tuning. Fixed-point and logarithmic quantisation, pruning and input-dependent compute methods require post-approximation retraining. e majority of the pruning methods captured in Fig. 3b use l 1 and l 2 regularisers. eir employment, however, tends to result in more iterations being required to achieve convergence, increasing training time. Ullrich et al. reported that training of networks exploiting the so weight-sharing method is very slow for large-scale datasets [140] . Furthermore, the search for so-called hyper-parameters, such as pruning thresholds and quantisation precisions, can be cumbersome and expensive [55, 69] . e use of low-rank factorisation tends to necessitate more retraining iterations for convergence than alternative methods since layer-wise factorisation results in increased network depth, exacerbating the problem of vanishing gradients in DNNs. Factorisation is also compute-intensive.
Parameterisation.
During hardware design-space exploration, ASIC designs and some early FPGA-based works were only optimised for a single design metric: usually throughput. Many recent FPGA-based works have introduced general-purpose DNN accelerator frameworks which can cater to di erent design considerations based on desired application requirements. As a follow-up to FPGA-based framework fpgaConvNet [142] , Stylianos et al. extended their automatic design-space exploration algorithm to also support area and latency optimisation [143] .
Hardware Design and
Turnaround. Due to the rapidly evolving landscape of DNN algorithmic development, the exibility of the hardware design process becomes a practical issue. With a time-and resource-consuming process, an inference platform could well become obsolete before it is manufactured. e design, fabrication and validation of ASICs normally take months, if not years, to complete. Such slow turnarounds expose DNN application designers to high risks in terms of time and monetary investment. GPPs, on the other hand, are well supported by full-stack DNN design frameworks using high-level front ends, with which approximation methods can be prototyped in weeks. Compared with these two families of platforms, FPGAs provide a useful tradeo between performance and design costs. High-level synthesis tools reduce design di culty and lead time while allowing the obtainment of high throughput and energy e ciency.
8.3.4
Regularisation. e authors of works exploiting many approximation methods, including low-precision quantisation [31, 101, 108] , pruning [55, 122] and weight sharing [21] , reported accuracies greater than FP32 baselines a er their application. Courbariaux et al. explained that low-precision quantisation limits network capacity, forcing networks to leave local minima and nd broader minima instead, improving generalisability by avoiding over ing [31] . Similarly, in FITNet, the student network achieved 10× compression but a 1.4 pp accuracy improvement over its teacher due to the regularisation e ect from reduced network complexity [116] . e authors of HashedNets explained that the random "virtual" connections generated by their parameter hashing increased network expressiveness [21] . Similar to dropout layers in DNN training, the introduction of randomness from approximation, in the form of either quantisation noise or connections, creates regularisation that improves the accuracy of smaller networks.
FUTURE DIRECTIONS
Now that we have evaluated the current trends in the eld of DNN approximation algorithms and their implementations, we are in a position to propose some promising future research directions.
Evaluation Methodologies
In the development of throughput-oriented DNN algorithm implementations, being able to identify bo lenecks is crucial to the e ciency of research. A misidenti cation of a bo leneck's source usually leads to wasted design e ort. In many publications to date, authors have employed ad hoc evaluation methodologies, reporting improvements against seemingly arbitrary DNN benchmarks without systematically determining their baselines' bo lenecks, how the characteristics of the selected models a ect those bo lenecks or how far away design points are from theoretical maxima.
One of the major issues with DNN evaluation is the emphasis currently placed by many authors on peak arithmetic performance (in op/s). For example, the authors of the TPU stated that their architecture can achieve 92 Top/s [66] . When tested with real DNN layers, however, that actually achieved was below 15 Top/s due to memory bandwidth limits for all cases but one with a particularly high operational intensity. A focus on peak op/s can potentially lead to ignorance of the importance of microarchitectural design, making post-deployment accelerator e ciency underwhelming.
In Section 3, we compared the acceleration potential of DNN inference platforms using roo ine modelling. For cross-platform evaluation, such models are useful since they present major bo lenecks in uniform and comparable formats, allowing the relative strengths and weaknesses of those platforms to be contrasted. Some authors have extended roo ine modelling in order to capture other metrics. For example, in an a empt to analyse the tradeo between energy e ciency and performance, Sayed et al. added frequency as a third axis, allowing power draw estimation [10] .
For comparison of implementations, however-particularly those on the same platform-we are of the opinion that the use of roo ine modelling is misguided. While points showing achieved arithmetic performance could be added to roo ine plots, showing how much of their compute and memory bandwidth potential particular implementations achieve, the methodology's inherent orientation to arithmetic performance obscures other factors a ecting analysis: chie y workload. Two otherwise identical implementations with di erent levels of pruning, for example, may well exhibit negatively correlated op/s and cl/s, potentially making comparison of arithmetic performance misleading. In an a empt to tackle this, metrics including "equivalent throughput" (the arithmetic performance of a post-pruned network using the pre-pruning workload) have been introduced and are unfortunately now commonplace [37, 53] . We consider these to be unmeaningful and to needlessly distract from consideration of fundamental measures, particularly classi cation rate.
We encourage the community to report sustained throughput (in cl/s or similar) for standard, up-to-date models and datasets in preference to (peak) arithmetic performance. In conducting the research for this article, we encountered many issues with performance comparison owing to authors evaluating their works very di erently, with some of the benchmarks used unpopular or even obsolete. Emerging benchmark suites such as MLPerf and DeepBench, which provide selections of widely accepted and current test cases, should be used for comprehensive evaluation, thereby also facilitating apples-to-apples comparison.
Research Objectives
9.2.1 Convergence Guarantees and Optimal Design Choices. Many approximation methods do not yet have mathematical proofs of guaranteed convergence, meaning that existing methods may not be applicable to new DNN models. We are therefore of the opinion that theoretical investigation into each such method's convergence would be a very useful endeavour. As a counterexample, Li et al. provided derivations for quantised DNNs' convergence criteria [79] . Sakr et al. also investigated analytical guarantees on the numerical precision of DNNs with xed-point quantisation [119] .
It would also be interesting to prove the existence of optimal design choices for each method. For example, Tai et al. [138] suggested that the CP decomposition proposed by Lebedev et al. [73] does not guarantee an optimal rank-r factorisation since the problem of nding the best low-rank CP factorisation is ill-posed [33] . Similarly, for circulant matrix methods, we can clearly observe a di erence in accuracy degradation between CNNs and RNNs, but it is not yet possible to explain this discrepancy mathematically. A good understanding of the convergence and applicability of the various approximation methods would be bene cial to allow for their generalisation. 9.2.2 Self-adaptive Hyper-parameter Fine-tuning. During quantisation and pruning, many hyperparameters need to be determined through extensive manual ne-tuning with a validation dataset.
is will become infeasible as networks deepen. ose with dynamic ne-tuning mechanisms are therefore potentially more scalable than those requiring manual intervention. As examples of the former, Bengio et al. [13] and Lin et al. [85] made pruning decisions using a Markov decision process, Liu et al. performed lter pruning using trainable scaling factors [92] , Shin et al. learnt quantisation granularities via retraining [128] and Yang et al. removed lters to meet resource constraints [153] . If self-adaptive network ne tuning can be generalised to di erent hyper-parameters and network models, the latency of DNN application design could be signi cantly reduced. Table 2 , we can conclude that, while FPGAs are extremely exible, ASICs o er the greatest performance. Instead of focussing on purely FPGA-or ASIC-only solutions, Nurvitadhi et al. proposed the single-package, heterogeneous integration of FPGAs and ASICs using Intel's Embedded Multi-die Interconnect Bridge [103] . In their system, the ASIC components, called TensorTiles, execute typical DNN operations such as matrix-vector MACs at eight-bit or lower precision, while the FPGA enables the application-speci c optimisation of data management and scheduling. With two TensorTiles and one FPGA, this design demonstrated 3.3× and 4.0× improvements in energy e ciency and throughput, respectively, with AlexNet against an FPGA-only implementation on an Intel Stratix 10. is work proved that such heterogeneous systems are promising platforms for DNN applications and thus deserve particular a ention. Xilinx's recently announced Adaptive Compute Acceleration Platform, featuring a hardened array of processors suited to neural network compute interfaced with so logic through a network on chip, was designed to simultaneously achieve high performance and exibility [151] .
FPGA-ASIC Heterogeneous Systems. From
9.2.4
Hardware Inference of Irregular Data Pa erns. While ne-grained pruning can lead to high compression, it also produces data distribution irregularity, making conversion of compression into speedups challenging [52, 55, 122] . For example, for AlexNet on GPUs with structured pruning, a compression ratio of 3.0 led to 3.0× greater throughput [74] , while, in contrast, element-wise pruning resulted in superior compression (9.0×) but the same throughput [55] . In this context, there is an emerging need for hardware accelerators to support compressed and sparse networks to become competitive high-performance, low-power GPP alternatives. Works based on custom hardware, such as ESE [52] on FPGAs and Cnvlutin [4] and Minerva [114] on ASICs, featured fast and dynamic arithmetic operation avoidance suiting ne-grained pruning, achieving superior throughput and energy e ciency to GPP implementations. Future works should explore the further use of design exibility to realise more acceleration from sparsity.
Parameter
Hardening. Almost all works exploiting existing approximation still see the storage of parameters in DRAM for hardware reusability and scalability. With the large memory transfer reductions achievable through the use of aggressive methods including binarisation, logarithmic quantisation and weight sharing, however, smaller-sized parameters can t on-chip more easily. It has thus become increasingly sensible to harden parameters into logic, reducing o -chip memory fetches. In some cases, memory fetching can be eliminated entirely. With base-two logarithmic quantisation, for example, multiplications are converted into binary shi s, which, when hardened, can be implemented without consuming any logic. Industrial rms such as Microso and Google have focussed their e orts on the optimisation of datacentre-scale DNN inference with custom ASIC [66] and FPGA [28] designs. eir huge throughput and energy e ciency requirements justify the use of extremely large and specialised accelerators employing loop unrolling and parameter hardening. Future research can explore the feasibility of this approach, showing how it trades o design reusability and scalability for throughput and e ciency.
SUMMARY
In this article, we discussed the past, present and future of DNN approximation for custom hardware.
With a roo ine model analysis, we explained why DNNs' algorithmic advancement favours custom implementations, demonstrating how FPGAs and ASICs can o er performance superior to that of alternative platforms through the exploitation of approximation. With a comprehensive selection of state-of-the-art publications, we presented in-depth evaluations and comparisons of DNN approximation algorithms along with their respective hardware implementations. We summarised the current trends in the eld, based on which we proposed several research questions which are yet to be su ciently answered. rough this work, we hope to inspire new and exciting developments in DNN approximation that tap into the full potential o ered by custom hardware platforms.
