Despite recent developments, deploying deep neural networks on resource constrained general purpose hardware remains a significant challenge. There has been much work in developing methods for reshaping neural networks, usually with a focus on minimising total parameter count. These methods are typically developed in a hardware-agnostic manner and do not exploit hardware behaviour. In this paper we propose a new approach, Hardware Aware Knowledge Distillation (HAKD) which uses empirical observations of hardware behaviour to design efficient student networks which are then trained with knowledge distillation. This allows the trade-off between accuracy and performance to be managed explicitly. We have applied this approach across three platforms and evaluated it on two networks, MobileNet and DenseNet, on CIFAR-10. We show that HAKD outperforms Deep Compression and Fisher pruning in terms of size, accuracy and performance.
Introduction
Deep neural networks (DNNs) have recently experienced a surge in popularity due to their exceptional performance on a range of tasks [1] , such as image classification, language translation, and speech recognition. This is partly due to the availability of GPU clusters which provide the computational power that DNNs typically demand. Such demands present a significant barrier to the adoption of DNNs in embedded edge devices (e.g. mobile phones), where their most interesting applications lie [2] . Given such large workloads, it is important to utilise any available hardware efficiently when deploying DNNs on resource constrained devices. The most prevalent solution to this problem is to reduce the parameter count, and hence the number of weights stored in the DNN, which reduces its memory footprint.
It is well established that significant redundancy exists in many network architectures [3] , and that often up to 90% of parameters, or weights, stored in the network can be removed with minimal effect on accuracy. A variety of frameworks have since been developed for removing redundant weights systematically to accelerate inference. These promising techniques may allow us to deploy highly accurate neural networks with fast inference times, opening the door to a wide range of potential applications.
One such compression technique which has proved exceptionally popular is Deep Compression [4] , a weight pruning method that induces very sparse weight matrices for inference acceleration on low power devices. Deep Compression was later extended in Scalpel [5] to utilise device-specific parallelism; for example, when deploying on CPUs, Scalpel prunes weights into groups that match the width of the SIMD lanes on the device. By properly exploiting device-level parallelism, Scalpel is able to give up to 2× speedup over the device-agnostic methodology of Deep Compression.
However, recent work has shown that reducing network size while keeping weight matrices dense performs better than introducing weight sparsity on resource-constrained devices [6] , exposing channel pruning methods (e.g. Fisher pruning [7] ) as a preferable means of compression.
Although channel pruning is effective, hardware structure has a significant impact on its performance. In this paper, we propose a novel channel pruning approach that uses hardware behaviour to reshape networks for accelerating inferences. We use knowledge distillation [8, 9] to train students that are hardware-aware reductions of their original teacher models; we call such an approach Hardware Aware Knowledge Distillation (HAKD). HAKD achieves between 1.1× and 7× speedup over the original models and up to 3× speedup over Deep Compression while also improving performance and accuracy relative to other pruning methods.
HAKD provides the ability to optimise for different objectives. For instance, in some cases we may be willing to sacrifice a small amount of accuracy for a large inference time speedup, while in other scenarios, maximal accuracy for a given inference time may be critical. We combine insights from developments in neural compression techniques and computer architecture to develop a new methodology for reshaping neural networks to fit specific hardware platforms, tasks, and deployment priorities.
The contributions of this paper are as follows:
• A novel device-aware channel optimisation approach based on empirical execution time, incorporating knowledge distillation.
• Evaluation and comparison against popular pruning techniques on CIFAR-10 classification for MobileNet and DenseNet architectures.
• Evaluation on resource-constrained hardware (Nvidia Jetson Tx2 ARM v7), showing significant performance improvement.
Motivation
At the heart of our approach is the need to guide channel pruning based on actual hardware behaviour. To illustrate this, consider the graph in Figure 1 . It shows the inference time on an Intel Core i7 of a single layer of a ResNet-50 [10] network pruned to different layer widths (number of channels).
With a larger number of channels, we have more weights and would expect greater accuracy, but as the figure shows, the associated extra computation increases execution time. However, as the figure also shows, the relationship between the number of channels and inference time is not linear. In fact a staircase pattern appears, where weight matrices stop utilising memory efficiently. The figure highlights the optimal points on each step of the staircase. These are the points that allow for the maximum number of channels (and hence accuracy) for a given inference time budget.
Our approach is to first apply channel pruning to reach a step of the staircase, and then learn a new model with the number of channels set to the optimal point on that step based on knowledge distillation. This is described in greater detail in section 4. and by reducing the number of layers (depth) starting from the structure of a teacher network. In this paper, we only consider width reduction.
3 Related Work
Neural Compression Techniques

Weight Pruning
Weight pruning is a popular technique for reducing the parameter count in neural networks, and has been shown to reduce overfitting in some cases [11] . Weight pruning techniques generally fall into two categories: (i) unstructured removal of weights deemed unnecessary and (ii) structured regularisation -pushing the network weights towards a regular sparse format. [12] first proposed the removal of redundant weights by penalising the error function with a count of the sum of parameters being optimised. Minimising this error function thus jointly optimises accuracy and parameter count, analogous to finding the minimum description length of the network.
In Deep Compression [13] , the authors promote an iterative process which separates the optimisation of parameter count and accuracy. In order to learn a minimal weight structure, they propose cycling between steps of traditional training and single steps where low magnitude weights are removed. This is applied layer-by-layer, with the threshold for removal tuned by the standard deviation of the layer. The authors show that it is possible to prune up to 90% of the weights without affecting accuracy. In [4] , this is developed using a three stage method for storing the network involving pruning, quantisation, and Huffman coding.
Deep Compression relies on the assumption that low magnitude weights are unimportant. However, the validity of this assumption has recently been challenged [14] and it has been shown that these small weights often carry more importance than their magnitude suggests. The large FLOP saving Deep Compression provides is also often not leveraged into speedup because of the sparse format the weight matrices are left in [5, 6] . Moreover, the process of iteratively pruning and retraining is very expensive.
Gaussian dropout [15] is a form of dropout that adds noise sampled from a random distribution instead of using a binary mask. In [16] , the authors show that Gaussian dropout can be modelled as a form of stochastic gradient variational Bayes using a specific prior and variational posterior over the weights. This means that they can optimise a dropout rate for each individual weight using the variational lower bound. This was extended in [17] , where a new method for approximating the KL divergence of the prior and posterior distribution was exploited to remove weights from the network.
The new approximation allows for the noise variance to be unbounded, and in such cases that the noise variance for a parameter tends to infinity, the parameter is removed. This yields impressive results and was shown to reduce the parameters up to 280× on LeNet [18] architectures, but has yet to be extended to larger models.
Channel Pruning
A natural extension of weight pruning is to not only remove individual weights in a network, but entire channels from the convolutional layers and nodes from the fully connected layers. A common approach to this is to apply Lasso regression to the problem of channel selection [19] since low regression coefficients may be viewed as corresponding to low impact on the error function. This view of channel pruning was questioned by [14] , who applied Lasso instead to the batch-norm [20] layers of the network and then removed any channels with low batch-norm coefficients, resulting in greater accuracy.
Another approach altogether is to approximate the effect of removing certain channels on the error using a Taylor series expansion [21, 7] (which may be seen as an extension of [22] ) to give a pruning signal. [7] weight this pruning signal with a FLOP penalty to bias the method towards removing computationally expensive neurons. While effective at providing a very regular set of small, dense matrices, we show that the solution found by Fisher pruning is very rarely optimal for specific deployment targets. Indeed, recent work has advocated training such networks from scratch [23] .
Hardware Aware Reshaping
DPP-Net [24] is a platform-aware method for performing neural architecture search that uses Sequential Model-Based Optimisation to generate candidate architectures and a Recurrent Neural Network (RNN) to estimate potential accuracy that could be reached by training the network. This search includes soft targets of memory budget, energy consumption, and inference time. It is able to generate highly inference-efficient network architectures that also achieve high accuracy on the example datasets. However, the architecture search process is both very expensive and usually only applicable to constrained spaces.
A search-free approach was outlined in NetAdapt [25] , which takes as input pre-trained models and iteratively applies pruning and retraining steps to each layer, suggesting several new layer sizes and empirically testing inference time on the target hardware. The authors compare their empirical testing against using FLOPs as a proxy for inference time and show that their method finds networks with faster inference times. While this is a very effective step towards hardware aware network architecture design, the process of iteratively pruning and retraining several different layer sizes is extremely computationally exhaustive and this approach quickly becomes intractable for very large networks.
Knowledge Distillation
Networks can also be made smaller through knowledge distillation [8, 9] whereby a small student network is trained both on the data and on the outputs of a larger pre-trained teacher network. This has been shown empirically to yield higher accuracy than training on just the data. Alternatively, the student can utilise the teacher's intermediate activations [26, 27] for the same purpose. Typically the student network is thinner or has fewer layers than the teacher (as shown in Figure 2 ), but recent work has demonstrated the effectiveness of simply using the same network with inexpensive convolutions [28] .
Parallelism in Neural Networks
The most common operation performed in state-of-the-art image processing networks is a 2D convolution, which demands a significant majority of the computation time. Unfortunately, direct convolution has received very little attention from the perspective of acceleration and is therefore not well-optimised for a range of devices. For this reason, deep learning acceleration libraries often reduce convolution to matrix-matrix multiplication by stretching out the input image and kernels into matrices via the im2col algorithm [29] .
Using the matrix-matrix multiplication form of convolution makes data movement the bottleneck of inference for many hardware platforms. However, matrix-matrix multiplication has been heavily optimised by the compiler community and many techniques have been developed for hiding the latency of memory accesses while doing large multiplications. On GPUs, the most common approach is to use tiling and register blocking to maximise the ratio of data reuse to data movement (i.e. to keep parameters in memory for as long as possible and reuse them where possible to avoid reloading them later); instead of reading one element at a time, blocks of elements are read together so that they can be reused in the calculation. However, tile size selection can have a very significant impact on the efficiency of the operation [30] . Finding a tile size that correctly balances efficient use of the cache line width on the device and efficient fragmentation of the matrix is often a challenging task; given that GPUs usually have cache line widths set to powers of two, it is common to see significant speedups when the tiles can fit into power-of-two length words. Thus, when a tile does not perfectly fit the available cache it is common practice to pad the line of cache until it fits the word length (a similar technique is used to accelerate convolution on CPUs [31] ).
However, with convolutional neural networks as our workload we are in a unique position, because we have shown that we are effectively able to choose the sizes of the matrices to fit the device we are deploying to by reshaping the widths of the layers.
Hardware Aware Distillation
Neural compression techniques have largely been developed in a hardware-agnostic manner, with a strong focus on reducing the number of parameters (/weights) without affecting accuracy. Conversely, systems researchers have focused on improving inference time for a fixed network structure.
We aim to unify both approaches using hardware aware knowledge distillation. Networks can be made smaller through knowledge distillation [8, 9] where a small student network is trained both on the data and on the outputs of a larger teacher network (Figure 2 ). We use knowledge of hardware behaviour to distil a smaller student network reduced in width from a teacher, which is likely to perform well on a specific device. Step 2: Distill to the Hardware Constrained Student Fisher Pruning Figure 3 : The Hardware Aware Distillation pipeline. In step 1, we use the Fisher pruned model and our staircase optimiser to design a student network that fits a specific deployment platform. In step 2 we use the original model to train the hardware constrained student using knowledge distillation.
Distillation
To describe knowledge distillation [8, 9] more precisely, let's assume we have a dataset of pairs of images and class labels, where each image is denoted as x and each label is a one-hot vector y. Given a pre-trained teacher network that outputs logits t = teacher(x), we can learn the parameters for a student network (outputting logits s = student(x)) by minimising:
where L CE (p, q) = − k p k log q k is the cross-entropy function, and σ(.) is the softmax function (transforming the logits for each network into a vector of probabilities). T is a fixed temperature parameter [9] .
The first term is a standard cross-entropy loss for the student network, the second term however is minimised when the student network emulates the teacher. This technique allows the student to achieve a greater generalisation capability than training without a teacher. Note that α controls the ratio of the two terms.
Hardware Aware Knowledge Distillation
HAKD is a 2 step process as shown in Figure 3 . First, we determine the best number of channels for an inference task based on hardware behaviour and then distil a network that best matches the function represented by the original teacher network.
Step 1 We first iteratively apply Fisher pruning [7] to our target network. This means we incrementally remove a channel from the network design, retrain the model and measure its inference time.
From this we have a curve describing accuracy vs. channel count.
Given this trade-off curve, we then determine the best channel count. This is scenario specific. There are two possible strategies: maximising accuracy for a given inference time, and minimising inference time for a given accuracy.
The first strategy is to cluster the layer width choices towards maximising parameter counts; this means that any time a pruning technique converges to a point that sits on a step (see Figure 1) , we choose the rightmost element on that step as our layer width. This technique will maximise the number of parameters in the model without affecting the inference time, usually resulting in greater accuracy performance (or, lower error) for a moderate increase in memory cost. We refer to this policy as Accuracy-maximisation (ACC-MAX).
The second possible strategy is to minimise inference time in scenarios where lower accuracy is acceptable. In this case, we cluster the decisions of a pruning technique towards the rightmost element of the step below the current step. This means that we incur a slight hit in accuracy (due to a decreased parameter count) but leverage this into a significant speedup on the hardware in question. We refer to this clustering policy as Inference-minimisation (INF-MIN).
Step 2 Once we have determined the ideal number of channels, we apply knowledge distillation to train a student which has this fixed number of channels. At this stage, as the size and cost of the network is fixed, we simply aim to maximise accuracy.
In the remainder of the paper we consider only ACC-MAX and leave INF-MIN to be explored in future work.
Experimental Setup
In this section we describe the network architectures -MobileNet and DenseNet -and the dataset (CIFAR-10) we use for our experiments. We also provide details for how we train, and prune these networks, as well as the the hardware platforms we evaluate them on.
Networks MobileNet [32] is an architecture developed specifically for mobile devices. Instead of full convolutional kernels it utilises depthwise separable convolutions: a convolutional filter is applied individually to each input channel, the channels are then mixed through a 1 × 1 convolution [33] . DenseNet [34] is a very popular modern architecture that consists of convolutional blocks. We use bottleneck blocks whereby each block contains a 1 × 1 convolution, followed by a 3 × 3 convolution. Block outputs are concatenated and form the input to later blocks, and this encourages feature reuse to allow for powerful representations. Specifically, we use a 121 layer DenseNet with a growth rate of 32, and a transition reduction rate of 0.5.
CIFAR-10
For our experiments we train and evaluate network models on CIFAR-10 [35] , a standard machine learning benchmark dataset. It consists of 60,000 32 × 32 pixel images across 10 object categories split into a train and test set of 50,000 and 10,000 images respectively.
Training To train networks from scratch, we use Stochastic Gradient Descent (SGD) with momentum to minimise the cross-entropy loss between class labels and network outputs. We use padding and left-right flips to augment the training data. We train for 200 epochs using mini-batches of size 128. The initial learning rate is 0.1 and is decayed by a factor of 0.2 every 60 epochs. We use weight decay of 0.0005 and momentum of 0.9. When performing knowledge distillation, we train the student network to instead minimise the loss in Equation 1 with α as 0.9 and T as 4.
Deep Compression
To perform Deep Compression, we take the trained network and lower the learning rate to 1 100 th of the last recorded learning rate. Every 30 epochs we increase the sparsity level by 10% by increasing the magnitude threshold, and retrain using L2 regularisation with a penalty of 0.0005. When the weights for a convolutional filter are all zero-valued, we remove the whole filter.
Fisher Pruning We perform Fisher pruning by fine-tuning the trained network with a learning rate of 0.0008. Every 10 steps, a single channel -the channel with the smallest effect on the loss -is pruned. For MobileNet we prune the connections between adjacent depthwise separable pairs. For DenseNet we prune the connections between the 1 × 1 and 3 × 3 convolution within each block. Hardware Platforms We evaluated our approach on 3 different hardware platforms. The first one is an Intel Core i7-3820 CPU with 4 cores @ 3.60GHz and 16GB of DDR2 RAM. We also consider the quad-core ARM Cortex-A57 CPU, and the 256-core Pascal GPU, both present on the Nvidia Jetson TX2 development board with 8GB of LPDDR4 RAM.
Results
In this section, we first consider the performance of our distillation technique on the test set of CIFAR-10 in terms of error. We then explore the inference time for each approach and present speedups over the default networks. This is followed by an evaluation of how network size and accuracy trade off for the main compression techniques.
Accuracy
Consider the bar plot in Figure 5 . Here we plot the CIFAR-10 top-1 classification error for each compression technique when applied to DenseNet-121 and MobileNet. The original DenseNet-121 network has an error-rate of 4.7%. Fisher pruning reduces the size of the network but at a cost of increased error-rate (5.5%). Deep Compression increases error still further (5.8%). The 3 HAKD schemes have superior accuracy: between 5 and 5.25% error depending on platform.
A similar trend is observed for MobileNet. Once again HAKD has improved error-rates relative to Deep Compression and Fisher pruning.
We show this again in Figure 4 , which shows the error curves for each compression technique on DenseNet-121 and MobileNet. If we slice the architecture found by Fisher pruning at the highlighted points and reshape it for each of our hardware platforms then we are able to leverage increased parameter count into a significant decrease in error, for a fixed inference time.
Speedup over Original Networks
Reducing the number of weights in a network reduces the amount of computation and hence we expect to reduce inference time at the cost of the increased error-rate shown above. In Figure 6 we show the relative performance of the 3 compression techniques on 3 different platforms for DenseNet-121. While Deep Compression is able to decrease inference time by 1.1x to 2x, this is outperformed by Fisher pruning which gives speedups from 2x to 7x. Given that Fisher pruning also has reduced error, it clearly outperforms Deep Compression.
P la in
A similar pattern holds for MobileNet ( Figure 7 ). However, given that the Jetson board and the i7 platform are large relative to the network, pruning has little impact on execution time here.
In all cases HAKD has the same speedup as Fisher pruning, while having superior accuracy as shown above.
Parameter Size
Here we explore in more detail the trade off in network size vs. error. Consider Figure 4 : this shows the error rate for two networks for a varying number of network parameters. As expected, Fisher pruning gives increased accuracy over Deep Compression for a particular network size. On both graphs are the solution points found by HAKD for the 3 platforms after network size has been selected. In each case HAKD gives reduced error rates for a particular size. Alternatively, this graph can be read as: for a particular error rate, HAKD gives a smaller network.
Conclusion
It has long been understood that significant parameter redundancy exists in many deep neural networks. Now that compression techniques have matured, we are able to take advantage of insights from both developments in compiler optimisation and improvements in neural network acceleration schemes to provide an across-stack approach [6] to optimising CNNs for specific tasks and devices. We show that taking an across-stack approach observing hardware behaviour allows us to outperform popular pruning techniques in terms of both accuracy and inference speed.
