Staircase: Distilling with Performance Enhanced Students for Hardware by Turner, Jack et al.
• Idea: propose a novel channel pruning approach that uses hardware 
behaviour to reshape networks
• A small student network is trained both on the data and outputs of a 
larger pre-trained teacher network
Staircase: 
Distilling with Performance Enhanced Students for Hardware
Jack Turner1, Elliot J. Crowley 1, Valentin Radu1, José Cano2, Amos Storkey1, Michael O'Boyle1
1School of Informatics, University of Edinburgh, UK - 2School of Computing Science, University of Glasgow, UK
Motivation
Results: CIFAR-10
ARM Research Summit - Austin, USA
September 15-18, 2019
Discovery and optimisation pipeline (1)
Model distillation
Discovery and optimisation pipeline (2)
• Inference time for a layer of ResNet-34 vs #channels on Intel Core i7
• Staircase pattern: For a given inference time, the green points maximise 









• We have described a simple method for discovering performance 
enhanced reductions of baseline, large neural networks
• We have compared our technique to common pruning approaches, and 
demonstrated its superiority on both the CIFAR-10 and ImageNet






• Step 1: Using channel saliency and empirical latency, design student • Step 2: Train via attention transfer
Network Params MACs Top‐1 Err Top‐5 Err Speed MACs/s
Baseline ResNet‐34 21.3M 4.12G 21.84 5.71 0.122s 33.77G
Fisher‐pruned ResNet‐34 5.3M 1.44G 43.43 18.87 0.038s 37.89G
Our ResNet‐34 6.8M 1.58G 31.29 11.16 0.040s 39.50G
• Student discovery algorithm • Example: block diagram of a WideResNet (attention maps:1, 2, 3)
o Starting: base model, a Fisher-pruned 
reduction of the base model, and a 
target hardware platform
o We iterate over all prunable layers in 
the base model and construct a set of 
optimal points
o We then adapt the pruned layer 
widths to their nearest optimal point 
and return the resulting architecture
Teacher
Student
