Large convolutional neural network models have recently demonstrated
impressive performance on video attention prediction. Conventionally, these
models are with intensive computation and large memory. To address these
issues, we design an extremely light-weight network with ultrafast speed, named
UVA-Net. The network is constructed based on depth-wise convolutions and takes
low-resolution images as input. However, this straight-forward acceleration
method will decrease performance dramatically. To this end, we propose a
coupled knowledge distillation strategy to augment and train the network
effectively. With this strategy, the model can further automatically discover
and emphasize implicit useful cues contained in the data. Both spatial and
temporal knowledge learned by the high-resolution complex teacher networks also
can be distilled and transferred into the proposed low-resolution light-weight
spatiotemporal network. Experimental results show that the performance of our
model is comparable to ten state-of-the-art models in video attention
prediction, while it costs only 0.68 MB memory footprint, runs about 10,106 FPS
on GPU and 404 FPS on CPU, which is 206 times faster than previous models