Video Anomaly Detection (VAD) is an important topic in computer vision.
Motivated by the recent advances in self-supervised learning, this paper
addresses VAD by solving an intuitive yet challenging pretext task, i.e.,
spatio-temporal jigsaw puzzles, which is cast as a multi-label fine-grained
classification problem. Our method exhibits several advantages over existing
works: 1) the spatio-temporal jigsaw puzzles are decoupled in terms of spatial
and temporal dimensions, responsible for capturing highly discriminative
appearance and motion features, respectively; 2) full permutations are used to
provide abundant jigsaw puzzles covering various difficulty levels, allowing
the network to distinguish subtle spatio-temporal differences between normal
and abnormal events; and 3) the pretext task is tackled in an end-to-end manner
without relying on any pre-trained models. Our method outperforms
state-of-the-art counterparts on three public benchmarks. Especially on
ShanghaiTech Campus, the result is superior to reconstruction and
prediction-based methods by a large margin.Comment: Accepted by ECCV'2022; Code is available at
https://github.com/gdwang08/Jigsaw-VA