Video anomaly detection (VAD) is an essential yet challenge task in signal
processing. Since certain anomalies cannot be detected by analyzing temporal or
spatial information alone, the interaction between two types of information is
considered crucial for VAD. However, current dual-stream architectures either
limit interaction between the two types of information to the bottleneck of
autoencoder or incorporate background pixels irrelevant to anomalies into the
interaction. To this end, we propose a multi-scale spatial-temporal interaction
network (MSTI-Net) for VAD. First, to pay particular attention to objects and
reconcile the significant semantic differences between the two information, we
propose an attention-based spatial-temporal fusion module (ASTM) as a
substitute for the conventional direct fusion. Furthermore, we inject multi
ASTM-based connections between the appearance and motion pathways of a dual
stream network to facilitate spatial-temporal interaction at all possible
scales. Finally, the regular information learned from multiple scales is
recorded in memory to enhance the differentiation between anomalies and normal
events during the testing phase. Solid experimental results on three standard
datasets validate the effectiveness of our approach, which achieve AUCs of
96.8% for UCSD Ped2, 87.6% for CUHK Avenue, and 73.9% for the ShanghaiTech
dataset