3,915 research outputs found
Exploiting Spatial-temporal Correlations for Video Anomaly Detection
Video anomaly detection (VAD) remains a challenging task in the pattern
recognition community due to the ambiguity and diversity of abnormal events.
Existing deep learning-based VAD methods usually leverage proxy tasks to learn
the normal patterns and discriminate the instances that deviate from such
patterns as abnormal. However, most of them do not take full advantage of
spatial-temporal correlations among video frames, which is critical for
understanding normal patterns. In this paper, we address unsupervised VAD by
learning the evolution regularity of appearance and motion in the long and
short-term and exploit the spatial-temporal correlations among consecutive
frames in normal videos more adequately. Specifically, we proposed to utilize
the spatiotemporal long short-term memory (ST-LSTM) to extract and memorize
spatial appearances and temporal variations in a unified memory cell. In
addition, inspired by the generative adversarial network, we introduce a
discriminator to perform adversarial learning with the ST-LSTM to enhance the
learning capability. Experimental results on standard benchmarks demonstrate
the effectiveness of spatial-temporal correlations for unsupervised VAD. Our
method achieves competitive performance compared to the state-of-the-art
methods with AUCs of 96.7%, 87.8%, and 73.1% on the UCSD Ped2, CUHK Avenue, and
ShanghaiTech, respectively.Comment: This paper is accepted at IEEE 26TH International Conference on
Pattern Recognition (ICPR) 202
Generalized Video Anomaly Event Detection: Systematic Taxonomy and Comparison of Deep Models
Video Anomaly Detection (VAD) serves as a pivotal technology in the
intelligent surveillance systems, enabling the temporal or spatial
identification of anomalous events within videos. While existing reviews
predominantly concentrate on conventional unsupervised methods, they often
overlook the emergence of weakly-supervised and fully-unsupervised approaches.
To address this gap, this survey extends the conventional scope of VAD beyond
unsupervised methods, encompassing a broader spectrum termed Generalized Video
Anomaly Event Detection (GVAED). By skillfully incorporating recent
advancements rooted in diverse assumptions and learning frameworks, this survey
introduces an intuitive taxonomy that seamlessly navigates through
unsupervised, weakly-supervised, supervised and fully-unsupervised VAD
methodologies, elucidating the distinctions and interconnections within these
research trajectories. In addition, this survey facilitates prospective
researchers by assembling a compilation of research resources, including public
datasets, available codebases, programming tools, and pertinent literature.
Furthermore, this survey quantitatively assesses model performance, delves into
research challenges and directions, and outlines potential avenues for future
exploration.Comment: Accepted by ACM Computing Surveys. For more information, please see
our project page: https://github.com/fudanyliu/GVAE
Cloze Test Helps: Effective Video Anomaly Detection via Learning to Complete Video Events
As a vital topic in media content interpretation, video anomaly detection
(VAD) has made fruitful progress via deep neural network (DNN). However,
existing methods usually follow a reconstruction or frame prediction routine.
They suffer from two gaps: (1) They cannot localize video activities in a both
precise and comprehensive manner. (2) They lack sufficient abilities to utilize
high-level semantics and temporal context information. Inspired by
frequently-used cloze test in language study, we propose a brand-new VAD
solution named Video Event Completion (VEC) to bridge gaps above: First, we
propose a novel pipeline to achieve both precise and comprehensive enclosure of
video activities. Appearance and motion are exploited as mutually complimentary
cues to localize regions of interest (RoIs). A normalized spatio-temporal cube
(STC) is built from each RoI as a video event, which lays the foundation of VEC
and serves as a basic processing unit. Second, we encourage DNN to capture
high-level semantics by solving a visual cloze test. To build such a visual
cloze test, a certain patch of STC is erased to yield an incomplete event (IE).
The DNN learns to restore the original video event from the IE by inferring the
missing patch. Third, to incorporate richer motion dynamics, another DNN is
trained to infer erased patches' optical flow. Finally, two ensemble strategies
using different types of IE and modalities are proposed to boost VAD
performance, so as to fully exploit the temporal context and modality
information for VAD. VEC can consistently outperform state-of-the-art methods
by a notable margin (typically 1.5%-5% AUROC) on commonly-used VAD benchmarks.
Our codes and results can be verified at github.com/yuguangnudt/VEC_VAD.Comment: To be published as an oral paper in Proceedings of the 28th ACM
International Conference on Multimedia (ACM MM '20). 9 pages, 7 figure
Multitarget Tracking in Nonoverlapping Cameras Using a Reference Set
Tracking multiple targets in nonoverlapping cameras are challenging since the observations of the same targets are often separated by time and space. There might be significant appearance change of a target across camera views caused by variations in illumination conditions, poses, and camera imaging characteristics. Consequently, the same target may appear very different in two cameras. Therefore, associating tracks in different camera views directly based on their appearance similarity is difficult and prone to error. In most previous methods, the appearance similarity is computed either using color histograms or based on pretrained brightness transfer function that maps color between cameras. In this paper, a novel reference set based appearance model is proposed to improve multitarget tracking in a network of nonoverlapping cameras. Contrary to previous work, a reference set is constructed for a pair of cameras, containing subjects appearing in both camera views. For track association, instead of directly comparing the appearance of two targets in different camera views, they are compared indirectly via the reference set. Besides global color histograms, texture and shape features are extracted at different locations of a target, and AdaBoost is used to learn the discriminative power of each feature. The effectiveness of the proposed method over the state of the art on two challenging real-world multicamera video data sets is demonstrated by thorough experiments
Multi-scale Spatial-temporal Interaction Network for Video Anomaly Detection
Video anomaly detection (VAD) is an essential yet challenge task in signal
processing. Since certain anomalies cannot be detected by analyzing temporal or
spatial information alone, the interaction between two types of information is
considered crucial for VAD. However, current dual-stream architectures either
limit interaction between the two types of information to the bottleneck of
autoencoder or incorporate background pixels irrelevant to anomalies into the
interaction. To this end, we propose a multi-scale spatial-temporal interaction
network (MSTI-Net) for VAD. First, to pay particular attention to objects and
reconcile the significant semantic differences between the two information, we
propose an attention-based spatial-temporal fusion module (ASTM) as a
substitute for the conventional direct fusion. Furthermore, we inject multi
ASTM-based connections between the appearance and motion pathways of a dual
stream network to facilitate spatial-temporal interaction at all possible
scales. Finally, the regular information learned from multiple scales is
recorded in memory to enhance the differentiation between anomalies and normal
events during the testing phase. Solid experimental results on three standard
datasets validate the effectiveness of our approach, which achieve AUCs of
96.8% for UCSD Ped2, 87.6% for CUHK Avenue, and 73.9% for the ShanghaiTech
dataset
- …