Search CORE

3,915 research outputs found

Exploiting Spatial-temporal Correlations for Video Anomaly Detection

Author: Li Jing
Liu Yang
Zeng Xinhua
Zhao Mengyang
Publication venue
Publication date: 01/11/2022
Field of study

Video anomaly detection (VAD) remains a challenging task in the pattern recognition community due to the ambiguity and diversity of abnormal events. Existing deep learning-based VAD methods usually leverage proxy tasks to learn the normal patterns and discriminate the instances that deviate from such patterns as abnormal. However, most of them do not take full advantage of spatial-temporal correlations among video frames, which is critical for understanding normal patterns. In this paper, we address unsupervised VAD by learning the evolution regularity of appearance and motion in the long and short-term and exploit the spatial-temporal correlations among consecutive frames in normal videos more adequately. Specifically, we proposed to utilize the spatiotemporal long short-term memory (ST-LSTM) to extract and memorize spatial appearances and temporal variations in a unified memory cell. In addition, inspired by the generative adversarial network, we introduce a discriminator to perform adversarial learning with the ST-LSTM to enhance the learning capability. Experimental results on standard benchmarks demonstrate the effectiveness of spatial-temporal correlations for unsupervised VAD. Our method achieves competitive performance compared to the state-of-the-art methods with AUCs of 96.7%, 87.8%, and 73.1% on the UCSD Ped2, CUHK Avenue, and ShanghaiTech, respectively.Comment: This paper is accepted at IEEE 26TH International Conference on Pattern Recognition (ICPR) 202

arXiv.org e-Print Archive

Generalized Video Anomaly Event Detection: Systematic Taxonomy and Comparison of Deep Models

Author: Boukerche Azzedine
Liu Jing
Liu Jun
Liu Yang
Song Liang
Sun Peng
Wang Yan
Yang Dingkang
Publication venue
Publication date: 01/02/2024
Field of study

Video Anomaly Detection (VAD) serves as a pivotal technology in the intelligent surveillance systems, enabling the temporal or spatial identification of anomalous events within videos. While existing reviews predominantly concentrate on conventional unsupervised methods, they often overlook the emergence of weakly-supervised and fully-unsupervised approaches. To address this gap, this survey extends the conventional scope of VAD beyond unsupervised methods, encompassing a broader spectrum termed Generalized Video Anomaly Event Detection (GVAED). By skillfully incorporating recent advancements rooted in diverse assumptions and learning frameworks, this survey introduces an intuitive taxonomy that seamlessly navigates through unsupervised, weakly-supervised, supervised and fully-unsupervised VAD methodologies, elucidating the distinctions and interconnections within these research trajectories. In addition, this survey facilitates prospective researchers by assembling a compilation of research resources, including public datasets, available codebases, programming tools, and pertinent literature. Furthermore, this survey quantitatively assesses model performance, delves into research challenges and directions, and outlines potential avenues for future exploration.Comment: Accepted by ACM Computing Surveys. For more information, please see our project page: https://github.com/fudanyliu/GVAE

arXiv.org e-Print Archive

Cloze Test Helps: Effective Video Anomaly Detection via Learning to Complete Video Events

Author: Cai Zhiping
Kloft Marius
Wang Siqi
Xu Chuanfu
Yin Jianping
Yu Guang
Zhu En
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 27/08/2020
Field of study

As a vital topic in media content interpretation, video anomaly detection (VAD) has made fruitful progress via deep neural network (DNN). However, existing methods usually follow a reconstruction or frame prediction routine. They suffer from two gaps: (1) They cannot localize video activities in a both precise and comprehensive manner. (2) They lack sufficient abilities to utilize high-level semantics and temporal context information. Inspired by frequently-used cloze test in language study, we propose a brand-new VAD solution named Video Event Completion (VEC) to bridge gaps above: First, we propose a novel pipeline to achieve both precise and comprehensive enclosure of video activities. Appearance and motion are exploited as mutually complimentary cues to localize regions of interest (RoIs). A normalized spatio-temporal cube (STC) is built from each RoI as a video event, which lays the foundation of VEC and serves as a basic processing unit. Second, we encourage DNN to capture high-level semantics by solving a visual cloze test. To build such a visual cloze test, a certain patch of STC is erased to yield an incomplete event (IE). The DNN learns to restore the original video event from the IE by inferring the missing patch. Third, to incorporate richer motion dynamics, another DNN is trained to infer erased patches' optical flow. Finally, two ensemble strategies using different types of IE and modalities are proposed to boost VAD performance, so as to fully exploit the temporal context and modality information for VAD. VEC can consistently outperform state-of-the-art methods by a notable margin (typically 1.5%-5% AUROC) on commonly-used VAD benchmarks. Our codes and results can be verified at github.com/yuguangnudt/VEC_VAD.Comment: To be published as an oral paper in Proceedings of the 28th ACM International Conference on Multimedia (ACM MM '20). 9 pages, 7 figure

arXiv.org e-Print Archive

Crossref

Multitarget Tracking in Nonoverlapping Cameras Using a Reference Set

Author: An L
Bhanu B
Chen X
Publication venue: eScholarship, University of California
Publication date: 01/05/2015
Field of study

Tracking multiple targets in nonoverlapping cameras are challenging since the observations of the same targets are often separated by time and space. There might be significant appearance change of a target across camera views caused by variations in illumination conditions, poses, and camera imaging characteristics. Consequently, the same target may appear very different in two cameras. Therefore, associating tracks in different camera views directly based on their appearance similarity is difficult and prone to error. In most previous methods, the appearance similarity is computed either using color histograms or based on pretrained brightness transfer function that maps color between cameras. In this paper, a novel reference set based appearance model is proposed to improve multitarget tracking in a network of nonoverlapping cameras. Contrary to previous work, a reference set is constructed for a pair of cameras, containing subjects appearing in both camera views. For track association, instead of directly comparing the appearance of two targets in different camera views, they are compared indirectly via the reference set. Besides global color histograms, texture and shape features are extracted at different locations of a target, and AdaBoost is used to learn the discriminative power of each feature. The effectiveness of the proposed method over the state of the art on two challenging real-world multicamera video data sets is demonstrated by thorough experiments

Crossref

eScholarship - University of California

Multi-scale Spatial-temporal Interaction Network for Video Anomaly Detection

Author: Li Zhangxun
Ning Zhiyuan
Song Liang
Publication venue
Publication date: 16/06/2023
Field of study

Video anomaly detection (VAD) is an essential yet challenge task in signal processing. Since certain anomalies cannot be detected by analyzing temporal or spatial information alone, the interaction between two types of information is considered crucial for VAD. However, current dual-stream architectures either limit interaction between the two types of information to the bottleneck of autoencoder or incorporate background pixels irrelevant to anomalies into the interaction. To this end, we propose a multi-scale spatial-temporal interaction network (MSTI-Net) for VAD. First, to pay particular attention to objects and reconcile the significant semantic differences between the two information, we propose an attention-based spatial-temporal fusion module (ASTM) as a substitute for the conventional direct fusion. Furthermore, we inject multi ASTM-based connections between the appearance and motion pathways of a dual stream network to facilitate spatial-temporal interaction at all possible scales. Finally, the regular information learned from multiple scales is recorded in memory to enhance the differentiation between anomalies and normal events during the testing phase. Solid experimental results on three standard datasets validate the effectiveness of our approach, which achieve AUCs of 96.8% for UCSD Ped2, 87.6% for CUHK Avenue, and 73.9% for the ShanghaiTech dataset

arXiv.org e-Print Archive