256 research outputs found
Weakly and Partially Supervised Learning Frameworks for Anomaly Detection
The automatic detection of abnormal events in surveillance footage is still a concern of the
research community. Since protection is the primary purpose of installing video surveillance systems, the monitoring capability to keep public safety, and its rapid response to
satisfy this purpose, is a significant challenge even for humans. Nowadays, human capacity has not kept pace with the increased use of surveillance systems, requiring much
supervision to identify unusual events that could put any person or company at risk, without ignoring the fact that there is a substantial waste of labor and time due to the extremely
low likelihood of occurring anomalous events compared to normal ones. Consequently,
the need for an automatic detection algorithm of abnormal events has become crucial in
video surveillance. Even being in the scope of various research works published in the last
decade, the state-of-the-art performance is still unsatisfactory and far below the required
for an effective deployment of this kind of technology in fully unconstrained scenarios.
Nevertheless, despite all the research done in this area, the automatic detection of abnormal events remains a challenge for many reasons. Starting by environmental diversity, the
complexity of movements resemblance in different actions, crowded scenarios, and taking into account all possible standard patterns to define a normal action is undoubtedly
difficult or impossible. Despite the difficulty of solving these problems, the substantive
problem lies in obtaining sufficient amounts of labeled abnormal samples, which concerning computer vision algorithms, is fundamental. More importantly, obtaining an extensive set of different videos that satisfy the previously mentioned conditions is not a
simple task. In addition to its effort and time-consuming, defining the boundary between
normal and abnormal actions is usually unclear.
Henceforward, in this work, the main objective is to provide several solutions to the
problems mentioned above, by focusing on analyzing previous state-of-the-art methods
and presenting an extensive overview to clarify the concepts employed on capturing normal and abnormal patterns. Also, by exploring different strategies, we were able to develop new approaches that consistently advance the state-of-the-art performance. Moreover, we announce the availability of a new large-scale first of its kind dataset fully annotated at the frame level, concerning a specific anomaly detection event with a wide diversity in fighting scenarios, that can be freely used by the research community. Along with
this document with the purpose of requiring minimal supervision, two different proposals
are described; the first method employs the recent technique of self-supervised learning
to avoid the laborious task of annotation, where the training set is autonomously labeled
using an iterative learning framework composed of two independent experts that feed
data to each other through a Bayesian framework. The second proposal explores a new
method to learn an anomaly ranking model in the multiple instance learning paradigm by
leveraging weakly labeled videos, where the training labels are done at the video-level. The
experiments were conducted in several well-known datasets, and our solutions solidly outperform the state-of-the-art. Additionally, as a proof-of-concept system, we also present the results of collected real-world simulations in different environments to perform a field
test of our learned models.A detecção automática de eventos anómalos em imagens de videovigilância permanece
uma inquietação por parte da comunidade científica. Sendo a proteção o principal
propósito da instalação de sistemas de vigilância, a capacidade de monitorização da segurança pública, e a sua rápida resposta para satisfazer essa finalidade, é uma adversidade
até para o ser humano. Nos dias de hoje, com o aumento do uso de sistemas de videovigilância, a capacidade humana não tem alcançado a cadência necessária, exigindo uma
supervisão exorbitante para a identificação de acontecimentos invulgares que coloquem
uma identidade ou sociedade em risco. O facto da probabilidade de se suceder um incidente ser extremamente reduzida comparada a eventualidades normais, existe um gasto
substancial de tempo de ofício. Consequentemente, a necessidade para um algorítmo de
detecção automática de incidentes tem vindo a ser crucial em videovigilância. Mesmo
sendo alvo de vários trabalhos científicos publicados na última década, o desempenho
do estado-da-arte continua insatisfatório e abaixo do requisitado para uma implementação eficiente deste tipo de tecnologias em ambientes e cenários totalmente espontâneos
e incontinentes. Porém, apesar de toda a investigação realizada nesta área, a automatização de detecção de incidentes é um desafio que perdura por várias razões. Começando
pela diversidade ambiental, a complexidade da semalhança entre movimentos de ações
distintas, cenários de multidões, e ter em conta todos os padrões para definir uma ação
normal, é indiscutivelmente difícil ou impossível. Não obstante a dificuldade de resolução
destes problemas, o obstáculo fundamental consiste na obtenção de um número suficiente
de instâncias classificadas anormais, considerando algoritmos de visão computacional é
essencial. Mais importante ainda, obter um vasto conjunto de diferentes vídeos capazes de
satisfazer as condições previamente mencionadas, não é uma tarefa simples. Em adição
ao esforço e tempo despendido, estabelecer um limite entre ações normais e anormais é
frequentemente indistinto.
Tendo estes aspetos em consideração, neste trabalho, o principal objetivo é providenciar diversas soluções para os problemas previamente mencionados, concentrando na
análise de métodos do estado-da-arte e apresentando uma visão abrangente dos mesmos
para clarificar os conceitos aplicados na captura de padrões normais e anormais. Inclusive, a exploração de diferentes estratégias habilitou-nos a desenvolver novas abordagens
que aprimoram consistentemente o desempenho do estado-da-arte. Por último, anunciamos a disponibilidade de um novo conjunto de dados, em grande escala, totalmente anotado ao nível da frame em relação à detecção de anomalias em um evento específico com
uma vasta diversidade em cenários de luta, podendo ser livremente utilizado pela comunidade científica. Neste documento, com o propósito de requerer o mínimo de supervisão,
são descritas duas propostas diferentes; O primeiro método põe em prática a recente técnica de aprendizagem auto-supervisionada para evitar a árdua tarefa de anotação, onde o
conjunto de treino é classificado autonomamente usando uma estrutura de aprendizagem
iterativa composta por duas redes neuronais independentes que fornecem dados entre si através de uma estrutura Bayesiana. A segunda proposta explora um novo método para
aprender um modelo de classificação de anomalias no paradigma multiple-instance learning manuseando vídeos fracamente anotados, onde a classificação do conjunto de treino
é feita ao nível do vídeo. As experiências foram concebidas em vários conjuntos de dados,
e as nossas soluções superam consolidamente o estado-da-arte. Adicionalmente, como
sistema de prova de conceito, apresentamos os resultados da execução do nosso modelo
em simulações reais em diferentes ambientes
Contracting Skeletal Kinematic Embeddings for Anomaly Detection
Detecting the anomaly of human behavior is paramount to timely recognizing
endangering situations, such as street fights or elderly falls. However,
anomaly detection is complex, since anomalous events are rare and because it is
an open set recognition task, i.e., what is anomalous at inference has not been
observed at training. We propose COSKAD, a novel model which encodes skeletal
human motion by an efficient graph convolutional network and learns to COntract
SKeletal kinematic embeddings onto a latent hypersphere of minimum volume for
Anomaly Detection. We propose and analyze three latent space designs for
COSKAD: the commonly-adopted Euclidean, and the new spherical-radial and
hyperbolic volumes. All three variants outperform the state-of-the-art,
including video-based techniques, on the ShangaiTechCampus, the Avenue, and on
the most recent UBnormal dataset, for which we contribute novel skeleton
annotations and the selection of human-related videos. The source code and
dataset will be released upon acceptance.Comment: Submitted to Patter Recognition Journa
A Hierarchical Spatio-Temporal Graph Convolutional Neural Network for Anomaly Detection in Videos
Deep learning models have been widely used for anomaly detection in
surveillance videos. Typical models are equipped with the capability to
reconstruct normal videos and evaluate the reconstruction errors on anomalous
videos to indicate the extent of abnormalities. However, existing approaches
suffer from two disadvantages. Firstly, they can only encode the movements of
each identity independently, without considering the interactions among
identities which may also indicate anomalies. Secondly, they leverage
inflexible models whose structures are fixed under different scenes, this
configuration disables the understanding of scenes. In this paper, we propose a
Hierarchical Spatio-Temporal Graph Convolutional Neural Network (HSTGCNN) to
address these problems, the HSTGCNN is composed of multiple branches that
correspond to different levels of graph representations. High-level graph
representations encode the trajectories of people and the interactions among
multiple identities while low-level graph representations encode the local body
postures of each person. Furthermore, we propose to weightedly combine multiple
branches that are better at different scenes. An improvement over single-level
graph representations is achieved in this way. An understanding of scenes is
achieved and serves anomaly detection. High-level graph representations are
assigned higher weights to encode moving speed and directions of people in
low-resolution videos while low-level graph representations are assigned higher
weights to encode human skeletons in high-resolution videos. Experimental
results show that the proposed HSTGCNN significantly outperforms current
state-of-the-art models on four benchmark datasets (UCSD Pedestrian,
ShanghaiTech, CUHK Avenue and IITB-Corridor) by using much less learnable
parameters.Comment: Accepted to IEEE Transactions on Circuits and Systems for Video
Technology (T-CSVT
Discovery and recognition of motion primitives in human activities
We present a novel framework for the automatic discovery and recognition of
motion primitives in videos of human activities. Given the 3D pose of a human
in a video, human motion primitives are discovered by optimizing the `motion
flux', a quantity which captures the motion variation of a group of skeletal
joints. A normalization of the primitives is proposed in order to make them
invariant with respect to a subject anatomical variations and data sampling
rate. The discovered primitives are unknown and unlabeled and are
unsupervisedly collected into classes via a hierarchical non-parametric Bayes
mixture model. Once classes are determined and labeled they are further
analyzed for establishing models for recognizing discovered primitives. Each
primitive model is defined by a set of learned parameters.
Given new video data and given the estimated pose of the subject appearing on
the video, the motion is segmented into primitives, which are recognized with a
probability given according to the parameters of the learned models.
Using our framework we build a publicly available dataset of human motion
primitives, using sequences taken from well-known motion capture datasets. We
expect that our framework, by providing an objective way for discovering and
categorizing human motion, will be a useful tool in numerous research fields
including video analysis, human inspired motion generation, learning by
demonstration, intuitive human-robot interaction, and human behavior analysis
A Study on Verification of CCTV Image Data through Unsupervised Learning Model of Deep Learning
Abnormal behavior is called an abnormal behavior that deviates from the same normal standard as the average. The installation of public CCTVs to prevent crimes is increasing, but the crime rate is rather increasing recently. In line with this situation, artificial intelligence research using deep learning that automatically finds abnormal behavior in CCTV is increasing. Deep learning is a type of artificial intelligence designed based on artificial neural networks, and the quality of learning data is important for high accuracy in the development of artificial intelligence through deep learning. This paper verifies whether learning data for abnormal behavior detection is suitable as learning data which is being constructed using an MPED-RNN model for binary classification to determine whether there is an abnormal behavior by frame using skeleton data of a person based on an autoencoder. As a result of the experiment, the unsupervised learning-based MPED-RNN model used in this paper is not suitable for verifying images with a similar number of frames with and without abnormal behavior, such as the corresponding data, and it is judged that appropriate results can be derived only when verified with a supervised learning-based model
Modeling Interacting Time-Series Signals
Many real-life systems consist of multiple information signals which might be potentially interacting with each other over time. These interactions can be estimated/modeled using techniques like Pearson Correlation (PC), Time lagged Cross Correlation (TLCC) and windowed TLCC, Dynamic Time Warping (DTW), and coupled Hidden Markov Model (cHMM). These techniques, excluding cHMM, cannot capture non-linear interactions and does not work well with multi-variate data. Although cHMM can capture the interactions effectively, it is bound by Markov property and other assumptions like latent variables, prior distributions, etc. These influence the performance of the model significantly. Recurrent Neural Network (RNN) is a variant of Neural Networks which can be used to model time-series data. RNN based architectures are the new state-of-the-art for complex tasks like machine translation. In this research, we explore techniques to extend RNNs to model interacting time-series signals. We propose architectures with coupling and attention mechanism. We evaluate the performance of the models on synthetically generated and real-life data sets. We compare the performance of our proposed architectures to similar ones in the literature. The goal of this exercise is to determine the most effective architecture to capture interaction information in the given interrelated time-series signals
- …