1,837 research outputs found
Large-Scale Mapping of Human Activity using Geo-Tagged Videos
This paper is the first work to perform spatio-temporal mapping of human
activity using the visual content of geo-tagged videos. We utilize a recent
deep-learning based video analysis framework, termed hidden two-stream
networks, to recognize a range of activities in YouTube videos. This framework
is efficient and can run in real time or faster which is important for
recognizing events as they occur in streaming video or for reducing latency in
analyzing already captured video. This is, in turn, important for using video
in smart-city applications. We perform a series of experiments to show our
approach is able to accurately map activities both spatially and temporally. We
also demonstrate the advantages of using the visual content over the
tags/titles.Comment: Accepted at ACM SIGSPATIAL 201
A fully integrated violence detection system using CNN and LSTM
Recently, the number of violence-related cases in places such as remote roads, pathways, shopping malls, elevators, sports stadiums, and liquor shops, has increased drastically which are unfortunately discovered only after it’s too late. The aim is to create a complete system that can perform real-time video analysis which will help recognize the presence of any violent activities and notify the same to the concerned authority, such as the police department of the corresponding area. Using the deep learning networks CNN and LSTM along with a well-defined system architecture, we have achieved an efficient solution that can be used for real-time analysis of video footage so that the concerned authority can monitor the situation through a mobile application that can notify about an occurrence of a violent event immediately
Spatio-temporal action localization with Deep Learning
Dissertação de mestrado em Engenharia InformáticaThe system that detects and identifies human activities are named human action recognition.
On the video approach, human activity is classified into four different categories, depending
on the complexity of the steps and the number of body parts involved in the action, namely
gestures, actions, interactions, and activities, which is challenging for video Human action
recognition to capture valuable and discriminative features because of the human body’s
variations. So, deep learning techniques have provided practical applications in multiple fields
of signal processing, usually surpassing traditional signal processing on a large scale.
Recently, several applications, namely surveillance, human-computer interaction, and video
recovery based on its content, have studied violence’s detection and recognition. In recent
years there has been a rapid growth in the production and consumption of a wide variety of
video data due to the popularization of high quality and relatively low-price video devices.
Smartphones and digital cameras contributed a lot to this factor. At the same time, there are
about 300 hours of video data updates every minute on YouTube. Along with the growing
production of video data, new technologies such as video captioning, answering video surveys,
and video-based activity/event detection are emerging every day. From the video input data,
the detection of human activity indicates which activity is contained in the video and locates
the regions in the video where the activity occurs.
This dissertation has conducted an experiment to identify and detect violence with spatial action localization, adapting a public dataset for effect. The idea was used an annotated
dataset of general action recognition and adapted only for violence detection.O sistema que deteta e identifica as atividades humanas é denominado reconhecimento da
ação humana. Na abordagem por vídeo, a atividade humana é classificada em quatro
categorias diferentes, dependendo da complexidade das etapas e do número de partes do
corpo envolvidas na ação, a saber, gestos, ações, interações e atividades, o que é desafiador
para o reconhecimento da ação humana do vídeo para capturar características valiosas e
discriminativas devido às variações do corpo humano. Portanto, as técnicas de deep learning
forneceram aplicações práticas em vários campos de processamento de sinal, geralmente
superando o processamento de sinal tradicional em grande escala.
Recentemente, várias aplicações, nomeadamente na vigilância, interação humano computador e recuperação de vídeo com base no seu conteúdo, estudaram a deteção e o
reconhecimento da violência. Nos últimos anos, tem havido um rápido crescimento na
produção e consumo de uma ampla variedade de dados de vídeo devido à popularização de
dispositivos de vídeo de alta qualidade e preços relativamente baixos. Smartphones e cameras
digitais contribuíram muito para esse fator. Ao mesmo tempo, há cerca de 300 horas de
atualizações de dados de vídeo a cada minuto no YouTube. Junto com a produção crescente
de dados de vídeo, novas tecnologias, como legendagem de vídeo, respostas a pesquisas de
vídeo e deteção de eventos / atividades baseadas em vídeo estão surgindo todos os dias. A
partir dos dados de entrada de vídeo, a deteção de atividade humana indica qual atividade
está contida no vídeo e localiza as regiões no vídeo onde a atividade ocorre.
Esta dissertação conduziu uma experiência para identificar e detetar violência com localização
espacial, adaptando um dataset público para efeito. A ideia foi usada um conjunto de dados
anotado de reconhecimento de ações gerais e adaptá-la apenas para deteção de violência
Unified Keypoint-based Action Recognition Framework via Structured Keypoint Pooling
This paper simultaneously addresses three limitations associated with
conventional skeleton-based action recognition; skeleton detection and tracking
errors, poor variety of the targeted actions, as well as person-wise and
frame-wise action recognition. A point cloud deep-learning paradigm is
introduced to the action recognition, and a unified framework along with a
novel deep neural network architecture called Structured Keypoint Pooling is
proposed. The proposed method sparsely aggregates keypoint features in a
cascaded manner based on prior knowledge of the data structure (which is
inherent in skeletons), such as the instances and frames to which each keypoint
belongs, and achieves robustness against input errors. Its less constrained and
tracking-free architecture enables time-series keypoints consisting of human
skeletons and nonhuman object contours to be efficiently treated as an input 3D
point cloud and extends the variety of the targeted action. Furthermore, we
propose a Pooling-Switching Trick inspired by Structured Keypoint Pooling. This
trick switches the pooling kernels between the training and inference phases to
detect person-wise and frame-wise actions in a weakly supervised manner using
only video-level action labels. This trick enables our training scheme to
naturally introduce novel data augmentation, which mixes multiple point clouds
extracted from different videos. In the experiments, we comprehensively verify
the effectiveness of the proposed method against the limitations, and the
method outperforms state-of-the-art skeleton-based action recognition and
spatio-temporal action localization methods.Comment: CVPR 202
Automatic video censoring system using deep learning
Due to the extensive use of video-sharing platforms and services, the amount of such all kinds of content on the web has become massive. This abundance of information is a problem controlling the kind of content that may be present in such a video. More than telling if the content is suitable for children and sensitive people or not, figuring it out is also important what parts of it contains such content, for preserving parts that would be discarded in a simple broad analysis. To tackle this problem, a comparison was done for popular image deep learning models: MobileNetV2, Xception model, InceptionV3, VGG16, VGG19, ResNet101 and ResNet50 to seek the one that is most suitable for the required application. Also, a system is developed that would automatically censor inappropriate content such as violent scenes with the help of deep learning. The system uses a transfer learning mechanism using the VGG16 model. The experiments suggested that the model showed excellent performance for the automatic censoring application that could also be used in other similar applications
DyAnNet: A Scene Dynamicity Guided Self-Trained Video Anomaly Detection Network
Unsupervised approaches for video anomaly detection may not perform as good
as supervised approaches. However, learning unknown types of anomalies using an
unsupervised approach is more practical than a supervised approach as
annotation is an extra burden. In this paper, we use isolation tree-based
unsupervised clustering to partition the deep feature space of the video
segments. The RGB- stream generates a pseudo anomaly score and the flow stream
generates a pseudo dynamicity score of a video segment. These scores are then
fused using a majority voting scheme to generate preliminary bags of positive
and negative segments. However, these bags may not be accurate as the scores
are generated only using the current segment which does not represent the
global behavior of a typical anomalous event. We then use a refinement strategy
based on a cross-branch feed-forward network designed using a popular I3D
network to refine both scores. The bags are then refined through a segment
re-mapping strategy. The intuition of adding the dynamicity score of a segment
with the anomaly score is to enhance the quality of the evidence. The method
has been evaluated on three popular video anomaly datasets, i.e., UCF-Crime,
CCTV-Fights, and UBI-Fights. Experimental results reveal that the proposed
framework achieves competitive accuracy as compared to the state-of-the-art
video anomaly detection methods.Comment: 10 pages, 8 figures, and 4 tables. (ACCEPTED AT WACV 2023
A Survey of Deep Learning in Sports Applications: Perception, Comprehension, and Decision
Deep learning has the potential to revolutionize sports performance, with
applications ranging from perception and comprehension to decision. This paper
presents a comprehensive survey of deep learning in sports performance,
focusing on three main aspects: algorithms, datasets and virtual environments,
and challenges. Firstly, we discuss the hierarchical structure of deep learning
algorithms in sports performance which includes perception, comprehension and
decision while comparing their strengths and weaknesses. Secondly, we list
widely used existing datasets in sports and highlight their characteristics and
limitations. Finally, we summarize current challenges and point out future
trends of deep learning in sports. Our survey provides valuable reference
material for researchers interested in deep learning in sports applications
CrimeNet: Neural Structured Learning using Vision Transformer for violence detection
The state of the art in violence detection in videos has improved in recent years thanks to deep learning models, but it is still below 90% of average precision in the most complex datasets, which may pose a problem of frequent false alarms in video surveillance environments and may cause security guards to disable the artificial intelligence system.
In this study, we propose a new neural network based on Vision Transformer (ViT) and Neural Structured Learning (NSL) with adversarial training. This network, called CrimeNet, outperforms previous works by a large margin and reduces practically to zero the false positives. Our tests on the four most challenging violence-related datasets (binary and multi-class) show the effectiveness of CrimeNet, improving the state of the art from 9.4 to 22.17 percentage points in ROC AUC depending on the dataset. In addition, we present a generalisation study on our model by training and testing it on different datasets. The obtained results show that CrimeNet improves over competing methods with a gain of between 12.39 and 25.22 percentage points, showing remarkable robustness.MCIN/AEI/ 10.13039/501100011033 and by the “European Union NextGenerationEU/PRTR ” HORUS project - Grant n. PID2021-126359OB-I0
Enhancing camera surveillance using computer vision: a research note
- The growth of police operated surveillance cameras has
out-paced the ability of humans to monitor them effectively. Computer vision is
a possible solution. An ongoing research project on the application of computer
vision within a municipal police department is described. The paper aims to
discuss these issues.
- Following the demystification of
computer vision technology, its potential for police agencies is developed
within a focus on computer vision as a solution for two common surveillance
camera tasks (live monitoring of multiple surveillance cameras and summarizing
archived video files). Three unaddressed research questions (can specialized
computer vision applications for law enforcement be developed at this time, how
will computer vision be utilized within existing public safety camera
monitoring rooms, and what are the system-wide impacts of a computer vision
capability on local criminal justice systems) are considered.
- Despite computer vision becoming accessible to law
enforcement agencies the impact of computer vision has not been discussed or
adequately researched. There is little knowledge of computer vision or its
potential in the field.
- This paper introduces and discusses computer
vision from a law enforcement perspective and will be valuable to police
personnel tasked with monitoring large camera networks and considering computer
vision as a system upgrade
- …