1,837 research outputs found

    Large-Scale Mapping of Human Activity using Geo-Tagged Videos

    Full text link
    This paper is the first work to perform spatio-temporal mapping of human activity using the visual content of geo-tagged videos. We utilize a recent deep-learning based video analysis framework, termed hidden two-stream networks, to recognize a range of activities in YouTube videos. This framework is efficient and can run in real time or faster which is important for recognizing events as they occur in streaming video or for reducing latency in analyzing already captured video. This is, in turn, important for using video in smart-city applications. We perform a series of experiments to show our approach is able to accurately map activities both spatially and temporally. We also demonstrate the advantages of using the visual content over the tags/titles.Comment: Accepted at ACM SIGSPATIAL 201

    A fully integrated violence detection system using CNN and LSTM

    Get PDF
    Recently, the number of violence-related cases in places such as remote roads, pathways, shopping malls, elevators, sports stadiums, and liquor shops, has increased drastically which are unfortunately discovered only after it’s too late. The aim is to create a complete system that can perform real-time video analysis which will help recognize the presence of any violent activities and notify the same to the concerned authority, such as the police department of the corresponding area. Using the deep learning networks CNN and LSTM along with a well-defined system architecture, we have achieved an efficient solution that can be used for real-time analysis of video footage so that the concerned authority can monitor the situation through a mobile application that can notify about an occurrence of a violent event immediately

    Spatio-temporal action localization with Deep Learning

    Get PDF
    Dissertação de mestrado em Engenharia InformáticaThe system that detects and identifies human activities are named human action recognition. On the video approach, human activity is classified into four different categories, depending on the complexity of the steps and the number of body parts involved in the action, namely gestures, actions, interactions, and activities, which is challenging for video Human action recognition to capture valuable and discriminative features because of the human body’s variations. So, deep learning techniques have provided practical applications in multiple fields of signal processing, usually surpassing traditional signal processing on a large scale. Recently, several applications, namely surveillance, human-computer interaction, and video recovery based on its content, have studied violence’s detection and recognition. In recent years there has been a rapid growth in the production and consumption of a wide variety of video data due to the popularization of high quality and relatively low-price video devices. Smartphones and digital cameras contributed a lot to this factor. At the same time, there are about 300 hours of video data updates every minute on YouTube. Along with the growing production of video data, new technologies such as video captioning, answering video surveys, and video-based activity/event detection are emerging every day. From the video input data, the detection of human activity indicates which activity is contained in the video and locates the regions in the video where the activity occurs. This dissertation has conducted an experiment to identify and detect violence with spatial action localization, adapting a public dataset for effect. The idea was used an annotated dataset of general action recognition and adapted only for violence detection.O sistema que deteta e identifica as atividades humanas é denominado reconhecimento da ação humana. Na abordagem por vídeo, a atividade humana é classificada em quatro categorias diferentes, dependendo da complexidade das etapas e do número de partes do corpo envolvidas na ação, a saber, gestos, ações, interações e atividades, o que é desafiador para o reconhecimento da ação humana do vídeo para capturar características valiosas e discriminativas devido às variações do corpo humano. Portanto, as técnicas de deep learning forneceram aplicações práticas em vários campos de processamento de sinal, geralmente superando o processamento de sinal tradicional em grande escala. Recentemente, várias aplicações, nomeadamente na vigilância, interação humano computador e recuperação de vídeo com base no seu conteúdo, estudaram a deteção e o reconhecimento da violência. Nos últimos anos, tem havido um rápido crescimento na produção e consumo de uma ampla variedade de dados de vídeo devido à popularização de dispositivos de vídeo de alta qualidade e preços relativamente baixos. Smartphones e cameras digitais contribuíram muito para esse fator. Ao mesmo tempo, há cerca de 300 horas de atualizações de dados de vídeo a cada minuto no YouTube. Junto com a produção crescente de dados de vídeo, novas tecnologias, como legendagem de vídeo, respostas a pesquisas de vídeo e deteção de eventos / atividades baseadas em vídeo estão surgindo todos os dias. A partir dos dados de entrada de vídeo, a deteção de atividade humana indica qual atividade está contida no vídeo e localiza as regiões no vídeo onde a atividade ocorre. Esta dissertação conduziu uma experiência para identificar e detetar violência com localização espacial, adaptando um dataset público para efeito. A ideia foi usada um conjunto de dados anotado de reconhecimento de ações gerais e adaptá-la apenas para deteção de violência

    Unified Keypoint-based Action Recognition Framework via Structured Keypoint Pooling

    Full text link
    This paper simultaneously addresses three limitations associated with conventional skeleton-based action recognition; skeleton detection and tracking errors, poor variety of the targeted actions, as well as person-wise and frame-wise action recognition. A point cloud deep-learning paradigm is introduced to the action recognition, and a unified framework along with a novel deep neural network architecture called Structured Keypoint Pooling is proposed. The proposed method sparsely aggregates keypoint features in a cascaded manner based on prior knowledge of the data structure (which is inherent in skeletons), such as the instances and frames to which each keypoint belongs, and achieves robustness against input errors. Its less constrained and tracking-free architecture enables time-series keypoints consisting of human skeletons and nonhuman object contours to be efficiently treated as an input 3D point cloud and extends the variety of the targeted action. Furthermore, we propose a Pooling-Switching Trick inspired by Structured Keypoint Pooling. This trick switches the pooling kernels between the training and inference phases to detect person-wise and frame-wise actions in a weakly supervised manner using only video-level action labels. This trick enables our training scheme to naturally introduce novel data augmentation, which mixes multiple point clouds extracted from different videos. In the experiments, we comprehensively verify the effectiveness of the proposed method against the limitations, and the method outperforms state-of-the-art skeleton-based action recognition and spatio-temporal action localization methods.Comment: CVPR 202

    Automatic video censoring system using deep learning

    Get PDF
    Due to the extensive use of video-sharing platforms and services, the amount of such all kinds of content on the web has become massive. This abundance of information is a problem controlling the kind of content that may be present in such a video. More than telling if the content is suitable for children and sensitive people or not, figuring it out is also important what parts of it contains such content, for preserving parts that would be discarded in a simple broad analysis. To tackle this problem, a comparison was done for popular image deep learning models: MobileNetV2, Xception model, InceptionV3, VGG16, VGG19, ResNet101 and ResNet50 to seek the one that is most suitable for the required application. Also, a system is developed that would automatically censor inappropriate content such as violent scenes with the help of deep learning. The system uses a transfer learning mechanism using the VGG16 model. The experiments suggested that the model showed excellent performance for the automatic censoring application that could also be used in other similar applications

    DyAnNet: A Scene Dynamicity Guided Self-Trained Video Anomaly Detection Network

    Full text link
    Unsupervised approaches for video anomaly detection may not perform as good as supervised approaches. However, learning unknown types of anomalies using an unsupervised approach is more practical than a supervised approach as annotation is an extra burden. In this paper, we use isolation tree-based unsupervised clustering to partition the deep feature space of the video segments. The RGB- stream generates a pseudo anomaly score and the flow stream generates a pseudo dynamicity score of a video segment. These scores are then fused using a majority voting scheme to generate preliminary bags of positive and negative segments. However, these bags may not be accurate as the scores are generated only using the current segment which does not represent the global behavior of a typical anomalous event. We then use a refinement strategy based on a cross-branch feed-forward network designed using a popular I3D network to refine both scores. The bags are then refined through a segment re-mapping strategy. The intuition of adding the dynamicity score of a segment with the anomaly score is to enhance the quality of the evidence. The method has been evaluated on three popular video anomaly datasets, i.e., UCF-Crime, CCTV-Fights, and UBI-Fights. Experimental results reveal that the proposed framework achieves competitive accuracy as compared to the state-of-the-art video anomaly detection methods.Comment: 10 pages, 8 figures, and 4 tables. (ACCEPTED AT WACV 2023

    A Survey of Deep Learning in Sports Applications: Perception, Comprehension, and Decision

    Full text link
    Deep learning has the potential to revolutionize sports performance, with applications ranging from perception and comprehension to decision. This paper presents a comprehensive survey of deep learning in sports performance, focusing on three main aspects: algorithms, datasets and virtual environments, and challenges. Firstly, we discuss the hierarchical structure of deep learning algorithms in sports performance which includes perception, comprehension and decision while comparing their strengths and weaknesses. Secondly, we list widely used existing datasets in sports and highlight their characteristics and limitations. Finally, we summarize current challenges and point out future trends of deep learning in sports. Our survey provides valuable reference material for researchers interested in deep learning in sports applications

    CrimeNet: Neural Structured Learning using Vision Transformer for violence detection

    Get PDF
    The state of the art in violence detection in videos has improved in recent years thanks to deep learning models, but it is still below 90% of average precision in the most complex datasets, which may pose a problem of frequent false alarms in video surveillance environments and may cause security guards to disable the artificial intelligence system. In this study, we propose a new neural network based on Vision Transformer (ViT) and Neural Structured Learning (NSL) with adversarial training. This network, called CrimeNet, outperforms previous works by a large margin and reduces practically to zero the false positives. Our tests on the four most challenging violence-related datasets (binary and multi-class) show the effectiveness of CrimeNet, improving the state of the art from 9.4 to 22.17 percentage points in ROC AUC depending on the dataset. In addition, we present a generalisation study on our model by training and testing it on different datasets. The obtained results show that CrimeNet improves over competing methods with a gain of between 12.39 and 25.22 percentage points, showing remarkable robustness.MCIN/AEI/ 10.13039/501100011033 and by the “European Union NextGenerationEU/PRTR ” HORUS project - Grant n. PID2021-126359OB-I0

    Enhancing camera surveillance using computer vision: a research note

    Full text link
    Purpose\mathbf{Purpose} - The growth of police operated surveillance cameras has out-paced the ability of humans to monitor them effectively. Computer vision is a possible solution. An ongoing research project on the application of computer vision within a municipal police department is described. The paper aims to discuss these issues. Design/methodology/approach\mathbf{Design/methodology/approach} - Following the demystification of computer vision technology, its potential for police agencies is developed within a focus on computer vision as a solution for two common surveillance camera tasks (live monitoring of multiple surveillance cameras and summarizing archived video files). Three unaddressed research questions (can specialized computer vision applications for law enforcement be developed at this time, how will computer vision be utilized within existing public safety camera monitoring rooms, and what are the system-wide impacts of a computer vision capability on local criminal justice systems) are considered. Findings\mathbf{Findings} - Despite computer vision becoming accessible to law enforcement agencies the impact of computer vision has not been discussed or adequately researched. There is little knowledge of computer vision or its potential in the field. Originality/value\mathbf{Originality/value} - This paper introduces and discusses computer vision from a law enforcement perspective and will be valuable to police personnel tasked with monitoring large camera networks and considering computer vision as a system upgrade
    corecore