1,191 research outputs found

    Learning to Detect Violent Videos using Convolutional Long Short-Term Memory

    Full text link
    Developing a technique for the automatic analysis of surveillance videos in order to identify the presence of violence is of broad interest. In this work, we propose a deep neural network for the purpose of recognizing violent videos. A convolutional neural network is used to extract frame level features from a video. The frame level features are then aggregated using a variant of the long short term memory that uses convolutional gates. The convolutional neural network along with the convolutional long short term memory is capable of capturing localized spatio-temporal features which enables the analysis of local motion taking place in the video. We also propose to use adjacent frame differences as the input to the model thereby forcing it to encode the changes occurring in the video. The performance of the proposed feature extraction pipeline is evaluated on three standard benchmark datasets in terms of recognition accuracy. Comparison of the results obtained with the state of the art techniques revealed the promising capability of the proposed method in recognizing violent videos.Comment: Accepted in International Conference on Advanced Video and Signal based Surveillance(AVSS 2017

    Holistic features for real-time crowd behaviour anomaly detection

    Get PDF
    This paper presents a new approach to crowd behaviour anomaly detection that uses a set of efficiently computed, easily interpretable, scene-level holistic features. This low-dimensional descriptor combines two features from the literature: crowd collectiveness [1] and crowd conflict [2], with two newly developed crowd features: mean motion speed and a new formulation of crowd density. Two different anomaly detection approaches are investigated using these features. When only normal training data is available we use a Gaussian Mixture Model (GMM) for outlier detection. When both normal and abnormal training data is available we use a Support Vector Machine (SVM) for binary classification. We evaluate on two crowd behaviour anomaly detection datasets, achieving both state-of-the-art classification performance on the violent-flows dataset [3] as well as better than real-time processing performance (40 frames per second)

    Differential Recurrent Neural Networks for Human Activity Recognition

    Get PDF
    Human activity recognition has been an active research area in recent years. The difficulty of this problem lies in the complex dynamical motion patterns embedded through the sequential frames. The Long Short-Term Memory (LSTM) recurrent neural network is capable of processing complex sequential information since it utilizes special gating schemes for learning representations from long input sequences. It has the potential to model various time-series data, where the current hidden state has to be considered in the context of the past hidden states. Unfortunately, the conventional LSTMs do not consider the impact of spatio-temporal dynamics corresponding to the given salient motion patterns, when they gate the information that ought to be memorized through time. To address this problem, we propose a differential gating scheme for the LSTM neural network, which emphasizes the change in information gain caused by the salient motions between the successive video frames. This change in information gain is quantified by Derivative of States (DoS), and thus the proposed LSTM model is termed differential Recurrent Neural Network (dRNN). Based on the energy profiling of DoS, we further propose to employ the State Energy Profile (SEP) to search for salient dRNN states and construct more informative representations. To better understand the scene and human appearance information, the dRNN model is extended by connecting Convolutional Neural Networks (CNN) and stacked dRNNs into an end-to-end model. Lastly, the dissertation continues to discuss and compare the combined and the individual orders of DoS used within the dRNN. We propose to control the LSTM gates via individual order of DoS and stack multiple levels of LSTM cells in increasing orders of state derivatives. To this end, we have introduced a new family of LSTMs, expanding the applications of LSTMs and advancing the performances of the state-of-the-art methods

    Detecção de eventos violentos em sequências de vídeos baseada no operador histograma da transformada census

    Get PDF
    Orientador: Hélio PedriniDissertação (mestrado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: Sistemas de vigilância em sequências de vídeo têm sido amplamente utilizados para o monitoramento de cenas em diversos ambientes, tais como aeroportos, bancos, escolas, indústrias, estações de ônibus e trens, rodovias e lojas. Devido à grande quantidade de informação obtida pelas câmeras de vigilância, o uso de inspeção visual por operadores de câmera se torna uma tarefa cansativa e sujeita a falhas, além de consumir muito tempo. Um desafio é o desenvolvimento de sistemas inteligentes de vigilância capazes de analisar longas sequências de vídeos capturadas por uma rede de câmeras de modo a identificar um determinado comportamento. Neste trabalho, foram propostas e avaliadas diversas técnicas de classificação, tendo como base o operador CENTRIST (Histograma da Transformada Census), no contexto de identificação de eventos violentos em cenas de vídeo. Adicionalmente, foram avaliados outros descritores tradicionais, como HoG (Histograma de Gradientes Orientados), HOF (Histograma do Fluxo Óptico) e descritores extraídos a partir de modelos de aprendizado de máquina profundo pré-treinados. De modo a permitir a avaliação apenas em regiões de interesse presentes nos quadros dos vídeos, técnicas para remoção do fundo da cena. Uma abordagem baseada em janela deslizante foi utilizada para avaliar regiões menores da cena em combinação com um critério de votação. A janela deslizante é então aplicada juntamente com uma filtragem de blocos utilizando fluxo óptico da cena. Para demonstrar a efetividade de nosso método para discriminar violência em cenas de multidões, os resultados obtidos foram comparados com outras abordagens disponíveis na literatura em duas bases de dados públicas (Violence in Crowds e Hockey Fights). A eficácia da combinação entre CENTRIST e HoG foi demonstrada em comparação com a utilização desses operadores individualmente. A combinação desses operadores obteve aproximadamente 88% contra 81% utilizando apenas HoG e 86% utilizando CENTRIST. A partir do refinamento do método proposto, foi identificado que avaliar blocos do quadro com a abordagem de janela deslizante tornou o método mais eficaz. Técnicas para geração de palavras visuais com codificação esparsa, medida de distância com um modelo de misturas Gaussianas e medida de distância entre agrupamentos também foram avaliadas e discutidas. Além disso, também foi avaliado calcular dinamicamente o limiar de votação, o que trouxe resultados melhores em alguns casos. Finalmente, formas de restringir os atores presentes nas cenas utilizando fluxo óptico foram analisadas. Utilizando o método de Otsu para calcular o limiar do fluxo óptico da cena a eficiência supera nossos resultados mais competitivos: 91,46% de acurácia para a base Violence in Crowds e 92,79% para a base Hockey FightsAbstract: Surveillance systems in video sequences have been widely used to monitor scenes in various environments, such as airports, banks, schools, industries, bus and train stations, highways and stores. Due to the large amount of information obtained via surveillance cameras, the use of visual inspection by camera operators becomes a task subject to fatigue and failure, in addition to consuming a lot of time. One challenge is the development of intelligent surveillance systems capable of analyzing long video sequences captured by a network of cameras in order to identify a certain behavior. In this work, we propose and analyze the use of several classification techniques, based on the CENTRIST (Transformation Census Histogram) operator, in the context of identifying violent events in video scenes. Additionally, we evaluated other traditional descriptors, such as HoG (Oriented Gradient Histogram), HOF (Optical Flow Histogram) and descriptors extracted from pre-trained deep machine learning models. In order to allow the evaluation only in regions of interest present in the video frames, we investigated techniques for removing the background from the scene. A sliding window-based approach was used to assess smaller regions of the scene in combination with a voting criterion. The sliding window is then applied along with block filtering using the optical flow of the scene. To demonstrate the effectiveness of our method for discriminating violence in crowd scenes, we compared the results to other approaches available in the literature in two public databases (Violence in Crowds and Hockey Fights). The combination of CENTRIST and HoG was demonstrated in comparison to the use of these operators individually. The combination of both operators obtained approximately 88% against 81% using only HoG and 86% using CENTRIST. From the refinement of the proposed method, we identified that evaluating blocks of the frame with the sliding window-based approach made the method more effective. Techniques for generating a codebook with sparse coding, distance measurement with a Gaussian mixture model and distance measurement between clusters were evaluated and discussed. Also we dynamically calculate the threshold for class voting, which obtained superior results in some cases. Finally, strategies for restricting the actors present in the scenes using optical flow were analyzed. By using the Otsu¿s method to calculate the threshold from the optical flow at the scene, the effectiveness surpasses our most competitive results: 91.46% accuracy for the Violence in Crowds dataset and 92.79% for the Hockey Fights datasetMestradoCiência da ComputaçãoMestre em Ciência da Computaçã

    Discovery and recognition of motion primitives in human activities

    Get PDF
    We present a novel framework for the automatic discovery and recognition of motion primitives in videos of human activities. Given the 3D pose of a human in a video, human motion primitives are discovered by optimizing the `motion flux', a quantity which captures the motion variation of a group of skeletal joints. A normalization of the primitives is proposed in order to make them invariant with respect to a subject anatomical variations and data sampling rate. The discovered primitives are unknown and unlabeled and are unsupervisedly collected into classes via a hierarchical non-parametric Bayes mixture model. Once classes are determined and labeled they are further analyzed for establishing models for recognizing discovered primitives. Each primitive model is defined by a set of learned parameters. Given new video data and given the estimated pose of the subject appearing on the video, the motion is segmented into primitives, which are recognized with a probability given according to the parameters of the learned models. Using our framework we build a publicly available dataset of human motion primitives, using sequences taken from well-known motion capture datasets. We expect that our framework, by providing an objective way for discovering and categorizing human motion, will be a useful tool in numerous research fields including video analysis, human inspired motion generation, learning by demonstration, intuitive human-robot interaction, and human behavior analysis

    Audio-visual multi-modality driven hybrid feature learning model for crowd analysis and classification

    Get PDF
    The high pace emergence in advanced software systems, low-cost hardware and decentralized cloud computing technologies have broadened the horizon for vision-based surveillance, monitoring and control. However, complex and inferior feature learning over visual artefacts or video streams, especially under extreme conditions confine majority of the at-hand vision-based crowd analysis and classification systems. Retrieving event-sensitive or crowd-type sensitive spatio-temporal features for the different crowd types under extreme conditions is a highly complex task. Consequently, it results in lower accuracy and hence low reliability that confines existing methods for real-time crowd analysis. Despite numerous efforts in vision-based approaches, the lack of acoustic cues often creates ambiguity in crowd classification. On the other hand, the strategic amalgamation of audio-visual features can enable accurate and reliable crowd analysis and classification. Considering it as motivation, in this research a novel audio-visual multi-modality driven hybrid feature learning model is developed for crowd analysis and classification. In this work, a hybrid feature extraction model was applied to extract deep spatio-temporal features by using Gray-Level Co-occurrence Metrics (GLCM) and AlexNet transferrable learning model. Once extracting the different GLCM features and AlexNet deep features, horizontal concatenation was done to fuse the different feature sets. Similarly, for acoustic feature extraction, the audio samples (from the input video) were processed for static (fixed size) sampling, pre-emphasis, block framing and Hann windowing, followed by acoustic feature extraction like GTCC, GTCC-Delta, GTCC-Delta-Delta, MFCC, Spectral Entropy, Spectral Flux, Spectral Slope and Harmonics to Noise Ratio (HNR). Finally, the extracted audio-visual features were fused to yield a composite multi-modal feature set, which is processed for classification using the random forest ensemble classifier. The multi-class classification yields a crowd-classification accurac12529y of (98.26%), precision (98.89%), sensitivity (94.82%), specificity (95.57%), and F-Measure of 98.84%. The robustness of the proposed multi-modality-based crowd analysis model confirms its suitability towards real-world crowd detection and classification tasks

    Automotive Interior Sensing - Anomaly Detection

    Get PDF
    Com o surgimento dos veículos autónomos partilhados não haverá condutores nos veículos capazes de manter o bem-estar dos passageiros. Por esta razão, é imperativo que exista um sistema preparado para detetar comportamentos anómalos, por exemplo, violência entre passageiros, e que responda de forma adequada. O tipo de anomalias pode ser tão diverso que ter um "dataset" para treino que contenha todas as anomalias possíveis neste contexto é impraticável, implicando que algoritmos tradicionais de classificação não sejam ideais para esta aplicação. Por estas razões, os algoritmos de deteção de anomalias são a melhor opção para construir um bom modelo discriminativo. Esta dissertação foca-se na utilização de técnicas de "deep learning", mais precisamente arquiteturas baseadas em "Spatiotemporal auto-encoders" que são treinadas apenas com sequências de "frames" de comportamentos normais e testadas com sequências normais e anómalas dos "datasets" internos da Bosch. O modelo foi treinado inicialmente com apenas uma categoria das ações não violentas e as iterações finais foram treinadas com todas as categorias de ações não violentas. A rede neuronal contém camadas convolucionais dedicadas à compressão e descompressão dos dados espaciais; e algumas camadas dedicadas à compressão e descompressão temporal dos dados, implementadas com células LSTM ("Long Short-Term Memory") convolucionais, que extraem informações relativas aos movimentos dos passageiros. A rede define como reconstruir corretamente as sequências de "frames" normais e durante os testes, cada sequência é classificada como normal ou anómala de acordo com o seu erro de reconstrução. Através dos erros de reconstrução são calculados os "regularity scores" que indicam a regularidade que o modelo previu para cada "frame". A "framework" resultante é uma adição viável aos algoritmos tradicionais de reconhecimento de ações visto que pode funcionar como um sistema que serve para detetar ações desconhecidas e contribuir para entender o significado de tais interações humanas.With the appearance of SAVs (Shared Autonomous Vehicles) there will no longer be a driver responsible for maintaining the car interior and well-being of passengers. To counter this, it is imperative to have a system that is able to detect any abnormal behaviours, e.g., violence between passengers, and trigger the appropriate response. Furthermore, the type of anomalous activities can be so diverse, that having a dataset that incorporates most use cases is unattainable, making traditional classification algorithms not ideal for this kind of application. In this sense, anomaly detection algorithms are a good approach in order to build a discriminative model. Taking this into account, this work focuses on the use of deep learning techniques, more precisely Spatiotemporal auto-encoder based frameworks, which are trained on human behavior video sequences and tested on use cases with normal and abnormal human interactions from Bosch's internal datasets. Initially, the model was trained on a single non-violent action category. Final iterations considered all of the identified non-violent actions as normal data. The network architecture presents a group of convolutional layers which encode and decode spatial data; and a temporal encoder/decoder structure, implemented as a convolutional Long Short Term Memory network, responsible for learning motion information. The network defines how to properly reconstruct the 'normal' frame sequences and during testing, each sequence is classified as normal or abnormal based on its reconstruction error. Based on these values, regularity scores are inferred showing the predicted regularity of each frame. The resulting framework is a viable addition to traditional action recognition algorithms since it can work as a tool for detecting unknown actions, strange/abnormal behaviours and aid in understanding the meaning of such human interactions

    Applying computer analysis to detect and predict violent crime during night time economy hours

    Get PDF
    The Night-Time Economy is characterised by increased levels of drunkenness, disorderly behaviour and assault-related injury. The annual cost associated with violent incidents is approximately £14 billion, with the cost of violence with injury costing approximately 6.6 times more than violence without injury. The severity of an injury can be reduced by intervening in the incident as soon as possible. Both understanding where violence occurs and detecting incidents can result in quicker intervention through effective police resource deployment. Current systems of detection use human operators whose detection ability is poor in typical surveillance environments. This is used as motivation for the development of computer vision-based detection systems. Alternatively, a predictive model can estimate where violence is likely to occur to help law enforcement with the tactical deployment of resources. Many studies have simulated pedestrian movement through an environment to inform environmental design to minimise negative outcomes. For the main contributions of this thesis, computer vision analysis and agent-based modelling are utilised to develop methods for the detection and prediction of violent behaviour respectively. Two methods of violent behaviour detection from video data are presented. Treating violence detection as a classification task, each method reports state-of-the-art classification performance and real-time performance. The first method targets crowd violence by encoding crowd motion using temporal summaries of Grey Level Co-occurrence Matrix (GLCM) derived features. The second method, aimed at detecting one-on-one violence, operates by locating and subsequently describing regions of interest based on motion characteristics associated with violent behaviour. Justified using existing literature, the characteristics are high acceleration, non-linear movement and convergent motion. Each violence detection method is used to evaluate the intrinsic properties of violent behaviour. We demonstrate issues associated with violent behaviour datasets by showing that state-of-the-art classification is achievable by exploiting data bias, highlighting potential failure points for feature representation learning schemes. Using agent-based modelling techniques and regression analysis, we discovered that including the effects of alcohol when simulating behaviour within city centre environments produces a more accurate model for predicting violent behaviour
    corecore