511 research outputs found

    Irish Machine Vision and Image Processing Conference Proceedings 2017

    Get PDF

    Deep neural networks for video classification in ecology

    Get PDF
    Analyzing large volumes of video data is a challenging and time-consuming task. Automating this process would very valuable, especially in ecological research where massive amounts of video can be used to unlock new avenues of ecological research into the behaviour of animals in their environments. Deep Neural Networks, particularly Deep Convolutional Neural Networks, are a powerful class of models for computer vision. When combined with Recurrent Neural Networks, Deep Convolutional models can be applied to video for frame level video classification. This research studies two datasets: penguins and seals. The purpose of the research is to compare the performance of image-only CNNs, which treat each frame of a video independently, against a combined CNN-RNN approach; and to assess whether incorporating the motion information in the temporal aspect of video improves the accuracy of classifications in these two datasets. Video and image-only models offer similar out-of-sample performance on the simpler seals dataset but the video model led to moderate performance improvements on the more complex penguin action recognition dataset

    Machine Learning in Image Analysis and Pattern Recognition

    Get PDF
    This book is to chart the progress in applying machine learning, including deep learning, to a broad range of image analysis and pattern recognition problems and applications. In this book, we have assembled original research articles making unique contributions to the theory, methodology and applications of machine learning in image analysis and pattern recognition

    Model-driven and Data-driven Methods for Recognizing Compositional Interactions from Videos

    Get PDF
    The ability to accurately understand how humans interact with their surroundings is critical for many vision based intelligent systems. Compared to simple atomic actions (eg. raise hand), many interactions found in our daily lives are defined as a composition of an atomic action with a variety of arguments (eg. pick up a pen). Despite recent progress in the literature, there still remains fundamental challenges unique to recognizing interactions from videos. First, most of the action recognition literature assumes a problem setting where a pre-defined set of action labels is supported by a large and relatively balanced set of training examples for those labels. There are many realistic cases where this data assumption breaks down, either because the application demands fine-grained classification of a potentially combinatorial number of activities, and/or because the problem at hand is an “open-set” problem where new labels may be defined at test time. Second, many deep video models often simply represent video as a three-dimensional tensor and ignore the differences in spatial and temporal dimensions during the representation learning stage. As a result, data-driven bottom-up action models frequently over-fit to the static content of the video and fail to accurately capture the dynamic changes in relations among actors in the video. In this dissertation, we address the aforementioned challenges of recognizing fine-grained interactions from videos by developing solutions that explicitly represent interactions as compositions of simpler static and dynamic elements. By exploiting the power of composition, our ``detection by description'' framework expresses a very rich space of interactions using only a small set of static visual attributes and a few dynamic patterns. A definition of an interaction is constructed on the fly from first-principles state machines which leverage bottom-up deep-learned components such as object detectors. Compared to existing model-driven methods for video understanding, we introduce the notion of dynamic action signatures which allows a practitioner to express the expected temporal behavior of various elements of an interaction. We show that our model-driven approach using dynamic action signatures outperforms other zero-shot methods on multiple public action classification benchmarks and even some fully supervised baselines under realistic problem settings. Next, we extend our approach to a setting where the static and dynamic action signatures are not given by the user but rather learned from data. We do so by borrowing ideas from data-driven, two-stream action recognition and model-driven, structured human-object interaction detection. The key idea behind our approach is that we can learn the static and dynamic decomposition of an interaction using a dual-pathway network by leveraging object detections. To do so, we introduce the Motion Guided Attention Fusion mechanism which transfers the motion-centric features learned using object detections to the representation learned from the RGB-based motion pathway. Finally, we conclude with a comprehensive case study on vision based activity detection applied to video surveillance. Using the methods presented in this dissertation, we step towards an intelligent vision system that can detect a particular interaction instance only given a description from a user and depart from requiring massive dataset of labeled training videos. Moreover, as our framework naturally defines a decompositional structure of activities into detectable static/visual attributes, we show that we can simulate necessary training data to acquire attribute detectors when the desired detector is otherwise unavailable. Our approach achieves competitive or superior performance over existing approaches for recognizing fine-grained interactions from realistic videos

    Exploratory search through large video corpora

    Get PDF
    Activity retrieval is a growing field in electrical engineering that specializes in the search and retrieval of relevant activities and events in video corpora. With the affordability and popularity of cameras for government, personal and retail use, the quantity of available video data is rapidly outscaling our ability to reason over it. Towards the end of empowering users to navigate and interact with the contents of these video corpora, we propose a framework for exploratory search that emphasizes activity structure and search space reduction over complex feature representations. Exploratory search is a user driven process wherein a person provides a system with a query describing the activity, event, or object he is interested in finding. Typically, this description takes the implicit form of one or more exemplar videos, but it can also involve an explicit description. The system returns candidate matches, followed by query refinement and iteration. System performance is judged by the run-time of the system and the precision/recall curve of of the query matches returned. Scaling is one of the primary challenges in video search. From vast web-video archives like youtube (1 billion videos and counting) to the 30 million active surveillance cameras shooting an estimated 4 billion hours of footage every week in the United States, trying to find a set of matches can be like looking for a needle in a haystack. Our goal is to create an efficient archival representation of video corpora that can be calculated in real-time as video streams in, and then enables a user to quickly get a set of results that match. First, we design a system for rapidly identifying simple queries in large-scale video corpora. Instead of focusing on feature design, our system focuses on the spatiotemporal relationships between those features as a means of disambiguating an activity of interest from background. We define a semantic feature vocabulary of concepts that are both readily extracted from video and easily understood by an operator. As data streams in, features are hashed to an inverted index and retrieved in constant time after the system is presented with a user's query. We take a zero-shot approach to exploratory search: the user manually assembles vocabulary elements like color, speed, size and type into a graph. Given that information, we perform an initial downsampling of the archived data, and design a novel dynamic programming approach based on genome-sequencing to search for similar patterns. Experimental results indicate that this approach outperforms other methods for detecting activities in surveillance video datasets. Second, we address the problem of representing complex activities that take place over long spans of space and time. Subgraph and graph matching methods have seen limited use in exploratory search because both problems are provably NP-hard. In this work, we render these problems computationally tractable by identifying the maximally discriminative spanning tree (MDST), and using dynamic programming to optimally reduce the archive data based on a custom algorithm for tree-matching in attributed relational graphs. We demonstrate the efficacy of this approach on popular surveillance video datasets in several modalities. Finally, we design an approach for successive search space reduction in subgraph matching problems. Given a query graph and archival data, our algorithm iteratively selects spanning trees from the query graph that optimize the expected search space reduction at each step until the archive converges. We use this approach to efficiently reason over video surveillance datasets, simulated data, as well as large graphs of protein data

    Deep representation learning for action recognition : a dissertation presented in partial fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science at Massey University, Auckland, New Zealand

    Get PDF
    Figures 2.2 through 2.7, and 2.9 through 2.11 were removed for copyright reasons. Figures 2.8, and 2.12 through 2.16 are licensed on the arXiv repository under a Creative Commons Attribution licence (https://arxiv.org/help/license).This research focuses on deep representation learning for human action recognition based on the emerging deep learning techniques using RGB and skeleton data. The output of such deep learning techniques is a parameterised hierarchical model, representing the learnt knowledge from the training dataset. It is similar to the knowledge stored in our brain, which is learned from our experience. Currently, the computer’s ability to perform such abstraction is far behind human’s level, perhaps due to the complex processing of spatio-temporal knowledge. The discriminative spatio-temporal representation of human actions is the key for human action recognition systems. Different feature encoding approaches and different learning models may lead to quite different output performances, and at the present time there is no approach that can accurately model the cognitive processing for human actions. This thesis presents several novel approaches to allow computers to learn discriminative, compact and representative spatio-temporal features for human action recognition from multiple input features, aiming at enhancing the performance of an automated system for human action recognition. The input features for the proposed approaches in this thesis are derived from signals that are captured by the depth camera, e.g., RGB video and skeleton data. In this thesis, I developed several geometric features, and proposed the following models for action recognition: CVR-CNN, SKB-TCN, Multi-Stream CNN and STN. These proposed models are inspired by the visual attention mechanisms that are inherently present in human beings. In addition, I discussed the performance of the geometric features that I developed along with the proposed models. Superior experimental results for the proposed geometric features and models are obtained and verified on several benchmarking human action recognition datasets. In the case of the most challenging benchmarking dataset, NTU RGB+D, the accuracy of the results obtained surpassed the performance of the existing RNN-based and ST-GCN models. This study provides a deeper understanding of the spatio-temporal representation of human actions and it has significant implications to explain the inner workings of the deep learning models in learning patterns from time series data. The findings of these proposed models can set forth a solid foundation for further developments, and for the guidance of future human action-related studies
    corecore