6 research outputs found

    Learning space-time structures for action recognition and localization

    Get PDF
    In this thesis the problem of automatic human action recognition and localization in videos is studied. In this problem, our goal is to recognize the category of the human action that is happening in the video, and also to localize the action in space and/or time. This problem is challenging due to the complexity of the human actions, the large intra-class variations and the distraction of backgrounds. Human actions are inherently structured patterns of body movements. However, past works are inadequate in learning the space-time structures in human actions and exploring them for better recognition and localization. In this thesis new methods are proposed that exploit such space-time structures for effective human action recognition and localization in videos, including sports videos, YouTube videos, TV programs and movies. A new local space-time video representation, the hierarchical Space-Time Segments, is first proposed. Using this new video representation, ensembles of hierarchical spatio-temporal trees, discovered directly from the training videos, are constructed to model the hierarchical, spatial and temporal structures of human actions. This proposed approach achieves promising performances in action recognition and localization on challenging benchmark datasets. Moreover, the discovered trees show good cross-dataset generalizability: trees learned on one dataset can be used to recognize and localize similar actions in another dataset. To handle large scale data, a deep model is explored that learns temporal progression of the actions using Long Short Term Memory (LSTM), which is a type of Recurrent Neural Network (RNN). Two novel ranking losses are proposed to train the model to better capture the temporal structures of actions for accurate action recognition and temporal localization. This model achieves state-of-art performance on a large scale video dataset. A deep model usually employs a Convolutional Neural Network (CNN) to learn visual features from video frames. The problem of utilizing web action images for training a Convolutional Neural Network (CNN) is also studied: training CNN typically requires a large number of training videos, but the findings of this study show that web action images can be utilized as additional training data to significantly reduce the burden of video training data collection

    Exploratory search through large video corpora

    Get PDF
    Activity retrieval is a growing field in electrical engineering that specializes in the search and retrieval of relevant activities and events in video corpora. With the affordability and popularity of cameras for government, personal and retail use, the quantity of available video data is rapidly outscaling our ability to reason over it. Towards the end of empowering users to navigate and interact with the contents of these video corpora, we propose a framework for exploratory search that emphasizes activity structure and search space reduction over complex feature representations. Exploratory search is a user driven process wherein a person provides a system with a query describing the activity, event, or object he is interested in finding. Typically, this description takes the implicit form of one or more exemplar videos, but it can also involve an explicit description. The system returns candidate matches, followed by query refinement and iteration. System performance is judged by the run-time of the system and the precision/recall curve of of the query matches returned. Scaling is one of the primary challenges in video search. From vast web-video archives like youtube (1 billion videos and counting) to the 30 million active surveillance cameras shooting an estimated 4 billion hours of footage every week in the United States, trying to find a set of matches can be like looking for a needle in a haystack. Our goal is to create an efficient archival representation of video corpora that can be calculated in real-time as video streams in, and then enables a user to quickly get a set of results that match. First, we design a system for rapidly identifying simple queries in large-scale video corpora. Instead of focusing on feature design, our system focuses on the spatiotemporal relationships between those features as a means of disambiguating an activity of interest from background. We define a semantic feature vocabulary of concepts that are both readily extracted from video and easily understood by an operator. As data streams in, features are hashed to an inverted index and retrieved in constant time after the system is presented with a user's query. We take a zero-shot approach to exploratory search: the user manually assembles vocabulary elements like color, speed, size and type into a graph. Given that information, we perform an initial downsampling of the archived data, and design a novel dynamic programming approach based on genome-sequencing to search for similar patterns. Experimental results indicate that this approach outperforms other methods for detecting activities in surveillance video datasets. Second, we address the problem of representing complex activities that take place over long spans of space and time. Subgraph and graph matching methods have seen limited use in exploratory search because both problems are provably NP-hard. In this work, we render these problems computationally tractable by identifying the maximally discriminative spanning tree (MDST), and using dynamic programming to optimally reduce the archive data based on a custom algorithm for tree-matching in attributed relational graphs. We demonstrate the efficacy of this approach on popular surveillance video datasets in several modalities. Finally, we design an approach for successive search space reduction in subgraph matching problems. Given a query graph and archival data, our algorithm iteratively selects spanning trees from the query graph that optimize the expected search space reduction at each step until the archive converges. We use this approach to efficiently reason over video surveillance datasets, simulated data, as well as large graphs of protein data

    A Big Bang Big Crunch Type-2 Fuzzy Logic System for Machine Vision-Based Event Detection and Summarization in Real-world Ambient Assisted Living

    Get PDF
    The recent years have witnessed the prevalence and abundance of vision sensors in various applications such as security surveillance, healthcare and Ambient Assisted Living (AAL) among others. This is so as to realize intelligent environments which are capable of detecting users’ actions and gestures so that the needed services can be provided automatically and instantly to maximize user comfort and safety as well as to minimize energy. However, it is very challenging to automatically detect important events and human behaviour from vision sensors and summarize them in real time. This is due to the massive data sizes related to video analysis applications and the high level of uncertainties associated with the real world unstructured environments occupied by various users. Machine vision based systems can help detect and summarize important information which cannot be detected by any other sensor; for example, how much water a candidate drank and whether or not they had something to eat. However, conventional non-fuzzy based methods are not robust enough to recognize the various complex types of behaviour in AAL applications. Fuzzy logic system (FLS) is an established field of research to robustly handle uncertainties in complicated real-world problems. In this thesis, we will present a general recognition and classification framework based on fuzzy logic systems which allows for behaviour recognition and event summarisation using 2D/3D video sensors in AAL applications. I started by investigating the use of 2D CCTV camera based system where I proposed and developed novel IT2FLS-based methods for silhouette extraction and 2D behaviour recognition which outperform the traditional on the publicly available Weizmann human action dataset. I will also present a novel system based on 3D RGB-D vision sensors and Interval Type-2 Fuzzy Logic based Systems (IT2FLSs) ) generated by the Big Bang Big Crunch (BB-BC) algorithm for the real time automatic detection and summarization of important events and human behaviour. I will present several real world experiments which were conducted for AAL related behaviour with various users. It will be shown that the proposed BB-BC IT2FLSs outperforms its Type-1 FLSs (T1FLSs) counterpart as well as other conventional non-fuzzy methods, and that performance improvement rises when the number of subjects increases. It will be shown that by utilizing the recognized output activity together with relevant event descriptions (such as video data, timestamp, location and user identification) detailed events are efficiently summarized and stored in our back-end SQL event database, which provides services including event searching, activity retrieval and high-definition video playback to the front-end user interfaces

    Learning a Discriminative Hidden Part Model for Human Action Recognition

    No full text
    We present a discriminative part-based approach for human action recognition from video sequences using motion features. Our model is based on the recently proposed hidden conditional random field (hCRF) for object recognition. Similar to hCRF for object recognition, we model a human action by a flexible constellation of parts conditioned on image observations. Different from object recognition, our model combines both large-scale global features and local patch features to distinguish various actions. Our experimental results show that our model is comparable to other state-of-the-art approaches in action recognition. In particular, our experimental results demonstrate that combining large-scale global features and local patch features performs significantly better than directly applying hCRF on local patches alone.
    corecore