17 research outputs found

    Action Recognition Using Particle Flow Fields

    Get PDF
    In recent years, research in human action recognition has advanced on multiple fronts to address various types of actions including simple, isolated actions in staged data (e.g., KTH dataset), complex actions (e.g., Hollywood dataset), and naturally occurring actions in surveillance videos (e.g, VIRAT dataset). Several techniques including those based on gradient, flow, and interest-points, have been developed for their recognition. Most perform very well in standard action recognition datasets, but fail to produce similar results in more complex, large-scale datasets. Action recognition on large categories of unconstrained videos taken from the web is a very challenging problem compared to datasets like KTH (six actions), IXMAS (thirteen actions), and Weizmann (ten actions). Challenges such as camera motion, different viewpoints, huge interclass variations, cluttered background, occlusions, bad illumination conditions, and poor quality of web videos cause the majority of the state-of-the-art action recognition approaches to fail. An increasing number of categories and the inclusion of actions with high confusion also increase the difficulty of the problem. The approach taken to solve this action recognition problem depends primarily on the dataset and the possibility of detecting and tracking the object of interest. In this dissertation, a new method for video representation is proposed and three new approaches to perform action recognition in different scenarios using varying prerequisites are presented. The prerequisites have decreasing levels of difficulty to obtain: 1) Scenario requires human detection and trackiii ing to perform action recognition; 2) Scenario requires background and foreground separation to perform action recognition; and 3) No pre-processing is required for action recognition. First, we propose a new video representation using optical flow and particle advection. The proposed “Particle Flow Field” (PFF) representation has been used to generate motion descriptors and tested in a Bag of Video Words (BoVW) framework on the KTH dataset. We show that particle flow fields has better performance than other low-level video representations, such as 2D-Gradients, 3D-Gradients and optical flow. Second, we analyze the performance of the state-of-the-art technique based on the histogram of oriented 3D-Gradients in spatio temporal volumes, where human detection and tracking are required. We use the proposed particle flow field and show superior results compared to the histogram of oriented 3D-Gradients in spatio temporal volumes. The proposed method, when used for human action recognition, just needs human detection and does not necessarily require human tracking and figure centric bounding boxes. It has been tested on KTH (six actions), Weizmann (ten actions), and IXMAS (thirteen actions, 4 different views) action recognition datasets. Third, we propose using the scene context information obtained from moving and stationary pixels in the key frames, in conjunction with motion descriptors obtained using Bag of Words framework, to solve the action recognition problem on a large (50 actions) dataset with videos from the web. We perform a combination of early and late fusion on multiple features to handle the huge number of categories. We demonstrate that scene context is a very important feature for performing action recognition on huge datasets. iv The proposed method needs separation of moving and stationary pixels, and does not require any kind of video stabilization, person detection, or tracking and pruning of features. Our approach obtains good performance on a huge number of action categories. It has been tested on the UCF50 dataset with 50 action categories, which is an extension of the UCF YouTube Action (UCF11) Dataset containing 11 action categories. We also tested our approach on the KTH and HMDB51 datasets for comparison. Finally, we focus on solving practice problems in representing actions by bag of spatio temporal features (i.e. cuboids), which has proven valuable for action recognition in recent literature. We observed that the visual vocabulary based (bag of video words) method suffers from many drawbacks in practice, such as: (i) It requires an intensive training stage to obtain good performance; (ii) it is sensitive to the vocabulary size; (iii) it is unable to cope with incremental recognition problems; (iv) it is unable to recognize simultaneous multiple actions; (v) it is unable to perform recognition frame by frame. In order to overcome these drawbacks, we propose a framework to index large scale motion features using Sphere/Rectangle-tree (SR-tree) for incremental action detection and recognition. The recognition comprises of the following two steps: 1) recognizing the local features by non-parametric nearest neighbor (NN), and 2) using a simple voting strategy to label the action. It can also provide localization of the action. Since it does not require feature quantization it can efficiently grow the feature-tree by adding features from new training actions or categories. Our method provides an effective way for practical incremental action recognition. Furthermore, it can handle large scale datasets because the SR-tree is a disk-based v data structure. We tested our approach on two publicly available datasets, the KTH dataset and the IXMAS multi-view dataset, and achieved promising results

    Symbolic and Deep Learning Based Data Representation Methods for Activity Recognition and Image Understanding at Pixel Level

    Get PDF
    Efficient representation of large amount of data particularly images and video helps in the analysis, processing and overall understanding of the data. In this work, we present two frameworks that encapsulate the information present in such data. At first, we present an automated symbolic framework to recognize particular activities in real time from videos. The framework uses regular expressions for symbolically representing (possibly infinite) sets of motion characteristics obtained from a video. It is a uniform framework that handles trajectory-based and periodic articulated activities and provides polynomial time graph algorithms for fast recognition. The regular expressions representing motion characteristics can either be provided manually or learnt automatically from positive and negative examples of strings (that describe dynamic behavior) using offline automata learning frameworks. Confidence measures are associated with recognitions using Levenshtein distance between a string representing a motion signature and the regular expression describing an activity. We have used our framework to recognize trajectory-based activities like vehicle turns (U-turns, left and right turns, and K-turns), vehicle start and stop, person running and walking, and periodic articulated activities like digging, waving, boxing, and clapping in videos from the VIRAT public dataset, the KTH dataset, and a set of videos obtained from YouTube. Next, we present a core sampling framework that is able to use activation maps from several layers of a Convolutional Neural Network (CNN) as features to another neural network using transfer learning to provide an understanding of an input image. The intermediate map responses of a Convolutional Neural Network (CNN) contain information about an image that can be used to extract contextual knowledge about it. Our framework creates a representation that combines features from the test data and the contextual knowledge gained from the responses of a pretrained network, processes it and feeds it to a separate Deep Belief Network. We use this representation to extract more information from an image at the pixel level, hence gaining understanding of the whole image. We experimentally demonstrate the usefulness of our framework using a pretrained VGG-16 model to perform segmentation on the BAERI dataset of Synthetic Aperture Radar (SAR) imagery and the CAMVID dataset. Using this framework, we also reconstruct images by removing noise from noisy character images. The reconstructed images are encoded using Quadtrees. Quadtrees can be an efficient representation in learning from sparse features. When we are dealing with handwritten character images, they are quite susceptible to noise. Hence, preprocessing stages to make the raw data cleaner can improve the efficacy of their use. We improve upon the efficiency of probabilistic quadtrees by using a pixel level classifier to extract the character pixels and remove noise from the images. The pixel level denoiser uses a pretrained CNN trained on a large image dataset and uses transfer learning to aid the reconstruction of characters. In this work, we primarily deal with classification of noisy characters and create the noisy versions of handwritten Bangla Numeral and Basic Character datasets and use them and the Noisy MNIST dataset to demonstrate the usefulness of our approach

    Interactive tracking and action retrieval to support human behavior analysis

    Get PDF
    The goal of this thesis is to develop a set of tools for continuous tracking of behavioral phenomena in videos to support human behavior study. Current standard practices for extracting useful behavioral information from a video are typically difficult to replicate and require a lot of human time. For example, extensive training is typically required for a human coder to reliably code a particular behavior/interaction. Also, manual coding typically takes a lot more time than the actual length of the video (e.g. , it can take up to 6 times the actual length of the video to do human-assisted single object tracking. The time intensive nature of this process (due to the need to train expert and manual coding) puts a strong burden on the research process. In fact, it is not uncommon for an institution that heavily uses videos for behavioral research to have a massive backlog of unprocessed video data. To address this issue, I have developed an efficient behavior retrieval and interactive tracking system. These tools allow behavioral researchers/clinicians to more easily extract relevant behavioral information, and more objectively analyze behavioral data from videos. I have demonstrated that my behavior retrieval system achieves state-of-the-art performance for retrieving stereotypical behaviors of individuals with autism in a real-world video data captured in a classroom setting. I have also demonstrated that my interactive tracking system is able to produce high-precision tracking results with less human effort compared to the state-of-the-art. I further show that by leveraging the tracking results, we can extract an objective measure based on proximity between people that is useful for analyzing certain social interactions. I validated this new measure by showing that we can use it to predict qualitative expert ratings in the Strange Situation (a procedure for studying infant attachment security), a quantity that is difficult to obtain due to the difficulty in training the human expert.Ph.D

    Novel Texture-based Probabilistic Object Recognition and Tracking Techniques for Food Intake Analysis and Traffic Monitoring

    Get PDF
    More complex image understanding algorithms are increasingly practical in a host of emerging applications. Object tracking has value in surveillance and data farming; and object recognition has applications in surveillance, data management, and industrial automation. In this work we introduce an object recognition application in automated nutritional intake analysis and a tracking application intended for surveillance in low quality videos. Automated food recognition is useful for personal health applications as well as nutritional studies used to improve public health or inform lawmakers. We introduce a complete, end-to-end system for automated food intake measurement. Images taken by a digital camera are analyzed, plates and food are located, food type is determined by neural network, distance and angle of food is determined and 3D volume estimated, the results are cross referenced with a nutritional database, and before and after meal photos are compared to determine nutritional intake. We compare against contemporary systems and provide detailed experimental results of our system\u27s performance. Our tracking systems consider the problem of car and human tracking on potentially very low quality surveillance videos, from fixed camera or high flying \acrfull{uav}. Our agile framework switches among different simple trackers to find the most applicable tracker based on the object and video properties. Our MAPTrack is an evolution of the agile tracker that uses soft switching to optimize between multiple pertinent trackers, and tracks objects based on motion, appearance, and positional data. In both cases we provide comparisons against trackers intended for similar applications i.e., trackers that stress robustness in bad conditions, with competitive results

    Aerial detection of ground moving objects

    Get PDF
    Automatic detection of ground moving objects (GMOs) from aerial camera platforms (ACPs) is essential in many video processing applications, both civilian and military. However, the extremely small size of GMOs and the continuous shaky motion of ACPs present challenges in detecting GMOs for traditional methods. In particular, existing detection methods fail to balance high detection accuracy and real-time performance. This thesis investigates the problem of GMOs detection from ACPs and overcoming the challenges and drawbacks that exist in traditional detection methods. The underlying assumption used in this thesis is based on principal component pursuits (PCP) in which the background of an aerial video is modelled as a low-rank matrix and the moving objects are modelled as sparse corrupting this video. The research in this thesis investigates the proposed problem in three directions: (1) handling the shaky motion in ACPs robustly with minimal computational cost, (2) improving the detection accuracy and radically lowering false detections via penalization term, and (3) extending PCP’s formulation to achieve adequate real-time performance. In this thesis, a series of novel algorithms are proposed to show the evolution of our research towards the development of KR-LNSP, a novel robust detection method which is characterized by high detection accuracy, low computational cost, adaptability to shaky motion in ACPs, and adequate real-time performance. Each of the proposed algorithms is intensively evaluated using different challenging datasets and compared with current state-of-the-art methods

    Automatic Analysis of People in Thermal Imagery

    Get PDF

    Model-driven and Data-driven Methods for Recognizing Compositional Interactions from Videos

    Get PDF
    The ability to accurately understand how humans interact with their surroundings is critical for many vision based intelligent systems. Compared to simple atomic actions (eg. raise hand), many interactions found in our daily lives are defined as a composition of an atomic action with a variety of arguments (eg. pick up a pen). Despite recent progress in the literature, there still remains fundamental challenges unique to recognizing interactions from videos. First, most of the action recognition literature assumes a problem setting where a pre-defined set of action labels is supported by a large and relatively balanced set of training examples for those labels. There are many realistic cases where this data assumption breaks down, either because the application demands fine-grained classification of a potentially combinatorial number of activities, and/or because the problem at hand is an “open-set” problem where new labels may be defined at test time. Second, many deep video models often simply represent video as a three-dimensional tensor and ignore the differences in spatial and temporal dimensions during the representation learning stage. As a result, data-driven bottom-up action models frequently over-fit to the static content of the video and fail to accurately capture the dynamic changes in relations among actors in the video. In this dissertation, we address the aforementioned challenges of recognizing fine-grained interactions from videos by developing solutions that explicitly represent interactions as compositions of simpler static and dynamic elements. By exploiting the power of composition, our ``detection by description'' framework expresses a very rich space of interactions using only a small set of static visual attributes and a few dynamic patterns. A definition of an interaction is constructed on the fly from first-principles state machines which leverage bottom-up deep-learned components such as object detectors. Compared to existing model-driven methods for video understanding, we introduce the notion of dynamic action signatures which allows a practitioner to express the expected temporal behavior of various elements of an interaction. We show that our model-driven approach using dynamic action signatures outperforms other zero-shot methods on multiple public action classification benchmarks and even some fully supervised baselines under realistic problem settings. Next, we extend our approach to a setting where the static and dynamic action signatures are not given by the user but rather learned from data. We do so by borrowing ideas from data-driven, two-stream action recognition and model-driven, structured human-object interaction detection. The key idea behind our approach is that we can learn the static and dynamic decomposition of an interaction using a dual-pathway network by leveraging object detections. To do so, we introduce the Motion Guided Attention Fusion mechanism which transfers the motion-centric features learned using object detections to the representation learned from the RGB-based motion pathway. Finally, we conclude with a comprehensive case study on vision based activity detection applied to video surveillance. Using the methods presented in this dissertation, we step towards an intelligent vision system that can detect a particular interaction instance only given a description from a user and depart from requiring massive dataset of labeled training videos. Moreover, as our framework naturally defines a decompositional structure of activities into detectable static/visual attributes, we show that we can simulate necessary training data to acquire attribute detectors when the desired detector is otherwise unavailable. Our approach achieves competitive or superior performance over existing approaches for recognizing fine-grained interactions from realistic videos
    corecore