1,062 research outputs found

    Fast human behavior analysis for scene understanding

    Get PDF
    Human behavior analysis has become an active topic of great interest and relevance for a number of applications and areas of research. The research in recent years has been considerably driven by the growing level of criminal behavior in large urban areas and increase of terroristic actions. Also, accurate behavior studies have been applied to sports analysis systems and are emerging in healthcare. When compared to conventional action recognition used in security applications, human behavior analysis techniques designed for embedded applications should satisfy the following technical requirements: (1) Behavior analysis should provide scalable and robust results; (2) High-processing efficiency to achieve (near) real-time operation with low-cost hardware; (3) Extensibility for multiple-camera setup including 3-D modeling to facilitate human behavior understanding and description in various events. The key to our problem statement is that we intend to improve behavior analysis performance while preserving the efficiency of the designed techniques, to allow implementation in embedded environments. More specifically, we look into (1) fast multi-level algorithms incorporating specific domain knowledge, and (2) 3-D configuration techniques for overall enhanced performance. If possible, we explore the performance of the current behavior-analysis techniques for improving accuracy and scalability. To fulfill the above technical requirements and tackle the research problems, we propose a flexible behavior-analysis framework consisting of three processing-layers: (1) pixel-based processing (background modeling with pixel labeling), (2) object-based modeling (human detection, tracking and posture analysis), and (3) event-based analysis (semantic event understanding). In Chapter 3, we specifically contribute to the analysis of individual human behavior. A novel body representation is proposed for posture classification based on a silhouette feature. Only pure binary-shape information is used for posture classification without texture/color or any explicit body models. To this end, we have studied an efficient HV-PCA shape-based descriptor with temporal modeling, which achieves a posture-recognition accuracy rate of about 86% and outperforms other existing proposals. As our human motion scheme is efficient and achieves a fast performance (6-8 frames/second), it enables a fast surveillance system or further analysis of human behavior. In addition, a body-part detection approach is presented. The color and body ratio are combined to provide clues for human body detection and classification. The conventional assumption of up-right body posture is not required. Afterwards, we design and construct a specific framework for fast algorithms and apply them in two applications: tennis sports analysis and surveillance. Chapter 4 deals with tennis sports analysis and presents an automatic real-time system for multi-level analysis of tennis video sequences. First, we employ a 3-D camera model to bridge the pixel-level, object-level and scene-level of tennis sports analysis. Second, a weighted linear model combining the visual cues in the real-world domain is proposed to identify various events. The experimentally found event extraction rate of the system is about 90%. Also, audio signals are combined to enhance the scene analysis performance. The complete proposed application is efficient enough to obtain a real-time or near real-time performance (2-3 frames/second for 720×576 resolution, and 5-7 frames/second for 320×240 resolution, with a P-IV PC running at 3GHz). Chapter 5 addresses surveillance and presents a full real-time behavior-analysis framework, featuring layers at pixel, object, event and visualization level. More specifically, this framework captures the human motion, classifies its posture, infers the semantic event exploiting interaction modeling, and performs the 3-D scene reconstruction. We have introduced our system design based on a specific software architecture, by employing the well-known "4+1" view model. In addition, human behavior analysis algorithms are directly designed for real-time operation and embedded in an experimental runtime AV content-analysis architecture. This executable system is designed to be generic for multiple streaming applications with component-based architectures. To evaluate the performance, we have applied this networked system in a single-camera setup. The experimental platform operates with two Pentium Quadcore engines (2.33 GHz) and 4-GB memory. Performance evaluations have shown that this networked framework is efficient and achieves a fast performance (13-15 frames/second) for monocular video sequences. Moreover, a dual-camera setup is tested within the behavior-analysis framework. After automatic camera calibration is conducted, the 3-D reconstruction and communication among different cameras are achieved. The extra view in the multi-camera setup improves the human tracking and event detection in case of occlusion. This extension of multiple-view fusion improves the event-based semantic analysis by 8.3-16.7% in accuracy rate. The detailed studies of two experimental intelligent applications, i.e., tennis sports analysis and surveillance, have proven their value in several extensive tests in the framework of the European Candela and Cantata ITEA research programs, where our proposed system has demonstrated competitive performance with respect to accuracy and efficiency

    Weakly Labeled Action Recognition and Detection

    Get PDF
    Research in human action recognition strives to develop increasingly generalized methods that are robust to intra-class variability and inter-class ambiguity. Recent years have seen tremendous strides in improving recognition accuracy on ever larger and complex benchmark datasets, comprising realistic actions in the wild videos. Unfortunately, the all-encompassing, dense, global representations that bring about such improvements often benefit from the inherent characteristics, specific to datasets and classes, that do not necessarily reflect knowledge about the entity to be recognized. This results in specific models that perform well within datasets but generalize poorly. Furthermore, training of supervised action recognition and detection methods need several precise spatio-temporal manual annotations to achieve good recognition and detection accuracy. For instance, current deep learning architectures require millions of accurately annotated videos to learn robust action classifiers. However, these annotations are quite difficult to achieve. In the first part of this dissertation, we explore the reasons for poor classifier performance when tested on novel datasets, and quantify the effect of scene backgrounds on action representations and recognition. We attempt to address the problem of recognizing human actions while training and testing on distinct datasets when test videos are neither labeled nor available during training. In this scenario, learning of a joint vocabulary, or domain transfer techniques are not applicable. We perform different types of partitioning of the GIST feature space for several datasets and compute measures of background scene complexity, as well as, for the extent to which scenes are helpful in action classification. We then propose a new process to obtain a measure of confidence in each pixel of the video being a foreground region using motion, appearance, and saliency together in a 3D-Markov Random Field (MRF) based framework. We also propose multiple ways to exploit the foreground confidence: to improve bag-of-words vocabulary, histogram representation of a video, and a novel histogram decomposition based representation and kernel. The above-mentioned work provides probability of each pixel being belonging to the actor, however, it does not give the precise spatio-temporal location of the actor. Furthermore, above framework would require precise spatio-temporal manual annotations to train an action detector. However, manual annotations in videos are laborious, require several annotators and contain human biases. Therefore, in the second part of this dissertation, we propose a weakly labeled approach to automatically obtain spatio-temporal annotations of actors in action videos. We first obtain a large number of action proposals in each video. To capture a few most representative action proposals in each video and evade processing thousands of them, we rank them using optical flow and saliency in a 3D-MRF based framework and select a few proposals using MAP based proposal subset selection method. We demonstrate that this ranking preserves the high-quality action proposals. Several such proposals are generated for each video of the same action. Our next challenge is to iteratively select one proposal from each video so that all proposals are globally consistent. We formulate this as Generalized Maximum Clique Graph problem (GMCP) using shape, global and fine-grained similarity of proposals across the videos. The output of our method is the most action representative proposals from each video. Using our method can also annotate multiple instances of the same action in a video can also be annotated. Moreover, action detection experiments using annotations obtained by our method and several baselines demonstrate the superiority of our approach. The above-mentioned annotation method uses multiple videos of the same action. Therefore, in the third part of this dissertation, we tackle the problem of spatio-temporal action localization in a video, without assuming the availability of multiple videos or any prior annotations. The action is localized by employing images downloaded from the Internet using action label. Given web images, we first dampen image noise using random walk and evade distracting backgrounds within images using image action proposals. Then, given a video, we generate multiple spatio-temporal action proposals. We suppress camera and background generated proposals by exploiting optical flow gradients within proposals. To obtain the most action representative proposals, we propose to reconstruct action proposals in the video by leveraging the action proposals in images. Moreover, we preserve the temporal smoothness of the video and reconstruct all proposal bounding boxes jointly using the constraints that push the coefficients for each bounding box toward a common consensus, thus enforcing the coefficient similarity across multiple frames. We solve this optimization problem using the variant of two-metric projection algorithm. Finally, the video proposal that has the lowest reconstruction cost and is motion salient is used to localize the action. Our method is not only applicable to the trimmed videos, but it can also be used for action localization in untrimmed videos, which is a very challenging problem. Finally, in the third part of this dissertation, we propose a novel approach to generate a few properly ranked action proposals from a large number of noisy proposals. The proposed approach begins with dividing each proposal into sub-proposals. We assume that the quality of proposal remains the same within each sub-proposal. We, then employ a graph optimization method to recombine the sub-proposals in all action proposals in a single video in order to optimally build new action proposals and rank them by the combined node and edge scores. For an untrimmed video, we first divide the video into shots and then make the above-mentioned graph within each shot. Our method generates a few ranked proposals that can be better than all the existing underlying proposals. Our experimental results validated that the properly ranked action proposals can significantly boost action detection results. Our extensive experimental results on different challenging and realistic action datasets, comparisons with several competitive baselines and detailed analysis of each step of proposed methods validate the proposed ideas and frameworks

    Automatic annotation of tennis games: An integration of audio, vision, and learning

    Get PDF
    Fully automatic annotation of tennis game using broadcast video is a task with a great potential but with enormous challenges. In this paper we describe our approach to this task, which integrates computer vision, machine listening, and machine learning. At the low level processing, we improve upon our previously proposed state-of-the-art tennis ball tracking algorithm and employ audio signal processing techniques to detect key events and construct features for classifying the events. At high level analysis, we model event classification as a sequence labelling problem, and investigate four machine learning techniques using simulated event sequences. Finally, we evaluate our proposed approach on three real world tennis games, and discuss the interplay between audio, vision and learning. To the best of our knowledge, our system is the only one that can annotate tennis game at such a detailed level

    Exemplar-Based Recognition of Human-Object Interactions

    Get PDF

    An empirical study of inter-concept similarities in multimedia ontologies

    Get PDF
    Generic concept detection has been a widely studied topic in recent research on multimedia analysis and retrieval, but the issue of how to exploit the structure of a multimedia ontology as well as different inter-concept relations, has not received similar attention. In this paper, we present results from our empirical analysis of different types of similarity among semantic concepts in two multimedia ontologies, LSCOM-Lite and CDVP-206. The results show promise that the proposed methods may be helpful in providing insight into the existing inter-concept relations within an ontology and selecting the most facilitating set of concepts and hierarchical relations. Such an analysis as this can be utilized in various tasks such as building more reliable concept detectors and designing large-scale ontologies

    The THUMOS Challenge on Action Recognition for Videos "in the Wild"

    Get PDF
    Automatically recognizing and localizing wide ranges of human actions has crucial importance for video understanding. Towards this goal, the THUMOS challenge was introduced in 2013 to serve as a benchmark for action recognition. Until then, video action recognition, including THUMOS challenge, had focused primarily on the classification of pre-segmented (i.e., trimmed) videos, which is an artificial task. In THUMOS 2014, we elevated action recognition to a more practical level by introducing temporally untrimmed videos. These also include `background videos' which share similar scenes and backgrounds as action videos, but are devoid of the specific actions. The three editions of the challenge organized in 2013--2015 have made THUMOS a common benchmark for action classification and detection and the annual challenge is widely attended by teams from around the world. In this paper we describe the THUMOS benchmark in detail and give an overview of data collection and annotation procedures. We present the evaluation protocols used to quantify results in the two THUMOS tasks of action classification and temporal detection. We also present results of submissions to the THUMOS 2015 challenge and review the participating approaches. Additionally, we include a comprehensive empirical study evaluating the differences in action recognition between trimmed and untrimmed videos, and how well methods trained on trimmed videos generalize to untrimmed videos. We conclude by proposing several directions and improvements for future THUMOS challenges.Comment: Preprint submitted to Computer Vision and Image Understandin

    Weakly supervised learning of interactions between humans and objects

    Get PDF
    International audienceWe introduce a weakly supervised approach for learning human actions modeled as interactions between humans and objects. Our approach is human-centric: we first localize a human in the image and then determine the object relevant for the action and its spatial relation with the human. The model is learned automatically from a set of still images annotated only with the action label. Our approach relies on a human detector to initialize the model learning. For robustness to various degrees of visibility, we build a detector that learns to combine a set of existing part detectors. Starting from humans detected in a set of images depicting the action, our approach determines the action object and its spatial relation to the human. Its final output is a probabilistic model of the human-object interaction, i.e. the spatial relation between the human and the object. We present an extensive experimental evaluation on the sports action dataset from Gupta et al., the PASCAL Action 2010 dataset, and a new human-object interaction dataset
    • 

    corecore