2,383 research outputs found

    Modeling and Recognizing Binary Human Interactions

    Get PDF
    Recognizing human activities from video is an important step forward towards the long-term goal of performing scene understanding fully automatically. Applications in this domain include, but are not limited to, the automated analysis of video surveillance footage for public and private monitoring, remote patient and elderly home monitoring, video archiving, search and retrieval, human-computer interaction, and robotics. While recent years have seen a concentration of works focusing on modeling and recognizing either group activities, or actions performed by people in isolation, modeling and recognizing binary human-human interactions is a fundamental building block that only recently has started to catalyze the attention of researchers.;This thesis introduces a new modeling framework for binary human-human interactions. The main idea is to describe interactions with spatio-temporal trajectories. Interaction trajectories can then be modeled as the output of dynamical systems, and recognizing interactions entails designing a suitable way for comparing them. This poses several challenges, starting from the type of information that should be captured by the trajectories, which defines the geometry structure of the output space of the systems. In addition, decision functions performing the recognition should account for the fact that the people interacting do not have a predefined ordering. This work addresses those challenges by carefully designing a kernel-based approach that combines non-linear dynamical system modeling with kernel PCA. Experimental results computed on three recently published datasets, clearly show the promise of this approach, where the classification accuracy, and the retrieval precision are comparable or better than the state-of-the-art

    Human Interaction Recognition with Audio and Visual Cues

    Get PDF
    The automated recognition of human activities from video is a fundamental problem with applications in several areas, ranging from video surveillance, and robotics, to smart healthcare, and multimedia indexing and retrieval, just to mention a few. However, the pervasive diffusion of cameras capable of recording audio also makes available to those applications a complementary modality. Despite the sizable progress made in the area of modeling and recognizing group activities, and actions performed by people in isolation from video, the availability of audio cues has rarely being leveraged. This is even more so in the area of modeling and recognizing binary interactions between humans, where also the use of video has been limited.;This thesis introduces a modeling framework for binary human interactions based on audio and visual cues. The main idea is to describe an interaction with a spatio-temporal trajectory modeling the visual motion cues, and a temporal trajectory modeling the audio cues. This poses the problem of how to fuse temporal trajectories from multiple modalities for the purpose of recognition. We propose a solution whereby trajectories are modeled as the output of kernel state space models. Then, we developed kernel-based methods for the audio-visual fusion that act at the feature level, as well as at the kernel level, by exploiting multiple kernel learning techniques. The approaches have been extensively tested and evaluated with a dataset made of videos obtained from TV shows and Hollywood movies, containing five different interactions. The results show the promise of this approach by producing a significant improvement of the recognition rate when audio cues are exploited, clearly setting the state-of-the-art in this particular application

    Activity recognition from videos with parallel hypergraph matching on GPUs

    Full text link
    In this paper, we propose a method for activity recognition from videos based on sparse local features and hypergraph matching. We benefit from special properties of the temporal domain in the data to derive a sequential and fast graph matching algorithm for GPUs. Traditionally, graphs and hypergraphs are frequently used to recognize complex and often non-rigid patterns in computer vision, either through graph matching or point-set matching with graphs. Most formulations resort to the minimization of a difficult discrete energy function mixing geometric or structural terms with data attached terms involving appearance features. Traditional methods solve this minimization problem approximately, for instance with spectral techniques. In this work, instead of solving the problem approximatively, the exact solution for the optimal assignment is calculated in parallel on GPUs. The graphical structure is simplified and regularized, which allows to derive an efficient recursive minimization algorithm. The algorithm distributes subproblems over the calculation units of a GPU, which solves them in parallel, allowing the system to run faster than real-time on medium-end GPUs

    Approximate Image Matching using Strings of Bag-of-Visual Words Representation

    No full text
    International audienceThe Spatial Pyramid Matching approach has become very popular to model images as sets of local bag-of words. The image comparison is then done region-by-region with an intersection kernel. Despite its success, this model presents some limitations: the grid partitioning is predefined and identical for all images and the matching is sensitive to intra- and inter-class variations. In this paper, we propose a novel approach based on approximate string matching to overcome these limitations and improve the results. First, we introduce a new image representation as strings of ordered bag-of-words. Second, we present a new edit distance specifically adapted to strings of histograms in the context of image comparison. This distance identifies local alignments between subregions and allows to remove sequences of similar subregions to better match two images. Experiments on 15 Scenes and Caltech 101 show that the proposed approach outperforms the classical spatial pyramid representation and most existing concurrent methods for classification presented in recent years

    Vision Science and Technology at NASA: Results of a Workshop

    Get PDF
    A broad review is given of vision science and technology within NASA. The subject is defined and its applications in both NASA and the nation at large are noted. A survey of current NASA efforts is given, noting strengths and weaknesses of the NASA program

    String representations and distances in deep Convolutional Neural Networks for image classification

    No full text
    International audienceRecent advances in image classification mostly rely on the use of powerful local features combined with an adapted image representation. Although Convolutional Neural Network (CNN) features learned from ImageNet were shown to be generic and very efficient, they still lack of flexibility to take into account variations in the spatial layout of visual elements. In this paper, we investigate the use of structural representations on top of pre-trained CNN features to improve image classification. Images are represented as strings of CNN features. Similarities between such representations are computed using two new edit distance variants adapted to the image classification domain. Our algorithms have been implemented and tested on several challenging datasets, 15Scenes, Caltech101, Pas-cal VOC 2007 and MIT indoor. The results show that our idea of using structural string representations and distances clearly improves the classification performance over standard approaches based on CNN and SVM with linear kernel, as well as other recognized methods of the literature
    corecore