1,076 research outputs found

    Sympathy for the Details: Dense Trajectories and Hybrid Classification Architectures for Action Recognition

    Get PDF
    Action recognition in videos is a challenging task due to the complexity of the spatio-temporal patterns to model and the difficulty to acquire and learn on large quantities of video data. Deep learning, although a breakthrough for Image classification and showing promise for videos, has still not clearly superseded action recognition methods using hand-crafted features, even when training on massive datasets. In this paper, we introduce hybrid video classification architectures based on carefully designed unsupervised representations of hand-crafted spatio-temporal features classified by supervised deep networks. As we show in our experiments on five popular benchmarks for action recognition, our hybrid model combines the best of both worlds: it is data efficient (trained on 150 to 10000 short clips) and yet improves significantly on the state of the art, including recent deep models trained on millions of manually labelled images and videos

    Self-Supervised Joint Encoding of Motion and Appearance for First Person Action Recognition

    Get PDF
    Wearable cameras are becoming more and more popular in several applications, increasing the interest of the research community in developing approaches for recognizing actions from the first-person point of view. An open challenge in egocentric action recognition is that videos lack detailed information about the main actor's pose and thus tend to record only parts of the movement when focusing on manipulation tasks. Thus, the amount of information about the action itself is limited, making crucial the understanding of the manipulated objects and their context. Many previous works addressed this issue with two-stream architectures, where one stream is dedicated to modeling the appearance of objects involved in the action, and another to extracting motion features from optical flow. In this paper, we argue that learning features jointly from these two information channels is beneficial to capture the spatio-temporal correlations between the two better. To this end, we propose a single stream architecture able to do so, thanks to the addition of a self-supervised block that uses a pretext motion prediction task to intertwine motion and appearance knowledge. Experiments on several publicly available databases show the power of our approach

    Deep Learning for Action and Gesture Recognition in Image Sequences: A Survey

    Get PDF
    Interest in automatic action and gesture recognition has grown considerably in the last few years. This is due in part to the large number of application domains for this type of technology. As in many other computer vision areas, deep learning based methods have quickly become a reference methodology for obtaining state-of-the-art performance in both tasks. This chapter is a survey of current deep learning based methodologies for action and gesture recognition in sequences of images. The survey reviews both fundamental and cutting edge methodologies reported in the last few years. We introduce a taxonomy that summarizes important aspects of deep learning for approaching both tasks. Details of the proposed architectures, fusion strategies, main datasets, and competitions are reviewed. Also, we summarize and discuss the main works proposed so far with particular interest on how they treat the temporal dimension of data, their highlighting features, and opportunities and challenges for future research. To the best of our knowledge this is the first survey in the topic. We foresee this survey will become a reference in this ever dynamic field of research

    Egocentric Vision-based Action Recognition: A survey

    Get PDF
    [EN] The egocentric action recognition EAR field has recently increased its popularity due to the affordable and lightweight wearable cameras available nowadays such as GoPro and similars. Therefore, the amount of egocentric data generated has increased, triggering the interest in the understanding of egocentric videos. More specifically, the recognition of actions in egocentric videos has gained popularity due to the challenge that it poses: the wild movement of the camera and the lack of context make it hard to recognise actions with a performance similar to that of third-person vision solutions. This has ignited the research interest on the field and, nowadays, many public datasets and competitions can be found in both the machine learning and the computer vision communities. In this survey, we aim to analyse the literature on egocentric vision methods and algorithms. For that, we propose a taxonomy to divide the literature into various categories with subcategories, contributing a more fine-grained classification of the available methods. We also provide a review of the zero-shot approaches used by the EAR community, a methodology that could help to transfer EAR algorithms to real-world applications. Finally, we summarise the datasets used by researchers in the literature.We gratefully acknowledge the support of the Basque Govern-ment's Department of Education for the predoctoral funding of the first author. This work has been supported by the Spanish Government under the FuturAAL-Context project (RTI2018-101045-B-C21) and by the Basque Government under the Deustek project (IT-1078-16-D)

    Learning with Privileged Information using Multimodal Data

    Get PDF
    Computer vision is the science related to teaching machines to see and understand digital images or videos. During the last decade, computer vision has seen tremendous progress on perception tasks such as object detection, semantic segmentation, and video action recognition, which lead to the development and improvements of important industrial applications such as self-driving cars and medical image analysis. These advances are mainly due to fast computation offered by GPUs, the development of high capacity models such as deep neural networks, and the availability of large datasets, often composed by a variety of modalities. In this thesis, we explore how multimodal data can be used to train deep convolutional neural networks. Humans perceive the world through multiple senses, and reason over the multimodal space of stimuli to act and understand the environment. One way to improve the perception capabilities of deep learning methods is to use different modalities as input, as it offers different and complementary information about the scene. Recent multimodal datasets for computer vision tasks include modalities such as depth maps, infrared, skeleton coordinates, and others, besides the traditional RGB. This thesis investigates deep learning systems that learn from multiple visual modalities. In particular, we are interested in a very practical scenario in which an input modality is missing at test time. The question we address is the following: how can we take advantage of multimodal datasets for training our model, knowing that, at test time, a modality might be missing or too noisy? The case of having access to more information at training time than at test time is referred to as learning using privileged information. In this work, we develop methods to address this challenge, with special focus on the tasks of action and object recognition, and on the modalities of depth, optical flow, and RGB, that we use for inference at test time. This thesis advances the art of multimodal learning in three different ways. First, we develop a deep learning method for video classification that is trained on RGB and depth data, and is able to hallucinate depth features and predictions at test time. Second, we build on this method and propose a more generic mechanism based on adversarial learning to learn to mimic the predictions originated by the depth modality, and is able to automatically switch from true depth features to generated depth features in case of a noisy sensor. Third, we develop a method that learns a single network trained on RGB data, that is enriched with additional supervision information from other modalities such as depth and optical flow at training time, and that outperforms an ensemble of networks trained independently on these modalities

    Deep Learning-Based Action Recognition

    Get PDF
    The classification of human action or behavior patterns is very important for analyzing situations in the field and maintaining social safety. This book focuses on recent research findings on recognizing human action patterns. Technology for the recognition of human action pattern includes the processing technology of human behavior data for learning, technology of expressing feature values ​​of images, technology of extracting spatiotemporal information of images, technology of recognizing human posture, and technology of gesture recognition. Research on these technologies has recently been conducted using general deep learning network modeling of artificial intelligence technology, and excellent research results have been included in this edition

    Personalized face and gesture analysis using hierarchical neural networks

    Full text link
    The video-based computational analyses of human face and gesture signals encompass a myriad of challenging research problems involving computer vision, machine learning and human computer interaction. In this thesis, we focus on the following challenges: a) the classification of hand and body gestures along with the temporal localization of their occurrence in a continuous stream, b) the recognition of facial expressivity levels in people with Parkinson's Disease using multimodal feature representations, c) the prediction of student learning outcomes in intelligent tutoring systems using affect signals, and d) the personalization of machine learning models, which can adapt to subject and group-specific nuances in facial and gestural behavior. Specifically, we first conduct a quantitative comparison of two approaches to the problem of segmenting and classifying gestures on two benchmark gesture datasets: a method that simultaneously segments and classifies gestures versus a cascaded method that performs the tasks sequentially. Second, we introduce a framework that computationally predicts an accurate score for facial expressivity and validate it on a dataset of interview videos of people with Parkinson's disease. Third, based on a unique dataset of videos of students interacting with MathSpring, an intelligent tutoring system, collected by our collaborative research team, we build models to predict learning outcomes from their facial affect signals. Finally, we propose a novel solution to a relatively unexplored area in automatic face and gesture analysis research: personalization of models to individuals and groups. We develop hierarchical Bayesian neural networks to overcome the challenges posed by group or subject-specific variations in face and gesture signals. We successfully validate our formulation on the problems of personalized subject-specific gesture classification, context-specific facial expressivity recognition and student-specific learning outcome prediction. We demonstrate the flexibility of our hierarchical framework by validating the utility of both fully connected and recurrent neural architectures
    corecore