8,412 research outputs found

    Pedestrian Attribute Recognition: A Survey

    Full text link
    Recognizing pedestrian attributes is an important task in computer vision community due to it plays an important role in video surveillance. Many algorithms has been proposed to handle this task. The goal of this paper is to review existing works using traditional methods or based on deep learning networks. Firstly, we introduce the background of pedestrian attributes recognition (PAR, for short), including the fundamental concepts of pedestrian attributes and corresponding challenges. Secondly, we introduce existing benchmarks, including popular datasets and evaluation criterion. Thirdly, we analyse the concept of multi-task learning and multi-label learning, and also explain the relations between these two learning algorithms and pedestrian attribute recognition. We also review some popular network architectures which have widely applied in the deep learning community. Fourthly, we analyse popular solutions for this task, such as attributes group, part-based, \emph{etc}. Fifthly, we shown some applications which takes pedestrian attributes into consideration and achieve better performance. Finally, we summarized this paper and give several possible research directions for pedestrian attributes recognition. The project page of this paper can be found from the following website: \url{https://sites.google.com/view/ahu-pedestrianattributes/}.Comment: Check our project page for High Resolution version of this survey: https://sites.google.com/view/ahu-pedestrianattributes

    Multi-Task Domain Adaptation for Deep Learning of Instance Grasping from Simulation

    Full text link
    Learning-based approaches to robotic manipulation are limited by the scalability of data collection and accessibility of labels. In this paper, we present a multi-task domain adaptation framework for instance grasping in cluttered scenes by utilizing simulated robot experiments. Our neural network takes monocular RGB images and the instance segmentation mask of a specified target object as inputs, and predicts the probability of successfully grasping the specified object for each candidate motor command. The proposed transfer learning framework trains a model for instance grasping in simulation and uses a domain-adversarial loss to transfer the trained model to real robots using indiscriminate grasping data, which is available both in simulation and the real world. We evaluate our model in real-world robot experiments, comparing it with alternative model architectures as well as an indiscriminate grasping baseline.Comment: ICRA 201

    Recurrent Attention Models for Depth-Based Person Identification

    Get PDF
    We present an attention-based model that reasons on human body shape and motion dynamics to identify individuals in the absence of RGB information, hence in the dark. Our approach leverages unique 4D spatio-temporal signatures to address the identification problem across days. Formulated as a reinforcement learning task, our model is based on a combination of convolutional and recurrent neural networks with the goal of identifying small, discriminative regions indicative of human identity. We demonstrate that our model produces state-of-the-art results on several published datasets given only depth images. We further study the robustness of our model towards viewpoint, appearance, and volumetric changes. Finally, we share insights gleaned from interpretable 2D, 3D, and 4D visualizations of our model's spatio-temporal attention.Comment: Computer Vision and Pattern Recognition (CVPR) 201

    Segmental Spatiotemporal CNNs for Fine-grained Action Segmentation

    Full text link
    Joint segmentation and classification of fine-grained actions is important for applications of human-robot interaction, video surveillance, and human skill evaluation. However, despite substantial recent progress in large-scale action classification, the performance of state-of-the-art fine-grained action recognition approaches remains low. We propose a model for action segmentation which combines low-level spatiotemporal features with a high-level segmental classifier. Our spatiotemporal CNN is comprised of a spatial component that uses convolutional filters to capture information about objects and their relationships, and a temporal component that uses large 1D convolutional filters to capture information about how object relationships change across time. These features are used in tandem with a semi-Markov model that models transitions from one action to another. We introduce an efficient constrained segmental inference algorithm for this model that is orders of magnitude faster than the current approach. We highlight the effectiveness of our Segmental Spatiotemporal CNN on cooking and surgical action datasets for which we observe substantially improved performance relative to recent baseline methods.Comment: Updated from the ECCV 2016 version. We fixed an important mathematical error and made the section on segmental inference cleare

    Every Smile is Unique: Landmark-Guided Diverse Smile Generation

    Full text link
    Each smile is unique: one person surely smiles in different ways (e.g., closing/opening the eyes or mouth). Given one input image of a neutral face, can we generate multiple smile videos with distinctive characteristics? To tackle this one-to-many video generation problem, we propose a novel deep learning architecture named Conditional Multi-Mode Network (CMM-Net). To better encode the dynamics of facial expressions, CMM-Net explicitly exploits facial landmarks for generating smile sequences. Specifically, a variational auto-encoder is used to learn a facial landmark embedding. This single embedding is then exploited by a conditional recurrent network which generates a landmark embedding sequence conditioned on a specific expression (e.g., spontaneous smile). Next, the generated landmark embeddings are fed into a multi-mode recurrent landmark generator, producing a set of landmark sequences still associated to the given smile class but clearly distinct from each other. Finally, these landmark sequences are translated into face videos. Our experimental results demonstrate the effectiveness of our CMM-Net in generating realistic videos of multiple smile expressions.Comment: Accepted as a poster in Conference on Computer Vision and Pattern Recognition (CVPR), 201
    • …
    corecore