3,467 research outputs found

    DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs

    Get PDF
    In this work we address the task of semantic image segmentation with Deep Learning and make three main contributions that are experimentally shown to have substantial practical merit. First, we highlight convolution with upsampled filters, or 'atrous convolution', as a powerful tool in dense prediction tasks. Atrous convolution allows us to explicitly control the resolution at which feature responses are computed within Deep Convolutional Neural Networks. It also allows us to effectively enlarge the field of view of filters to incorporate larger context without increasing the number of parameters or the amount of computation. Second, we propose atrous spatial pyramid pooling (ASPP) to robustly segment objects at multiple scales. ASPP probes an incoming convolutional feature layer with filters at multiple sampling rates and effective fields-of-views, thus capturing objects as well as image context at multiple scales. Third, we improve the localization of object boundaries by combining methods from DCNNs and probabilistic graphical models. The commonly deployed combination of max-pooling and downsampling in DCNNs achieves invariance but has a toll on localization accuracy. We overcome this by combining the responses at the final DCNN layer with a fully connected Conditional Random Field (CRF), which is shown both qualitatively and quantitatively to improve localization performance. Our proposed "DeepLab" system sets the new state-of-art at the PASCAL VOC-2012 semantic image segmentation task, reaching 79.7% mIOU in the test set, and advances the results on three other datasets: PASCAL-Context, PASCAL-Person-Part, and Cityscapes. All of our code is made publicly available online.Comment: Accepted by TPAM

    CERN: Confidence-Energy Recurrent Network for Group Activity Recognition

    Full text link
    This work is about recognizing human activities occurring in videos at distinct semantic levels, including individual actions, interactions, and group activities. The recognition is realized using a two-level hierarchy of Long Short-Term Memory (LSTM) networks, forming a feed-forward deep architecture, which can be trained end-to-end. In comparison with existing architectures of LSTMs, we make two key contributions giving the name to our approach as Confidence-Energy Recurrent Network -- CERN. First, instead of using the common softmax layer for prediction, we specify a novel energy layer (EL) for estimating the energy of our predictions. Second, rather than finding the common minimum-energy class assignment, which may be numerically unstable under uncertainty, we specify that the EL additionally computes the p-values of the solutions, and in this way estimates the most confident energy minimum. The evaluation on the Collective Activity and Volleyball datasets demonstrates: (i) advantages of our two contributions relative to the common softmax and energy-minimization formulations and (ii) a superior performance relative to the state-of-the-art approaches.Comment: Accepted to IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 201

    Learning Social Affordance Grammar from Videos: Transferring Human Interactions to Human-Robot Interactions

    Full text link
    In this paper, we present a general framework for learning social affordance grammar as a spatiotemporal AND-OR graph (ST-AOG) from RGB-D videos of human interactions, and transfer the grammar to humanoids to enable a real-time motion inference for human-robot interaction (HRI). Based on Gibbs sampling, our weakly supervised grammar learning can automatically construct a hierarchical representation of an interaction with long-term joint sub-tasks of both agents and short term atomic actions of individual agents. Based on a new RGB-D video dataset with rich instances of human interactions, our experiments of Baxter simulation, human evaluation, and real Baxter test demonstrate that the model learned from limited training data successfully generates human-like behaviors in unseen scenarios and outperforms both baselines.Comment: The 2017 IEEE International Conference on Robotics and Automation (ICRA
    • …
    corecore