36,027 research outputs found

    VQS: Linking Segmentations to Questions and Answers for Supervised Attention in VQA and Question-Focused Semantic Segmentation

    Full text link
    Rich and dense human labeled datasets are among the main enabling factors for the recent advance on vision-language understanding. Many seemingly distant annotations (e.g., semantic segmentation and visual question answering (VQA)) are inherently connected in that they reveal different levels and perspectives of human understandings about the same visual scenes --- and even the same set of images (e.g., of COCO). The popularity of COCO correlates those annotations and tasks. Explicitly linking them up may significantly benefit both individual tasks and the unified vision and language modeling. We present the preliminary work of linking the instance segmentations provided by COCO to the questions and answers (QAs) in the VQA dataset, and name the collected links visual questions and segmentation answers (VQS). They transfer human supervision between the previously separate tasks, offer more effective leverage to existing problems, and also open the door for new research problems and models. We study two applications of the VQS data in this paper: supervised attention for VQA and a novel question-focused semantic segmentation task. For the former, we obtain state-of-the-art results on the VQA real multiple-choice task by simply augmenting the multilayer perceptrons with some attention features that are learned using the segmentation-QA links as explicit supervision. To put the latter in perspective, we study two plausible methods and compare them to an oracle method assuming that the instance segmentations are given at the test stage.Comment: To appear on ICCV 201

    Multiresolution hierarchy co-clustering for semantic segmentation in sequences with small variations

    Full text link
    This paper presents a co-clustering technique that, given a collection of images and their hierarchies, clusters nodes from these hierarchies to obtain a coherent multiresolution representation of the image collection. We formalize the co-clustering as a Quadratic Semi-Assignment Problem and solve it with a linear programming relaxation approach that makes effective use of information from hierarchies. Initially, we address the problem of generating an optimal, coherent partition per image and, afterwards, we extend this method to a multiresolution framework. Finally, we particularize this framework to an iterative multiresolution video segmentation algorithm in sequences with small variations. We evaluate the algorithm on the Video Occlusion/Object Boundary Detection Dataset, showing that it produces state-of-the-art results in these scenarios.Comment: International Conference on Computer Vision (ICCV) 201

    VSSA-NET: Vertical Spatial Sequence Attention Network for Traffic Sign Detection

    Full text link
    Although traffic sign detection has been studied for years and great progress has been made with the rise of deep learning technique, there are still many problems remaining to be addressed. For complicated real-world traffic scenes, there are two main challenges. Firstly, traffic signs are usually small size objects, which makes it more difficult to detect than large ones; Secondly, it is hard to distinguish false targets which resemble real traffic signs in complex street scenes without context information. To handle these problems, we propose a novel end-to-end deep learning method for traffic sign detection in complex environments. Our contributions are as follows: 1) We propose a multi-resolution feature fusion network architecture which exploits densely connected deconvolution layers with skip connections, and can learn more effective features for the small size object; 2) We frame the traffic sign detection as a spatial sequence classification and regression task, and propose a vertical spatial sequence attention (VSSA) module to gain more context information for better detection performance. To comprehensively evaluate the proposed method, we do experiments on several traffic sign datasets as well as the general object detection dataset and the results have shown the effectiveness of our proposed method

    Multi-Context Attention for Human Pose Estimation

    Full text link
    In this paper, we propose to incorporate convolutional neural networks with a multi-context attention mechanism into an end-to-end framework for human pose estimation. We adopt stacked hourglass networks to generate attention maps from features at multiple resolutions with various semantics. The Conditional Random Field (CRF) is utilized to model the correlations among neighboring regions in the attention map. We further combine the holistic attention model, which focuses on the global consistency of the full human body, and the body part attention model, which focuses on the detailed description for different body parts. Hence our model has the ability to focus on different granularity from local salient regions to global semantic-consistent spaces. Additionally, we design novel Hourglass Residual Units (HRUs) to increase the receptive field of the network. These units are extensions of residual units with a side branch incorporating filters with larger receptive fields, hence features with various scales are learned and combined within the HRUs. The effectiveness of the proposed multi-context attention mechanism and the hourglass residual units is evaluated on two widely used human pose estimation benchmarks. Our approach outperforms all existing methods on both benchmarks over all the body parts.Comment: The first two authors contribute equally to this wor

    Semi-Global Stereo Matching with Surface Orientation Priors

    Full text link
    Semi-Global Matching (SGM) is a widely-used efficient stereo matching technique. It works well for textured scenes, but fails on untextured slanted surfaces due to its fronto-parallel smoothness assumption. To remedy this problem, we propose a simple extension, termed SGM-P, to utilize precomputed surface orientation priors. Such priors favor different surface slants in different 2D image regions or 3D scene regions and can be derived in various ways. In this paper we evaluate plane orientation priors derived from stereo matching at a coarser resolution and show that such priors can yield significant performance gains for difficult weakly-textured scenes. We also explore surface normal priors derived from Manhattan-world assumptions, and we analyze the potential performance gains using oracle priors derived from ground-truth data. SGM-P only adds a minor computational overhead to SGM and is an attractive alternative to more complex methods employing higher-order smoothness terms.Comment: extended draft of 3DV 2017 (spotlight) pape
    corecore