4,423 research outputs found

    Learning Correspondence Structures for Person Re-identification

    Full text link
    This paper addresses the problem of handling spatial misalignments due to camera-view changes or human-pose variations in person re-identification. We first introduce a boosting-based approach to learn a correspondence structure which indicates the patch-wise matching probabilities between images from a target camera pair. The learned correspondence structure can not only capture the spatial correspondence pattern between cameras but also handle the viewpoint or human-pose variation in individual images. We further introduce a global constraint-based matching process. It integrates a global matching constraint over the learned correspondence structure to exclude cross-view misalignments during the image patch matching process, hence achieving a more reliable matching score between images. Finally, we also extend our approach by introducing a multi-structure scheme, which learns a set of local correspondence structures to capture the spatial correspondence sub-patterns between a camera pair, so as to handle the spatial misalignments between individual images in a more precise way. Experimental results on various datasets demonstrate the effectiveness of our approach.Comment: IEEE Trans. Image Processing, vol. 26, no. 5, pp. 2438-2453, 2017. The project page for this paper is available at http://min.sjtu.edu.cn/lwydemo/personReID.htm arXiv admin note: text overlap with arXiv:1504.0624

    Disparate View Matching

    Get PDF
    Matching of disparate views has gained significance in computer vision due to its role in many novel application areas. Being able to match images of the same scene captured during day and night, between a historic and contemporary picture of a scene, and between aerial and ground-level views of a building facade all enable novel applications ranging from loop-closure detection for structure-from-motion and re-photography to geo-localization of a street-level image using reference imagery captured from the air. The goal of this work is to develop novel features and methods that address matching problems where direct appearance-based correspondences are either difficult to obtain or infeasible because of the lack of appearance similarity altogether. To address these problems, we propose methods that span the appearance-geometry spectrum in terms of both the use of these cues as well as the ability of each method to handle variations in appearance and geometry. First, we consider the problem of geo-localization of a query street-level image using a reference database of building facades captured from a bird\u27s eye view. To address this wide-baseline facade matching problem, a novel scale-selective self-similarity feature that avoids direct comparison of appearance between disparate facade images is presented. Next, to address image matching problems with more extreme appearance variation, a novel representation for matchable images expressed in terms of the eigen-functions of the joint graph of the two images is presented. This representation is used to derive features that are persistent across wide variations in appearance. Next, the problem setting of matching between a street-level image and a digital elevation map (DEM) is considered. Given the limited appearance information available in this scenario, the matching approach has to rely more significantly on geometric cues. Therefore, a purely geometric method to establish correspondences between building corners in the DEM and the visible corners in the query image is presented. Finally, to generalize this problem setting we address the problem of establishing correspondences between 3D and 2D point clouds using geometric means alone. A novel framework for incorporating purely geometric constraints into a higher-order graph matching framework is presented with specific formulations for the three-point calibrated absolute camera pose problem (P3P), two-point upright camera pose problem (Up2p) and the three-plus-one relative camera pose problem

    Aligning Books and Movies: Towards Story-Like Visual Explanations by Watching Movies and Reading Books

    Get PDF
    Books are a rich source of both fine-grained information, how a character, an object or a scene looks like, as well as high-level semantics, what someone is thinking, feeling and how these states evolve through a story. This paper aims to align books to their movie releases in order to provide rich descriptive explanations for visual content that go semantically far beyond the captions available in the current datasets. To align movies and books we propose a neural sentence embedding that is trained in an unsupervised way from a large corpus of books, as well as a video-text neural embedding for computing similarities between movie clips and sentences in the book. We propose a context-aware CNN to combine information from multiple sources. We demonstrate good quantitative performance for movie/book alignment and show several qualitative examples that showcase the diversity of tasks our model can be used for.Natural Sciences and Engineering Research Council of CanadaCanadian Institute for Advanced ResearchSamsung (Firm)Google (Firm)United States. Office of Naval Research (ONR-N00014-14-1-0232

    A New Computational Framework for Efficient Parallelization and Optimization of Large Scale Graph Matching

    Get PDF
    There are so many applications in data fusion, comparison, and recognition that require a robust and efficient algorithm to match features of multiple images. To improve accuracy and get a more stable result is important to take into consideration both local appearance and the pairwise relationship of features. Graphs are a powerful and flexible data structure, allowing for the description of complex relationships between data elements, whose nodes correspond to salient features and edges correspond to relational aspects between features. Therefore, the problem of graph matching is to find a mapping between the two sets of nodes that preserves the relationships between them as much as possible. This graph-matching problem is mathematically formulated as an IQP problem which solving it is NP-hard, and obtaining exact Optima only plausible for very small data. Therefore, handling large-scale scientific visual data is quite limited, necessitating both efficient serial algorithms, as well as scalable parallel formulations. In this thesis, we first focused on exploring techniques to reduce the computation cost as well as memory usage of Pairwise graph matching by adopting a heuristic pruning strategy together with a redundancy pattern suppression scheme. We also modified the structure of the affinity matrix for minimizing memory requirement and parallelizing our algorithm by employing CPU’s and GPU’s accelerated libraries. Any pair of features with similar distance from first image results in same sub-matrices, therefore instead of constructing the whole affinity matrix, we only built the sub-blocked affinity for those distinct feature distances. By employing this scheme not only saved large memory and reduced computation time tremendously but also, the matrix-vector multiplication of gradient computation performed in parallel, where each block-vector calculation computed independently without synchronization. The accelerated libraries such as MKL, cuSparse, cuBlas and thrust applied to solving the GM problem, following the scheme of the spectral matching algorithm. We also extended our work for Multi-graph imaging, since many tasks require finding correspondences across multiple images. Also, considering more graph improves the matching accuracy. Most algorithms obtain approximate solutions for solving the GM NP-hard problem, result in a weak optimal solution. Therefore, we proposed a new solver, which iteratively modified the affinity matrix and binarized the solution by optimizing the original problem with its integer constraints

    Visual7W: Grounded Question Answering in Images

    Full text link
    We have seen great progress in basic perceptual tasks such as object recognition and detection. However, AI models still fail to match humans in high-level vision tasks due to the lack of capacities for deeper reasoning. Recently the new task of visual question answering (QA) has been proposed to evaluate a model's capacity for deep image understanding. Previous works have established a loose, global association between QA sentences and images. However, many questions and answers, in practice, relate to local regions in the images. We establish a semantic link between textual descriptions and image regions by object-level grounding. It enables a new type of QA with visual answers, in addition to textual answers used in previous work. We study the visual QA tasks in a grounded setting with a large collection of 7W multiple-choice QA pairs. Furthermore, we evaluate human performance and several baseline models on the QA tasks. Finally, we propose a novel LSTM model with spatial attention to tackle the 7W QA tasks.Comment: CVPR 201

    Learning Aligned Cross-Modal Representations from Weakly Aligned Data

    Get PDF
    People can recognize scenes across many different modalities beyond natural images. In this paper, we investigate how to learn cross-modal scene representations that transfer across modalities. To study this problem, we introduce a new cross-modal scene dataset. While convolutional neural networks can categorize cross-modal scenes well, they also learn an intermediate representation not aligned across modalities, which is undesirable for cross-modal transfer applications. We present methods to regularize cross-modal convolutional neural networks so that they have a shared representation that is agnostic of the modality. Our experiments suggest that our scene representation can help transfer representations across modalities for retrieval. Moreover, our visualizations suggest that units emerge in the shared representation that tend to activate on consistent concepts independently of the modality.Comment: Conference paper at CVPR 201
    • …
    corecore