46,568 research outputs found

    Discriminate-and-Rectify Encoders: Learning from Image Transformation Sets

    Get PDF
    The complexity of a learning task is increased by transformations in the input space that preserve class identity. Visual object recognition for example is affected by changes in viewpoint, scale, illumination or planar transformations. While drastically altering the visual appearance, these changes are orthogonal to recognition and should not be reflected in the representation or feature encoding used for learning. We introduce a framework for weakly supervised learning of image embeddings that are robust to transformations and selective to the class distribution, using sets of transforming examples (orbit sets), deep parametrizations and a novel orbit-based loss. The proposed loss combines a discriminative, contrastive part for orbits with a reconstruction error that learns to rectify orbit transformations. The learned embeddings are evaluated in distance metric-based tasks, such as one-shot classification under geometric transformations, as well as face verification and retrieval under more realistic visual variability. Our results suggest that orbit sets, suitably computed or observed, can be used for efficient, weakly-supervised learning of semantically relevant image embeddings.This material is based upon work supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF-1231216

    3D Pose Estimation and 3D Model Retrieval for Objects in the Wild

    Full text link
    We propose a scalable, efficient and accurate approach to retrieve 3D models for objects in the wild. Our contribution is twofold. We first present a 3D pose estimation approach for object categories which significantly outperforms the state-of-the-art on Pascal3D+. Second, we use the estimated pose as a prior to retrieve 3D models which accurately represent the geometry of objects in RGB images. For this purpose, we render depth images from 3D models under our predicted pose and match learned image descriptors of RGB images against those of rendered depth images using a CNN-based multi-view metric learning approach. In this way, we are the first to report quantitative results for 3D model retrieval on Pascal3D+, where our method chooses the same models as human annotators for 50% of the validation images on average. In addition, we show that our method, which was trained purely on Pascal3D+, retrieves rich and accurate 3D models from ShapeNet given RGB images of objects in the wild.Comment: Accepted to Conference on Computer Vision and Pattern Recognition (CVPR) 201

    Sherlock: Scalable Fact Learning in Images

    Full text link
    We study scalable and uniform understanding of facts in images. Existing visual recognition systems are typically modeled differently for each fact type such as objects, actions, and interactions. We propose a setting where all these facts can be modeled simultaneously with a capacity to understand unbounded number of facts in a structured way. The training data comes as structured facts in images, including (1) objects (e.g., ),(2)attributes(e.g.,), (2) attributes (e.g., ), (3) actions (e.g., ),and(4)interactions(e.g.,), and (4) interactions (e.g., ). Each fact has a semantic language view (e.g., ) and a visual view (an image with this fact). We show that learning visual facts in a structured way enables not only a uniform but also generalizable visual understanding. We propose and investigate recent and strong approaches from the multiview learning literature and also introduce two learning representation models as potential baselines. We applied the investigated methods on several datasets that we augmented with structured facts and a large scale dataset of more than 202,000 facts and 814,000 images. Our experiments show the advantage of relating facts by the structure by the proposed models compared to the designed baselines on bidirectional fact retrieval.Comment: Jan 7 Updat

    Searching objects of interest in large scale data

    Get PDF
    Title from PDF of title page (University of Missouri--Columbia, viewed on October 31, 2012).The entire thesis text is included in the research.pdf file; the official abstract appears in the short.pdf file; a non-technical public abstract appears in the public.pdf file.Dissertation advisor: Dr. Tony X. HanIncludes bibliographical references.Vita.Ph. D. University of Missouri--Columbia 2012."July 2012"The research on object detection/tracking and large scale visual search/recognition has recently gained substantial progress and has started to contribute to improving the quality of life worldwide: real-time face detectors have been integrated into point-and-shoot cameras, smart phones, and tablets; content-based image search is available at Google and Snaptell of Amazon;vision-based gesture recognition has been an indispensable component of the popular Kinect game console. In this dissertation, we investigate computer vision problems related to object detection, adaptation, tracking and content based image retrieval, all of which are indispensable components of a video surveillance system or a robot system. Our contribution involves feature development, exploration of detection correlations, object modeling, local context information of descriptors. More specifically, we designed a feature set for object detection with occlusion handling. To improve the detection performance on a video, we proposed a non-parametric detector adaptation algorithm to improve the performance of state of the art detectors for each specific video. To effectively track the detected object, we introduce a metric learning framework to unify the appearance modeling and visual matching. Taking advantage of image descriptor appearance context as well as local spatial context, we achieved state of the art retrieval performance based on the vocabulary tree based image retrieval framework. All the proposed algorithms are validated by throughout experiments.Includes bibliographical references

    Multi-view Convolutional Neural Networks for 3D Shape Recognition

    Full text link
    A longstanding question in computer vision concerns the representation of 3D shapes for recognition: should 3D shapes be represented with descriptors operating on their native 3D formats, such as voxel grid or polygon mesh, or can they be effectively represented with view-based descriptors? We address this question in the context of learning to recognize 3D shapes from a collection of their rendered views on 2D images. We first present a standard CNN architecture trained to recognize the shapes' rendered views independently of each other, and show that a 3D shape can be recognized even from a single view at an accuracy far higher than using state-of-the-art 3D shape descriptors. Recognition rates further increase when multiple views of the shapes are provided. In addition, we present a novel CNN architecture that combines information from multiple views of a 3D shape into a single and compact shape descriptor offering even better recognition performance. The same architecture can be applied to accurately recognize human hand-drawn sketches of shapes. We conclude that a collection of 2D views can be highly informative for 3D shape recognition and is amenable to emerging CNN architectures and their derivatives.Comment: v1: Initial version. v2: An updated ModelNet40 training/test split is used; results with low-rank Mahalanobis metric learning are added. v3 (ICCV 2015): A second camera setup without the upright orientation assumption is added; some accuracy and mAP numbers are changed slightly because a small issue in mesh rendering related to specularities is fixe
    • …
    corecore