8,014 research outputs found

    Discriminative Feature Learning with Application to Fine-grained Recognition

    Get PDF
    For various computer vision tasks, finding suitable feature representations is fundamental. Fine-grained recognition, distinguishing sub-categories under the same super-category (e.g., bird species, car makes and models, etc.), serves as a good task to study discriminative feature learning for visual recognition task. The main reason is that the inter-class variations between fine-grained categories are very subtle and even smaller than intra-class variations caused by pose or deformation. This thesis focuses on tasks mostly related to fine-grained categories. After briefly discussing our earlier attempt to capture subtle visual differences using sparse/low-rank analysis, the main part of the thesis reflects the trends in the past a few years as deep learning prevails. In the first part of the thesis, we address the problem of fine-grained recognition via a patch-based framework built upon Convolutional Neural Network (CNN) features. We introduce triplets of patches with two geometric constraints to improve the accuracy of patch localization, and automatically mine discriminative geometrically-constrained triplets for recognition. In the second part we begin to learn discriminative features in an end-to-end fashion. We propose a supervised feature learning approach, Label Consistent Neural Network, which enforces direct supervision in late hidden layers. We associate each neuron in a hidden layer with a particular class and encourage it to be activated for input signals from the same class by introducing a label consistency regularization. This label consistency constraint makes the features more discriminative and tends to faster convergence. The third part proposes a more sophisticated and effective end-to-end network specifically designed for fine-grained recognition, which learns discriminative patches within a CNN. We show that patch-level learning capability of CNN can be enhanced by learning a bank of convolutional filters that capture class-specific discriminative patches without extra part or bounding box annotations. Such a filter bank is well structured, properly initialized and discriminatively learned through a novel asymmetric multi-stream architecture with convolutional filter supervision and a non-random layer initialization. In the last part we goes beyond obtaining category labels and study the problem of continuous 3D pose estimation for fine-grained object categories. We augment three existing popular fine-grained recognition datasets by annotating each instance in the image with corresponding fine-grained 3D shape and ground-truth 3D pose. We cast the problem into a detection framework based on Faster/Mask R-CNN. To utilize the 3D information, we also introduce a novel 3D representation, named as location field, that is effective for representing 3D shapes

    Multi-View Priors for Learning Detectors from Sparse Viewpoint Data

    Full text link
    While the majority of today's object class models provide only 2D bounding boxes, far richer output hypotheses are desirable including viewpoint, fine-grained category, and 3D geometry estimate. However, models trained to provide richer output require larger amounts of training data, preferably well covering the relevant aspects such as viewpoint and fine-grained categories. In this paper, we address this issue from the perspective of transfer learning, and design an object class model that explicitly leverages correlations between visual features. Specifically, our model represents prior distributions over permissible multi-view detectors in a parametric way -- the priors are learned once from training data of a source object class, and can later be used to facilitate the learning of a detector for a target class. As we show in our experiments, this transfer is not only beneficial for detectors based on basic-level category representations, but also enables the robust learning of detectors that represent classes at finer levels of granularity, where training data is typically even scarcer and more unbalanced. As a result, we report largely improved performance in simultaneous 2D object localization and viewpoint estimation on a recent dataset of challenging street scenes.Comment: 13 pages, 7 figures, 4 tables, International Conference on Learning Representations 201

    RotationNet: Joint Object Categorization and Pose Estimation Using Multiviews from Unsupervised Viewpoints

    Full text link
    We propose a Convolutional Neural Network (CNN)-based model "RotationNet," which takes multi-view images of an object as input and jointly estimates its pose and object category. Unlike previous approaches that use known viewpoint labels for training, our method treats the viewpoint labels as latent variables, which are learned in an unsupervised manner during the training using an unaligned object dataset. RotationNet is designed to use only a partial set of multi-view images for inference, and this property makes it useful in practical scenarios where only partial views are available. Moreover, our pose alignment strategy enables one to obtain view-specific feature representations shared across classes, which is important to maintain high accuracy in both object categorization and pose estimation. Effectiveness of RotationNet is demonstrated by its superior performance to the state-of-the-art methods of 3D object classification on 10- and 40-class ModelNet datasets. We also show that RotationNet, even trained without known poses, achieves the state-of-the-art performance on an object pose estimation dataset. The code is available on https://github.com/kanezaki/rotationnetComment: 24 pages, 23 figures. Accepted to CVPR 201

    Fast Single Shot Detection and Pose Estimation

    Full text link
    For applications in navigation and robotics, estimating the 3D pose of objects is as important as detection. Many approaches to pose estimation rely on detecting or tracking parts or keypoints [11, 21]. In this paper we build on a recent state-of-the-art convolutional network for slidingwindow detection [10] to provide detection and rough pose estimation in a single shot, without intermediate stages of detecting parts or initial bounding boxes. While not the first system to treat pose estimation as a categorization problem, this is the first attempt to combine detection and pose estimation at the same level using a deep learning approach. The key to the architecture is a deep convolutional network where scores for the presence of an object category, the offset for its location, and the approximate pose are all estimated on a regular grid of locations in the image. The resulting system is as accurate as recent work on pose estimation (42.4% 8 View mAVP on Pascal 3D+ [21] ) and significantly faster (46 frames per second (FPS) on a TITAN X GPU). This approach to detection and rough pose estimation is fast and accurate enough to be widely applied as a pre-processing step for tasks including high-accuracy pose estimation, object tracking and localization, and vSLAM
    • …
    corecore