15,244 research outputs found
Hand-Shape Recognition Using the Distributions of Multi-Viewpoint Image Sets
This paper proposes a method for recognizing hand-shapes by using multi-viewpoint image sets. The recognition of a hand-shape is a difficult problem, as appearance of the hand changes largely depending on viewpoint, illumination conditions and individual characteristics. To overcome this problem, we apply the Kernel Orthogonal Mutual Subspace Method (KOMSM) to shift-invariance features obtained from multi-viewpoint images of a hand. When applying KOMSM to hand recognition with a lot of learning images from each class, it is necessary to consider how to run the KOMSM with heavy computational cost due to the kernel trick technique. We propose a new method that can drastically reduce the computational cost of KOMSM by adopting centroids and the number of images belonging to the centroids, which are obtained by using k-means clustering. The validity of the proposed method is demonstrated through evaluation experiments using multi-viewpoint image sets of 30 classes of hand-shapes
One-shot learning of object categories
Learning visual models of object categories notoriously requires hundreds or thousands of training examples. We show that it is possible to learn much information about a category from just one, or a handful, of images. The key insight is that, rather than learning from scratch, one can take advantage of knowledge coming from previously learned categories, no matter how different these categories might be. We explore a Bayesian implementation of this idea. Object categories are represented by probabilistic models. Prior knowledge is represented as a probability density function on the parameters of these models. The posterior model for an object category is obtained by updating the prior in the light of one or more observations. We test a simple implementation of our algorithm on a database of 101 diverse object categories. We compare category models learned by an implementation of our Bayesian approach to models learned from by maximum likelihood (ML) and maximum a posteriori (MAP) methods. We find that on a database of more than 100 categories, the Bayesian approach produces informative models when the number of training examples is too small for other methods to operate successfully
Consistent Generative Query Networks
Stochastic video prediction models take in a sequence of image frames, and
generate a sequence of consecutive future image frames. These models typically
generate future frames in an autoregressive fashion, which is slow and requires
the input and output frames to be consecutive. We introduce a model that
overcomes these drawbacks by generating a latent representation from an
arbitrary set of frames that can then be used to simultaneously and efficiently
sample temporally consistent frames at arbitrary time-points. For example, our
model can "jump" and directly sample frames at the end of the video, without
sampling intermediate frames. Synthetic video evaluations confirm substantial
gains in speed and functionality without loss in fidelity. We also apply our
framework to a 3D scene reconstruction dataset. Here, our model is conditioned
on camera location and can sample consistent sets of images for what an
occluded region of a 3D scene might look like, even if there are multiple
possibilities for what that region might contain. Reconstructions and videos
are available at https://bit.ly/2O4Pc4R
Boosted Random ferns for object detection
© 20xx IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.In this paper we introduce the Boosted Random Ferns (BRFs) to rapidly build discriminative classifiers for learning and detecting object categories. At the core of our approach we use standard random ferns, but we introduce four main innovations that let us bring ferns from an instance to a category level, and still retain efficiency. First, we define binary features on the histogram of oriented gradients-domain (as opposed to intensity-), allowing for a better representation of intra-class variability. Second, both the positions where ferns are evaluated within the sliding window, and the location of the binary features for each fern are not chosen completely at random, but instead we use a boosting strategy to pick the most discriminative combination of them. This is further enhanced by our third contribution, that is to adapt the boosting strategy to enable sharing of binary features among different ferns, yielding high recognition rates at a low computational cost. And finally, we show that training can be performed online, for sequentially arriving images. Overall, the resulting classifier can be very efficiently trained, densely evaluated for all image locations in about 0.1 seconds, and provides detection rates similar to competing approaches that require expensive and significantly slower processing times. We demonstrate the effectiveness of our approach by thorough experimentation in publicly available datasets in which we compare against state-of-the-art, and for tasks of both 2D detection and 3D multi-view estimation.Peer ReviewedPostprint (author's final draft
Learning Local Shape Descriptors from Part Correspondences With Multi-view Convolutional Networks
We present a new local descriptor for 3D shapes, directly applicable to a
wide range of shape analysis problems such as point correspondences, semantic
segmentation, affordance prediction, and shape-to-scan matching. The descriptor
is produced by a convolutional network that is trained to embed geometrically
and semantically similar points close to one another in descriptor space. The
network processes surface neighborhoods around points on a shape that are
captured at multiple scales by a succession of progressively zoomed out views,
taken from carefully selected camera positions. We leverage two extremely large
sources of data to train our network. First, since our network processes
rendered views in the form of 2D images, we repurpose architectures pre-trained
on massive image datasets. Second, we automatically generate a synthetic dense
point correspondence dataset by non-rigid alignment of corresponding shape
parts in a large collection of segmented 3D models. As a result of these design
choices, our network effectively encodes multi-scale local context and
fine-grained surface detail. Our network can be trained to produce either
category-specific descriptors or more generic descriptors by learning from
multiple shape categories. Once trained, at test time, the network extracts
local descriptors for shapes without requiring any part segmentation as input.
Our method can produce effective local descriptors even for shapes whose
category is unknown or different from the ones used while training. We
demonstrate through several experiments that our learned local descriptors are
more discriminative compared to state of the art alternatives, and are
effective in a variety of shape analysis applications
Discovery of Latent 3D Keypoints via End-to-end Geometric Reasoning
This paper presents KeypointNet, an end-to-end geometric reasoning framework
to learn an optimal set of category-specific 3D keypoints, along with their
detectors. Given a single image, KeypointNet extracts 3D keypoints that are
optimized for a downstream task. We demonstrate this framework on 3D pose
estimation by proposing a differentiable objective that seeks the optimal set
of keypoints for recovering the relative pose between two views of an object.
Our model discovers geometrically and semantically consistent keypoints across
viewing angles and instances of an object category. Importantly, we find that
our end-to-end framework using no ground-truth keypoint annotations outperforms
a fully supervised baseline using the same neural network architecture on the
task of pose estimation. The discovered 3D keypoints on the car, chair, and
plane categories of ShapeNet are visualized at http://keypointnet.github.io/
Purely Geometric Scene Association and Retrieval - A Case for Macro Scale 3D Geometry
We address the problems of measuring geometric similarity between 3D scenes,
represented through point clouds or range data frames, and associating them.
Our approach leverages macro-scale 3D structural geometry - the relative
configuration of arbitrary surfaces and relationships among structures that are
potentially far apart. We express such discriminative information in a
viewpoint-invariant feature space. These are subsequently encoded in a
frame-level signature that can be utilized to measure geometric similarity.
Such a characterization is robust to noise, incomplete and partially
overlapping data besides viewpoint changes. We show how it can be employed to
select a diverse set of data frames which have structurally similar content,
and how to validate whether views with similar geometric content are from the
same scene. The problem is formulated as one of general purpose retrieval from
an unannotated, spatio-temporally unordered database. Empirical analysis
indicates that the presented approach thoroughly outperforms baselines on depth
/ range data. Its depth-only performance is competitive with state-of-the-art
approaches with RGB or RGB-D inputs, including ones based on deep learning.
Experiments show retrieval performance to hold up well with much sparser
databases, which is indicative of the approach's robustness. The approach
generalized well - it did not require dataset specific training, and scaled up
in our experiments. Finally, we also demonstrate how geometrically diverse
selection of views can result in richer 3D reconstructions.Comment: Accepted in ICRA '1
Learning Where to Look: Data-Driven Viewpoint Set Selection for 3D Scenes
The use of rendered images, whether from completely synthetic datasets or
from 3D reconstructions, is increasingly prevalent in vision tasks. However,
little attention has been given to how the selection of viewpoints affects the
performance of rendered training sets. In this paper, we propose a data-driven
approach to view set selection. Given a set of example images, we extract
statistics describing their contents and generate a set of views matching the
distribution of those statistics. Motivated by semantic segmentation tasks, we
model the spatial distribution of each semantic object category within an image
view volume. We provide a search algorithm that generates a sampling of likely
candidate views according to the example distribution, and a set selection
algorithm that chooses a subset of the candidates that jointly cover the
example distribution. Results of experiments with these algorithms on SUNCG
indicate that they are indeed able to produce view distributions similar to an
example set from NYUDv2 according to the earth mover's distance. Furthermore,
the selected views improve performance on semantic segmentation compared to
alternative view selection algorithms.Comment: ICCV submission, combined main paper and supplemental materia
Occlusion Coherence: Detecting and Localizing Occluded Faces
The presence of occluders significantly impacts object recognition accuracy.
However, occlusion is typically treated as an unstructured source of noise and
explicit models for occluders have lagged behind those for object appearance
and shape. In this paper we describe a hierarchical deformable part model for
face detection and landmark localization that explicitly models part occlusion.
The proposed model structure makes it possible to augment positive training
data with large numbers of synthetically occluded instances. This allows us to
easily incorporate the statistics of occlusion patterns in a discriminatively
trained model. We test the model on several benchmarks for landmark
localization and detection including challenging new data sets featuring
significant occlusion. We find that the addition of an explicit occlusion model
yields a detection system that outperforms existing approaches for occluded
instances while maintaining competitive accuracy in detection and landmark
localization for unoccluded instances
SurfConv: Bridging 3D and 2D Convolution for RGBD Images
We tackle the problem of using 3D information in convolutional neural
networks for down-stream recognition tasks. Using depth as an additional
channel alongside the RGB input has the scale variance problem present in image
convolution based approaches. On the other hand, 3D convolution wastes a large
amount of memory on mostly unoccupied 3D space, which consists of only the
surface visible to the sensor. Instead, we propose SurfConv, which "slides"
compact 2D filters along the visible 3D surface. SurfConv is formulated as a
simple depth-aware multi-scale 2D convolution, through a new Data-Driven Depth
Discretization (D4) scheme. We demonstrate the effectiveness of our method on
indoor and outdoor 3D semantic segmentation datasets. Our method achieves
state-of-the-art performance with less than 30% parameters used by the 3D
convolution-based approaches.Comment: Published at CVPR 201
- …