5,145 research outputs found
Time-Agnostic Prediction: Predicting Predictable Video Frames
Prediction is arguably one of the most basic functions of an intelligent
system. In general, the problem of predicting events in the future or between
two waypoints is exceedingly difficult. However, most phenomena naturally pass
through relatively predictable bottlenecks---while we cannot predict the
precise trajectory of a robot arm between being at rest and holding an object
up, we can be certain that it must have picked the object up. To exploit this,
we decouple visual prediction from a rigid notion of time. While conventional
approaches predict frames at regularly spaced temporal intervals, our
time-agnostic predictors (TAP) are not tied to specific times so that they may
instead discover predictable "bottleneck" frames no matter when they occur. We
evaluate our approach for future and intermediate frame prediction across three
robotic manipulation tasks. Our predictions are not only of higher visual
quality, but also correspond to coherent semantic subgoals in temporally
extended tasks.Comment: 8 pages, plus appendice
L2AE-D: Learning to Aggregate Embeddings for Few-shot Learning with Meta-level Dropout
Few-shot learning focuses on learning a new visual concept with very limited
labelled examples. A successful approach to tackle this problem is to compare
the similarity between examples in a learned metric space based on
convolutional neural networks. However, existing methods typically suffer from
meta-level overfitting due to the limited amount of training tasks and do not
normally consider the importance of the convolutional features of different
examples within the same channel. To address these limitations, we make the
following two contributions: (a) We propose a novel meta-learning approach for
aggregating useful convolutional features and suppressing noisy ones based on a
channel-wise attention mechanism to improve class representations. The proposed
model does not require fine-tuning and can be trained in an end-to-end manner.
The main novelty lies in incorporating a shared weight generation module that
learns to assign different weights to the feature maps of different examples
within the same channel. (b) We also introduce a simple meta-level dropout
technique that reduces meta-level overfitting in several few-shot learning
approaches. In our experiments, we find that this simple technique
significantly improves the performance of the proposed method as well as
various state-of-the-art meta-learning algorithms. Applying our method to
few-shot image recognition using Omniglot and miniImageNet datasets shows that
it is capable of delivering a state-of-the-art classification performance
An empirical study on evaluation metrics of generative adversarial networks
Evaluating generative adversarial networks (GANs) is inherently challenging.
In this paper, we revisit several representative sample-based evaluation
metrics for GANs, and address the problem of how to evaluate the evaluation
metrics. We start with a few necessary conditions for metrics to produce
meaningful scores, such as distinguishing real from generated samples,
identifying mode dropping and mode collapsing, and detecting overfitting. With
a series of carefully designed experiments, we comprehensively investigate
existing sample-based metrics and identify their strengths and limitations in
practical settings. Based on these results, we observe that kernel Maximum Mean
Discrepancy (MMD) and the 1-Nearest-Neighbor (1-NN) two-sample test seem to
satisfy most of the desirable properties, provided that the distances between
samples are computed in a suitable feature space. Our experiments also unveil
interesting properties about the behavior of several popular GAN models, such
as whether they are memorizing training samples, and how far they are from
learning the target distribution.Comment: arXiv admin note: text overlap with arXiv:1802.03446 by other author
Improving Nighttime Retrieval-Based Localization
Outdoor visual localization is a crucial component to many computer vision
systems. We propose an approach to localization from images that is designed to
explicitly handle the strong variations in appearance happening between daytime
and nighttime. As revealed by recent long-term localization benchmarks, both
traditional feature-based and retrieval-based approaches still struggle to
handle such changes. Our novel localization method combines a state-of-the-art
image retrieval architecture with condition-specific sub-networks allowing the
computation of global image descriptors that are explicitly dependent of the
capturing conditions. We show that our approach improves localization by a
factor of almost 300\% compared to the popular VLAD-based methods on nighttime
localization
Learning a Multi-View Stereo Machine
We present a learnt system for multi-view stereopsis. In contrast to recent
learning based methods for 3D reconstruction, we leverage the underlying 3D
geometry of the problem through feature projection and unprojection along
viewing rays. By formulating these operations in a differentiable manner, we
are able to learn the system end-to-end for the task of metric 3D
reconstruction. End-to-end learning allows us to jointly reason about shape
priors while conforming geometric constraints, enabling reconstruction from
much fewer images (even a single image) than required by classical approaches
as well as completion of unseen surfaces. We thoroughly evaluate our approach
on the ShapeNet dataset and demonstrate the benefits over classical approaches
as well as recent learning based methods
Semantic Instance Segmentation via Deep Metric Learning
We propose a new method for semantic instance segmentation, by first
computing how likely two pixels are to belong to the same object, and then by
grouping similar pixels together. Our similarity metric is based on a deep,
fully convolutional embedding model. Our grouping method is based on selecting
all points that are sufficiently similar to a set of "seed points", chosen from
a deep, fully convolutional scoring model. We show competitive results on the
Pascal VOC instance segmentation benchmark
Purely Geometric Scene Association and Retrieval - A Case for Macro Scale 3D Geometry
We address the problems of measuring geometric similarity between 3D scenes,
represented through point clouds or range data frames, and associating them.
Our approach leverages macro-scale 3D structural geometry - the relative
configuration of arbitrary surfaces and relationships among structures that are
potentially far apart. We express such discriminative information in a
viewpoint-invariant feature space. These are subsequently encoded in a
frame-level signature that can be utilized to measure geometric similarity.
Such a characterization is robust to noise, incomplete and partially
overlapping data besides viewpoint changes. We show how it can be employed to
select a diverse set of data frames which have structurally similar content,
and how to validate whether views with similar geometric content are from the
same scene. The problem is formulated as one of general purpose retrieval from
an unannotated, spatio-temporally unordered database. Empirical analysis
indicates that the presented approach thoroughly outperforms baselines on depth
/ range data. Its depth-only performance is competitive with state-of-the-art
approaches with RGB or RGB-D inputs, including ones based on deep learning.
Experiments show retrieval performance to hold up well with much sparser
databases, which is indicative of the approach's robustness. The approach
generalized well - it did not require dataset specific training, and scaled up
in our experiments. Finally, we also demonstrate how geometrically diverse
selection of views can result in richer 3D reconstructions.Comment: Accepted in ICRA '1
Hyperbolic Image Embeddings
Computer vision tasks such as image classification, image retrieval and
few-shot learning are currently dominated by Euclidean and spherical
embeddings, so that the final decisions about class belongings or the degree of
similarity are made using linear hyperplanes, Euclidean distances, or spherical
geodesic distances (cosine similarity). In this work, we demonstrate that in
many practical scenarios hyperbolic embeddings provide a better alternative
Scale-Robust Localization Using General Object Landmarks
Visual localization under large changes in scale is an important capability
in many robotic mapping applications, such as localizing at low altitudes in
maps built at high altitudes, or performing loop closure over long distances.
Existing approaches, however, are robust only up to about a 3x difference in
scale between map and query images.
We propose a novel combination of deep-learning-based object features and
state-of-the-art SIFT point-features that yields improved robustness to scale
change. This technique is training-free and class-agnostic, and in principle
can be deployed in any environment out-of-the-box. We evaluate the proposed
technique on the KITTI Odometry benchmark and on a novel dataset of outdoor
images exhibiting changes in visual scale of and greater, which we
have released to the public. Our technique consistently outperforms
localization using either SIFT features or the proposed object features alone,
achieving both greater accuracy and much lower failure rates under large
changes in scale
Few-Shot Object Detection with Attention-RPN and Multi-Relation Detector
Conventional methods for object detection typically require a substantial
amount of training data and preparing such high-quality training data is very
labor-intensive. In this paper, we propose a novel few-shot object detection
network that aims at detecting objects of unseen categories with only a few
annotated examples. Central to our method are our Attention-RPN, Multi-Relation
Detector and Contrastive Training strategy, which exploit the similarity
between the few shot support set and query set to detect novel objects while
suppressing false detection in the background. To train our network, we
contribute a new dataset that contains 1000 categories of various objects with
high-quality annotations. To the best of our knowledge, this is one of the
first datasets specifically designed for few-shot object detection. Once our
few-shot network is trained, it can detect objects of unseen categories without
further training or fine-tuning. Our method is general and has a wide range of
potential applications. We produce a new state-of-the-art performance on
different datasets in the few-shot setting. The dataset link is
https://github.com/fanq15/Few-Shot-Object-Detection-Dataset.Comment: CVPR2020 Camera Ready. (Fix Figure 3 and Table 5. More implementation
details in the supplementary material.
- …