16 research outputs found
Gather-Excite: Exploiting Feature Context in Convolutional Neural Networks
While the use of bottom-up local operators in convolutional neural networks
(CNNs) matches well some of the statistics of natural images, it may also
prevent such models from capturing contextual long-range feature interactions.
In this work, we propose a simple, lightweight approach for better context
exploitation in CNNs. We do so by introducing a pair of operators: gather,
which efficiently aggregates feature responses from a large spatial extent, and
excite, which redistributes the pooled information to local features. The
operators are cheap, both in terms of number of added parameters and
computational complexity, and can be integrated directly in existing
architectures to improve their performance. Experiments on several datasets
show that gather-excite can bring benefits comparable to increasing the depth
of a CNN at a fraction of the cost. For example, we find ResNet-50 with
gather-excite operators is able to outperform its 101-layer counterpart on
ImageNet with no additional learnable parameters. We also propose a parametric
gather-excite operator pair which yields further performance gains, relate it
to the recently-introduced Squeeze-and-Excitation Networks, and analyse the
effects of these changes to the CNN feature activation statistics.Comment: NeurIPS 201
SFNet: Learning Object-aware Semantic Correspondence
We address the problem of semantic correspondence, that is, establishing a
dense flow field between images depicting different instances of the same
object or scene category. We propose to use images annotated with binary
foreground masks and subjected to synthetic geometric deformations to train a
convolutional neural network (CNN) for this task. Using these masks as part of
the supervisory signal offers a good compromise between semantic flow methods,
where the amount of training data is limited by the cost of manually selecting
point correspondences, and semantic alignment ones, where the regression of a
single global geometric transformation between images may be sensitive to
image-specific details such as background clutter. We propose a new CNN
architecture, dubbed SFNet, which implements this idea. It leverages a new and
differentiable version of the argmax function for end-to-end training, with a
loss that combines mask and flow consistency with smoothness terms.
Experimental results demonstrate the effectiveness of our approach, which
significantly outperforms the state of the art on standard benchmarks.Comment: cvpr 2019 oral pape
Unsupervised Intuitive Physics from Visual Observations
While learning models of intuitive physics is an increasingly active area of
research, current approaches still fall short of natural intelligences in one
important regard: they require external supervision, such as explicit access to
physical states, at training and sometimes even at test times. Some authors
have relaxed such requirements by supplementing the model with an handcrafted
physical simulator. Still, the resulting methods are unable to automatically
learn new complex environments and to understand physical interactions within
them. In this work, we demonstrated for the first time learning such predictors
directly from raw visual observations and without relying on simulators. We do
so in two steps: first, we learn to track mechanically-salient objects in
videos using causality and equivariance, two unsupervised learning principles
that do not require auto-encoding. Second, we demonstrate that the extracted
positions are sufficient to successfully train visual motion predictors that
can take the underlying environment into account. We validate our predictors on
synthetic datasets; then, we introduce a new dataset, ROLL4REAL, consisting of
real objects rolling on complex terrains (pool table, elliptical bowl, and
random height-field). We show that in all such cases it is possible to learn
reliable extrapolators of the object trajectories from raw videos alone,
without any form of external supervision and with no more prior knowledge than
the choice of a convolutional neural network architecture
Correspondence Networks with Adaptive Neighbourhood Consensus
In this paper, we tackle the task of establishing dense visual
correspondences between images containing objects of the same category. This is
a challenging task due to large intra-class variations and a lack of dense
pixel level annotations. We propose a convolutional neural network
architecture, called adaptive neighbourhood consensus network (ANC-Net), that
can be trained end-to-end with sparse key-point annotations, to handle this
challenge. At the core of ANC-Net is our proposed non-isotropic 4D convolution
kernel, which forms the building block for the adaptive neighbourhood consensus
module for robust matching. We also introduce a simple and efficient
multi-scale self-similarity module in ANC-Net to make the learned feature
robust to intra-class variations. Furthermore, we propose a novel orthogonal
loss that can enforce the one-to-one matching constraint. We thoroughly
evaluate the effectiveness of our method on various benchmarks, where it
substantially outperforms state-of-the-art methods.Comment: CVPR 2020. Project page: https://ancnet.avlcode.org
Hyperpixel Flow: Semantic Correspondence with Multi-layer Neural Features
International audienceEstablishing visual correspondences under large intra-class variations requires analyzing images at different levels , from features linked to semantics and context to local patterns, while being invariant to instance-specific details. To tackle these challenges, we represent images by "hyper-pixels" that leverage a small number of relevant features selected among early to late layers of a convolutional neu-ral network. Taking advantage of the condensed features of hyperpixels, we develop an effective real-time matching algorithm based on Hough geometric voting. The proposed method, hyperpixel flow, sets a new state of the art on three standard benchmarks as well as a new dataset, SPair-71k, which contains a significantly larger number of image pairs than existing datasets, with more accurate and richer annotations for in-depth analysis
From Coarse to Fine: Robust Hierarchical Localization at Large Scale
Robust and accurate visual localization is a fundamental capability for
numerous applications, such as autonomous driving, mobile robotics, or
augmented reality. It remains, however, a challenging task, particularly for
large-scale environments and in presence of significant appearance changes.
State-of-the-art methods not only struggle with such scenarios, but are often
too resource intensive for certain real-time applications. In this paper we
propose HF-Net, a hierarchical localization approach based on a monolithic CNN
that simultaneously predicts local features and global descriptors for accurate
6-DoF localization. We exploit the coarse-to-fine localization paradigm: we
first perform a global retrieval to obtain location hypotheses and only later
match local features within those candidate places. This hierarchical approach
incurs significant runtime savings and makes our system suitable for real-time
operation. By leveraging learned descriptors, our method achieves remarkable
localization robustness across large variations of appearance and sets a new
state-of-the-art on two challenging benchmarks for large-scale localization.Comment: Camera-ready for CVPR 201
Probabilistic Pixel-Adaptive Refinement Networks
Encoder-decoder networks have found widespread use in various dense
prediction tasks. However, the strong reduction of spatial resolution in the
encoder leads to a loss of location information as well as boundary artifacts.
To address this, image-adaptive post-processing methods have shown beneficial
by leveraging the high-resolution input image(s) as guidance data. We extend
such approaches by considering an important orthogonal source of information:
the network's confidence in its own predictions. We introduce probabilistic
pixel-adaptive convolutions (PPACs), which not only depend on image guidance
data for filtering, but also respect the reliability of per-pixel predictions.
As such, PPACs allow for image-adaptive smoothing and simultaneously
propagating pixels of high confidence into less reliable regions, while
respecting object boundaries. We demonstrate their utility in refinement
networks for optical flow and semantic segmentation, where PPACs lead to a
clear reduction in boundary artifacts. Moreover, our proposed refinement step
is able to substantially improve the accuracy on various widely used
benchmarks.Comment: To appear at CVPR 202