122 research outputs found
Multi-Task Learning as Multi-Objective Optimization
In multi-task learning, multiple tasks are solved jointly, sharing inductive
bias between them. Multi-task learning is inherently a multi-objective problem
because different tasks may conflict, necessitating a trade-off. A common
compromise is to optimize a proxy objective that minimizes a weighted linear
combination of per-task losses. However, this workaround is only valid when the
tasks do not compete, which is rarely the case. In this paper, we explicitly
cast multi-task learning as multi-objective optimization, with the overall
objective of finding a Pareto optimal solution. To this end, we use algorithms
developed in the gradient-based multi-objective optimization literature. These
algorithms are not directly applicable to large-scale learning problems since
they scale poorly with the dimensionality of the gradients and the number of
tasks. We therefore propose an upper bound for the multi-objective loss and
show that it can be optimized efficiently. We further prove that optimizing
this upper bound yields a Pareto optimal solution under realistic assumptions.
We apply our method to a variety of multi-task deep learning problems including
digit classification, scene understanding (joint semantic segmentation,
instance segmentation, and depth estimation), and multi-label classification.
Our method produces higher-performing models than recent multi-task learning
formulations or per-task training.Comment: In Neural Information Processing Systems (NeurIPS) 201
Multi-Scale Context Aggregation by Dilated Convolutions
State-of-the-art models for semantic segmentation are based on adaptations of
convolutional networks that had originally been designed for image
classification. However, dense prediction and image classification are
structurally different. In this work, we develop a new convolutional network
module that is specifically designed for dense prediction. The presented module
uses dilated convolutions to systematically aggregate multi-scale contextual
information without losing resolution. The architecture is based on the fact
that dilated convolutions support exponential expansion of the receptive field
without loss of resolution or coverage. We show that the presented context
module increases the accuracy of state-of-the-art semantic segmentation
systems. In addition, we examine the adaptation of image classification
networks to dense prediction and show that simplifying the adapted network can
increase accuracy.Comment: Published as a conference paper at ICLR 201
Direct Sparse Odometry
We propose a novel direct sparse visual odometry formulation. It combines a
fully direct probabilistic model (minimizing a photometric error) with
consistent, joint optimization of all model parameters, including geometry --
represented as inverse depth in a reference frame -- and camera motion. This is
achieved in real time by omitting the smoothness prior used in other direct
methods and instead sampling pixels evenly throughout the images. Since our
method does not depend on keypoint detectors or descriptors, it can naturally
sample pixels from across all image regions that have intensity gradient,
including edges or smooth intensity variations on mostly white walls. The
proposed model integrates a full photometric calibration, accounting for
exposure time, lens vignetting, and non-linear response functions. We
thoroughly evaluate our method on three different datasets comprising several
hours of video. The experiments show that the presented approach significantly
outperforms state-of-the-art direct and indirect methods in a variety of
real-world settings, both in terms of tracking accuracy and robustness.Comment: ** Corrected a bug which caused the real-time results for ORB-SLAM
(dashed lines in Fig. 10 and 12) to be much worse than they should be **
Added references [12], [13],[19], and Fig. 11. ** Partly re-formulated and
extended [5. Conclusion]. ** Fixed typos and minor re-formulation
Semi-parametric Topological Memory for Navigation
We introduce a new memory architecture for navigation in previously unseen
environments, inspired by landmark-based navigation in animals. The proposed
semi-parametric topological memory (SPTM) consists of a (non-parametric) graph
with nodes corresponding to locations in the environment and a (parametric)
deep network capable of retrieving nodes from the graph based on observations.
The graph stores no metric information, only connectivity of locations
corresponding to the nodes. We use SPTM as a planning module in a navigation
system. Given only 5 minutes of footage of a previously unseen maze, an
SPTM-based navigation agent can build a topological map of the environment and
use it to confidently navigate towards goals. The average success rate of the
SPTM agent in goal-directed navigation across test environments is higher than
the best-performing baseline by a factor of three. A video of the agent is
available at https://youtu.be/vRF7f4lhswoComment: Published at International Conference on Learning Representations
(ICLR) 2018. Project website at https://sites.google.com/view/SPT
Does computer vision matter for action?
Computer vision produces representations of scene content. Much computer
vision research is predicated on the assumption that these intermediate
representations are useful for action. Recent work at the intersection of
machine learning and robotics calls this assumption into question by training
sensorimotor systems directly for the task at hand, from pixels to actions,
with no explicit intermediate representations. Thus the central question of our
work: Does computer vision matter for action? We probe this question and its
offshoots via immersive simulation, which allows us to conduct controlled
reproducible experiments at scale. We instrument immersive three-dimensional
environments to simulate challenges such as urban driving, off-road trail
traversal, and battle. Our main finding is that computer vision does matter.
Models equipped with intermediate representations train faster, achieve higher
task performance, and generalize better to previously unseen environments. A
video that summarizes the work and illustrates the results can be found at
https://youtu.be/4MfWa2yZ0JcComment: Published in Science Robotics, 4(30), May 201
Learning to Inpaint for Image Compression
We study the design of deep architectures for lossy image compression. We
present two architectural recipes in the context of multi-stage progressive
encoders and empirically demonstrate their importance on compression
performance. Specifically, we show that: (a) predicting the original image data
from residuals in a multi-stage progressive architecture facilitates learning
and leads to improved performance at approximating the original content and (b)
learning to inpaint (from neighboring image pixels) before performing
compression reduces the amount of information that must be stored to achieve a
high-quality approximation. Incorporating these design choices in a baseline
progressive encoder yields an average reduction of over in file size
with similar quality compared to the original residual encoder.Comment: Published in Advances in Neural Information Processing Systems (NIPS
2017
OpenBot: Turning Smartphones into Robots
Current robots are either expensive or make significant compromises on
sensory richness, computational power, and communication capabilities. We
propose to leverage smartphones to equip robots with extensive sensor suites,
powerful computational abilities, state-of-the-art communication channels, and
access to a thriving software ecosystem. We design a small electric vehicle
that costs $50 and serves as a robot body for standard Android smartphones. We
develop a software stack that allows smartphones to use this body for mobile
operation and demonstrate that the system is sufficiently powerful to support
advanced robotics workloads such as person following and real-time autonomous
navigation in unstructured environments. Controlled experiments demonstrate
that the presented approach is robust across different smartphones and robot
bodies. A video of our work is available at
https://www.youtube.com/watch?v=qc8hFLyWDO
Free View Synthesis
We present a method for novel view synthesis from input images that are
freely distributed around a scene. Our method does not rely on a regular
arrangement of input views, can synthesize images for free camera movement
through the scene, and works for general scenes with unconstrained geometric
layouts. We calibrate the input images via SfM and erect a coarse geometric
scaffold via MVS. This scaffold is used to create a proxy depth map for a novel
view of the scene. Based on this depth map, a recurrent encoder-decoder network
processes reprojected features from nearby views and synthesizes the new view.
Our network does not need to be optimized for a given scene. After training on
a dataset, it works in previously unseen environments with no fine-tuning or
per-scene optimization. We evaluate the presented approach on challenging
real-world datasets, including Tanks and Temples, where we demonstrate
successful view synthesis for the first time and substantially outperform prior
and concurrent work.Comment: published at ECCV 2020, https://youtu.be/JDJPn3ZtfZ
Stable View Synthesis
We present Stable View Synthesis (SVS). Given a set of source images
depicting a scene from freely distributed viewpoints, SVS synthesizes new views
of the scene. The method operates on a geometric scaffold computed via
structure-from-motion and multi-view stereo. Each point on this 3D scaffold is
associated with view rays and corresponding feature vectors that encode the
appearance of this point in the input images. The core of SVS is view-dependent
on-surface feature aggregation, in which directional feature vectors at each 3D
point are processed to produce a new feature vector for a ray that maps this
point into the new target view. The target view is then rendered by a
convolutional network from a tensor of features synthesized in this way for all
pixels. The method is composed of differentiable modules and is trained
end-to-end. It supports spatially-varying view-dependent importance weighting
and feature transformation of source images at each point; spatial and temporal
stability due to the smooth dependence of on-surface feature aggregation on the
target view; and synthesis of view-dependent effects such as specular
reflection. Experimental results demonstrate that SVS outperforms
state-of-the-art view synthesis methods both quantitatively and qualitatively
on three diverse real-world datasets, achieving unprecedented levels of realism
in free-viewpoint video of challenging large-scale scenes.Comment: https://youtu.be/gqgXIY09ht
Learning to Guide Random Search
We are interested in derivative-free optimization of high-dimensional
functions. The sample complexity of existing methods is high and depends on
problem dimensionality, unlike the dimensionality-independent rates of
first-order methods. The recent success of deep learning suggests that many
datasets lie on low-dimensional manifolds that can be represented by deep
nonlinear models. We therefore consider derivative-free optimization of a
high-dimensional function that lies on a latent low-dimensional manifold. We
develop an online learning approach that learns this manifold while performing
the optimization. In other words, we jointly learn the manifold and optimize
the function. Our analysis suggests that the presented method significantly
reduces sample complexity. We empirically evaluate the method on continuous
optimization benchmarks and high-dimensional continuous control problems. Our
method achieves significantly lower sample complexity than Augmented Random
Search, Bayesian optimization, covariance matrix adaptation (CMA-ES), and other
derivative-free optimization algorithms.Comment: Published at ICLR 2020, Code is available at:
https://github.com/intel-isl/LMR
- …