68 research outputs found
Bayesian Pose Graph Optimization via Bingham Distributions and Tempered Geodesic MCMC
We introduce Tempered Geodesic Markov Chain Monte Carlo (TG-MCMC) algorithm
for initializing pose graph optimization problems, arising in various scenarios
such as SFM (structure from motion) or SLAM (simultaneous localization and
mapping). TG-MCMC is first of its kind as it unites asymptotically global
non-convex optimization on the spherical manifold of quaternions with posterior
sampling, in order to provide both reliable initial poses and uncertainty
estimates that are informative about the quality of individual solutions. We
devise rigorous theoretical convergence guarantees for our method and
extensively evaluate it on synthetic and real benchmark datasets. Besides its
elegance in formulation and theory, we show that our method is robust to
missing data, noise and the estimated uncertainties capture intuitive
properties of the data.Comment: Published at NeurIPS 2018, 25 pages with supplement
Supervised Symbolic Music Style Translation Using Synthetic Data
International audienceResearch on style transfer and domain translation has clearly demonstrated the ability of deep learning-based algorithms to manipulate images in terms of artistic style. More recently, several attempts have been made to extend such approaches to music (both symbolic and audio) in order to enable transforming musical style in a similar manner. In this study, we focus on symbolic music with the goal of altering the 'style' of a piece while keeping its original 'content'. As opposed to the current methods, which are inherently restricted to be unsupervised due to the lack of 'aligned' data (i.e. the same musical piece played in multiple styles), we develop the first fully supervised algorithm for this task. At the core of our approach lies a synthetic data generation scheme which allows us to produce virtually unlimited amounts of aligned data, and hence avoid the above issue. In view of this data generation scheme, we propose an encoder-decoder model for translating symbolic music accompaniments between a number of different styles. Our experiments show that our models, although trained entirely on synthetic data, are capable of producing musically meaningful accompaniments even for real (non-synthetic) MIDI recordings
Generalization Bounds with Data-dependent Fractal Dimensions
Providing generalization guarantees for modern neural networks has been a
crucial task in statistical learning. Recently, several studies have attempted
to analyze the generalization error in such settings by using tools from
fractal geometry. While these works have successfully introduced new
mathematical tools to apprehend generalization, they heavily rely on a
Lipschitz continuity assumption, which in general does not hold for neural
networks and might make the bounds vacuous. In this work, we address this issue
and prove fractal geometry-based generalization bounds without requiring any
Lipschitz assumption. To achieve this goal, we build up on a classical covering
argument in learning theory and introduce a data-dependent fractal dimension.
Despite introducing a significant amount of technical complications, this new
notion lets us control the generalization error (over either fixed or random
hypothesis spaces) along with certain mutual information (MI) terms. To provide
a clearer interpretation to the newly introduced MI terms, as a next step, we
introduce a notion of "geometric stability" and link our bounds to the prior
art. Finally, we make a rigorous connection between the proposed data-dependent
dimension and topological data analysis tools, which then enables us to compute
the dimension in a numerically efficient way. We support our theory with
experiments conducted on various settings
The Heavy-Tail Phenomenon in SGD
In recent years, various notions of capacity and complexity have been
proposed for characterizing the generalization properties of stochastic
gradient descent (SGD) in deep learning. Some of the popular notions that
correlate well with the performance on unseen data are (i) the `flatness' of
the local minimum found by SGD, which is related to the eigenvalues of the
Hessian, (ii) the ratio of the stepsize to the batch-size , which
essentially controls the magnitude of the stochastic gradient noise, and (iii)
the `tail-index', which measures the heaviness of the tails of the network
weights at convergence. In this paper, we argue that these three seemingly
unrelated perspectives for generalization are deeply linked to each other. We
claim that depending on the structure of the Hessian of the loss at the
minimum, and the choices of the algorithm parameters and , the SGD
iterates will converge to a \emph{heavy-tailed} stationary distribution. We
rigorously prove this claim in the setting of quadratic optimization: we show
that even in a simple linear regression problem with independent and
identically distributed data whose distribution has finite moments of all
order, the iterates can be heavy-tailed with infinite variance. We further
characterize the behavior of the tails with respect to algorithm parameters,
the dimension, and the curvature. We then translate our results into insights
about the behavior of SGD in deep learning. We support our theory with
experiments conducted on synthetic data, fully connected, and convolutional
neural networks
Generalization Bounds for Stochastic Gradient Descent via Localized -Covers
In this paper, we propose a new covering technique localized for the
trajectories of SGD. This localization provides an algorithm-specific
complexity measured by the covering number, which can have
dimension-independent cardinality in contrast to standard uniform covering
arguments that result in exponential dimension dependency. Based on this
localized construction, we show that if the objective function is a finite
perturbation of a piecewise strongly convex and smooth function with
pieces, i.e. non-convex and non-smooth in general, the generalization error can
be upper bounded by , where is the number of
data samples. In particular, this rate is independent of dimension and does not
require early stopping and decaying step size. Finally, we employ these results
in various contexts and derive generalization bounds for multi-index linear
models, multi-class support vector machines, and -means clustering for both
hard and soft label setups, improving the known state-of-the-art rates
Learning via Wasserstein-Based High Probability Generalisation Bounds
Minimising upper bounds on the population risk or the generalisation gap has
been widely used in structural risk minimisation (SRM) -- this is in particular
at the core of PAC-Bayesian learning. Despite its successes and unfailing surge
of interest in recent years, a limitation of the PAC-Bayesian framework is that
most bounds involve a Kullback-Leibler (KL) divergence term (or its
variations), which might exhibit erratic behavior and fail to capture the
underlying geometric structure of the learning problem -- hence restricting its
use in practical applications. As a remedy, recent studies have attempted to
replace the KL divergence in the PAC-Bayesian bounds with the Wasserstein
distance. Even though these bounds alleviated the aforementioned issues to a
certain extent, they either hold in expectation, are for bounded losses, or are
nontrivial to minimize in an SRM framework. In this work, we contribute to this
line of research and prove novel Wasserstein distance-based PAC-Bayesian
generalisation bounds for both batch learning with independent and identically
distributed (i.i.d.) data, and online learning with potentially non-i.i.d.
data. Contrary to previous art, our bounds are stronger in the sense that (i)
they hold with high probability, (ii) they apply to unbounded (potentially
heavy-tailed) losses, and (iii) they lead to optimizable training objectives
that can be used in SRM. As a result we derive novel Wasserstein-based
PAC-Bayesian learning algorithms and we illustrate their empirical advantage on
a variety of experiments.Comment: Accepted to NeurIPS 202
- …