83 research outputs found
Fractal Structure and Generalization Properties of Stochastic Optimization Algorithms
Understanding generalization in deep learning has been one of the major
challenges in statistical learning theory over the last decade. While recent
work has illustrated that the dataset and the training algorithm must be taken
into account in order to obtain meaningful generalization bounds, it is still
theoretically not clear which properties of the data and the algorithm
determine the generalization performance. In this study, we approach this
problem from a dynamical systems theory perspective and represent stochastic
optimization algorithms as random iterated function systems (IFS). Well studied
in the dynamical systems literature, under mild assumptions, such IFSs can be
shown to be ergodic with an invariant measure that is often supported on sets
with a fractal structure. As our main contribution, we prove that the
generalization error of a stochastic optimization algorithm can be bounded
based on the `complexity' of the fractal structure that underlies its invariant
measure. Leveraging results from dynamical systems theory, we show that the
generalization error can be explicitly linked to the choice of the algorithm
(e.g., stochastic gradient descent -- SGD), algorithm hyperparameters (e.g.,
step-size, batch-size), and the geometry of the problem (e.g., Hessian of the
loss). We further specialize our results to specific problems (e.g.,
linear/logistic regression, one hidden-layered neural networks) and algorithms
(e.g., SGD and preconditioned variants), and obtain analytical estimates for
our bound.For modern neural networks, we develop an efficient algorithm to
compute the developed bound and support our theory with various experiments on
neural networks.Comment: 34 pages including Supplement, 4 Figure
A Variable Density Sampling Scheme for Compressive Fourier Transform Interferometry
Fourier Transform Interferometry (FTI) is an appealing Hyperspectral (HS)
imaging modality for many applications demanding high spectral resolution,
e.g., in fluorescence microscopy. However, the effective resolution of FTI is
limited by the durability of biological elements when exposed to illuminating
light. Overexposed elements are subject to photo-bleaching and become unable to
fluoresce. In this context, the acquisition of biological HS volumes based on
sampling the Optical Path Difference (OPD) axis at Nyquist rate leads to
unpleasant trade-offs between spectral resolution, quality of the HS volume,
and light exposure intensity. We propose two variants of the FTI imager, i.e.,
Coded Illumination-FTI (CI-FTI) and Structured Illumination FTI (SI-FTI), based
on the theory of compressive sensing (CS). These schemes efficiently modulate
light exposure temporally (in CI-FTI) or spatiotemporally (in SI-FTI).
Leveraging a variable density sampling strategy recently introduced in CS, we
provide near-optimal illumination strategies, so that the light exposure
imposed on a biological specimen is minimized while the spectral resolution is
preserved. Our analysis focuses on two criteria: (i) a trade-off between
exposure intensity and the quality of the reconstructed HS volume for a given
spectral resolution; (ii) maximizing HS volume quality for a fixed spectral
resolution and constrained exposure budget. Our contributions can be adapted to
an FTI imager without hardware modifications. The reconstruction of HS volumes
from CS-FTI measurements relies on an -norm minimization problem promoting
a spatiospectral sparsity prior. Numerically, we support the proposed methods
on synthetic data and simulated CS measurements (from actual FTI measurements)
under various scenarios. In particular, the biological HS volumes can be
reconstructed with a three-to-ten-fold reduction in the light exposure.Comment: 45 pages, 11 figure
Trust-Region Variational Inference with Gaussian Mixture Models
Many methods for machine learning rely on approximate inference from intractable probability distributions. Variational inference approximates such distributions by tractable models that can be subsequently used for approximate inference. Learning sufficiently accurate approximations requires a rich model family and careful exploration of the relevant modes of the target distribution. We propose a method for learning accurate GMM approximations of intractable probability distributions based on insights from policy search by using information-geometric trust regions for principled exploration. For efficient improvement of the GMM approximation, we derive a lower bound on the corresponding optimization objective enabling us to update the components independently. Our use of the lower bound ensures convergence to a stationary point of the original objective. The number of components is adapted online by adding new components in promising regions and by deleting components with negligible weight. We demonstrate on several domains that we can learn approximations of complex, multimodal distributions with a quality that is unmet by previous variational inference methods, and that the GMM approximation can be used for drawing samples that are on par with samples created by state-of-theart MCMC samplers while requiring up to three orders of magnitude less computational resources
Advances in scalable learning and sampling of unnormalised models
We study probabilistic models that are known incompletely, up to an intractable normalising constant. To reap the full benefit of such models, two
tasks must be solved: learning and sampling. These two tasks have been
subject to decades of research, and yet significant challenges still persist.
Traditional approaches often suffer from poor scalability with respect to
dimensionality and model-complexity, generally rendering them inapplicable to models parameterised by deep neural networks. In this thesis, we
contribute a new set of methods for addressing this scalability problem.
We first explore the problem of learning unnormalised models. Our investigation begins with a well-known learning principle, Noise-contrastive
Estimation, whose underlying mechanism is that of density-ratio estimation.
By examining why existing density-ratio estimators scale poorly, we identify a new framework, telescoping density-ratio estimation (TRE), that can
learn ratios between highly dissimilar densities in high-dimensional spaces.
Our experiments demonstrate that TRE not only yields substantial improvements for the learning of deep unnormalised models, but can do the
same for a broader set of tasks including mutual information estimation and
representation learning.
Subsequently, we explore the problem of sampling unnormalised models.
A large literature on Markov chain Monte Carlo (MCMC) can be leveraged here, and in continuous domains, gradient-based samplers such as
Metropolis-adjusted Langevin algorithm (MALA) and Hamiltonian Monte
Carlo are excellent options. However, there has been substantially less
progress in MCMC for discrete domains. To advance this subfield, we introduce several discrete Metropolis-Hastings samplers that are conceptually
inspired by MALA, and demonstrate their strong empirical performance
across a range of challenging sampling tasks
- …