83 research outputs found

    Fractal Structure and Generalization Properties of Stochastic Optimization Algorithms

    Full text link
    Understanding generalization in deep learning has been one of the major challenges in statistical learning theory over the last decade. While recent work has illustrated that the dataset and the training algorithm must be taken into account in order to obtain meaningful generalization bounds, it is still theoretically not clear which properties of the data and the algorithm determine the generalization performance. In this study, we approach this problem from a dynamical systems theory perspective and represent stochastic optimization algorithms as random iterated function systems (IFS). Well studied in the dynamical systems literature, under mild assumptions, such IFSs can be shown to be ergodic with an invariant measure that is often supported on sets with a fractal structure. As our main contribution, we prove that the generalization error of a stochastic optimization algorithm can be bounded based on the `complexity' of the fractal structure that underlies its invariant measure. Leveraging results from dynamical systems theory, we show that the generalization error can be explicitly linked to the choice of the algorithm (e.g., stochastic gradient descent -- SGD), algorithm hyperparameters (e.g., step-size, batch-size), and the geometry of the problem (e.g., Hessian of the loss). We further specialize our results to specific problems (e.g., linear/logistic regression, one hidden-layered neural networks) and algorithms (e.g., SGD and preconditioned variants), and obtain analytical estimates for our bound.For modern neural networks, we develop an efficient algorithm to compute the developed bound and support our theory with various experiments on neural networks.Comment: 34 pages including Supplement, 4 Figure

    A Variable Density Sampling Scheme for Compressive Fourier Transform Interferometry

    Full text link
    Fourier Transform Interferometry (FTI) is an appealing Hyperspectral (HS) imaging modality for many applications demanding high spectral resolution, e.g., in fluorescence microscopy. However, the effective resolution of FTI is limited by the durability of biological elements when exposed to illuminating light. Overexposed elements are subject to photo-bleaching and become unable to fluoresce. In this context, the acquisition of biological HS volumes based on sampling the Optical Path Difference (OPD) axis at Nyquist rate leads to unpleasant trade-offs between spectral resolution, quality of the HS volume, and light exposure intensity. We propose two variants of the FTI imager, i.e., Coded Illumination-FTI (CI-FTI) and Structured Illumination FTI (SI-FTI), based on the theory of compressive sensing (CS). These schemes efficiently modulate light exposure temporally (in CI-FTI) or spatiotemporally (in SI-FTI). Leveraging a variable density sampling strategy recently introduced in CS, we provide near-optimal illumination strategies, so that the light exposure imposed on a biological specimen is minimized while the spectral resolution is preserved. Our analysis focuses on two criteria: (i) a trade-off between exposure intensity and the quality of the reconstructed HS volume for a given spectral resolution; (ii) maximizing HS volume quality for a fixed spectral resolution and constrained exposure budget. Our contributions can be adapted to an FTI imager without hardware modifications. The reconstruction of HS volumes from CS-FTI measurements relies on an l1l_1-norm minimization problem promoting a spatiospectral sparsity prior. Numerically, we support the proposed methods on synthetic data and simulated CS measurements (from actual FTI measurements) under various scenarios. In particular, the biological HS volumes can be reconstructed with a three-to-ten-fold reduction in the light exposure.Comment: 45 pages, 11 figure

    Trust-Region Variational Inference with Gaussian Mixture Models

    Get PDF
    Many methods for machine learning rely on approximate inference from intractable probability distributions. Variational inference approximates such distributions by tractable models that can be subsequently used for approximate inference. Learning sufficiently accurate approximations requires a rich model family and careful exploration of the relevant modes of the target distribution. We propose a method for learning accurate GMM approximations of intractable probability distributions based on insights from policy search by using information-geometric trust regions for principled exploration. For efficient improvement of the GMM approximation, we derive a lower bound on the corresponding optimization objective enabling us to update the components independently. Our use of the lower bound ensures convergence to a stationary point of the original objective. The number of components is adapted online by adding new components in promising regions and by deleting components with negligible weight. We demonstrate on several domains that we can learn approximations of complex, multimodal distributions with a quality that is unmet by previous variational inference methods, and that the GMM approximation can be used for drawing samples that are on par with samples created by state-of-theart MCMC samplers while requiring up to three orders of magnitude less computational resources

    Advances in scalable learning and sampling of unnormalised models

    Get PDF
    We study probabilistic models that are known incompletely, up to an intractable normalising constant. To reap the full benefit of such models, two tasks must be solved: learning and sampling. These two tasks have been subject to decades of research, and yet significant challenges still persist. Traditional approaches often suffer from poor scalability with respect to dimensionality and model-complexity, generally rendering them inapplicable to models parameterised by deep neural networks. In this thesis, we contribute a new set of methods for addressing this scalability problem. We first explore the problem of learning unnormalised models. Our investigation begins with a well-known learning principle, Noise-contrastive Estimation, whose underlying mechanism is that of density-ratio estimation. By examining why existing density-ratio estimators scale poorly, we identify a new framework, telescoping density-ratio estimation (TRE), that can learn ratios between highly dissimilar densities in high-dimensional spaces. Our experiments demonstrate that TRE not only yields substantial improvements for the learning of deep unnormalised models, but can do the same for a broader set of tasks including mutual information estimation and representation learning. Subsequently, we explore the problem of sampling unnormalised models. A large literature on Markov chain Monte Carlo (MCMC) can be leveraged here, and in continuous domains, gradient-based samplers such as Metropolis-adjusted Langevin algorithm (MALA) and Hamiltonian Monte Carlo are excellent options. However, there has been substantially less progress in MCMC for discrete domains. To advance this subfield, we introduce several discrete Metropolis-Hastings samplers that are conceptually inspired by MALA, and demonstrate their strong empirical performance across a range of challenging sampling tasks