3,481 research outputs found

    Limitations of the Empirical Fisher Approximation for Natural Gradient Descent

    Full text link
    Natural gradient descent, which preconditions a gradient descent update with the Fisher information matrix of the underlying statistical model, is a way to capture partial second-order information. Several highly visible works have advocated an approximation known as the empirical Fisher, drawing connections between approximate second-order methods and heuristics like Adam. We dispute this argument by showing that the empirical Fisher---unlike the Fisher---does not generally capture second-order information. We further argue that the conditions under which the empirical Fisher approaches the Fisher (and the Hessian) are unlikely to be met in practice, and that, even on simple optimization problems, the pathologies of the empirical Fisher can have undesirable effects.Comment: V3: Minor corrections (typographic errors

    Towards Deep Learning with Competing Generalisation Objectives

    Get PDF
    The unreasonable effectiveness of Deep Learning continues to deliver unprecedented Artificial Intelligence capabilities to billions of people. Growing datasets and technological advances keep extending the reach of expressive model architectures trained through efficient optimisations. Thus, deep learning approaches continue to provide increasingly proficient subroutines for, among others, computer vision and natural interaction through speech and text. Due to their scalable learning and inference priors, higher performance is often gained cost-effectively through largely automatic training. As a result, new and improved capabilities empower more people while the costs of access drop. The arising opportunities and challenges have profoundly influenced research. Quality attributes of scalable software became central desiderata of deep learning paradigms, including reusability, efficiency, robustness and safety. Ongoing research into continual, meta- and robust learning aims to maximise such scalability metrics in addition to multiple generalisation criteria, despite possible conflicts. A significant challenge is to satisfy competing criteria automatically and cost-effectively. In this thesis, we introduce a unifying perspective on learning with competing generalisation objectives and make three additional contributions. When autonomous learning through multi-criteria optimisation is impractical, it is reasonable to ask whether knowledge of appropriate trade-offs could make it simultaneously effective and efficient. Informed by explicit trade-offs of interest to particular applications, we developed and evaluated bespoke model architecture priors. We introduced a novel architecture for sim-to-real transfer of robotic control policies by learning progressively to generalise anew. Competing desiderata of continual learning were balanced through disjoint capacity and hierarchical reuse of previously learnt representations. A new state-of-the-art meta-learning approach is then proposed. We showed that meta-trained hypernetworks efficiently store and flexibly reuse knowledge for new generalisation criteria through few-shot gradient-based optimisation. Finally, we characterised empirical trade-offs between the many desiderata of adversarial robustness and demonstrated a novel defensive capability of implicit neural networks to hinder many attacks simultaneously

    Scalable Stochastic Gradient Riemannian Langevin Dynamics in Non-Diagonal Metrics

    Full text link
    Stochastic-gradient sampling methods are often used to perform Bayesian inference on neural networks. It has been observed that the methods in which notions of differential geometry are included tend to have better performances, with the Riemannian metric improving posterior exploration by accounting for the local curvature. However, the existing methods often resort to simple diagonal metrics to remain computationally efficient. This loses some of the gains. We propose two non-diagonal metrics that can be used in stochastic-gradient samplers to improve convergence and exploration but have only a minor computational overhead over diagonal metrics. We show that for fully connected neural networks (NNs) with sparsity-inducing priors and convolutional NNs with correlated priors, using these metrics can provide improvements. For some other choices the posterior is sufficiently easy also for the simpler metrics
    • …
    corecore