468 research outputs found
Backpropagation Beyond the Gradient
Automatic differentiation is a key enabler of deep learning: previously, practitioners were limited to models
for which they could manually compute derivatives. Now, they can create sophisticated models with almost
no restrictions and train them using first-order, i. e. gradient, information. Popular libraries like PyTorch
and TensorFlow compute this gradient efficiently, automatically, and conveniently with a single line of
code. Under the hood, reverse-mode automatic differentiation, or gradient backpropagation, powers the
gradient computation in these libraries. Their entire design centers around gradient backpropagation.
These frameworks are specialized around one specific task—computing the average gradient in a mini-batch.
This specialization often complicates the extraction of other information like higher-order statistical moments
of the gradient, or higher-order derivatives like the Hessian. It limits practitioners and researchers to methods
that rely on the gradient. Arguably, this hampers the field from exploring the potential of higher-order
information and there is evidence that focusing solely on the gradient has not lead to significant recent
advances in deep learning optimization.
To advance algorithmic research and inspire novel ideas, information beyond the batch-averaged gradient
must be made available at the same level of computational efficiency, automation, and convenience.
This thesis presents approaches to simplify experimentation with rich information beyond the gradient
by making it more readily accessible. We present an implementation of these ideas as an extension to the
backpropagation procedure in PyTorch. Using this newly accessible information, we demonstrate possible use
cases by (i) showing how it can inform our understanding of neural network training by building a diagnostic
tool, and (ii) enabling novel methods to efficiently compute and approximate curvature information.
First, we extend gradient backpropagation for sequential feedforward models to Hessian backpropagation
which enables computing approximate per-layer curvature. This perspective unifies recently proposed block-
diagonal curvature approximations. Like gradient backpropagation, the computation of these second-order
derivatives is modular, and therefore simple to automate and extend to new operations.
Based on the insight that rich information beyond the gradient can be computed efficiently and at the
same time, we extend the backpropagation in PyTorch with the BackPACK library. It provides efficient and
convenient access to statistical moments of the gradient and approximate curvature information, often at a
small overhead compared to computing just the gradient.
Next, we showcase the utility of such information to better understand neural network training. We build
the Cockpit library that visualizes what is happening inside the model during training through various
instruments that rely on BackPACK’s statistics. We show how Cockpit provides a meaningful statistical
summary report to the deep learning engineer to identify bugs in their machine learning pipeline, guide
hyperparameter tuning, and study deep learning phenomena.
Finally, we use BackPACK’s extended automatic differentiation functionality to develop ViViT, an approach
to efficiently compute curvature information, in particular curvature noise. It uses the low-rank structure
of the generalized Gauss-Newton approximation to the Hessian and addresses shortcomings in existing
curvature approximations. Through monitoring curvature noise, we demonstrate how ViViT’s information
helps in understanding challenges to make second-order optimization methods work in practice.
This work develops new tools to experiment more easily with higher-order information in complex deep
learning models. These tools have impacted works on Bayesian applications with Laplace approximations,
out-of-distribution generalization, differential privacy, and the design of automatic differentia-
tion systems. They constitute one important step towards developing and establishing more efficient deep
learning algorithms
LIPIcs, Volume 251, ITCS 2023, Complete Volume
LIPIcs, Volume 251, ITCS 2023, Complete Volum
Decision-making with gaussian processes: sampling strategies and monte carlo methods
We study Gaussian processes and their application to decision-making in the real world. We begin by reviewing the foundations of Bayesian decision theory and show how these ideas give rise to methods such as Bayesian optimization. We investigate practical techniques for carrying out these strategies, with an emphasis on estimating and maximizing acquisition functions. Finally, we introduce pathwise approaches to conditioning Gaussian processes and demonstrate key benefits for representing random variables in this manner.Open Acces
Recommended from our members
Operational modal analysis and prediction of remaining useful life for rotating machinery
This thesis was submitted for the award of Doctor of Philosophy and was awarded by Brunel University LondonThe significance of rotating machinery spans areas from household items to vital industry sectors, such as aerospace, automotive, railway, sea transport, resource extraction, and manufacturing. Hence, our technologised society depends on efficient and reliable operation of rotating machinery. To contribute to this aim, this thesis leverages measurable quantities during its operation for structural-mechanical evaluation employing Operational Modal Analysis (OMA) and the prediction of Remaining Useful Life (RUL). Modal parameters determined by OMA are central for the design, test, and validation of rotating machinery. This thesis introduces the first open parametric simulation dataset of rotating machinery during an acceleration run. As there is a lack of similar open datasets suitable for OMA, it lays a foundation for improved reproducibility and comparability of future research. Based on this, the Averaged Order-Based Modal Analysis (AOBMA) method is developed. The novel addition of scaling and weighted averaging of individual machine orders in AOBMA alleviates the analysis effort of the existing Order-Based Modal Analysis (OBMA) method by providing a unified set of modal parameters with higher accuracy. As such, AOBMA showed a lower mean absolute relative error of 0.03 for damping ratio estimations across compared modes while OBMA provided an error value of 0.32 depending on the processed order. At excitation with high harmonic contributions, AOBMA also resulted in the highest number of accurately identified modes among the compared methods. At a harmonic ratio of 0.8, for example, AOBMA identified an average of 11.9 modes per estimation, while OBMA and baseline OMA followed with 9.5 and 9 modes, respectively. Moreover, it is the first study, which systematically evaluates the impact of excitation conditions on the compared methods and finds an advantage of OBMA and AOBMA over traditional OMA regarding mode shape estimation accuracy. While OMA can be used to evaluate significant structural changes, Machine Learning (ML) methods have seen substantially greater success in condition monitoring, including RUL prediction. However, as these methods often require large amounts of time and cost-
intensive training data, a novel data-efficient RUL prediction methodology is introduced, taking advantage of distinct healthy and faulty condition data. When the number of training sequences from an open dataset is reduced to 5%, an average prediction Root Mean Square Error (RMSE) of 24.9 operation cycles is achieved, outperforming the baseline method with an RMSE of 28.1. Motivated by environmental considerations, the impact of data reduction on the training duration of several method variants is quantified. When the full training set is
utilised, the most resource-saving variant of the proposed approach achieves an average training duration of 8.9% compared to the baseline method
LIPIcs, Volume 261, ICALP 2023, Complete Volume
LIPIcs, Volume 261, ICALP 2023, Complete Volum
CoLA: Exploiting Compositional Structure for Automatic and Efficient Numerical Linear Algebra
Many areas of machine learning and science involve large linear algebra
problems, such as eigendecompositions, solving linear systems, computing matrix
exponentials, and trace estimation. The matrices involved often have Kronecker,
convolutional, block diagonal, sum, or product structure. In this paper, we
propose a simple but general framework for large-scale linear algebra problems
in machine learning, named CoLA (Compositional Linear Algebra). By combining a
linear operator abstraction with compositional dispatch rules, CoLA
automatically constructs memory and runtime efficient numerical algorithms.
Moreover, CoLA provides memory efficient automatic differentiation, low
precision computation, and GPU acceleration in both JAX and PyTorch, while also
accommodating new objects, operations, and rules in downstream packages via
multiple dispatch. CoLA can accelerate many algebraic operations, while making
it easy to prototype matrix structures and algorithms, providing an appealing
drop-in tool for virtually any computational effort that requires linear
algebra. We showcase its efficacy across a broad range of applications,
including partial differential equations, Gaussian processes, equivariant model
construction, and unsupervised learning.Comment: Code available at https://github.com/wilson-labs/col
Structured prior distributions for the covariance matrix in latent factor models
Factor models are widely used for dimension reduction in the analysis of
multivariate data. This is achieved through decomposition of a p x p covariance
matrix into the sum of two components. Through a latent factor representation,
they can be interpreted as a diagonal matrix of idiosyncratic variances and a
shared variation matrix, that is, the product of a p x k factor loadings matrix
and its transpose. If k << p, this defines a sparse factorisation of the
covariance matrix. Historically, little attention has been paid to
incorporating prior information in Bayesian analyses using factor models where,
at best, the prior for the factor loadings is order invariant. In this work, a
class of structured priors is developed that can encode ideas of dependence
structure about the shared variation matrix. The construction allows
data-informed shrinkage towards sensible parametric structures while also
facilitating inference over the number of factors. Using an unconstrained
reparameterisation of stationary vector autoregressions, the methodology is
extended to stationary dynamic factor models. For computational inference,
parameter-expanded Markov chain Monte Carlo samplers are proposed, including an
efficient adaptive Gibbs sampler. Two substantive applications showcase the
scope of the methodology and its inferential benefits
A Survey of Geometric Optimization for Deep Learning: From Euclidean Space to Riemannian Manifold
Although Deep Learning (DL) has achieved success in complex Artificial
Intelligence (AI) tasks, it suffers from various notorious problems (e.g.,
feature redundancy, and vanishing or exploding gradients), since updating
parameters in Euclidean space cannot fully exploit the geometric structure of
the solution space. As a promising alternative solution, Riemannian-based DL
uses geometric optimization to update parameters on Riemannian manifolds and
can leverage the underlying geometric information. Accordingly, this article
presents a comprehensive survey of applying geometric optimization in DL. At
first, this article introduces the basic procedure of the geometric
optimization, including various geometric optimizers and some concepts of
Riemannian manifold. Subsequently, this article investigates the application of
geometric optimization in different DL networks in various AI tasks, e.g.,
convolution neural network, recurrent neural network, transfer learning, and
optimal transport. Additionally, typical public toolboxes that implement
optimization on manifold are also discussed. Finally, this article makes a
performance comparison between different deep geometric optimization methods
under image recognition scenarios.Comment: 41 page
- …