12 research outputs found
Deep learning that scales: leveraging compute and data
Deep learning has revolutionized the field of artificial intelligence in the past decade. Although the development of these techniques spans over several years, the recent advent of deep learning is explained by an increased availability of data and compute that have unlocked the potential of deep neural networks. They have become ubiquitous in domains such as natural language processing, computer vision, speech processing, and control, where enough training data is available. Recent years have seen continuous progress driven by ever-growing neural networks that benefited from large amounts of data and computing power.
This thesis is motivated by the observation that scale is one of the key factors driving progress in deep learning research, and aims at devising deep learning methods that scale gracefully with the available data and compute. We narrow down this scope into two main research directions. The first of them is concerned with designing hardware-aware methods which can make the most of the computing resources in current high performance computing facilities. We then study bottlenecks preventing existing methods from scaling up as more data becomes available, providing solutions that contribute towards enabling training of more complex models.
This dissertation studies the aforementioned research questions for two different learning paradigms, each with its own algorithmic and computational characteristics. The first part of this thesis studies the paradigm where the model needs to learn from a collection of examples, extracting as much information as possible from the given data. The second part is concerned with training agents that learn by interacting with a simulated environment, which introduces unique challenges such as efficient exploration and simulation
Optimization Theory for ReLU Neural Networks Trained with Normalization Layers
The success of deep neural networks is in part due to the use of
normalization layers. Normalization layers like Batch Normalization, Layer
Normalization and Weight Normalization are ubiquitous in practice, as they
improve generalization performance and speed up training significantly.
Nonetheless, the vast majority of current deep learning theory and non-convex
optimization literature focuses on the un-normalized setting, where the
functions under consideration do not exhibit the properties of commonly
normalized neural networks. In this paper, we bridge this gap by giving the
first global convergence result for two-layer neural networks with ReLU
activations trained with a normalization layer, namely Weight Normalization.
Our analysis shows how the introduction of normalization layers changes the
optimization landscape and can enable faster convergence as compared with
un-normalized neural networks.Comment: To be presented at ICML 202
Expressive Monotonic Neural Networks
The monotonic dependence of the outputs of a neural network on some of its
inputs is a crucial inductive bias in many scenarios where domain knowledge
dictates such behavior. This is especially important for interpretability and
fairness considerations. In a broader context, scenarios in which monotonicity
is important can be found in finance, medicine, physics, and other disciplines.
It is thus desirable to build neural network architectures that implement this
inductive bias provably. In this work, we propose a weight-constrained
architecture with a single residual connection to achieve exact monotonic
dependence in any subset of the inputs. The weight constraint scheme directly
controls the Lipschitz constant of the neural network and thus provides the
additional benefit of robustness. Compared to currently existing techniques
used for monotonicity, our method is simpler in implementation and in theory
foundations, has negligible computational overhead, is guaranteed to produce
monotonic dependence, and is highly expressive. We show how the algorithm is
used to train powerful, robust, and interpretable discriminators that achieve
competitive performance compared to current state-of-the-art methods across
various benchmarks, from social applications to the classification of the
decays of subatomic particles produced at the CERN Large Hadron Collider.Comment: 9 pages, 4 figures, ICLR 2023 final submissio
Deep Double Descent via Smooth Interpolation
The ability of overparameterized deep networks to interpolate noisy data,
while at the same time showing good generalization performance, has been
recently characterized in terms of the double descent curve for the test error.
Common intuition from polynomial regression suggests that overparameterized
networks are able to sharply interpolate noisy data, without considerably
deviating from the ground-truth signal, thus preserving generalization ability.
At present, a precise characterization of the relationship between
interpolation and generalization for deep networks is missing. In this work, we
quantify sharpness of fit of the training data interpolated by neural network
functions, by studying the loss landscape w.r.t. to the input variable locally
to each training point, over volumes around cleanly- and noisily-labelled
training samples, as we systematically increase the number of model parameters
and training epochs. Our findings show that loss sharpness in the input space
follows both model- and epoch-wise double descent, with worse peaks observed
around noisy labels. While small interpolating models sharply fit both clean
and noisy data, large interpolating models express a smooth loss landscape,
where noisy targets are predicted over large volumes around training data
points, in contrast to existing intuition
Eigenvalue initialisation and regularisation for Koopman autoencoders
Regularising the parameter matrices of neural networks is ubiquitous in
training deep models. Typical regularisation approaches suggest initialising
weights using small random values, and to penalise weights to promote sparsity.
However, these widely used techniques may be less effective in certain
scenarios. Here, we study the Koopman autoencoder model which includes an
encoder, a Koopman operator layer, and a decoder. These models have been
designed and dedicated to tackle physics-related problems with interpretable
dynamics and an ability to incorporate physics-related constraints. However,
the majority of existing work employs standard regularisation practices. In our
work, we take a step toward augmenting Koopman autoencoders with initialisation
and penalty schemes tailored for physics-related settings. Specifically, we
propose the "eigeninit" initialisation scheme that samples initial Koopman
operators from specific eigenvalue distributions. In addition, we suggest the
"eigenloss" penalty scheme that penalises the eigenvalues of the Koopman
operator during training. We demonstrate the utility of these schemes on two
synthetic data sets: a driven pendulum and flow past a cylinder; and two
real-world problems: ocean surface temperatures and cyclone wind fields. We
find on these datasets that eigenloss and eigeninit improves the convergence
rate by up to a factor of 5, and that they reduce the cumulative long-term
prediction error by up to a factor of 3. Such a finding points to the utility
of incorporating similar schemes as an inductive bias in other physics-related
deep learning approaches.Comment: 18 page