The deep learning recipe of casting real-world problems as mathematical
optimisation and tackling the optimisation by training deep neural networks
using gradient-based optimisation has undoubtedly proven to be a fruitful one.
The understanding behind why deep learning works, however, has lagged behind
its practical significance. We aim to make steps towards an improved
understanding of deep learning with a focus on optimisation and model
regularisation. We start by investigating gradient descent (GD), a
discrete-time algorithm at the basis of most popular deep learning optimisation
algorithms. Understanding the dynamics of GD has been hindered by the presence
of discretisation drift, the numerical integration error between GD and its
often studied continuous-time counterpart, the negative gradient flow (NGF). To
add to the toolkit available to study GD, we derive novel continuous-time flows
that account for discretisation drift. Unlike the NGF, these new flows can be
used to describe learning rate specific behaviours of GD, such as training
instabilities observed in supervised learning and two-player games. We then
translate insights from continuous time into mitigation strategies for unstable
GD dynamics, by constructing novel learning rate schedules and regularisers
that do not require additional hyperparameters. Like optimisation, smoothness
regularisation is another pillar of deep learning's success with wide use in
supervised learning and generative modelling. Despite their individual
significance, the interactions between smoothness regularisation and
optimisation have yet to be explored. We find that smoothness regularisation
affects optimisation across multiple deep learning domains, and that
incorporating smoothness regularisation in reinforcement learning leads to a
performance boost that can be recovered using adaptions to optimisation
methods.Comment: PhD thesis. arXiv admin note: text overlap with arXiv:2302.0195