196 research outputs found
Deep Learning and Linear Programming for Automated Ensemble Forecasting and Interpretation
This paper presents an ensemble forecasting method that shows strong results
on the M4 Competition dataset by decreasing feature and model selection
assumptions, termed DONUT (DO Not UTilize human beliefs). Our assumption
reductions, primarily consisting of auto-generated features and a more diverse
model pool for the ensemble, significantly outperform the statistical,
feature-based ensemble method FFORMA by Montero-Manso et al. (2020). We also
investigate feature extraction with a Long Short-term Memory Network (LSTM)
Autoencoder and find that such features contain crucial information not
captured by standard statistical feature approaches. The ensemble weighting
model uses LSTM and statistical features to combine the models accurately. The
analysis of feature importance and interaction shows a slight superiority for
LSTM features over the statistical ones alone. Clustering analysis shows that
essential LSTM features differ from most statistical features and each other.
We also find that increasing the solution space of the weighting model by
augmenting the ensemble with new models is something the weighting model learns
to use, thus explaining part of the accuracy gains. Moreover, we present a
formal ex-post-facto analysis of an optimal combination and selection for
ensembles, quantifying differences through linear optimization on the M4
dataset. Our findings indicate that classical statistical time series features,
such as trend and seasonality, alone do not capture all relevant information
for forecasting a time series. On the contrary, our novel LSTM features contain
significantly more predictive power than the statistical ones alone, but
combining the two feature sets proved the best in practice
On the Theoretical Properties of Noise Correlation in Stochastic Optimization
Studying the properties of stochastic noise to optimize complex non-convex
functions has been an active area of research in the field of machine learning.
Prior work has shown that the noise of stochastic gradient descent improves
optimization by overcoming undesirable obstacles in the landscape. Moreover,
injecting artificial Gaussian noise has become a popular idea to quickly escape
saddle points. Indeed, in the absence of reliable gradient information, the
noise is used to explore the landscape, but it is unclear what type of noise is
optimal in terms of exploration ability. In order to narrow this gap in our
knowledge, we study a general type of continuous-time non-Markovian process,
based on fractional Brownian motion, that allows for the increments of the
process to be correlated. This generalizes processes based on Brownian motion,
such as the Ornstein-Uhlenbeck process. We demonstrate how to discretize such
processes which gives rise to the new algorithm fPGD. This method is a
generalization of the known algorithms PGD and Anti-PGD. We study the
properties of fPGD both theoretically and empirically, demonstrating that it
possesses exploration abilities that, in some cases, are favorable over PGD and
Anti-PGD. These results open the field to novel ways to exploit noise for
training machine learning models
Learning Robust Statistics for Simulation-based Inference under Model Misspecification
Simulation-based inference (SBI) methods such as approximate Bayesian
computation (ABC), synthetic likelihood, and neural posterior estimation (NPE)
rely on simulating statistics to infer parameters of intractable likelihood
models. However, such methods are known to yield untrustworthy and misleading
inference outcomes under model misspecification, thus hindering their
widespread applicability. In this work, we propose the first general approach
to handle model misspecification that works across different classes of SBI
methods. Leveraging the fact that the choice of statistics determines the
degree of misspecification in SBI, we introduce a regularized loss function
that penalises those statistics that increase the mismatch between the data and
the model. Taking NPE and ABC as use cases, we demonstrate the superior
performance of our method on high-dimensional time-series models that are
artificially misspecified. We also apply our method to real data from the field
of radio propagation where the model is known to be misspecified. We show
empirically that the method yields robust inference in misspecified scenarios,
whilst still being accurate when the model is well-specified
Stochastic Collapse: How Gradient Noise Attracts SGD Dynamics Towards Simpler Subnetworks
In this work, we reveal a strong implicit bias of stochastic gradient descent
(SGD) that drives overly expressive networks to much simpler subnetworks,
thereby dramatically reducing the number of independent parameters, and
improving generalization. To reveal this bias, we identify invariant sets, or
subsets of parameter space that remain unmodified by SGD. We focus on two
classes of invariant sets that correspond to simpler (sparse or low-rank)
subnetworks and commonly appear in modern architectures. Our analysis uncovers
that SGD exhibits a property of stochastic attractivity towards these simpler
invariant sets. We establish a sufficient condition for stochastic attractivity
based on a competition between the loss landscape's curvature around the
invariant set and the noise introduced by stochastic gradients. Remarkably, we
find that an increased level of noise strengthens attractivity, leading to the
emergence of attractive invariant sets associated with saddle-points or local
maxima of the train loss. We observe empirically the existence of attractive
invariant sets in trained deep neural networks, implying that SGD dynamics
often collapses to simple subnetworks with either vanishing or redundant
neurons. We further demonstrate how this simplifying process of stochastic
collapse benefits generalization in a linear teacher-student framework.
Finally, through this analysis, we mechanistically explain why early training
with large learning rates for extended periods benefits subsequent
generalization.Comment: 37 pages, 12 figures, NeurIPS 202
Flatter, faster: scaling momentum for optimal speedup of SGD
Commonly used optimization algorithms often show a trade-off between good
generalization and fast training times. For instance, stochastic gradient
descent (SGD) tends to have good generalization; however, adaptive gradient
methods have superior training times. Momentum can help accelerate training
with SGD, but so far there has been no principled way to select the momentum
hyperparameter. Here we study training dynamics arising from the interplay
between SGD with label noise and momentum in the training of overparametrized
neural networks. We find that scaling the momentum hyperparameter
with the learning rate to the power of maximally accelerates training,
without sacrificing generalization. To analytically derive this result we
develop an architecture-independent framework, where the main assumption is the
existence of a degenerate manifold of global minimizers, as is natural in
overparametrized models. Training dynamics display the emergence of two
characteristic timescales that are well-separated for generic values of the
hyperparameters. The maximum acceleration of training is reached when these two
timescales meet, which in turn determines the scaling limit we propose. We
confirm our scaling rule for synthetic regression problems (matrix sensing and
teacher-student paradigm) and classification for realistic datasets (ResNet-18
on CIFAR10, 6-layer MLP on FashionMNIST), suggesting the robustness of our
scaling rule to variations in architectures and datasets.Comment: v2: expanded introduction section, corrected minor typos. v1: 12+13
pages, 3 figure
Modelling of the In-Play Football Betting Market
This thesis is about modelling the in-play football betting market. Our aim is to apply and extend financial mathematical concepts and models to value and risk-manage in-play football bets. We also apply machine learning methods to predict the outcome of the game using in-play indicators. In-play football betting provides a unique opportunity to observe the interplay between a clearly defined fundamental process, that is the game itself and a market on top of this process, the in-play betting market. This is in contrast with classical finance where the relationship between the fundamentals and the market is often indirect or unclear due to lack of direct connection, lack of information and infrequency or delay of information. What makes football betting unique is that the physical fundamentals are well observable because of the existence of rich high frequency data sets, the games have a limited time horizon of usually 90 minutes which avoids the buildup of long term expectations and finally the payoff of the traded products is directly linked to the fundamentals. In the first part of the thesis we show that a number of results in financial mathematics that have been developed for financial derivatives can be applied to value and risk manage in-play football bets. In the second part we develop models to predict the outcomes of football games using in-play data. First, we show that the concepts of risk-neutral measure, arbitrage freeness and completeness can also be applied to in-play football betting. This is achieved by assuming a model where the scores of the two teams follow standard Poisson processes with constant intensities. We note that this model is analogous to the Black-Scholes model in many ways. Second, we observe that an implied intensity smile does exist in football betting and we propose the so-called Local Intensity model. This is motivated by the local volatility model from finance which was the answer to the problem of the implied volatility smile. We show that the counterparts of the Dupire formulae [31] can also be derived in this setting. Third, we propose a Microscopic Model to describe not only the number of goals scored by the two teams, but also two additional variables: the position of the ball and the team holding the ball. We start from a general model where the model parameters are multi-variate functions of all the state variables. Then we characterise the general parameter surfaces using in-play game data and arrive to a simplified model of 13 scalar parameters only. We then show that a semi-analytic method can be used to solve the model. We use the model to predict scoring intensities for various time intervals in the future and find that the initial ball position and team holding the ball is relevant for time intervals of under 30 seconds. Fourth, we consider in-play indicators observed at the end of the first half to predict the number of goals scored during the second half, we refer to this model as the First Half Indicators Model. We use various feature selection methods to identify relevant indicators and use different machine learning models to predict goal intensities for the second half. In our setting a linear model with Elastic Net regularisation had the best performance. Fifth, we compare the predictive powers of the Microscopic Model and the First Half Indicators Model and we find that the Microscopic Model outperforms the First Half Indicators Model for delays of under 30 seconds because this is the time frame where the initial team having the ball and the initial position of the ball is relevant
- …