196 research outputs found

    Deep Learning and Linear Programming for Automated Ensemble Forecasting and Interpretation

    Full text link
    This paper presents an ensemble forecasting method that shows strong results on the M4 Competition dataset by decreasing feature and model selection assumptions, termed DONUT (DO Not UTilize human beliefs). Our assumption reductions, primarily consisting of auto-generated features and a more diverse model pool for the ensemble, significantly outperform the statistical, feature-based ensemble method FFORMA by Montero-Manso et al. (2020). We also investigate feature extraction with a Long Short-term Memory Network (LSTM) Autoencoder and find that such features contain crucial information not captured by standard statistical feature approaches. The ensemble weighting model uses LSTM and statistical features to combine the models accurately. The analysis of feature importance and interaction shows a slight superiority for LSTM features over the statistical ones alone. Clustering analysis shows that essential LSTM features differ from most statistical features and each other. We also find that increasing the solution space of the weighting model by augmenting the ensemble with new models is something the weighting model learns to use, thus explaining part of the accuracy gains. Moreover, we present a formal ex-post-facto analysis of an optimal combination and selection for ensembles, quantifying differences through linear optimization on the M4 dataset. Our findings indicate that classical statistical time series features, such as trend and seasonality, alone do not capture all relevant information for forecasting a time series. On the contrary, our novel LSTM features contain significantly more predictive power than the statistical ones alone, but combining the two feature sets proved the best in practice

    On the Theoretical Properties of Noise Correlation in Stochastic Optimization

    Full text link
    Studying the properties of stochastic noise to optimize complex non-convex functions has been an active area of research in the field of machine learning. Prior work has shown that the noise of stochastic gradient descent improves optimization by overcoming undesirable obstacles in the landscape. Moreover, injecting artificial Gaussian noise has become a popular idea to quickly escape saddle points. Indeed, in the absence of reliable gradient information, the noise is used to explore the landscape, but it is unclear what type of noise is optimal in terms of exploration ability. In order to narrow this gap in our knowledge, we study a general type of continuous-time non-Markovian process, based on fractional Brownian motion, that allows for the increments of the process to be correlated. This generalizes processes based on Brownian motion, such as the Ornstein-Uhlenbeck process. We demonstrate how to discretize such processes which gives rise to the new algorithm fPGD. This method is a generalization of the known algorithms PGD and Anti-PGD. We study the properties of fPGD both theoretically and empirically, demonstrating that it possesses exploration abilities that, in some cases, are favorable over PGD and Anti-PGD. These results open the field to novel ways to exploit noise for training machine learning models

    Learning Robust Statistics for Simulation-based Inference under Model Misspecification

    Full text link
    Simulation-based inference (SBI) methods such as approximate Bayesian computation (ABC), synthetic likelihood, and neural posterior estimation (NPE) rely on simulating statistics to infer parameters of intractable likelihood models. However, such methods are known to yield untrustworthy and misleading inference outcomes under model misspecification, thus hindering their widespread applicability. In this work, we propose the first general approach to handle model misspecification that works across different classes of SBI methods. Leveraging the fact that the choice of statistics determines the degree of misspecification in SBI, we introduce a regularized loss function that penalises those statistics that increase the mismatch between the data and the model. Taking NPE and ABC as use cases, we demonstrate the superior performance of our method on high-dimensional time-series models that are artificially misspecified. We also apply our method to real data from the field of radio propagation where the model is known to be misspecified. We show empirically that the method yields robust inference in misspecified scenarios, whilst still being accurate when the model is well-specified

    Stochastic Collapse: How Gradient Noise Attracts SGD Dynamics Towards Simpler Subnetworks

    Full text link
    In this work, we reveal a strong implicit bias of stochastic gradient descent (SGD) that drives overly expressive networks to much simpler subnetworks, thereby dramatically reducing the number of independent parameters, and improving generalization. To reveal this bias, we identify invariant sets, or subsets of parameter space that remain unmodified by SGD. We focus on two classes of invariant sets that correspond to simpler (sparse or low-rank) subnetworks and commonly appear in modern architectures. Our analysis uncovers that SGD exhibits a property of stochastic attractivity towards these simpler invariant sets. We establish a sufficient condition for stochastic attractivity based on a competition between the loss landscape's curvature around the invariant set and the noise introduced by stochastic gradients. Remarkably, we find that an increased level of noise strengthens attractivity, leading to the emergence of attractive invariant sets associated with saddle-points or local maxima of the train loss. We observe empirically the existence of attractive invariant sets in trained deep neural networks, implying that SGD dynamics often collapses to simple subnetworks with either vanishing or redundant neurons. We further demonstrate how this simplifying process of stochastic collapse benefits generalization in a linear teacher-student framework. Finally, through this analysis, we mechanistically explain why early training with large learning rates for extended periods benefits subsequent generalization.Comment: 37 pages, 12 figures, NeurIPS 202

    Flatter, faster: scaling momentum for optimal speedup of SGD

    Full text link
    Commonly used optimization algorithms often show a trade-off between good generalization and fast training times. For instance, stochastic gradient descent (SGD) tends to have good generalization; however, adaptive gradient methods have superior training times. Momentum can help accelerate training with SGD, but so far there has been no principled way to select the momentum hyperparameter. Here we study training dynamics arising from the interplay between SGD with label noise and momentum in the training of overparametrized neural networks. We find that scaling the momentum hyperparameter 1−β1-\beta with the learning rate to the power of 2/32/3 maximally accelerates training, without sacrificing generalization. To analytically derive this result we develop an architecture-independent framework, where the main assumption is the existence of a degenerate manifold of global minimizers, as is natural in overparametrized models. Training dynamics display the emergence of two characteristic timescales that are well-separated for generic values of the hyperparameters. The maximum acceleration of training is reached when these two timescales meet, which in turn determines the scaling limit we propose. We confirm our scaling rule for synthetic regression problems (matrix sensing and teacher-student paradigm) and classification for realistic datasets (ResNet-18 on CIFAR10, 6-layer MLP on FashionMNIST), suggesting the robustness of our scaling rule to variations in architectures and datasets.Comment: v2: expanded introduction section, corrected minor typos. v1: 12+13 pages, 3 figure

    Modelling of the In-Play Football Betting Market

    Get PDF
    This thesis is about modelling the in-play football betting market. Our aim is to apply and extend financial mathematical concepts and models to value and risk-manage in-play football bets. We also apply machine learning methods to predict the outcome of the game using in-play indicators. In-play football betting provides a unique opportunity to observe the interplay between a clearly defined fundamental process, that is the game itself and a market on top of this process, the in-play betting market. This is in contrast with classical finance where the relationship between the fundamentals and the market is often indirect or unclear due to lack of direct connection, lack of information and infrequency or delay of information. What makes football betting unique is that the physical fundamentals are well observable because of the existence of rich high frequency data sets, the games have a limited time horizon of usually 90 minutes which avoids the buildup of long term expectations and finally the payoff of the traded products is directly linked to the fundamentals. In the first part of the thesis we show that a number of results in financial mathematics that have been developed for financial derivatives can be applied to value and risk manage in-play football bets. In the second part we develop models to predict the outcomes of football games using in-play data. First, we show that the concepts of risk-neutral measure, arbitrage freeness and completeness can also be applied to in-play football betting. This is achieved by assuming a model where the scores of the two teams follow standard Poisson processes with constant intensities. We note that this model is analogous to the Black-Scholes model in many ways. Second, we observe that an implied intensity smile does exist in football betting and we propose the so-called Local Intensity model. This is motivated by the local volatility model from finance which was the answer to the problem of the implied volatility smile. We show that the counterparts of the Dupire formulae [31] can also be derived in this setting. Third, we propose a Microscopic Model to describe not only the number of goals scored by the two teams, but also two additional variables: the position of the ball and the team holding the ball. We start from a general model where the model parameters are multi-variate functions of all the state variables. Then we characterise the general parameter surfaces using in-play game data and arrive to a simplified model of 13 scalar parameters only. We then show that a semi-analytic method can be used to solve the model. We use the model to predict scoring intensities for various time intervals in the future and find that the initial ball position and team holding the ball is relevant for time intervals of under 30 seconds. Fourth, we consider in-play indicators observed at the end of the first half to predict the number of goals scored during the second half, we refer to this model as the First Half Indicators Model. We use various feature selection methods to identify relevant indicators and use different machine learning models to predict goal intensities for the second half. In our setting a linear model with Elastic Net regularisation had the best performance. Fifth, we compare the predictive powers of the Microscopic Model and the First Half Indicators Model and we find that the Microscopic Model outperforms the First Half Indicators Model for delays of under 30 seconds because this is the time frame where the initial team having the ball and the initial position of the ball is relevant
    • …
    corecore