10 research outputs found
Parallelizable sparse inverse formulation Gaussian processes (SpInGP)
We propose a parallelizable sparse inverse formulation Gaussian process
(SpInGP) for temporal models. It uses a sparse precision GP formulation and
sparse matrix routines to speed up the computations. Due to the state-space
formulation used in the algorithm, the time complexity of the basic SpInGP is
linear, and because all the computations are parallelizable, the parallel form
of the algorithm is sublinear in the number of data points. We provide example
algorithms to implement the sparse matrix routines and experimentally test the
method using both simulated and real data.Comment: Presented at Machine Learning in Signal Processing (MLSP2017
Forecasting of commercial sales with large scale Gaussian Processes
This paper argues that there has not been enough discussion in the field of
applications of Gaussian Process for the fast moving consumer goods industry.
Yet, this technique can be important as it e.g., can provide automatic feature
relevance determination and the posterior mean can unlock insights on the data.
Significant challenges are the large size and high dimensionality of commercial
data at a point of sale. The study reviews approaches in the Gaussian Processes
modeling for large data sets, evaluates their performance on commercial sales
and shows value of this type of models as a decision-making tool for
management.Comment: 1o pages, 5 figure
Ultra-fast Deep Mixtures of Gaussian Process Experts
Mixtures of experts have become an indispensable tool for flexible modelling
in a supervised learning context, and sparse Gaussian processes (GP) have shown
promise as a leading candidate for the experts in such models. In the present
article, we propose to design the gating network for selecting the experts from
such mixtures of sparse GPs using a deep neural network (DNN). This combination
provides a flexible, robust, and efficient model which is able to significantly
outperform competing models. We furthermore consider efficient approaches to
computing maximum a posteriori (MAP) estimators of these models by iteratively
maximizing the distribution of experts given allocations and allocations given
experts. We also show that a recently introduced method called
Cluster-Classify-Regress (CCR) is capable of providing a good approximation of
the optimal solution extremely quickly. This approximation can then be further
refined with the iterative algorithm
Human motion estimation and controller learning
Humans are capable of complex manipulation and locomotion tasks. They are able to achieve energy-efficient gait, reject disturbances, handle changing loads, and adapt to environmental constraints. Using inspiration from the human body, robotics researchers aim to develop systems with similar capabilities. Research suggests that humans minimize a task specific cost function when performing movements. In order to learn this cost function from demonstrations and incorporate it into a controller, it is first imperative to accurately estimate the expert motion. The captured motions can then be analyzed to extract the objective function the expert was minimizing.
We propose a framework for human motion estimation from wearable sensors. Human body joints are modeled by matrix Lie groups, using special orthogonal groups SO(2) and SO(3) for joint pose and special Euclidean group SE(3) for base link pose representation. To estimate the human joint pose, velocity and acceleration, we provide the equations for employing the extended Kalman Filter on Lie Groups, thus explicitly accounting for the non-Euclidean geometry of the state space. Incorporating interaction constraints with respect to the environment or within the participant allows us to track global body position without an absolute reference and ensure viable pose estimate. The algorithms are extensively validated in both simulation and real-world experiments.
Next, to learn underlying expert control strategies from the expert demonstrations we present a novel fast approximate multi-variate Gaussian Process regression. The method estimates the underlying cost function, without making assumptions on its structure. The computational efficiency of the approach allows for real time forward horizon prediction. Using a linear model predictive control framework we then reproduce the demonstrated movements on a robot. The learned cost function captures the variability in expert motion as well as the correlations between states, leading to a controller that both produces motions and reacts to disturbances in a human-like manner. The model predictive control formulation allows the controller to satisfy task and joint space constraints avoiding obstacles and self collisions, as well as torque constraints, ensuring operational feasibility. The approach is validated on the Franka Emika robot using real human motion exemplars
Inference using Gaussian processes in animal movement modelling
In recent years, the field of movement ecology has been changed dramatically by the capacity to collect accurate high-frequency telemetry data. In this thesis I present new statistical methods scalable to very large volumes of data being generated as there is a problem of scale dependence in most popular animal movement models. Popular and widely used movement models in ecology are discrete-time movement models, where animalsâ positions are observed at discrete times. However, discrete-time models do not perform well when problems such as missing or irregular data are present. A remedy to the inefficiency of discrete-time movement models is to use continuous-time movement models, however the formulation of continuous-time movement models is often difficult and hard to interpret. In this thesis, I first focus on discrete-time movement models, where through a study I illustrate one of the problems that discrete-time movement models pose - the specification in advance of the discretisation time-step. I then move on to probabilistic methods, widely used in the machine learning community, Gaussian processes (GPs), and I show that they are equivalent to many continuous-time movement models. Given that the primary goal of machine learning methods is to learn from large scale datasets, using robust continuous-time movement models such as Gaussian processes is highly advantageous for multiple reasons. These include their flexibility in choosing various covariance functions, their scalability to large datasets and their ability to analyse data, infer parameters of interest and quantify uncertainty within a nonparametric Bayesian approach. I extend the standard Gaussian process (GP) into a non-stationary hierarchical Gaussian process, where both the movement process and the dynamic parameters of the movement model are Gaussian processes, which allows for increased flexibility to a wide range of behaviour modes that animals can exhibit. Throughout this thesis, I implement Gaussian processes on simulated and real tracking data using statistical libraries such as TensorFlow, which provide an accessible way to implement the model and gain access to GPU/HPC-accelerated machine learning libraries. I perform inference using optimisation methods such as maximum-a-posteriori (MAP) estimation, approximate sampling based inference methods such as Markov Chain Monte Carlo (MCMC) and variational inference methods on both synthetic and real datasets
Probabilistic Ordinary Differential Equation Solvers - Theory and Applications
Ordinary differential equations are ubiquitous in science and engineering, as they provide mathematical models for many physical processes. However, most practical purposes require the temporal evolution of a particular solution. Many relevant ordinary differential equations are known to lack closed-form solutions in terms of simple analytic functions. Thus, users rely on numerical algorithms to compute discrete approximations.
Numerical methods replace the intractable, and thus inaccessible, solution by an approximating model with known computational strategies. This is akin to a process in statistics where an unknown true relationship is modeled with access to instances of said relationship. One branch of statistics, Bayesian modeling, expresses degrees of uncertainty with probability distributions. In recent years, this idea has gained traction for the design and study of numerical algorithms which established probabilistic numerics as a research field in its own right.
The theory part of this thesis is concerned with bridging the gap between classical numerical methods for ordinary differential equations and probabilistic numerics. To this end, an algorithm is presented based on Gaussian processes, a general and versatile model for Bayesian regression. This algorithm is compared to two standard frameworks for the solution of initial value problems. It is shown that the maximum a-posteriori estimator of certain Gaussian process regressors coincide with certain multistep formulae. Furthermore, a particular initialization scheme based on an improper prior model coincides with a Runge-Kutta method for the first discretization step. This analysis provides a higher-order probabilistic numerical algorithm for initial value problems.
Based on the probabilistic description, an estimator of the local integration error is presented, which is used in a step size adaptation scheme. The completed algorithm is evaluated on a benchmark on initial value problems, confirming empirically the theoretically predicted error rates and displaying particularly efficient performance on domains with low accuracy requirements.
To establish the practical benefit of the probabilistic solution, a probabilistic boundary value problem solver is applied to a medical imaging problem. In tractography, diffusion-weighted magnetic resonance imaging data is used to infer connectivity of neural fibers. The first application of the probabilistic solver shows how the quantification of the discretization error can be used in subsequent estimation of fiber density. The second application additionally incorporates the measurement noise of the imaging data into the
tract estimation model. These two extensions of the shortest-path tractography method give more faithful data, modeling and algorithmic uncertainty representations in neural connectivity studies.Gewöhnliche Differentialgleichungen sind allgegenwĂ€rtig in Wissenschaft und Technik, da sie die mathematische Beschreibung vieler physikalischen VorgĂ€nge sind. Jedoch benötigt ein GroĂteil der praktischen Anwendungen die zeitliche Entwicklung einer bestimmten Lösung. Es ist bekannt, dass viele relevante gewöhnliche Differentialgleichungen keine geschlossene Lösung als AusdrĂŒcke einfacher analytischer Funktion besitzen. Daher verlassen sich Anwender auf numerische Algorithmen, um diskrete AnnĂ€herungen zu berechnen.
Numerische Methoden ersetzen die unauswertbare, und daher unzugĂ€ngliche, Lösung durch eine AnnĂ€herung mit bekannten Rechenverfahren. Dies Ă€hnelt einem Vorgang in der Statistik, wobei ein unbekanntes wahres VerhĂ€ltnis mittels Zugang zu Beispielen modeliert wird. Eine Unterdisziplin der Statistik, Bayesâsche Modellierung, stellt graduelle Unsicherheit mittels Wahrscheinlichkeitsverteilungen dar. In den letzten Jahren hat diese Idee an Zugkraft fĂŒr die Konstruktion und Analyse von numerischen Algorithmen gewonnen, was zur Etablierung von probabilistischer Numerik als eigenstĂ€ndiges
Forschungsgebiet fĂŒhrte.
Der Theorieteil dieser Dissertation schlĂ€gt eine BrĂŒcke zwischen herkömmlichen numerischen Verfahren zur Lösung gewöhnlicher Differentialgleichungen und probabilistischer Numerik. Ein auf GauĂâschen Prozessen basierender Algorithmus wird vorgestellt, welche ein generelles und vielseitiges Modell der Bayesschen Regression sind. Dieser Algorithmus wird verglichen mit zwei StandardansĂ€tzen fĂŒr die Lösung von Anfangswertproblemen. Es wird gezeigt, dass der Maximum-a-posteriori-SchĂ€tzer bestimmter GauĂprozess-Regressoren ĂŒbereinstimmt mit bestimmten Mehrschrittverfahren. Weiterhin stimmt ein besonderes Initialisierungsverfahren basierend auf einer uneigentlichen
A-priori-Wahrscheinlichkeit ĂŒberein mit einer Runge-Kutta Methode im ersten Rechenschritt. Diese Analyse fĂŒhrt zu einer probabilistisch-numerischen Methode höherer Ordnung zur Lösung von Anfangswertproblemen.
Basierend auf der probabilistischen Beschreibung wird ein SchÀtzer des lokalen Integrationfehlers prÀsentiert, welcher in einem Schrittweitensteuerungsverfahren verwendet
wird. Der vollstÀndige Algorithmus wird auf einem Satz standardisierter Anfangswertprobleme ausgewertet, um empirisch den von der Theorie vorhergesagten Fehler zu bestÀtigen. Der Test weist dem Verfahren einen besonders effizienten Rechenaufwand im Bereich der niedrigen Genauigkeitsanforderungen aus.
Um den praktischen Nutzen der probabilistischen Lösung nachzuweisen, wird ein probabilistischer Löser fĂŒr Randwertprobleme auf eine Fragestellung der medizinischen Bildgebung angewandt. In der Traktografie werden die Daten der diffusionsgewichteten Magnetresonanzbildgebung verwendet, um die KonnektivitĂ€t neuronaler Fasern zu bestimmen. Die erste Anwendung des probabilistische Lösers demonstriert, wie die Quantifizierung des Diskretisierungsfehlers in einer nachgeschalteten SchĂ€tzung der Faserdichte verwendet werden kann. Die zweite Anwendung integriert zusĂ€tzlich das Messrauschen der Bildgebungsdaten in das StrangschĂ€tzungsmodell. Diese beiden Erweiterungen der KĂŒrzesten-Pfad-Traktografie reprĂ€sentieren die Daten-, Modellierungs- und algorithmische Unsicherheit abbildungstreuer in neuronalen KonnektivitĂ€tsstudien
Probabilistic Approaches to Stochastic Optimization
Optimization is a cardinal concept in the sciences, and viable algorithms of utmost importance as tools
for finding the solution to an optimization problem. Empirical risk minimization is a major workhorse,
in particular in machine learning applications, where an input-target relation is learned in a supervised
manner. Empirical risks with high-dimensional inputs are mostly optimized by greedy, gradient-based,
and possibly stochastic optimization routines, such as stochastic gradient descent.
Though popular, and practically successful, this setup has major downsides which often makes it
finicky to work with, or at least the bottleneck in a larger chain of learning procedures. For instance,
typical issues are:
âą Overfitting of a parametrized model to the data. This generally leads to poor generalization performance
on unseen data.
âą Tuning of algorithmic parameters, such as learning rates, is tedious, inefficient, and costly.
âą Stochastic losses and gradients occur due to sub-sampling of a large dataset. They only yield
incomplete, or corrupted information about the empirical risk, and are thus difficult to handle from a
decision making point of view.
This thesis consist of four conceptual parts.
In the first one, we argue that conditional distributions of local full and mini-batch evaluations of
losses and gradients can be well approximated by Gaussian distributions, since the losses themselves
are sums of independently and identically distributed random variables. We then provide a way of
estimating the corresponding sufficient statistics, i. e., variances and means, with low computational
overhead. This yields an analytic likelihood for the loss and gradient at every point of the inputs space,
which subsequently can be incorporated into active decision making at run-time of the optimizer.
The second part focuses on estimating generalization performance, not by monitoring a validation
loss, but by assessing if stochastic gradients can be fully explained by noise that occurs due to the
finiteness of the training dataset, and not due to an informative gradient direction of the expected loss
(risk). This yields a criterion for early-stopping where no validation set is needed, and the full dataset
can be used for training.
The third part is concerned with fully automated learning rate adaption for stochastic gradient descent
(SGD). Global learning rates are arguably the most exposed manual tuning parameters of stochastic
optimization routines. We propose a cheap and self-contained sub-routine, called a âprobabilistic
line searchâ that automatically adapts the learning rate in every step, based on a local probability of
descent. The result is an entirely parameter-free, stochastic optimizer that reaches comparable or better
generalization performances than SGD with a carefully hand-tuned learning rate on the tested problems.
The last part deals with noise-robust search directions. Inspired by classic first- and second-order
methods, we model the unknown dynamics of the gradient or Hessian-function on the optimization
path. The approach has strong connections to classic filtering frameworks and can incorporate noise-corrupted
evaluations of the gradient at successive locations. The benefits are twofold. Firstly, we gain
valuable insight on less accessible or ad-hoc design choices of classic optimizer as special cases. Secondly,
we provide the basis for a flexible, self-contained, and easy-to-use class of stochastic optimizers that
exhibit a higher degree of robustness and automation.Optimierung ist ein grundlegendes Prinzip in denWissenschaften, und Algorithmen zu deren Lösung
von groĂer praktischer Bedeutung. Empirische Risikominimierung ist ein gĂ€ngiges Modell, vor allem in
Anwendungen des Maschinellen Lernens, in denen eine Eingabe-Ausgabe Relation uÌberwacht gelernt
wird. Empirische Risiken mit hoch-dimensionalen Eingaben werden meist durch gierige, gradientenbasierte,
und möglicherweise stochastische Routinen optimiert, so wie beispielsweise der stochastische
Gradientenabstieg.
Obwohl dieses Konzept populÀr als auch erfolgreich in der Praxis ist, hat es doch betrÀchtliche
Nachteile, die es entweder aufwendig machen damit zu arbeiten, oder verlangsamen, sodass es den
Engpass in einer gröĂeren Kette von Lernprozessen darstellen kann. Typische Verhalten sind zum
Beispiel:
âą Ăberanpassung eines parametrischen Modells an die Daten. Dies fuÌhrt oft zu schlechterer Generalisierungsleistung
auf ungesehenen Daten.
âą Die manuelle Anpassung von algorithmischen Parametern, wie zum Beispiel Lernraten ist oft muÌhsam,
ineffizient und kostspielig.
âą Stochastische Verluste und Gradienten treten auf, wenn Zufallsstichproben anstelle eines ganzen
groĂen Datensatzes fuÌr deren Berechnung benutzt wird. Erstere stellen nur inkomplette, oder korrupte
Information uÌber das empirische Risiko dar und sind deshalb schwieriger zu handhaben, wenn ein
Algorithmus Entscheidungen treffen soll.
Diese Arbeit enthÀlt vier konzeptionelle Teile.
Im ersten Teil argumentieren wir, dass bedingte Verteilungen von lokalen Voll- und Mini-Batch Verlusten
und deren Gradienten gut mit GauĂverteilungen approximiert werden können, da die Verluste
selbst Summen aus unabhÀngig und identisch verteilten Zufallsvariablen sind. Wir stellen daraufhin
dar, wie man die suffizienten Statistiken, also Varianzen und Mittelwerte, mit geringem zusÀtzlichen
Rechenaufwand schĂ€tzen kann. Dies fuÌhrt zu analytischen Likelihood-Funktionen fuÌr Verlust und Gradient
an jedem Eingabepunkt, die daraufhin in aktive Entscheidungen des Optimierer zur Laufzeit
einbezogen werden können.
Der zweite Teil konzentriert sich auf die SchÀtzung der Generalisierungsleistung nicht indem der
Verlust eines Validierungsdatensatzes uÌberwacht wird, sondern indem beurteilt wird, ob stochastische
Gradienten vollstÀndig durch Rauschen aufgrund der Endlichkeit des Trainingsdatensatzes und nicht
durch eine informative Gradientenrichtung des erwarteten Verlusts (des Risikos), erklÀrt werden können.
Daraus wird ein Early-Stopping Kriterium abgeleitet, das keinen Validierungsdatensatz benötigt,
sodass der komplette Datensatz fuÌr das Training verwendet werden kann.
Der dritte Teil betrifft die vollstĂ€ndige Automatisierung der Adaptierung von Lernraten fuÌr den stochastischen
Gradientenabstieg (SGD). Globale Lernraten sind wohl die prominentesten Parameter von
stochastischen Optimierungsroutinen, die manuell angepasst werden muÌssenWir stellen eine guÌnstige
und eigenstĂ€ndige Subroutine vor, genannt âProbabilistic Line Searchâ, die automatisch die Lernrate
in jedem Schritt, basierend auf einer lokalen Abstiegswahrscheinlichkeit, anpasst. Das Ergebnis ist ein
vollstÀndig parameterfreier stochastischer Optimierer, der vergleichbare oder bessere Generalisierungsleistung
wie SGD mit sorgfÀltig von Hand eingestellten Lernraten erbringt.
Der letzte Teil beschĂ€ftigt sich mit Suchrichtungen, die robust gegenuÌber Rauschen sind. Inspiriert von
klassischen Optimierern erster und zweiter Ordnung, modellieren wir die Dynamik der Gradienten oder
Hesse-Funktion auf dem Optimierungspfad. Dieser Ansatz ist stark verwandt mit klassischen
Filter-Modellen, die aufeinanderfolgende verrauschte Gradienten beruÌcksichtigen können Die Vorteile
sind zweifÀltig. ZunÀchst gewinnen wir wertvolle Einsichten in weniger zugÀngliche oder ad hoc
gewĂ€hlte Designs klassischer Optimierer als SpezialfĂ€lle. Zweitens bereiten wir die Basis fuÌr flexible,
eigenstÀndige und nutzerfreundliche stochastische Optimierer mit einem erhöhten Grad an Robustheit
und Automatisierung