50 research outputs found

    Practical and Rigorous Uncertainty Bounds for Gaussian Process Regression

    Full text link
    Gaussian Process Regression is a popular nonparametric regression method based on Bayesian principles that provides uncertainty estimates for its predictions. However, these estimates are of a Bayesian nature, whereas for some important applications, like learning-based control with safety guarantees, frequentist uncertainty bounds are required. Although such rigorous bounds are available for Gaussian Processes, they are too conservative to be useful in applications. This often leads practitioners to replacing these bounds by heuristics, thus breaking all theoretical guarantees. To address this problem, we introduce new uncertainty bounds that are rigorous, yet practically useful at the same time. In particular, the bounds can be explicitly evaluated and are much less conservative than state of the art results. Furthermore, we show that certain model misspecifications lead to only graceful degradation. We demonstrate these advantages and the usefulness of our results for learning-based control with numerical examples.Comment: Contains supplementary material and corrections to the original versio

    Bayesian scalar-on-image regression via random image partition models: automatic identification of regions of interest

    Get PDF
    Scalar-on-image regression aims to investigate changes in a scalar response of interest based on high-dimensional imaging data. These problems are increasingly prevalent in numerous domains, particularly in biomedical studies. For instance, they aim to utilise medical imaging data to capture and study the complex pattern of changes associated with disease to improve diagnostic accuracy. Due to the massive dimension of the images, which can often be in millions, combined with modest sample sizes, typically in the hundreds in most biomedical studies, pose serious challenges. Specifically, scalar-on-image regression belongs to the “large p, small n” paradigm, and hence, many models utilise shrinkage methods. However, neighbouring pixels in images are highly correlated, making standard regression methods, even with shrinkage, problematic due to multicollinearity and the high number of nonzero coefficients. We propose a novel Bayesian scalar-on-image regression model that utilises spatial coordinates of the pixels to group them with similar effects on the response to have a common coefficient, thus, allowing for automatic identification of regions of interest in the image for predicting the response of interest. In this thesis, we explore two classes of priors for the spatially-dependent partition process, namely, Potts-Gibbs random partition models (Potts-Gibbs) and Ewens-Pitman attraction (EPA) distribution and provide a thorough comparison of the models. In addition, Bayesian shrinkage priors are utilised to identify the covariates and regions that are most relevant for the prediction. The proposed model is illustrated using the simulated data sets and to identify brain regions of interest in Alzheimer’s disease

    Statistical signatures for adverse events in molecular life sciences

    Get PDF
    The ongoing evolution of computational sciences is helping to address the growing data analytical needs in applications. For instance, in biosciences, recent advances in measurement technologies have resulted in large amounts of data with domain-specific properties that are challenging to analyze with traditional statistical methods. An example of such a domain is microbiomics, the study of microbial communities, which in humans, have been reported to be associated with health and diseases. Despite advances in the field, further research is needed, as there is still a lack of understanding of how microbiome data should be processed and of the universal ecological properties of these complex systems. The objective of this thesis is to advance the field of microbiome data science by considering methods for predicting future outcomes based on current information. This is achieved through developing time series methods for complex systems and applying established statistical models in large population cohorts. The thesis consists of two complementary parts. The first part consists of analyses of two prospective human gut microbiome data sets, and contains the first ever microbiome-based survival analysis. The second part is focused on the stability properties of dynamical systems. It shows that the Bayesian statistical framework can be used to improve accuracy in inferring stability features, such as systemic resilience and early warning signals for catastrophic state transitions. The results of this thesis contribute to the best practices of human microbiomerelated data science and demonstrate the advantages of the Bayesian framework in detecting adverse events in limited time series. Although the work was motivated by timely questions in microbiomics, the developed tools are generic and applicable in various contexts

    Statistical learning of random probability measures

    Get PDF
    The study of random probability measures is a lively research topic that has attracted interest from different fields in recent years. In this thesis, we consider random probability measures in the context of Bayesian nonparametrics, where the law of a random probability measure is used as prior distribution, and in the context of distributional data analysis, where the goal is to perform inference given avsample from the law of a random probability measure. The contributions contained in this thesis can be subdivided according to three different topics: (i) the use of almost surely discrete repulsive random measures (i.e., whose support points are well separated) for Bayesian model-based clustering, (ii) the proposal of new laws for collections of random probability measures for Bayesian density estimation of partially exchangeable data subdivided into different groups, and (iii) the study of principal component analysis and regression models for probability distributions seen as elements of the 2-Wasserstein space. Specifically, for point (i) above we propose an efficient Markov chain Monte Carlo algorithm for posterior inference, which sidesteps the need of split-merge reversible jump moves typically associated with poor performance, we propose a model for clustering high-dimensional data by introducing a novel class of anisotropic determinantal point processes, and study the distributional properties of the repulsive measures, shedding light on important theoretical results which enable more principled prior elicitation and more efficient posterior simulation algorithms. For point (ii) above, we consider several models suitable for clustering homogeneous populations, inducing spatial dependence across groups of data, extracting the characteristic traits common to all the data-groups, and propose a novel vector autoregressive model to study of growth curves of Singaporean kids. Finally, for point (iii), we propose a novel class of projected statistical methods for distributional data analysis for measures on the real line and on the unit-circle

    Statistical Modelling

    Get PDF
    The book collects the proceedings of the 19th International Workshop on Statistical Modelling held in Florence on July 2004. Statistical modelling is an important cornerstone in many scientific disciplines, and the workshop has provided a rich environment for cross-fertilization of ideas from different disciplines. It consists in four invited lectures, 48 contributed papers and 47 posters. The contributions are arranged in sessions: Statistical Modelling; Statistical Modelling in Genomics; Semi-parametric Regression Models; Generalized Linear Mixed Models; Correlated Data Modelling; Missing Data, Measurement of Error and Survival Analysis; Spatial Data Modelling and Time Series and Econometrics

    Big Data Analytics and Information Science for Business and Biomedical Applications II

    Get PDF
    The analysis of big data in biomedical, business and financial research has drawn much attention from researchers worldwide. This collection of articles aims to provide a platform for an in-depth discussion of novel statistical methods developed for the analysis of Big Data in these areas. Both applied and theoretical contributions to these areas are showcased

    Automated Quality Control for In-Situ Water Temperature Sensors

    Get PDF
    The identification of data not representative of the target subject for outdoor (in-situ) environmental sensors (bad data) is a topic that has been explored in the past. Many tools (such as data filters and computer models) have succeeded in providing an end user with properly identified incorrect data over 95% of the time. However, with the continuous increase in the use of automated data collection, a simple indication of the bad data may no longer provide the end user with enough information to reduce the amount of time that must be spent for manual quality control. The purpose of this research was to devise and test a data classification technique capable of determining when and why water quality data are incorrect in an environment that experiences seasonal and daily fluctuations. This should reduce or eliminate the need for manual quality control (QC) in a large-volume data system where the range of good data is wide and changes often. The objectives this project sought to achieve were; training a learning machine that could identify local maximum and minimum values as well as dulled signals, and forming a multi-class classifier that accurately placed sensor temperature data into three categories; good, bad (because of exposure of the temperature probe to ambient air temperature), and bad (because the sensor has become buried in sediment). This involved the development of a model using a Multi-Class Relevance Vector Machine (MCRVM), and identification of its parameters that would provide at least 90% removal of false negatives for Classes 2 and 3 (the bad data) using only 100 data points from each class for purposes of training the learning machine. These objectives were met using the following methods: (1) QC completion on water temperature sensors manually, (2) an iterative process that involved the selection of inputs for the model and then the optimization of these values based on the RVMs performance, and (3) evaluation of the best performing machines testing a small group of data and then a full year

    Bayesian Nonparametric Methods for Cyber Security with Applications to Malware Detection and Classification

    Get PDF
    The statistical approach to cyber security has become an active and important area of research due to the growth in number and threat of cyber attacks perpetrated nowadays. In this thesis, we centre our attention on the Bayesian approach to cyber security, which provides several modelling advantages such as the flexibility achieved through the probabilistic quantification of uncertainty. In particular, we have found that Bayesian models have been mainly used to detect volume-traffic anomalies, network anomalies and malicious software. To provide a unifying view of these ideas, we first present a thorough review on Bayesian methods applied to cyber security. Bayesian models applied to detecting malware and classifying them into known malicious classes is one of the cyber security areas discussed in our review. However, and contrary to detecting traffic and network anomalies, this area has not been widely developed from a Bayesian perspective. That is why we have centred our attention on developing novel supervised learning Bayesian nonparametric models to detect and classify malware using binary features built directly from the executables’ binary code. For these methods, important theoretical properties and simulation techniques are fully developed and for real malware data, we have compared their performance against well-known machine learning models which have been widely applied in this area. With respect to our methodologies, we first present a new discrete nonparametric prior specifically designed for binary data that builds on an elegant nonparametric hierarchical structure, which allows us to study the importance of each individual feature across the groups found in the data. Moreover, and due to the large, and possibly redundant, number of features, we have developed a generalised version of the model that allows the introduction of a feature selection step within the inferential learning. Finally, for a more complex modelling where there is a need to introduce dependence across the features, we have extended the capabilities of this new class of nonparametric priors by using it as the building block of a latent feature model

    Estimating Information in Earth System Data with Machine Learning

    Get PDF
    El aprendizaje automático ha hecho grandes avances en la ciencia e ingeniería actuales en general y en las ciencias de la Tierra en particular. Sin embargo, los datos de la Tierra plantean problemas particularmente difíciles para el aprendizaje automático debido no sólo al volumen de datos implicado, sino también por la presencia de correlaciones no lineales tanto espaciales como temporales, por una gran diversidad de fuentes de ruido y de incertidumbre, así como por la heterogeneidad de las fuentes de información involucradas. Más datos no implica necesariamente más información. Por lo tanto, extraer conocimiento y contenido informativo mediante el análisis y el modelado de datos resulta crucial, especialmente ahora donde el volumen y la heterogeneidad de los datos aumentan constantemente. Este hecho requiere avances en métodos que puedan cuantificar la información y caracterizar las distribuciones e incertidumbres con precisión. Cuantificar el contenido informativo a los datos y los modelos de nuestro sistema son problemas no resueltos en estadística y el aprendizaje automático. Esta tesis introduce nuevos modelos de aprendizaje automático para extraer conocimiento e información a partir de datos de observación de la Tierra. Proponemos métodos núcleo ('kernel methods'), procesos gaussianos y gaussianización multivariada para tratar la incertidumbre y la cuantificación de la información, y aplicamos estos métodos a una amplia gama de problemas científicos del sistema terrestre. Estos conllevan muchos tipos de problemas de aprendizaje, incluida la clasificación, regresión, estimación de densidad, síntesis, propagación de errores y estimación de medidas teóricas de la información. También demostramos cómo funcionan estos métodos con diferentes fuentes de datos, provenientes de distintos sensores (radar, multiespectrales, hiperespectrales), productos de datos (observaciones, reanálisis y simulaciones de modelos) y cubos de datos (agregados de varias fuentes de datos espacial-temporales ). Las metodologías presentadas nos permiten cuantificar y visualizar cuáles son las características relevantes que gobiernan distintos métodos núcleo, tales como clasificadores, métodos de regresión o incluso las medidas de independencia estadística, como propagar mejor los errores y las distorsiones de los datos de entrada con procesos gaussianos, así como dónde y cuándo se puede encontrar más información en cubos arbitrarios espacio-temporales. Las técnicas presentadas abren una amplia gama de posibles casos de uso y de aplicaciones, con las que prevemos un uso más extenso y robusto de algoritmos estadísticos en las ciencias de la Tierra y el clima.Machine learning has made great strides in today's Science and engineering in general and Earth Sciences in particular. However, Earth data poses particularly challenging problems for machine learning due to not only the volume of data, but also the spatial-temporal nonlinear correlations, noise and uncertainty sources, and heterogeneous sources of information. More data does not necessarily imply more information. Therefore, extracting knowledge and information content using data analysis and modeling is important and is especially prevalent in an era where data volume and heterogeneity is steadily increasing. This calls for advances in methods that can quantify information and characterize distributions accurately. Quantifying information content within our system's data and models are still unresolved problems in statistics and machine learning. This thesis introduces new machine learning models to extract knowledge and information from Earth data. We propose kernel methods, Gaussian processes and multivariate Gaussianization to handle uncertainty and information quantification and we apply these methods to a wide range of Earth system science problems. These involve many types of learning problems including classification, regression, density estimation, synthesis, error propagation and information-theoretic measures estimation. We also demonstrate how these methods perform with different data sources including sensory data (radar, multispectral, hyperspectral, infrared sounders), data products (observations, reanalysis and model simulations) and data cubes (aggregates of various spatial-temporal data sources). The presented methodologies allow us to quantify and visualize what are the salient features driving kernel classifiers, regressors or dependence measures, how to better propagate errors and distortions of input data with Gaussian processes, and where and when more information can be found in arbitrary spatial-temporal data cubes. The presented techniques open a wide range of possible use cases and applications and we anticipate a wider adoption in the Earth sciences

    Uncertainty quantification for an electric motor inverse problem - tackling the model discrepancy challenge

    Get PDF
    In the context of complex applications from engineering sciences the solution of identification problems still poses a fundamental challenge. In terms of Uncertainty Quantification (UQ), the identification problem can be stated as a separation task for structural model and parameter uncertainty. This thesis provides new insights and methods to tackle this challenge and demonstrates these developments on an industrial benchmark use case combining simulation and real-world measurement data. While significant progress has been made in development of methods for model parameter inference, still most of those methods operate under the assumption of a perfect model. For a full, unbiased quantification of uncertainties in inverse problems, it is crucial to consider all uncertainty sources. The present work develops methods for inference of deterministic and aleatoric model parameters from noisy measurement data with explicit consideration of model discrepancy and additional quantification of the associated uncertainties using a Bayesian approach. A further important ingredient is surrogate modeling with Polynomial Chaos Expansion (PCE), enabling sampling from Bayesian posterior distributions with complex simulation models. Based on this, a novel identification strategy for separation of different sources of uncertainty is presented. Discrepancy is approximated by orthogonal functions with iterative determination of optimal model complexity, weakening the problem inherent identifiability problems. The model discrepancy quantification is complemented with studies to statistical approximate numerical approximation error. Additionally, strategies for approximation of aleatoric parameter distributions via hierarchical surrogate-based sampling are developed. The proposed method based on Approximate Bayesian Computation (ABC) with summary statistics estimates the posterior computationally efficient, in particular for large data. Furthermore, the combination with divergence-based subset selection provides a novel methodology for UQ in stochastic inverse problems inferring both, model discrepancy and aleatoric parameter distributions. Detailed analysis in numerical experiments and successful application to the challenging industrial benchmark problem -- an electric motor test bench -- validates the proposed methods
    corecore