190 research outputs found

    Generalised latent variable models for location, scale, and shape parameters

    Get PDF
    Latent Variable Models (LVM) are widely used in social, behavioural, and educational sciences to uncover underlying associations in multivariate data using a smaller number of latent variables. However, the classical LVM framework has certain assumptions that can be restrictive in empirical applications. In particular, the distribution of the observed variables being from the exponential family and the latent variables influencing only the conditional mean of the observed variables. This thesis addresses these limitations and contributes to the current literature in two ways. First, we propose a novel class of models called Generalised Latent Variable Models for Location, Scale, and Shape parameters (GLVM-LSS). These models use linear functions of latent factors to model location, scale, and shape parameters of the items’ conditional distributions. By doing so, we model higher order moments such as variance, skewness, and kurtosis in terms of the latent variables, providing a more flexible framework compared to classical factor models. The model parameters are estimated using maximum likelihood estimation. Second, we address the challenge of interpreting the GLVM-LSS, which can be complex due to its increased number of parameters. We propose a penalised maximum likelihood estimation approach with automatic selection of tuning parameters. This extends previous work on penalised estimation in the LVM literature to cases without closed-form solutions. Our findings suggest that modelling the entire distribution of items, not just the conditional mean, leads to improved model fit and deeper insights into how the items reflect the latent constructs they are intended to measure. To assess the performance of the proposed methods, we conduct extensive simulation studies and apply it to real-world data from educational testing and public opinion research. The results highlight the efficacy of the GLVM-LSS framework in capturing complex relationships between observed variables and latent factors, providing valuable insights for researchers in various fields

    Statistical Machine Learning Methodology for Individualized Treatment Rule Estimation in Precision Medicine

    Get PDF
    Precision medicine aims to deliver optimal, individualized treatments for patients by accounting for their unique characteristics. With a foundation in reinforcement learning, decision theory, and causal inference, the field of precision medicine has seen many advancements in recent years. Significant focus has been placed on creating algorithms to estimate individualized treatment rules (ITRs), which map from patient covariates to the space of available treatments with the goal of maximizing patient outcome. In Chapter 1, we extend ITR estimation methodology in the scenario where variance of the outcome is heterogeneous with respect to treatment and covariates. Accordingly, we propose Stabilized Direct Learning (SD-Learning), which utilizes heteroscedasticity in the error term through a residual reweighting framework that models residual variance via flexible machine learning algorithms such as XGBoost and random forests. We also develop an internal cross-validation scheme which determines the best residual model among competing models. Further, we extend this methodology to multi-arm treatment scenarios. In Chapter 2, we develop ITR estimation methodology for situations where clinical decision-making involves balancing multiple outcomes of interest. Our proposed framework estimates an ITR which maximizes a combination of the multiple clinical outcomes, accounting for the fact that patients may ascribe importance to outcomes differently (utility heterogeneity). This approach employs inverse reinforcement learning (IRL) techniques through an expert-augmentation solution, whereby physicians provide input to guide the utility estimation and ITR learning processes. In Chapter 3, we apply an end-to-end precision medicine workflow to novel data from older adults with Type 1 Diabetes in order to understand the heterogeneous treatment effects of continuous glucose monitoring (CGM) and develop an interpretable ITR to reveal patients for which CGM confers a major safety benefit. The results from this analysis elucidate the demographic and clinical markers which moderate CGM's success, provide the basis for using diagnostic CGM to inform therapeutic CGM decisions, and serve to augment clinical decision-making. Finally, in Chapter 4, as a future research direction, we propose a deep autoencoder framework which simultaneously performs feature selection and ITR optimization, contributing to methodology built for direct consumption of unstructured, high-dimensional data in the precision medicine pipeline.Doctor of Philosoph

    An uncertainty prediction approach for active learning - application to earth observation

    Get PDF
    Mapping land cover and land usage dynamics are crucial in remote sensing since farmers are encouraged to either intensify or extend crop use due to the ongoing rise in the world’s population. A major issue in this area is interpreting and classifying a scene captured in high-resolution satellite imagery. Several methods have been put forth, including neural networks which generate data-dependent models (i.e. model is biased toward data) and static rule-based approaches with thresholds which are limited in terms of diversity(i.e. model lacks diversity in terms of rules). However, the problem of having a machine learning model that, given a large amount of training data, can classify multiple classes over different geographic Sentinel-2 imagery that out scales existing approaches remains open. On the other hand, supervised machine learning has evolved into an essential part of many areas due to the increasing number of labeled datasets. Examples include creating classifiers for applications that recognize images and voices, anticipate traffic, propose products, act as a virtual personal assistant and detect online fraud, among many more. Since these classifiers are highly dependent from the training datasets, without human interaction or accurate labels, the performance of these generated classifiers with unseen observations is uncertain. Thus, researchers attempted to evaluate a number of independent models using a statistical distance. However, the problem of, given a train-test split and classifiers modeled over the train set, identifying a prediction error using the relation between train and test sets remains open. Moreover, while some training data is essential for supervised machine learning, what happens if there is insufficient labeled data? After all, assigning labels to unlabeled datasets is a time-consuming process that may need significant expert human involvement. When there aren’t enough expert manual labels accessible for the vast amount of openly available data, active learning becomes crucial. However, given a large amount of training and unlabeled datasets, having an active learning model that can reduce the training cost of the classifier and at the same time assist in labeling new data points remains an open problem. From the experimental approaches and findings, the main research contributions, which concentrate on the issue of optical satellite image scene classification include: building labeled Sentinel-2 datasets with surface reflectance values; proposal of machine learning models for pixel-based image scene classification; proposal of a statistical distance based Evidence Function Model (EFM) to detect ML models misclassification; and proposal of a generalised sampling approach for active learning that, together with the EFM enables a way of determining the most informative examples. Firstly, using a manually annotated Sentinel-2 dataset, Machine Learning (ML) models for scene classification were developed and their performance was compared to Sen2Cor the reference package from the European Space Agency – a micro-F1 value of 84% was attained by the ML model, which is a significant improvement over the corresponding Sen2Cor performance of 59%. Secondly, to quantify the misclassification of the ML models, the Mahalanobis distance-based EFM was devised. This model achieved, for the labeled Sentinel-2 dataset, a micro-F1 of 67.89% for misclassification detection. Lastly, EFM was engineered as a sampling strategy for active learning leading to an approach that attains the same level of accuracy with only 0.02% of the total training samples when compared to a classifier trained with the full training set. With the help of the above-mentioned research contributions, we were able to provide an open-source Sentinel-2 image scene classification package which consists of ready-touse Python scripts and a ML model that classifies Sentinel-2 L1C images generating a 20m-resolution RGB image with the six studied classes (Cloud, Cirrus, Shadow, Snow, Water, and Other) giving academics a straightforward method for rapidly and effectively classifying Sentinel-2 scene images. Additionally, an active learning approach that uses, as sampling strategy, the observed prediction uncertainty given by EFM, will allow labeling only the most informative points to be used as input to build classifiers; Sumário: Uma Abordagem de Previsão de Incerteza para Aprendizagem Ativa – Aplicação à Observação da Terra O mapeamento da cobertura do solo e a dinâmica da utilização do solo são cruciais na deteção remota uma vez que os agricultores são incentivados a intensificar ou estender as culturas devido ao aumento contínuo da população mundial. Uma questão importante nesta área é interpretar e classificar cenas capturadas em imagens de satélite de alta resolução. Várias aproximações têm sido propostas incluindo a utilização de redes neuronais que produzem modelos dependentes dos dados (ou seja, o modelo é tendencioso em relação aos dados) e aproximações baseadas em regras que apresentam restrições de diversidade (ou seja, o modelo carece de diversidade em termos de regras). No entanto, a criação de um modelo de aprendizagem automática que, dada uma uma grande quantidade de dados de treino, é capaz de classificar, com desempenho superior, as imagens do Sentinel-2 em diferentes áreas geográficas permanece um problema em aberto. Por outro lado, têm sido utilizadas técnicas de aprendizagem supervisionada na resolução de problemas nas mais diversas áreas de devido à proliferação de conjuntos de dados etiquetados. Exemplos disto incluem classificadores para aplicações que reconhecem imagem e voz, antecipam tráfego, propõem produtos, atuam como assistentes pessoais virtuais e detetam fraudes online, entre muitos outros. Uma vez que estes classificadores são fortemente dependente do conjunto de dados de treino, sem interação humana ou etiquetas precisas, o seu desempenho sobre novos dados é incerta. Neste sentido existem propostas para avaliar modelos independentes usando uma distância estatística. No entanto, o problema de, dada uma divisão de treino-teste e um classificador, identificar o erro de previsão usando a relação entre aqueles conjuntos, permanece aberto. Mais ainda, embora alguns dados de treino sejam essenciais para a aprendizagem supervisionada, o que acontece quando a quantidade de dados etiquetados é insuficiente? Afinal, atribuir etiquetas é um processo demorado e que exige perícia, o que se traduz num envolvimento humano significativo. Quando a quantidade de dados etiquetados manualmente por peritos é insuficiente a aprendizagem ativa torna-se crucial. No entanto, dada uma grande quantidade dados de treino não etiquetados, ter um modelo de aprendizagem ativa que reduz o custo de treino do classificador e, ao mesmo tempo, auxilia a etiquetagem de novas observações permanece um problema em aberto. A partir das abordagens e estudos experimentais, as principais contribuições deste trabalho, que se concentra na classificação de cenas de imagens de satélite óptico incluem: criação de conjuntos de dados Sentinel-2 etiquetados, com valores de refletância de superfície; proposta de modelos de aprendizagem automática baseados em pixels para classificação de cenas de imagens de satétite; proposta de um Modelo de Função de Evidência (EFM) baseado numa distância estatística para detetar erros de classificação de modelos de aprendizagem; e proposta de uma abordagem de amostragem generalizada para aprendizagem ativa que, em conjunto com o EFM, possibilita uma forma de determinar os exemplos mais informativos. Em primeiro lugar, usando um conjunto de dados Sentinel-2 etiquetado manualmente, foram desenvolvidos modelos de Aprendizagem Automática (AA) para classificação de cenas e seu desempenho foi comparado com o do Sen2Cor – o produto de referência da Agência Espacial Europeia – tendo sido alcançado um valor de micro-F1 de 84% pelo classificador, o que representa uma melhoria significativa em relação ao desempenho Sen2Cor correspondente, de 59%. Em segundo lugar, para quantificar o erro de classificação dos modelos de AA, foi concebido o Modelo de Função de Evidência baseado na distância de Mahalanobis. Este modelo conseguiu, para o conjunto de dados etiquetado do Sentinel-2 um micro-F1 de 67,89% na deteção de classificação incorreta. Por fim, o EFM foi utilizado como uma estratégia de amostragem para a aprendizagem ativa, uma abordagem que permitiu atingir o mesmo nível de desempenho com apenas 0,02% do total de exemplos de treino quando comparado com um classificador treinado com o conjunto de treino completo. Com a ajuda das contribuições acima mencionadas, foi possível desenvolver um pacote de código aberto para classificação de cenas de imagens Sentinel-2 que, utilizando num conjunto de scripts Python, um modelo de classificação, e uma imagem Sentinel-2 L1C, gera a imagem RGB correspondente (com resolução de 20m) com as seis classes estudadas (Cloud, Cirrus, Shadow, Snow, Water e Other), disponibilizando à academia um método direto para a classificação de cenas de imagens do Sentinel-2 rápida e eficaz. Além disso, a abordagem de aprendizagem ativa que usa, como estratégia de amostragem, a deteção de classificacão incorreta dada pelo EFM, permite etiquetar apenas os pontos mais informativos a serem usados como entrada na construção de classificadores

    Bayesian Parametric Financial Risk Forecasting Employing Multiple High-Frequency Realized Measures

    Get PDF
    This thesis aims to develop parametric volatility models that utilize multiple high-frequency realized volatility measures to forecast two types of tail risks: Value at Risk (VaR) and Expected Shortfall (ES). An extension of the realized exponential generalized autoregressive conditional heteroskedasticity model (realized EGARCH) is proposed, incorporating standardized Student t and skewed Student t distributions to model return equation errors. The realized EGARCH model employs robust realized volatility measures: subsampled realized variance (RVSS), subsampled realized range (RRSS), and realized kernel (RK). A Bayesian estimation technique outperforms maximum likelihood (ML) approach in simulation studies. The proposed models are empirically tested on seven market indices, demonstrating improved accuracy in tail risk prediction. Including RRSS, either individually or jointly with RVSS and/or RK, enhances the forecast performance of the model. A variable selection method based on the Least Absolute Shrinkage and Selection Operator (Lasso) is proposed, employing cross-validation (CV) and Bayesian Lasso (BLasso) approaches. In the empirical study, RVSS, RRSS, RK, and Range variables are incorporated as realized volatility measures. Both CV and BLasso approaches indicate that RRSS and RVSS provide stronger signals about future volatility compared to RK and Range variables, with BLasso resulting in sparser realized EGARCH models. Furthermore, an extension of the realized EGARCH model using a standardized two-sided Weibull distribution for return error distribution is proposed, with Bayesian estimation producing less biased and more precise parameter estimates than ML estimation. The empirical study demonstrates comparable forecast performance between the realized EGARCH models using standardized two-sided Weibull and standardized skewed Student t distributions, employing RVSS, RRSS, and RK

    Proc. 33. Workshop Computational Intelligence, Berlin, 23.-24.11.2023

    Get PDF
    Dieser Tagungsband enthält die Beiträge des 33. Workshops „Computational Intelligence“ der vom 23.11. – 24.11.2023 in Berlin stattfindet. Die Schwerpunkte sind Methoden, Anwendungen und Tools für ° Fuzzy-Systeme, ° Künstliche Neuronale Netze, ° Evolutionäre Algorithmen und ° Data-Mining-Verfahren sowie der Methodenvergleich anhand von industriellen und Benchmark-Problemen.The workshop proceedings contain the contributions of the 33rd workshop "Computational Intelligence" which will take place from 23.11. - 24.11.2023 in Berlin. The focus is on methods, applications and tools for ° Fuzzy systems, ° Artificial Neural Networks, ° Evolutionary algorithms and ° Data mining methods as well as the comparison of methods on the basis of industrial and benchmark problems

    A Novel Embedded Feature Selection Framework for Probabilistic Load Forecasting With Sparse Data via Bayesian Inference

    Get PDF
    With the modernization of power industry over recent decades, diverse smart technologies have been introduced to the power systems. Such transition has brought in a significant level of variability and uncertainty to the networks, resulting in less predictable electricity demand. In this regard, load forecasting stands in the breach and is even more challenging. Urgent needs have been raised from different sections, especially for probabilistic analysis for industrial applications. Hence, attentions have been shifted from point load forecasting to probabilistic load forecasting (PLF) in recent years. This research proposes a novel embedded feature selection method for PLF to deal with sparse features and thus to improve PLF performance. Firstly, the proposed method employs quantile regression to connect the predictor variables and each quantile of the distribution of the load. Thereafter, an embedded feature selection structure is incorporated to identify and select subsets of input features by introducing an inclusion indicator variable for each feature. Then, Bayesian inference is applied to the model with a sparseness favoring prior endowed over the inclusion indicator variables. A Markov Chain Monte Carlo (MCMC) approach is adopted to sample the parameters from the posterior. Finally, the samples are used to approximate the posterior distribution, which is achieved by using discrete formulas applied to these samples to approximate the integrals of interest. The proposed approach allows each quantile of the distribution of the dependent load to be affected by different sets of features, and also allows all features to take a chance to show their impact on the load. Consequently, this methodology leads to the improved estimation of more complex predictive densities. The proposed framework has been successfully applied to a linear model, the quantile linear regression, and been extended to improve the performance of a nonlinear model. Three case studies have been designed to validate the effectiveness of the proposed method. The first case study performed on an open dataset validates that the proposed feature selection technique can improve the performance of PLF based on quantile linear regression and outperforms the selected comparable benchmarks. This case study does not consider any recency effect. The second case study further examines the impact of recency effect using another open dataset which contains historical load and weather records of 10 different regions. The third case study explores the potential of extending the application of the proposed framework for nonlinear models. In this case study, the proposed method is used as a wrapper approach and applied to a nonlinear model. The simulation results show that the proposed method has the best overall performance among all the tested methods with and without considering recency effect, and it could slightly improve the performance of other models when applied as a wrapper approach

    Generalized Matrix Decomposition Regression: Estimation and Inference for Two-way Structured Data

    Full text link
    This paper studies high-dimensional regression with two-way structured data. To estimate the high-dimensional coefficient vector, we propose the generalized matrix decomposition regression (GMDR) to efficiently leverage any auxiliary information on row and column structures. The GMDR extends the principal component regression (PCR) to two-way structured data, but unlike PCR, the GMDR selects the components that are most predictive of the outcome, leading to more accurate prediction. For inference on regression coefficients of individual variables, we propose the generalized matrix decomposition inference (GMDI), a general high-dimensional inferential framework for a large family of estimators that include the proposed GMDR estimator. GMDI provides more flexibility for modeling relevant auxiliary row and column structures. As a result, GMDI does not require the true regression coefficients to be sparse; it also allows dependent and heteroscedastic observations. We study the theoretical properties of GMDI in terms of both the type-I error rate and power and demonstrate the effectiveness of GMDR and GMDI on simulation studies and an application to human microbiome data.Comment: 25 pages, 6 figures, Accepted by the Annals of Applied Statistic

    Methods in machine learning for probabilistic modelling of environment, with applications in meteorology and geology

    Get PDF
    Earth scientists increasingly deal with ‘big data’. Where once we may have struggled to obtain a handful of relevant measurements, we now often have data being collected from multiple sources, on the ground, in the air, and from space. These observations are accumulating at a rate that far outpaces our ability to make sense of them using traditional methods with limited scalability (e.g., mental modelling, or trial-and-error improvement of process based models). The revolution in machine learning offers a new paradigm for modelling the environment: rather than focusing on tweaking every aspect of models developed from the top down based largely on prior knowledge, we now have the capability to instead set up more abstract machine learning systems that can ‘do the tweaking for us’ in order to learn models from the bottom up that can be considered optimal in terms of how well they agree with our (rapidly increasing number of) observations of reality, while still being guided by our prior beliefs. In this thesis, with the help of spatial, temporal, and spatio-temporal examples in meteorology and geology, I present methods for probabilistic modelling of environmental variables using machine learning, and explore the considerations involved in developing and adopting these technologies, as well as the potential benefits they stand to bring, which include improved knowledge-acquisition and decision-making. In each application, the common theme is that we would like to learn predictive distributions for the variables of interest that are well-calibrated and as sharp as possible (i.e., to provide answers that are as precise as possible while remaining honest about their uncertainty). Achieving this requires the adoption of statistical approaches, but the volume and complexity of data available mean that scalability is an important factor — we can only realise the value of available data if it can be successfully incorporated into our models.Engineering and Physical Sciences Research Council (EPSRC

    Learning Physical Models that Can Respect Conservation Laws

    Full text link
    Recent work in scientific machine learning (SciML) has focused on incorporating partial differential equation (PDE) information into the learning process. Much of this work has focused on relatively ``easy'' PDE operators (e.g., elliptic and parabolic), with less emphasis on relatively ``hard'' PDE operators (e.g., hyperbolic). Within numerical PDEs, the latter problem class requires control of a type of volume element or conservation constraint, which is known to be challenging. Delivering on the promise of SciML requires seamlessly incorporating both types of problems into the learning process. To address this issue, we propose ProbConserv, a framework for incorporating conservation constraints into a generic SciML architecture. To do so, ProbConserv combines the integral form of a conservation law with a Bayesian update. We provide a detailed analysis of ProbConserv on learning with the Generalized Porous Medium Equation (GPME), a widely-applicable parameterized family of PDEs that illustrates the qualitative properties of both easier and harder PDEs. ProbConserv is effective for easy GPME variants, performing well with state-of-the-art competitors; and for harder GPME variants it outperforms other approaches that do not guarantee volume conservation. ProbConserv seamlessly enforces physical conservation constraints, maintains probabilistic uncertainty quantification (UQ), and deals well with shocks and heteroscedasticities. In each case, it achieves superior predictive performance on downstream tasks

    Mind the spikes: Benign overfitting of kernels and neural networks in fixed dimension

    Full text link
    The success of over-parameterized neural networks trained to near-zero training error has caused great interest in the phenomenon of benign overfitting, where estimators are statistically consistent even though they interpolate noisy training data. While benign overfitting in fixed dimension has been established for some learning methods, current literature suggests that for regression with typical kernel methods and wide neural networks, benign overfitting requires a high-dimensional setting where the dimension grows with the sample size. In this paper, we show that the smoothness of the estimators, and not the dimension, is the key: benign overfitting is possible if and only if the estimator's derivatives are large enough. We generalize existing inconsistency results to non-interpolating models and more kernels to show that benign overfitting with moderate derivatives is impossible in fixed dimension. Conversely, we show that rate-optimal benign overfitting is possible for regression with a sequence of spiky-smooth kernels with large derivatives. Using neural tangent kernels, we translate our results to wide neural networks. We prove that while infinite-width networks do not overfit benignly with the ReLU activation, this can be fixed by adding small high-frequency fluctuations to the activation function. Our experiments verify that such neural networks, while overfitting, can indeed generalize well even on low-dimensional data sets.Comment: We provide Python code to reproduce all of our experimental results at https://github.com/moritzhaas/mind-the-spike
    • …
    corecore