190 research outputs found
Generalised latent variable models for location, scale, and shape parameters
Latent Variable Models (LVM) are widely used in social, behavioural, and educational sciences to uncover underlying associations in multivariate data using a smaller number of latent variables. However, the classical LVM framework has certain assumptions that can be restrictive in empirical applications. In particular, the distribution of the observed variables being from the exponential family and the latent variables influencing only the conditional mean of the observed variables. This thesis addresses these limitations and contributes to the current literature in two ways. First, we propose a novel class of models called Generalised Latent Variable Models for Location, Scale, and Shape parameters (GLVM-LSS). These models use linear functions of latent factors to model location, scale, and shape parameters of the items’ conditional distributions. By doing so, we model higher order moments such as variance, skewness, and kurtosis in terms of the latent variables, providing a more flexible framework compared to classical factor models. The model parameters are estimated using maximum likelihood estimation. Second, we address the challenge of interpreting the GLVM-LSS, which can be complex due to its increased number of parameters. We propose a penalised maximum likelihood estimation approach with automatic selection of tuning parameters. This extends previous work on penalised estimation in the LVM literature to cases without closed-form solutions. Our findings suggest that modelling the entire distribution of items, not just the conditional mean, leads to improved model fit and deeper insights into how the items reflect the latent constructs they are intended to measure. To assess the performance of the proposed methods, we conduct extensive simulation studies and apply it to real-world data from educational testing and public opinion research. The results highlight the efficacy of the GLVM-LSS framework in capturing complex relationships between observed variables and latent factors, providing valuable insights for researchers in various fields
Statistical Machine Learning Methodology for Individualized Treatment Rule Estimation in Precision Medicine
Precision medicine aims to deliver optimal, individualized treatments for patients by accounting for their unique characteristics. With a foundation in reinforcement learning, decision theory, and causal inference, the field of precision medicine has seen many advancements in recent years. Significant focus has been placed on creating algorithms to estimate individualized treatment rules (ITRs), which map from patient covariates to the space of available treatments with the goal of maximizing patient outcome. In Chapter 1, we extend ITR estimation methodology in the scenario where variance of the outcome is heterogeneous with respect to treatment and covariates. Accordingly, we propose Stabilized Direct Learning (SD-Learning), which utilizes heteroscedasticity in the error term through a residual reweighting framework that models residual variance via flexible machine learning algorithms such as XGBoost and random forests. We also develop an internal cross-validation scheme which determines the best residual model among competing models. Further, we extend this methodology to multi-arm treatment scenarios. In Chapter 2, we develop ITR estimation methodology for situations where clinical decision-making involves balancing multiple outcomes of interest. Our proposed framework estimates an ITR which maximizes a combination of the multiple clinical outcomes, accounting for the fact that patients may ascribe importance to outcomes differently (utility heterogeneity). This approach employs inverse reinforcement learning (IRL) techniques through an expert-augmentation solution, whereby physicians provide input to guide the utility estimation and ITR learning processes. In Chapter 3, we apply an end-to-end precision medicine workflow to novel data from older adults with Type 1 Diabetes in order to understand the heterogeneous treatment effects of continuous glucose monitoring (CGM) and develop an interpretable ITR to reveal patients for which CGM confers a major safety benefit. The results from this analysis elucidate the demographic and clinical markers which moderate CGM's success, provide the basis for using diagnostic CGM to inform therapeutic CGM decisions, and serve to augment clinical decision-making. Finally, in Chapter 4, as a future research direction, we propose a deep autoencoder framework which simultaneously performs feature selection and ITR optimization, contributing to methodology built for direct consumption of unstructured, high-dimensional data in the precision medicine pipeline.Doctor of Philosoph
An uncertainty prediction approach for active learning - application to earth observation
Mapping land cover and land usage dynamics are crucial in remote sensing since farmers
are encouraged to either intensify or extend crop use due to the ongoing rise in the world’s
population. A major issue in this area is interpreting and classifying a scene captured in
high-resolution satellite imagery. Several methods have been put forth, including neural
networks which generate data-dependent models (i.e. model is biased toward data) and
static rule-based approaches with thresholds which are limited in terms of diversity(i.e.
model lacks diversity in terms of rules). However, the problem of having a machine learning
model that, given a large amount of training data, can classify multiple classes over different
geographic Sentinel-2 imagery that out scales existing approaches remains open.
On the other hand, supervised machine learning has evolved into an essential part of many
areas due to the increasing number of labeled datasets. Examples include creating classifiers
for applications that recognize images and voices, anticipate traffic, propose products, act
as a virtual personal assistant and detect online fraud, among many more. Since these
classifiers are highly dependent from the training datasets, without human interaction or
accurate labels, the performance of these generated classifiers with unseen observations
is uncertain. Thus, researchers attempted to evaluate a number of independent models
using a statistical distance. However, the problem of, given a train-test split and classifiers
modeled over the train set, identifying a prediction error using the relation between train
and test sets remains open.
Moreover, while some training data is essential for supervised machine learning, what
happens if there is insufficient labeled data? After all, assigning labels to unlabeled datasets
is a time-consuming process that may need significant expert human involvement. When
there aren’t enough expert manual labels accessible for the vast amount of openly available
data, active learning becomes crucial. However, given a large amount of training and
unlabeled datasets, having an active learning model that can reduce the training cost of
the classifier and at the same time assist in labeling new data points remains an open
problem.
From the experimental approaches and findings, the main research contributions, which
concentrate on the issue of optical satellite image scene classification include: building
labeled Sentinel-2 datasets with surface reflectance values; proposal of machine learning
models for pixel-based image scene classification; proposal of a statistical distance based
Evidence Function Model (EFM) to detect ML models misclassification; and proposal of
a generalised sampling approach for active learning that, together with the EFM enables
a way of determining the most informative examples.
Firstly, using a manually annotated Sentinel-2 dataset, Machine Learning (ML) models
for scene classification were developed and their performance was compared to Sen2Cor the reference package from the European Space Agency – a micro-F1 value of 84%
was attained by the ML model, which is a significant improvement over the corresponding
Sen2Cor performance of 59%. Secondly, to quantify the misclassification of the ML models,
the Mahalanobis distance-based EFM was devised. This model achieved, for the labeled
Sentinel-2 dataset, a micro-F1 of 67.89% for misclassification detection. Lastly, EFM was
engineered as a sampling strategy for active learning leading to an approach that attains
the same level of accuracy with only 0.02% of the total training samples when compared
to a classifier trained with the full training set.
With the help of the above-mentioned research contributions, we were able to provide
an open-source Sentinel-2 image scene classification package which consists of ready-touse
Python scripts and a ML model that classifies Sentinel-2 L1C images generating a
20m-resolution RGB image with the six studied classes (Cloud, Cirrus, Shadow, Snow,
Water, and Other) giving academics a straightforward method for rapidly and effectively
classifying Sentinel-2 scene images. Additionally, an active learning approach that uses, as
sampling strategy, the observed prediction uncertainty given by EFM, will allow labeling
only the most informative points to be used as input to build classifiers; Sumário:
Uma Abordagem de Previsão de Incerteza para
Aprendizagem Ativa – Aplicação à Observação da Terra
O mapeamento da cobertura do solo e a dinâmica da utilização do solo são cruciais na
deteção remota uma vez que os agricultores são incentivados a intensificar ou estender as
culturas devido ao aumento contÃnuo da população mundial. Uma questão importante
nesta área é interpretar e classificar cenas capturadas em imagens de satélite de alta resolução.
Várias aproximações têm sido propostas incluindo a utilização de redes neuronais
que produzem modelos dependentes dos dados (ou seja, o modelo é tendencioso em relação
aos dados) e aproximações baseadas em regras que apresentam restrições de diversidade
(ou seja, o modelo carece de diversidade em termos de regras). No entanto, a criação de
um modelo de aprendizagem automática que, dada uma uma grande quantidade de dados
de treino, é capaz de classificar, com desempenho superior, as imagens do Sentinel-2 em
diferentes áreas geográficas permanece um problema em aberto.
Por outro lado, têm sido utilizadas técnicas de aprendizagem supervisionada na resolução
de problemas nas mais diversas áreas de devido à proliferação de conjuntos de dados etiquetados.
Exemplos disto incluem classificadores para aplicações que reconhecem imagem
e voz, antecipam tráfego, propõem produtos, atuam como assistentes pessoais virtuais e
detetam fraudes online, entre muitos outros. Uma vez que estes classificadores são fortemente
dependente do conjunto de dados de treino, sem interação humana ou etiquetas
precisas, o seu desempenho sobre novos dados é incerta. Neste sentido existem propostas
para avaliar modelos independentes usando uma distância estatÃstica. No entanto, o problema
de, dada uma divisão de treino-teste e um classificador, identificar o erro de previsão
usando a relação entre aqueles conjuntos, permanece aberto.
Mais ainda, embora alguns dados de treino sejam essenciais para a aprendizagem supervisionada,
o que acontece quando a quantidade de dados etiquetados é insuficiente? Afinal,
atribuir etiquetas é um processo demorado e que exige perÃcia, o que se traduz num envolvimento
humano significativo. Quando a quantidade de dados etiquetados manualmente por
peritos é insuficiente a aprendizagem ativa torna-se crucial. No entanto, dada uma grande
quantidade dados de treino não etiquetados, ter um modelo de aprendizagem ativa que
reduz o custo de treino do classificador e, ao mesmo tempo, auxilia a etiquetagem de novas
observações permanece um problema em aberto.
A partir das abordagens e estudos experimentais, as principais contribuições deste trabalho,
que se concentra na classificação de cenas de imagens de satélite óptico incluem:
criação de conjuntos de dados Sentinel-2 etiquetados, com valores de refletância de superfÃcie;
proposta de modelos de aprendizagem automática baseados em pixels para classificação de cenas de imagens de satétite; proposta de um Modelo de Função de Evidência (EFM)
baseado numa distância estatÃstica para detetar erros de classificação de modelos de aprendizagem;
e proposta de uma abordagem de amostragem generalizada para aprendizagem
ativa que, em conjunto com o EFM, possibilita uma forma de determinar os exemplos mais
informativos.
Em primeiro lugar, usando um conjunto de dados Sentinel-2 etiquetado manualmente,
foram desenvolvidos modelos de Aprendizagem Automática (AA) para classificação de cenas
e seu desempenho foi comparado com o do Sen2Cor – o produto de referência da
Agência Espacial Europeia – tendo sido alcançado um valor de micro-F1 de 84% pelo classificador,
o que representa uma melhoria significativa em relação ao desempenho Sen2Cor
correspondente, de 59%. Em segundo lugar, para quantificar o erro de classificação dos
modelos de AA, foi concebido o Modelo de Função de Evidência baseado na distância de
Mahalanobis. Este modelo conseguiu, para o conjunto de dados etiquetado do Sentinel-2
um micro-F1 de 67,89% na deteção de classificação incorreta. Por fim, o EFM foi utilizado
como uma estratégia de amostragem para a aprendizagem ativa, uma abordagem
que permitiu atingir o mesmo nÃvel de desempenho com apenas 0,02% do total de exemplos
de treino quando comparado com um classificador treinado com o conjunto de treino
completo.
Com a ajuda das contribuições acima mencionadas, foi possÃvel desenvolver um pacote
de código aberto para classificação de cenas de imagens Sentinel-2 que, utilizando num
conjunto de scripts Python, um modelo de classificação, e uma imagem Sentinel-2 L1C,
gera a imagem RGB correspondente (com resolução de 20m) com as seis classes estudadas
(Cloud, Cirrus, Shadow, Snow, Water e Other), disponibilizando à academia um método
direto para a classificação de cenas de imagens do Sentinel-2 rápida e eficaz. Além disso, a
abordagem de aprendizagem ativa que usa, como estratégia de amostragem, a deteção de
classificacão incorreta dada pelo EFM, permite etiquetar apenas os pontos mais informativos
a serem usados como entrada na construção de classificadores
Bayesian Parametric Financial Risk Forecasting Employing Multiple High-Frequency Realized Measures
This thesis aims to develop parametric volatility models that utilize multiple high-frequency realized volatility measures to forecast two types of tail risks: Value at Risk (VaR) and Expected Shortfall (ES). An extension of the realized exponential generalized autoregressive conditional heteroskedasticity model (realized EGARCH) is proposed, incorporating standardized Student t and skewed Student t distributions to model return equation errors. The realized EGARCH model employs robust realized volatility measures: subsampled realized variance (RVSS), subsampled realized range (RRSS), and realized kernel (RK). A Bayesian estimation technique outperforms maximum likelihood (ML) approach in simulation studies. The proposed models are empirically tested on seven market indices, demonstrating improved accuracy in tail risk prediction. Including RRSS, either individually or jointly with RVSS and/or RK, enhances the forecast performance of the model.
A variable selection method based on the Least Absolute Shrinkage and Selection Operator (Lasso) is proposed, employing cross-validation (CV) and Bayesian Lasso (BLasso) approaches. In the empirical study, RVSS, RRSS, RK, and Range variables are incorporated as realized volatility measures. Both CV and BLasso approaches indicate that RRSS and RVSS provide stronger signals about future volatility compared to RK and Range variables, with BLasso resulting in sparser realized EGARCH models.
Furthermore, an extension of the realized EGARCH model using a standardized two-sided Weibull distribution for return error distribution is proposed, with Bayesian estimation producing less biased and more precise parameter estimates than ML estimation. The empirical study demonstrates comparable forecast performance between the realized EGARCH models using standardized two-sided Weibull and standardized skewed Student t distributions, employing RVSS, RRSS, and RK
Proc. 33. Workshop Computational Intelligence, Berlin, 23.-24.11.2023
Dieser Tagungsband enthält die Beiträge des 33. Workshops „Computational Intelligence“ der vom 23.11. – 24.11.2023 in Berlin stattfindet. Die Schwerpunkte sind Methoden, Anwendungen und Tools für ° Fuzzy-Systeme, ° Künstliche Neuronale Netze, ° Evolutionäre Algorithmen und ° Data-Mining-Verfahren sowie der Methodenvergleich anhand von industriellen und Benchmark-Problemen.The workshop proceedings contain the contributions of the 33rd workshop "Computational Intelligence" which will take place from 23.11. - 24.11.2023 in Berlin. The focus is on methods, applications and tools for ° Fuzzy systems, ° Artificial Neural Networks, ° Evolutionary algorithms and ° Data mining methods as well as the comparison of methods on the basis of industrial and benchmark problems
A Novel Embedded Feature Selection Framework for Probabilistic Load Forecasting With Sparse Data via Bayesian Inference
With the modernization of power industry over recent decades, diverse smart technologies have been introduced to the power systems. Such transition has brought in a significant level of variability and uncertainty to the networks, resulting in less predictable electricity demand. In this regard, load forecasting stands in the breach and is even more challenging. Urgent needs have been raised from different sections, especially for probabilistic analysis for industrial applications. Hence, attentions have been shifted from point load forecasting to probabilistic load forecasting (PLF) in recent years.
This research proposes a novel embedded feature selection method for PLF to deal with sparse features and thus to improve PLF performance. Firstly, the proposed method employs quantile regression to connect the predictor variables and each quantile of the distribution of the load. Thereafter, an embedded feature selection structure is incorporated to identify and select subsets of input features by introducing an inclusion indicator variable for each feature. Then, Bayesian inference is applied to the model with a sparseness favoring prior endowed over the inclusion indicator variables. A Markov Chain Monte Carlo (MCMC) approach is adopted to sample the parameters from the posterior. Finally, the samples are used to approximate the posterior distribution, which is achieved by using discrete formulas applied to these samples to approximate the integrals of interest. The proposed approach allows each quantile of the distribution of the dependent load to be affected by different sets of features, and also allows all features to take a chance to show their impact on the load. Consequently, this methodology leads to the improved estimation of more complex predictive densities. The proposed framework has been successfully applied to a linear model, the quantile linear regression, and been extended to improve the performance of a nonlinear model.
Three case studies have been designed to validate the effectiveness of the proposed method. The first case study performed on an open dataset validates that the proposed feature selection technique can improve the performance of PLF based on quantile linear regression and outperforms the selected comparable benchmarks. This case study does not consider any recency effect. The second case study further examines the impact of recency effect using another open dataset which contains historical load and weather records of 10 different regions. The third case study explores the potential of extending the application of the proposed framework for nonlinear models. In this case study, the proposed method is used as a wrapper approach and applied to a nonlinear model. The simulation results show that the proposed method has the best overall performance among all the tested methods with and without considering recency effect, and it could slightly improve the performance of other models when applied as a wrapper approach
Generalized Matrix Decomposition Regression: Estimation and Inference for Two-way Structured Data
This paper studies high-dimensional regression with two-way structured data.
To estimate the high-dimensional coefficient vector, we propose the generalized
matrix decomposition regression (GMDR) to efficiently leverage any auxiliary
information on row and column structures. The GMDR extends the principal
component regression (PCR) to two-way structured data, but unlike PCR, the GMDR
selects the components that are most predictive of the outcome, leading to more
accurate prediction. For inference on regression coefficients of individual
variables, we propose the generalized matrix decomposition inference (GMDI), a
general high-dimensional inferential framework for a large family of estimators
that include the proposed GMDR estimator. GMDI provides more flexibility for
modeling relevant auxiliary row and column structures. As a result, GMDI does
not require the true regression coefficients to be sparse; it also allows
dependent and heteroscedastic observations. We study the theoretical properties
of GMDI in terms of both the type-I error rate and power and demonstrate the
effectiveness of GMDR and GMDI on simulation studies and an application to
human microbiome data.Comment: 25 pages, 6 figures, Accepted by the Annals of Applied Statistic
Methods in machine learning for probabilistic modelling of environment, with applications in meteorology and geology
Earth scientists increasingly deal with ‘big data’. Where once we may have struggled to obtain a handful of relevant measurements, we now often have data being collected from multiple sources, on the ground, in the air, and from space. These observations are accumulating at a rate that far outpaces our ability to make sense of them using traditional methods with limited scalability (e.g., mental modelling, or trial-and-error improvement of process based models). The revolution in machine learning offers a new paradigm for modelling the environment: rather than focusing on tweaking every aspect of models developed from the top down based largely on prior knowledge, we now have the capability to instead set up more abstract machine learning systems that can ‘do the tweaking for us’ in order to learn models from the bottom up that can be considered optimal in terms of how well they agree with our (rapidly increasing number of) observations of reality, while still being guided by our prior beliefs.
In this thesis, with the help of spatial, temporal, and spatio-temporal examples in meteorology and geology, I present methods for probabilistic modelling of environmental variables using machine learning, and explore the considerations involved in developing and adopting these technologies, as well as the potential benefits they stand to bring, which include improved knowledge-acquisition and decision-making. In each application, the common theme is that we would like to learn predictive distributions for the variables of interest that are well-calibrated and as sharp as possible (i.e., to provide answers that are as precise as possible while remaining honest about their uncertainty). Achieving this requires the adoption of statistical approaches, but the volume and complexity of data available mean that scalability is an important factor — we can only realise the value of available data if it can be successfully incorporated into our models.Engineering and Physical Sciences Research Council (EPSRC
Learning Physical Models that Can Respect Conservation Laws
Recent work in scientific machine learning (SciML) has focused on
incorporating partial differential equation (PDE) information into the learning
process. Much of this work has focused on relatively ``easy'' PDE operators
(e.g., elliptic and parabolic), with less emphasis on relatively ``hard'' PDE
operators (e.g., hyperbolic). Within numerical PDEs, the latter problem class
requires control of a type of volume element or conservation constraint, which
is known to be challenging. Delivering on the promise of SciML requires
seamlessly incorporating both types of problems into the learning process. To
address this issue, we propose ProbConserv, a framework for incorporating
conservation constraints into a generic SciML architecture. To do so,
ProbConserv combines the integral form of a conservation law with a Bayesian
update. We provide a detailed analysis of ProbConserv on learning with the
Generalized Porous Medium Equation (GPME), a widely-applicable parameterized
family of PDEs that illustrates the qualitative properties of both easier and
harder PDEs. ProbConserv is effective for easy GPME variants, performing well
with state-of-the-art competitors; and for harder GPME variants it outperforms
other approaches that do not guarantee volume conservation. ProbConserv
seamlessly enforces physical conservation constraints, maintains probabilistic
uncertainty quantification (UQ), and deals well with shocks and
heteroscedasticities. In each case, it achieves superior predictive performance
on downstream tasks
Mind the spikes: Benign overfitting of kernels and neural networks in fixed dimension
The success of over-parameterized neural networks trained to near-zero
training error has caused great interest in the phenomenon of benign
overfitting, where estimators are statistically consistent even though they
interpolate noisy training data. While benign overfitting in fixed dimension
has been established for some learning methods, current literature suggests
that for regression with typical kernel methods and wide neural networks,
benign overfitting requires a high-dimensional setting where the dimension
grows with the sample size. In this paper, we show that the smoothness of the
estimators, and not the dimension, is the key: benign overfitting is possible
if and only if the estimator's derivatives are large enough. We generalize
existing inconsistency results to non-interpolating models and more kernels to
show that benign overfitting with moderate derivatives is impossible in fixed
dimension. Conversely, we show that rate-optimal benign overfitting is possible
for regression with a sequence of spiky-smooth kernels with large derivatives.
Using neural tangent kernels, we translate our results to wide neural networks.
We prove that while infinite-width networks do not overfit benignly with the
ReLU activation, this can be fixed by adding small high-frequency fluctuations
to the activation function. Our experiments verify that such neural networks,
while overfitting, can indeed generalize well even on low-dimensional data
sets.Comment: We provide Python code to reproduce all of our experimental results
at https://github.com/moritzhaas/mind-the-spike
- …