9 research outputs found
EMFlow: Data Imputation in Latent Space via EM and Deep Flow Models
High dimensional incomplete data can be found in a wide range of systems. Due
to the fact that most of the data mining techniques and machine learning
algorithms require complete observations, data imputation is vital for
down-stream analysis. In this work, we introduce an imputation approach, called
EMFlow, that performs imputation in an latent space via an online version of
Expectation-Maximization (EM) algorithm and connects the latent space and the
data space via the normalizing flow (NF). The inference of EMFlow is iterative,
involving updating the parameters of online EM and NF alternatively. Extensive
experimental results on multivariate and image datasets show that the proposed
EMFlow has superior performance to competing methods in terms of both
imputation quality and convergence speed
Missing Data Imputation and Acquisition with Deep Hierarchical Models and Hamiltonian Monte Carlo
Variational Autoencoders (VAEs) have recently been highly successful at
imputing and acquiring heterogeneous missing data. However, within this
specific application domain, existing VAE methods are restricted by using only
one layer of latent variables and strictly Gaussian posterior approximations.
To address these limitations, we present HH-VAEM, a Hierarchical VAE model for
mixed-type incomplete data that uses Hamiltonian Monte Carlo with automatic
hyper-parameter tuning for improved approximate inference. Our experiments show
that HH-VAEM outperforms existing baselines in the tasks of missing data
imputation and supervised learning with missing features. Finally, we also
present a sampling-based approach for efficiently computing the information
gain when missing features are to be acquired with HH-VAEM. Our experiments
show that this sampling-based approach is superior to alternatives based on
Gaussian approximations.Comment: Accepted at NeurIPS 202
Networked Time Series Imputation via Position-aware Graph Enhanced Variational Autoencoders
Multivariate time series (MTS) imputation is a widely studied problem in
recent years. Existing methods can be divided into two main groups, including
(1) deep recurrent or generative models that primarily focus on time series
features, and (2) graph neural networks (GNNs) based models that utilize the
topological information from the inherent graph structure of MTS as relational
inductive bias for imputation. Nevertheless, these methods either neglect
topological information or assume the graph structure is fixed and accurately
known. Thus, they fail to fully utilize the graph dynamics for precise
imputation in more challenging MTS data such as networked time series (NTS),
where the underlying graph is constantly changing and might have missing edges.
In this paper, we propose a novel approach to overcome these limitations.
First, we define the problem of imputation over NTS which contains missing
values in both node time series features and graph structures. Then, we design
a new model named PoGeVon which leverages variational autoencoder (VAE) to
predict missing values over both node time series features and graph
structures. In particular, we propose a new node position embedding based on
random walk with restart (RWR) in the encoder with provable higher expressive
power compared with message-passing based graph neural networks (GNNs). We
further design a decoder with 3-stage predictions from the perspective of
multi-task learning to impute missing values in both time series and graph
structures reciprocally. Experiment results demonstrate the effectiveness of
our model over baselines.Comment: KDD 202
Posterior Consistency for Missing Data in Variational Autoencoders
We consider the problem of learning Variational Autoencoders (VAEs), i.e., a
type of deep generative model, from data with missing values. Such data is
omnipresent in real-world applications of machine learning because complete
data is often impossible or too costly to obtain. We particularly focus on
improving a VAE's amortized posterior inference, i.e., the encoder, which in
the case of missing data can be susceptible to learning inconsistent posterior
distributions regarding the missingness. To this end, we provide a formal
definition of posterior consistency and propose an approach for regularizing an
encoder's posterior distribution which promotes this consistency. We observe
that the proposed regularization suggests a different training objective than
that typically considered in the literature when facing missing values.
Furthermore, we empirically demonstrate that our regularization leads to
improved performance in missing value settings in terms of reconstruction
quality and downstream tasks utilizing uncertainty in the latent space. This
improved performance can be observed for many classes of VAEs including VAEs
equipped with normalizing flows.Comment: First published in ECML PKDD 2023, Proceedings, Part II, by Springer
Nature (https://doi.org/10.1007/978-3-031-43415-0_30). This version of the
work has been extended with the addition of an Appendix, which includes
proofs, the derivation of the posterior regularization, additional background
information on technical topics, an extended related work section, and
additional experimental result
How can humans leverage machine learning? From Medical Data Wrangling to Learning to Defer to Multiple Experts
Mención Internacional en el título de doctorThe irruption of the smartphone into everyone’s life and the ease with which we digitise or record
any data supposed an explosion of quantities of data. Smartphones, equipped with advanced
cameras and sensors, have empowered individuals to capture moments and contribute to the
growing pool of data. This data-rich landscape holds great promise for research, decision-making,
and personalized applications. By carefully analyzing and interpreting this wealth of information,
valuable insights, patterns, and trends can be uncovered.
However, big data is worthless in a vacuum. Its potential value is unlocked only when leveraged
to drive decision-making. In recent times we have been participants of the outburst of artificial
intelligence: the development of computer systems and algorithms capable of perceiving, reasoning,
learning, and problem-solving, emulating certain aspects of human cognitive abilities. Nevertheless,
our focus tends to be limited, merely skimming the surface of the problem, while the reality
is that the application of machine learning models to data introduces is usually fraught. More
specifically, there are two crucial pitfalls frequently neglected in the field of machine learning:
the quality of the data and the erroneous assumption that machine learning models operate
autonomously. These two issues have established the foundation for the motivation driving this
thesis, which strives to offer solutions to two major associated challenges: 1) dealing with irregular
observations and 2) learning when and who should we trust.
The first challenge originates from our observation that the majority of machine learning
research primarily concentrates on handling regular observations, neglecting a crucial technological
obstacle encountered in practical big-data scenarios: the aggregation and curation of heterogeneous
streams of information. Before applying machine learning algorithms, it is crucial to establish
robust techniques for handling big data, as this specific aspect presents a notable bottleneck in
the creation of robust algorithms. Data wrangling, which encompasses the extraction, integration,
and cleaning processes necessary for data analysis, plays a crucial role in this regard. Therefore,
the first objective of this thesis is to tackle the frequently disregarded challenge of addressing
irregularities within the context of medical data. We will focus on three specific aspects. Firstly,
we will tackle the issue of missing data by developing a framework that facilitates the imputation
of missing data points using relevant information derived from alternative data sources or past
observations. Secondly, we will move beyond the assumption of homogeneous observations,
where only one statistical data type (such as Gaussian) is considered, and instead, work with
heterogeneous observations. This means that different data sources can be represented by various
statistical likelihoods, such as Gaussian, Bernoulli, categorical, etc. Lastly, considering the
temporal enrichment of todays collected data and our focus on medical data, we will develop a novel algorithm capable of capturing and propagating correlations among different data streams
over time. All these three problems are addressed in our first contribution which involves the
development of a novel method based on Deep Generative Models (DGM) using Variational
Autoencoders (VAE). The proposed model, the Sequential Heterogeneous Incomplete VAE (Shi-
VAE), enables the aggregation of multiple heterogeneous data streams in a modular manner,
taking into consideration the presence of potential missing data. To demonstrate the feasibility
of our approach, we present proof-of-concept results obtained from a real database generated
through continuous passive monitoring of psychiatric patients.
Our second challenge relates to the misbelief that machine learning algorithms can perform
independently. However, this notion that AI systems can solely account for automated decisionmaking,
especially in critical domains such as healthcare, is far from reality. Our focus now shifts
towards a specific scenario where the algorithm has the ability to make predictions independently
or alternatively defer the responsibility to a human expert. The purpose of including the human
is not to obtain jsut better performance, but also more reliable and trustworthy predictions we
can rely on. In reality, however, important decisions are not made by one person but are usually
committed by an ensemble of human experts. With this in mind, two important questions arise:
1) When should the human or the machine bear responsibility and 2) among the experts, who
should we trust? To answer the first question, we will employ a recent theory known as Learning
to defer (L2D). In L2D we are not only interested in abstaining from prediction but also in
understanding the humans confidence for making such prediction. thus deferring only when the
human is more likely to be correct. The second question about who to defer among a pool of
experts has not been yet answered in the L2D literature, and this is what our contributions
aim to provide. First, we extend the two yet proposed consistent surrogate losses in the L2D
literature to the multiple-expert setting. Second, we study the frameworks ability to estimate
the probability that a given expert correctly predicts and assess whether the two surrogate losses
are confidence calibrated. Finally, we propose a conformal inference technique that chooses a
subset of experts to query when the system defers. Ensembling experts based on confidence
levels is vital to optimize human-machine collaboration.
In conclusion, this doctoral thesis has investigated two cases where humans can leverage the
power of machine learning: first, as a tool to assist in data wrangling and data understanding
problems and second, as a collaborative tool where decision-making can be automated by the
machine or delegated to human experts, fostering more transparent and trustworthy solutions.La irrupción de los smartphones en la vida de todos y la facilidad con la que digitalizamos o
registramos cualquier situación ha supuesto una explosión en la cantidad de datos. Los teléfonos,
equipados con cámaras y sensores avanzados, han contribuido a que las personas puedann capturar
más momentos, favoreciendo así el creciente conjunto de datos. Este panorama repleto de datos
aporta un gran potencial de cara a la investigación, la toma de decisiones y las aplicaciones
personalizadas. Mediante el análisis minucioso y una cuidada interpretación de esta abundante
información, podemos descubrir valiosos patrones, tendencias y conclusiones
Sin embargo, este gran volumen de datos no tiene valor por si solo. Su potencial se desbloquea
solo cuando se aprovecha para impulsar la toma de decisiones. En tiempos recientes, hemos sido
testigos del auge de la inteligencia artificial: el desarrollo de sistemas informáticos y algoritmos
capaces de percibir, razonar, aprender y resolver problemas, emulando ciertos aspectos de las
capacidades cognitivas humanas. No obstante, solemos centrarnos solo en la superficie del problema
mientras que la realidad es que la aplicación de modelos de aprendizaje automático a los datos
presenta desafíos significativos. Concretamente, se suelen pasar por alto dos problemas cruciales
en el campo del aprendizaje automático: la calidad de los datos y la suposición errónea de
que los modelos de aprendizaje automático pueden funcionar de manera autónoma. Estos dos
problemas han sido el fundamento de la motivación que impulsa esta tesis, que se esfuerza
en ofrecer soluciones a dos desafíos importantes asociados: 1) lidiar con datos irregulares y 2)
aprender cuándo y en quién debemos confiar.
El primer desafío surge de nuestra observación de que la mayoría de las investigaciones en
aprendizaje automático se centran principalmente en manejar datos regulares, descuidando un
obstáculo tecnológico crucial que se encuentra en escenarios prácticos con gran cantidad de
datos: la agregación y el curado de secuencias heterogéneas. Antes de aplicar algoritmos de
aprendizaje automático, es crucial establecer técnicas robustas para manejar estos datos, ya que
est problemática representa un cuello de botella claro en la creación de algoritmos robustos. El
procesamiento de datos (en concreto, nos centraremos en el término inglés data wrangling), que
abarca los procesos de extracción, integración y limpieza necesarios para el análisis de datos,
desempeña un papel crucial en este sentido. Por lo tanto, el primer objetivo de esta tesis es
abordar el desafío normalmente paso por alto de tratar datos irregulare. Específicamente, bajo
el contexto de datos médicos. Nos centraremos en tres aspectos principales. En primer lugar,
abordaremos el problema de los datos perdidos mediante el desarrollo de un marco que facilite
la imputación de estos datos perdidos utilizando información relevante obtenida de fuentes de
datos de diferente naturalaeza u observaciones pasadas. En segundo lugar, iremos más allá de la suposición de lidiar con observaciones homogéneas, donde solo se considera un tipo de dato
estadístico (como Gaussianos) y, en su lugar, trabajaremos con observaciones heterogéneas. Esto
significa que diferentes fuentes de datos pueden estar representadas por diversas distribuciones
de probabilidad, como Gaussianas, Bernoulli, categóricas, etc. Por último, teniendo en cuenta
el enriquecimiento temporal de los datos hoy en día y nuestro enfoque directo sobre los datos
médicos, propondremos un algoritmo innovador capaz de capturar y propagar la correlación
entre diferentes flujos de datos a lo largo del tiempo. Todos estos tres problemas se abordan
en nuestra primera contribución, que implica el desarrollo de un método basado en Modelos
Generativos Profundos (Deep Genarative Model en inglés) utilizando Autoencoders Variacionales
(Variational Autoencoders en ingés). El modelo propuesto, Sequential Heterogeneous Incomplete
VAE (Shi-VAE), permite la agregación de múltiples flujos de datos heterogéneos de manera
modular, teniendo en cuenta la posible presencia de datos perdidos. Para demostrar la viabilidad
de nuestro enfoque, presentamos resultados de prueba de concepto obtenidos de una base de datos
real generada a través del monitoreo continuo pasivo de pacientes psiquiátricos.
Nuestro segundo desafío está relacionado con la creencia errónea de que los algoritmos de
aprendizaje automático pueden funcionar de manera independiente. Sin embargo, esta idea de que
los sistemas de inteligencia artificial pueden ser los únicos responsables en la toma de decisione,
especialmente en dominios críticos como la atención médica, está lejos de la realidad. Ahora,
nuestro enfoque se centra en un escenario específico donde el algoritmo tiene la capacidad de
realizar predicciones de manera independiente o, alternativamente, delegar la responsabilidad
en un experto humano. La inclusión del ser humano no solo tiene como objetivo obtener un
mejor rendimiento, sino también obtener predicciones más transparentes y seguras en las que
podamos confiar. En la realidad, sin embargo, las decisiones importantes no las toma una sola
persona, sino que generalmente son el resultado de la colaboración de un conjunto de expertos.
Con esto en mente, surgen dos preguntas importantes: 1) ¿Cuándo debe asumir la responsabilidad
el ser humano o cuándo la máquina? y 2) de entre los expertos, ¿en quién debemos confiar?
Para responder a la primera pregunta, emplearemos una nueva teoría llamada Learning to defer
(L2D). En L2D, no solo estamos interesados en abstenernos de hacer predicciones, sino también
en comprender cómo de seguro estará el experto para hacer dichas predicciones, diferiendo solo
cuando el humano sea más probable en predecir correcatmente. La segunda pregunta sobre a quién
deferir entre un conjunto de expertos aún no ha sido respondida en la literatura de L2D, y esto es
precisamente lo que nuestras contribuciones pretenden proporcionar. En primer lugar, extendemos
las dos primeras surrogate losses consistentes propuestas hasta ahora en la literatura de L2D al
contexto de múltiples expertos. En segundo lugar, estudiamos la capacidad de estos modelos para
estimar la probabilidad de que un experto dado haga predicciones correctas y evaluamos si estas
surrogate losses están calibradas en términos de confianza. Finalmente, proponemos una técnica
de conformal inference que elige un subconjunto de expertos para consultar cuando el sistema
decide diferir. Esta combinación de expertos basada en los respectivos niveles de confianza es
fundamental para optimizar la colaboración entre humanos y máquinas En conclusión, esta tesis doctoral ha investigado dos casos en los que los humanos pueden
aprovechar el poder del aprendizaje automático: primero, como herramienta para ayudar en
problemas de procesamiento y comprensión de datos y, segundo, como herramienta colaborativa en
la que la toma de decisiones puede ser automatizada para ser realizada por la máquina o delegada
a expertos humanos, fomentando soluciones más transparentes y seguras.Programa de Doctorado en Multimedia y Comunicaciones por la Universidad Carlos III de Madrid y la Universidad Rey Juan CarlosPresidente: Joaquín Míguez Arenas.- Secretario: Juan José Murillo Fuentes.- Vocal: Mélanie Natividad Fernández Pradie
Generative modelling: addressing open problems in model misspecification and differential privacy
Generative modelling has become a popular application of artificial intelligence. Model performance can, however, be impacted negatively when the generative model is misspecified, or when the generative model estimator is modified to adhere to a privacy notion such as differential privacy. In this thesis, we approach generative modelling under model misspecification and differential privacy by presenting four different works.
We first present related work on generative modelling. Subsequently, we delve into the reasons that necessitate an examination of generative modelling under the challenges of model misspecification and differential privacy.
As an initial contribution, we consider generative modelling for density estimation. One way to approach model misspecification is to relax model assumptions. We show that this can also help in nonparametric models. In particular, we study a recently proposed nonparametric quasi-Bayesian density estimator and identify its strong model assumptions as a reason for poor performance in finite data sets. We propose an autoregressive extension relaxing model assumptions to allow for a-priori feature dependencies.
Next, we consider generative modelling for missingness imputation. After categorising current deep generative imputation approaches into the classes of nonignorable missingness models as introduced by Rubin [1976], we extend the formulation of variational autoencoders to factorise according to a nonignorable missingness model class that has not been studied in the deep generative modelling literature before. These explicitly model the missingness mechanisms to prevent model misspecification when missingness is not at random.
Then, we focus the attention of this thesis on improving synthetic data generation under differential privacy. For this purpose, we propose differentially private importance sampling of differentially private synthetic data samples. We observe that importance sampling helps more, the better the generative model is. We next focus on increasing data generation quality by considering differentially private diffusion models. We identify training strategies that significantly improve the performance of DP image generators.
We conclude the dissertation with a discussion, including contributions and limitations of the presented work, and propose potential directions for future work
Variational learning for inverse problems
Machine learning methods for solving inverse problems require uncertainty estimation to be reliable in real settings. While deep variational models offer a computationally tractable way of recovering complex uncertainties, they need large supervised data volumes to be trained, which in many practical applications requires prohibitively expensive collections with specific instruments. This thesis introduces two novel frameworks to train variational inference models for inverse problems, in semi-supervised and unsupervised settings respectively. In the former, a realistic scenario is considered, where few experimentally collected supervised data are available, and analytical models from domain expertise and existing unsupervised data sets are leveraged in addition to solve inverse problems in a semi-supervised fashion. This minimises the supervised data collection requirements and allows the training of effective probabilistic recovery models relatively inexpensively. This novel method is first evaluated in quantitative simulated experiments, testing performance in various controlled settings and compared to alternative techniques. The framework is then implemented in several real world applications, spanning imaging, astronomy and human-computer interaction. In each real world setting, the novel technique makes use of all available information for training, whether this is simulations, data or both, depending on the task. In each experimental scenario, state of the art recovery and uncertainty estimation were demonstrated with reasonably limited experimental collection efforts for training. The second framework presented in this thesis approaches instead the challenging unsupervised situation, where no examples of ground-truths are available. This type of inverse problem is commonly encountered in data pre-processing and information retrieval. A variational framework is designed to capture the solution space of inverse problem by using solely an estimate of the observation process and large ensembles of observations examples. The unsupervised framework is tested on data recovery tasks under the common setting of missing values and noise, demonstrating superior performance to existing variational methods for imputation and de-noising with different real data sets. Furthermore, higher classification accuracy after imputation are shown, proving the advantage of propagating uncertainty to downstream tasks with the new model