10 research outputs found
Clinical microbiology with multi-view deep probabilistic models
Clinical microbiology is one of the critical topics of this century. Identification
and discrimination of microorganisms is considered a global public health
threat by the main international health organisations, such as World Health
Organisation (WHO) or the European Centre for Disease Prevention and Control
(ECDC). Rapid spread, high morbidity and mortality, as well as the economic
burden associated with their treatment and control are the main causes of their
impact. Discrimination of microorganisms is crucial for clinical applications, for
instance, Clostridium difficile (C. diff ) increases the mortality and morbidity of
healthcare-related infections. Furthermore, in the past two decades, other bacteria,
including Klebsiella pneumoniae (K. pneumonia), have demonstrated a significant
propensity to acquire antibiotic resistance mechanisms. Consequently, the use of
an ineffective antibiotic may result in mortality. Machine Learning (ML) has the
potential to be applied in the clinical microbiology field to automatise current
methodologies and provide more efficient guided personalised treatments.
However, microbiological data are challenging to exploit owing to the presence
of a heterogeneous mix of data types, such as real-valued high-dimensional data,
categorical indicators, multilabel epidemiological data, binary targets, or even
time-series data representations. This problem, which in the field of ML is known
as multi-view or multi-modal representation learning, has been studied in other
application fields such as mental health monitoring or haematology. Multi-view
learning combines different modalities or views representing the same data to extract
richer insights and improve understanding. Each modality or view corresponds
to a distinct encoding mechanism for the data, and this dissertation specifically
addresses the issue of heterogeneity across multiple views.
In the probabilistic ML field, the exploitation of multi-view learning is also
known as Bayesian Factor Analysis (FA). Current solutions face limitations when
handling high-dimensional data and non-linear associations. Recent research
proposes deep probabilistic methods to learn hierarchical representations of the data,
which can capture intricate non-linear relationships between features. However,
some Deep Learning (DL) techniques rely on complicated representations, which
can hinder the interpretation of the outcomes. In addition, some inference methods
used in DL approaches can be computationally burdensome, which can hinder their
practical application in real-world situations. Therefore, there is a demand for
more interpretable, explainable, and computationally efficient techniques for highdimensional
data. By combining multiple views representing the same information, such as genomic, proteomic, and epidemiologic data, multi-modal representation
learning could provide a better understanding of the microbial world. Hence,
in this dissertation, the development of two deep probabilistic models, that can
handle current limitations in state-of-the-art of clinical microbiology, are proposed.
Moreover, both models are also tested in two real scenarios regarding antibiotic
resistance prediction in K. pneumoniae and automatic ribotyping of C. diff in
collaboration with the Instituto de Investigaci贸n Sanitaria Gregorio Mara帽贸n
(IISGM) and the Instituto Ram贸n y Cajal de Investigaci贸n Sanitaria (IRyCIS).
The first presented algorithm is the Kernelised Sparse Semi-supervised Heterogeneous
Interbattery Bayesian Analysis (SSHIBA). This algorithm uses a kernelised
formulation to handle non-linear data relationships while providing compact representations
through the automatic selection of relevant vectors. Additionally, it
uses an Automatic Relevance Determination (ARD) over the kernel to determine
the input feature relevance functionality. Then, it is tailored and applied to the
microbiological laboratories of the IISGM and IRyCIS to predict antibiotic resistance
in K. pneumoniae. To do so, specific kernels that handle Matrix-Assisted
Laser Desorption Ionization (MALDI)-Time-Of-Flight (TOF) mass spectrometry
of bacteria are used. Moreover, by exploiting the multi-modal learning between
the spectra and epidemiological information, it outperforms other state-of-the-art
algorithms. Presented results demonstrate the importance of heterogeneous models
that can analyse epidemiological information and can automatically be adjusted for
different data distributions. The implementation of this method in microbiological
laboratories could significantly reduce the time required to obtain resistance results
in 24-72 hours and, moreover, improve patient outcomes.
The second algorithm is a hierarchical Variational AutoEncoder (VAE) for
heterogeneous data using an explainable FA latent space, called FA-VAE. The
FA-VAE model is built on the foundation of the successful KSSHIBA approach for
dealing with semi-supervised heterogeneous multi-view problems. This approach
further expands the range of data domains it can handle. With the ability to
work with a wide range of data types, including multilabel, continuous, binary,
categorical, and even image data, the FA-VAE model offers a versatile and powerful
solution for real-world data sets, depending on the VAE architecture. Additionally,
this model is adapted and used in the microbiological laboratory of IISGM, resulting
in an innovative technique for automatic ribotyping of C. diff, using MALDI-TOF
data. To the best of our knowledge, this is the first demonstration of using any
kind of ML for C. diff ribotyping. Experiments have been conducted on strains
of Hospital General Universitario Gregorio Mara帽贸n (HGUGM) to evaluate the
viability of the proposed approach. The results have demonstrated high accuracy
rates where KSSHIBA even achieved perfect accuracy in the first data collection.
These models have also been tested in a real-life outbreak scenario at the HGUGM,
where successful classification of all outbreak samples has been achieved by FAVAE. The presented results have not only shown high accuracy in predicting
each strain鈥檚 ribotype but also revealed an explainable latent space. Furthermore,
traditional ribotyping methods, which rely on PCR, required 7 days while FA-VAE
has predicted equal results on the same day. This improvement has significantly
reduced the time response by helping in the decision-making of isolating patients
with hyper-virulent ribotypes of C. diff on the same day of infection. The promising
results, obtained in a real outbreak, have provided a solid foundation for further
advancements in the field. This study has been a crucial stepping stone towards
realising the full potential of MALDI-TOF for bacterial ribotyping and advancing
our ability to tackle bacterial outbreaks.
In conclusion, this doctoral thesis has significantly contributed to the field of
Bayesian FA by addressing its drawbacks in handling various data types through
the creation of novel models, namely KSSHIBA and FA-VAE. Additionally, a
comprehensive analysis of the limitations of automating laboratory procedures in
the microbiology field has been carried out. The shown effectiveness of the newly
developed models has been demonstrated through their successful implementation in
critical problems, such as predicting antibiotic resistance and automating ribotyping.
As a result, KSSHIBA and FA-VAE, both in terms of their technical and practical
contributions, signify noteworthy progress both in the clinical and the Bayesian
statistics fields. This dissertation opens up possibilities for future advancements in
automating microbiological laboratories.La microbiolog铆a cl铆nica es uno de los temas cr铆ticos de este siglo. La identificaci贸n
y discriminaci贸n de microorganismos se considera una amenaza mundial
para la salud p煤blica por parte de las principales organizaciones internacionales de
salud, como la Organizaci贸n Mundial de la Salud (OMS) o el Centro Europeo para
la Prevenci贸n y Control de Enfermedades (ECDC). La r谩pida propagaci贸n, alta
morbilidad y mortalidad, as铆 como la carga econ贸mica asociada con su tratamiento
y control, son las principales causas de su impacto. La discriminaci贸n de microorganismos
es crucial para aplicaciones cl铆nicas, como el caso de Clostridium difficile
(C. diff ), el cual aumenta la mortalidad y morbilidad de las infecciones relacionadas
con la atenci贸n m茅dica. Adem谩s, en las 煤ltimas dos d茅cadas, otros tipos de bacterias,
incluyendo Klebsiella pneumoniae (K. pneumonia), han demostrado una
propensi贸n significativa a adquirir mecanismos de resistencia a los antibi贸ticos. En
consecuencia, el uso de un antibi贸tico ineficaz puede resultar en un aumento de la
mortalidad. El aprendizaje autom谩tico (ML) tiene el potencial de ser aplicado en
el campo de la microbiolog铆a cl铆nica para automatizar las metodolog铆as actuales y
proporcionar tratamientos personalizados m谩s eficientes y guiados.
Sin embargo, los datos microbiol贸gicos son dif铆ciles de explotar debido a la
presencia de una mezcla heterog茅nea de tipos de datos, tales como datos reales de
alta dimensionalidad, indicadores categ贸ricos, datos epidemiol贸gicos multietiqueta,
objetivos binarios o incluso series temporales. Este problema, conocido en el campo
del aprendizaje autom谩tico (ML) como aprendizaje multimodal o multivista, ha
sido estudiado en otras 谩reas de aplicaci贸n, como en el monitoreo de la salud mental
o la hematolog铆a. El aprendizaje multivista combina diferentes modalidades o vistas
que representan los mismos datos para extraer conocimientos m谩s ricos y mejorar la
comprensi贸n. Cada vista corresponde a un mecanismo de codificaci贸n distinto para
los datos, y esta tesis aborda particularmente el problema de la heterogeneidad
multivista.
En el campo del aprendizaje autom谩tico probabil铆stico, la explotaci贸n del aprendizaje
multivista tambi茅n se conoce como An谩lisis de Factores (FA) Bayesianos.
Las soluciones actuales enfrentan limitaciones al manejar datos de alta dimensionalidad
y correlaciones no lineales. Investigaciones recientes proponen m茅todos
probabil铆sticos profundos para aprender representaciones jer谩rquicas de los datos,
que pueden capturar relaciones no lineales intrincadas entre caracter铆sticas. Sin
embargo, algunas t茅cnicas de aprendizaje profundo (DL) se basan en representaciones
complejas, dificultando as铆 la interpretaci贸n de los resultados. Adem谩s, algunos m茅todos de inferencia utilizados en DL pueden ser computacionalmente
costosos, obstaculizando su aplicaci贸n pr谩ctica. Por lo tanto, existe una demanda de
t茅cnicas m谩s interpretables, explicables y computacionalmente eficientes para datos
de alta dimensionalidad. Al combinar m煤ltiples vistas que representan la misma
informaci贸n, como datos gen贸micos, prote贸micos y epidemiol贸gicos, el aprendizaje
multimodal podr铆a proporcionar una mejor comprensi贸n del mundo microbiano.
Dicho lo cual, en esta tesis se proponen el desarrollo de dos modelos probabil铆sticos
profundos que pueden manejar las limitaciones actuales en el estado del arte de la
microbiolog铆a cl铆nica. Adem谩s, ambos modelos tambi茅n se someten a prueba en
dos escenarios reales relacionados con la predicci贸n de resistencia a los antibi贸ticos
en K. pneumoniae y el ribotipado autom谩tico de C. diff en colaboraci贸n con el
IISGM y el IRyCIS.
El primer algoritmo presentado es Kernelised Sparse Semi-supervised Heterogeneous
Interbattery Bayesian Analysis (SSHIBA). Este algoritmo utiliza una
formulaci贸n kernelizada para manejar correlaciones no lineales proporcionando representaciones
compactas a trav茅s de la selecci贸n autom谩tica de vectores relevantes.
Adem谩s, utiliza un Automatic Relevance Determination (ARD) sobre el kernel
para determinar la relevancia de las caracter铆sticas de entrada. Luego, se adapta
y aplica a los laboratorios microbiol贸gicos del IISGM y IRyCIS para predecir la
resistencia a antibi贸ticos en K. pneumoniae. Para ello, se utilizan kernels espec铆ficos
que manejan la espectrometr铆a de masas Matrix-Assisted Laser Desorption
Ionization (MALDI)-Time-Of-Flight (TOF) de bacterias. Adem谩s, al aprovechar el
aprendizaje multimodal entre los espectros y la informaci贸n epidemiol贸gica, supera
a otros algoritmos de 煤ltima generaci贸n. Los resultados presentados demuestran la
importancia de los modelos heterog茅neos ya que pueden analizar la informaci贸n
epidemiol贸gica y ajustarse autom谩ticamente para diferentes distribuciones de datos.
La implementaci贸n de este m茅todo en laboratorios microbiol贸gicos podr铆a reducir
significativamente el tiempo requerido para obtener resultados de resistencia en
24-72 horas y, adem谩s, mejorar los resultados para los pacientes.
El segundo algoritmo es un modelo jer谩rquico de Variational AutoEncoder
(VAE) para datos heterog茅neos que utiliza un espacio latente con un FA explicativo,
llamado FA-VAE. El modelo FA-VAE se construye sobre la base del enfoque de
KSSHIBA para tratar problemas semi-supervisados multivista. Esta propuesta
ampl铆a a煤n m谩s el rango de dominios que puede manejar incluyendo multietiqueta,
continuos, binarios, categ贸ricos e incluso im谩genes. De esta forma, el modelo
FA-VAE ofrece una soluci贸n vers谩til y potente para conjuntos de datos realistas,
dependiendo de la arquitectura del VAE. Adem谩s, este modelo es adaptado y
utilizado en el laboratorio microbiol贸gico del IISGM, lo que resulta en una t茅cnica
innovadora para el ribotipado autom谩tico de C. diff utilizando datos MALDI-TOF.
Hasta donde sabemos, esta es la primera demostraci贸n del uso de cualquier tipo
de ML para el ribotipado de C. diff. Se han realizado experimentos en cepas del Hospital General Universitario Gregorio Mara帽贸n (HGUGM) para evaluar la
viabilidad de la t茅cnica propuesta. Los resultados han demostrado altas tasas de
precisi贸n donde KSSHIBA incluso logr贸 una clasificaci贸n perfecta en la primera
colecci贸n de datos. Estos modelos tambi茅n se han probado en un brote real
en el HGUGM, donde FA-VAE logr贸 clasificar con 茅xito todas las muestras del
mismo. Los resultados presentados no solo han demostrado una alta precisi贸n
en la predicci贸n del ribotipo de cada cepa, sino que tambi茅n han revelado un
espacio latente explicativo. Adem谩s, los m茅todos tradicionales de ribotipado, que
dependen de PCR, requieren 7 d铆as para obtener resultados mientras que FA-VAE
ha predicho resultados correctos el mismo d铆a del brote. Esta mejora ha reducido
significativamente el tiempo de respuesta ayudando as铆 en la toma de decisiones
para aislar a los pacientes con ribotipos hipervirulentos de C. diff el mismo d铆a
de la infecci贸n. Los resultados prometedores, obtenidos en un brote real, han
sentado las bases para nuevos avances en el campo. Este estudio ha sido un paso
crucial hacia el despliegue del pleno potencial de MALDI-TOF para el ribotipado
bacteriana avanzado as铆 nuestra capacidad para abordar brotes bacterianos.
En conclusi贸n, esta tesis doctoral ha contribuido significativamente al campo
del FA Bayesiano al abordar sus limitaciones en el manejo de tipos de datos
heterog茅neos a trav茅s de la creaci贸n de modelos noveles, concretamente, KSSHIBA
y FA-VAE. Adem谩s, se ha llevado a cabo un an谩lisis exhaustivo de las limitaciones de
la automatizaci贸n de procedimientos de laboratorio en el campo de la microbiolog铆a.
La efectividad de los nuevos modelos, en este campo, se ha demostrado a trav茅s de su
implementaci贸n exitosa en problemas cr铆ticos, como la predicci贸n de resistencia a los
antibi贸ticos y la automatizaci贸n del ribotipado. Como resultado, KSSHIBA y FAVAE,
tanto en t茅rminos de sus contribuciones t茅cnicas como pr谩cticas, representan
un progreso notable tanto en los campos cl铆nicos como en la estad铆stica Bayesiana.
Esta disertaci贸n abre posibilidades para futuros avances en la automatizaci贸n de
laboratorios microbiol贸gicos.Programa de Doctorado en Multimedia y Comunicaciones por la Universidad Carlos III de Madrid y la Universidad Rey Juan CarlosPresidente: Juan Jos茅 Murillo Fuentes.- Secretario: Jer贸nimo Arenas Garc铆a.- Vocal: Mar铆a de las Mercedes Mar铆n Arriaz
Deep transfer learning for drug response prediction
The goal of precision oncology is to make accurate predictions for cancer patients via some omics data types of individual patients. Major challenges of computational methods for drug response prediction are that labeled clinical data is very limited, not publicly available, or has drug response for one or two drugs. These challenges have been addressed by generating large-scale pre-clinical datasets such as cancer cell lines or patient-derived xenografts (PDX). These pre-clinical datasets have multi-omics characterization of samples and are often screened with hundreds of drugs which makes them viable resources for precision oncology. However, they raise new questions: how can we integrate different data types? how can we handle data discrepancy between pre-clinical and clinical datasets that exist due to basic biological differences? and how can we make the best use of unlabeled samples in drug response prediction where labeling is extra challenging? In this thesis, we propose methods based on deep neural networks to answer these questions. First, we propose a method of multi-omics integration. Second, we propose a transfer learning method to address data discrepancy between cell lines, patients, and PDX models in the input and output space. Finally, we proposed a semi-supervised method of out-of-distribution generalization to predict drug response using labeled and unlabeled samples. The proposed methods have promising performance when compared to the state-of-the-art and may guide precision oncology more accurately
Variational autoencoders for tissue heterogeneity exploration from (almost) no preprocessed mass spectrometry imaging data
The paper presents the application of Variational Autoencoders (VAE) for data dimensionality reduction and explorative analysis of mass spectrometry imaging data (MSI). The results confirm that VAEs are capable of detecting the patterns associated with the different tissue sub-types with performance than standard approaches
Coarse-grained modeling for molecular discovery:Applications to cardiolipin-selectivity
The development of novel materials is pivotal for addressing global challenges such as achieving sustainability, technological progress, and advancements in medical technology. Traditionally, developing or designing new molecules was a resource-intensive endeavor, often reliant on serendipity. Given the vast space of chemically feasible drug-like molecules, estimated between 106 - 10100 compounds, traditional in vitro techniques fall short.Consequently, in silico tools such as virtual screening and molecular modeling have gained increasing recognition. However, the computational cost and the limited precision of the utilized molecular models still limit computational molecular design.This thesis aimed to enhance the molecular design process by integrating multiscale modeling and free energy calculations. Employing a coarse-grained model allowed us to efficiently traverse a significant portion of chemical space and reduce the sampling time required by molecular dynamics simulations. The physics-informed nature of the applied Martini force field and its level of retained structural detail make the model a suitable starting point for the focused learning of molecular properties.We applied our proposed approach to a cardiolipin bilayer, posing a relevant and challenging problem and facilitating reasonable comparison to experimental measurements.We identified promising molecules with defined properties within the resolution limit of a coarse-grained representation. Furthermore, we were able to bridge the gap from in silico predictions to in vitro and in vivo experiments, supporting the validity of the theoretical concept. The findings underscore the potential of multiscale modeling and free-energy calculations in enhancing molecular discovery and design and offer a promising direction for future research
Coarse-grained modeling for molecular discovery:Applications to cardiolipin-selectivity
The development of novel materials is pivotal for addressing global challenges such as achieving sustainability, technological progress, and advancements in medical technology. Traditionally, developing or designing new molecules was a resource-intensive endeavor, often reliant on serendipity. Given the vast space of chemically feasible drug-like molecules, estimated between 106 - 10100 compounds, traditional in vitro techniques fall short.Consequently, in silico tools such as virtual screening and molecular modeling have gained increasing recognition. However, the computational cost and the limited precision of the utilized molecular models still limit computational molecular design.This thesis aimed to enhance the molecular design process by integrating multiscale modeling and free energy calculations. Employing a coarse-grained model allowed us to efficiently traverse a significant portion of chemical space and reduce the sampling time required by molecular dynamics simulations. The physics-informed nature of the applied Martini force field and its level of retained structural detail make the model a suitable starting point for the focused learning of molecular properties.We applied our proposed approach to a cardiolipin bilayer, posing a relevant and challenging problem and facilitating reasonable comparison to experimental measurements.We identified promising molecules with defined properties within the resolution limit of a coarse-grained representation. Furthermore, we were able to bridge the gap from in silico predictions to in vitro and in vivo experiments, supporting the validity of the theoretical concept. The findings underscore the potential of multiscale modeling and free-energy calculations in enhancing molecular discovery and design and offer a promising direction for future research
The data concept behind the data: From metadata models and labelling schemes towards a generic spectral library
Spectral libraries play a major role in imaging spectroscopy. They are commonly used to store end-member and spectrally pure material spectra, which are primarily used for mapping or unmixing purposes. However, the development of spectral libraries is time consuming and usually sensor and site dependent. Spectral libraries are therefore often developed, used and tailored only for a specific case study and only for one sensor. Multi-sensor and multi-site use of spectral libraries is difficult and requires technical effort for adaptation, transformation, and data harmonization steps. Especially the huge amount of urban material specifications and its spectral variations hamper the setup of a complete spectral library consisting of all available urban material spectra. By a combined use of different urban spectral libraries, besides the improvement of spectral inter- and intra-class variability, missing material spectra could be considered with respect to a multi-sensor/ -site use. Publicly available spectral libraries mostly lack the metadata information that is essential for describing spectra acquisition and sampling background, and can serve to some extent as a measure of quality and reliability of the spectra and the entire library itself. In the GenLib project, a concept for a generic, multi-site and multi-sensor usable spectral library for image spectra on the urban focus was developed. This presentation will introduce a 1) unified, easy-to-understand hierarchical labeling scheme combined with 2) a comprehensive metadata concept that is 3) implemented in the SPECCHIO spectral information system to promote the setup and usability of a generic urban spectral library (GUSL). The labelling scheme was developed to ensure the translation of individual spectral libraries with their own labelling schemes and their usually varying level of details into the GUSL framework. It is based on a modified version of the EAGLE classification concept by combining land use, land cover, land characteristics and spectral characteristics. The metadata concept consists of 59 mandatory and optional attributes that are intended to specify the spatial context, spectral library information, references, accessibility, calibration, preprocessing steps, and spectra specific information describing library spectra implemented in the GUSL. It was developed on the basis of existing metadata concepts and was subject of an expert survey. The metadata concept and the labelling scheme are implemented in the spectral information system SPECCHIO, which is used for sharing and holding GUSL spectra. It allows easy implementation of spectra as well as their specification with the proposed metadata information to extend the GUSL. Therefore, the proposed data model represents a first fundamental step towards a generic usable and continuously expandable spectral library for urban areas. The metadata concept and the labelling scheme also build the basis for the necessary adaptation and transformation steps of the GUSL in order to use it entirely or in excerpts for further multi-site and multi-sensor applications
On Improving Generalization of CNN-Based Image Classification with Delineation Maps Using the CORF Push-Pull Inhibition Operator
Deployed image classification pipelines are typically dependent on the images captured in real-world environments. This means that images might be affected by different sources of perturbations (e.g. sensor noise in low-light environments). The main challenge arises by the fact that image quality directly impacts the reliability and consistency of classification tasks. This challenge has, hence, attracted wide interest within the computer vision communities. We propose a transformation step that attempts to enhance the generalization ability of CNN models in the presence of unseen noise in the test set. Concretely, the delineation maps of given images are determined using the CORF push-pull inhibition operator. Such an operation transforms an input image into a space that is more robust to noise before being processed by a CNN. We evaluated our approach on the Fashion MNIST data set with an AlexNet model. It turned out that the proposed CORF-augmented pipeline achieved comparable results on noise-free images to those of a conventional AlexNet classification model without CORF delineation maps, but it consistently achieved significantly superior performance on test images perturbed with different levels of Gaussian and uniform noise
Applications
Volume 3 describes how resource-aware machine learning methods and techniques are used to successfully solve real-world problems. The book provides numerous specific application examples: in health and medicine for risk modelling, diagnosis, and treatment selection for diseases in electronics, steel production and milling for quality control during manufacturing processes in traffic, logistics for smart cities and for mobile communications
Applications
Volume 3 describes how resource-aware machine learning methods and techniques are used to successfully solve real-world problems. The book provides numerous specific application examples: in health and medicine for risk modelling, diagnosis, and treatment selection for diseases in electronics, steel production and milling for quality control during manufacturing processes in traffic, logistics for smart cities and for mobile communications